![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

# #03 | Train Test Split for Model Selection

<a href="https://colab.research.google.com/github/jsulopz/machine-learning/blob/main/03_Model%20Selection.%20Decision%20Tree%20vs%20Support%20Vector%20Machines%20vs%20Logistic%20Regression/03_model-selection_session_solution.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Load the Data

Load the dataset from [CIS](https://www.cis.es/cis/opencms/ES/index.html) executing the lines of code below:
- The goal of this dataset is
- To predict `internet_usage` of **people** (rows)
- Based on their **socio-demographical characteristics** (columns)

In [6]:
import pandas as pd #!

df_internet = pd.read_excel('../data/internet_usage_spain.xlsx', sheet_name=1, index_col=0)
df_internet

Unnamed: 0_level_0,internet_usage,sex,age,education
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Josefina,0,Female,66,Elementary
Vicki,1,Male,72,Elementary
...,...,...,...,...
Christine,1,Male,31,High School
Kimberly,0,Male,52,Elementary


## Preprocess the Data

In [12]:
df_internet.isna().sum()

internet_usage    0
sex               0
age               0
education         0
dtype: int64

In [11]:
pd.get_dummies(df_internet, drop_first=True)

Unnamed: 0_level_0,internet_usage,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Josefina,0,66,0,0,0,0,0,0
Vicki,1,72,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
Christine,1,31,1,1,0,0,0,0
Kimberly,0,52,1,0,0,0,0,0


In [13]:
df_internet = pd.get_dummies(df_internet, drop_first=True)

## Feature Selection

In [14]:
explanatory = df_internet.drop(columns='internet_usage')

In [15]:
target = df_internet.internet_usage

## Build & Compare Models' Scores

### `DecisionTreeClassifier()` Model in Python

In [16]:
from sklearn.tree import DecisionTreeClassifier

In [17]:
model_dt = DecisionTreeClassifier()

In [18]:
model_dt.fit(X=explanatory, y=target)

DecisionTreeClassifier()

In [19]:
model_dt.score(X=explanatory, y=target)

0.859877800407332

### `SVC()` Model in Python

In [28]:
from sklearn.svm import SVC

In [29]:
model_svc = SVC(probability=True)

In [30]:
model_svc.fit(X=explanatory, y=target)

SVC(probability=True)

In [31]:
model_svc.score(X=explanatory, y=target)

0.7837067209775967

### `LogisticRegression()` Model in Python

In [32]:
from sklearn.linear_model import LogisticRegression

In [33]:
model_lr = LogisticRegression(max_iter=1000)

In [34]:
model_lr.fit(X=explanatory, y=target)

LogisticRegression(max_iter=1000)

In [35]:
model_lr.score(X=explanatory, y=target)

0.8334012219959267

## Function to Automate Lines of Code

- We repeated all the time the same code:

```python
model.fit()
model.score()
```

- Why not turning the lines into a `function()` to **automate the process**?

```python
calculate_accuracy(model=dt)

calculate_accuracy(model=svm)

calculate_accuracy(model = lr)
```

- To calculate the `accuracy`

### Make a Procedure Sample for `DecisionTreeClassifier()`

In [36]:
model_dt = DecisionTreeClassifier()

In [37]:
model_dt.fit(X=explanatory, y=target)

DecisionTreeClassifier()

In [38]:
model_dt.score(X=explanatory, y=target)

0.859877800407332

### Automate the Procedure into a `function()`

**Code Thinking**

1. Think of the functions `result`
2. Store that `object` to a variable
3. `return` the `result` at the end
4. **Indent the body** of the function to the right
5. `def`ine the `function():`
6. Think of what's gonna change when you execute the function with `different models`
7. Locate the **`variable` that you will change**
8. Turn it into the `parameter` of the `function()`

In [39]:
model_dt = DecisionTreeClassifier()

In [40]:
model_dt.fit(X=explanatory, y=target)

DecisionTreeClassifier()

In [41]:
model_dt.score(X=explanatory, y=target)

0.859877800407332

#### Select all cells and `[shift] + [M]`

In [30]:
model_dt = DecisionTreeClassifier()

model_dt.fit(X=explanatory, y=target)

model_dt.score(X=explanatory, y=target)

0.859877800407332

#### Distinguish the line that gives you the `result` you want and put it into a variable

In [31]:
model_dt = DecisionTreeClassifier()

model_dt.fit(X=explanatory, y=target)

result = model_dt.score(X=explanatory, y=target)

#### Add a line with a `return` to tell the function the object you want in the end

In [32]:
model_dt = DecisionTreeClassifier()

model_dt.fit(X=explanatory, y=target)

result = model_dt.score(X=explanatory, y=target)

return result

SyntaxError: 'return' outside function (1025789869.py, line 7)

#### Indent everything to the right

In [33]:
    model_dt = DecisionTreeClassifier()

    model_dt.fit(X=explanatory, y=target)

    result = model_dt.score(X=explanatory, y=target)

    return result

SyntaxError: 'return' outside function (1943077332.py, line 7)

#### Define the function in the first line

In [34]:
def calculate_accuracy():
    
    model_dt = DecisionTreeClassifier()

    model_dt.fit(X=explanatory, y=target)

    result = model_dt.score(X=explanatory, y=target)

    return result

#### What am I gonna change every time I run the function

In [36]:
def calculate_accuracy(model_dt):
    
     = DecisionTreeClassifier()

    model_dt.fit(X=explanatory, y=target)

    result = model_dt.score(X=explanatory, y=target)

    return result

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 5)

#### What am I gonna change every time I run the function

In [37]:
def calculate_accuracy(model_dt):

    model_dt.fit(X=explanatory, y=target)

    result = model_dt.score(X=explanatory, y=target)

    return result

#### Generalize the name of the parameter

In [38]:
def calculate_accuracy(model):

    model.fit(X=explanatory, y=target)

    result = model.score(X=explanatory, y=target)

    return result

#### Add docstring

In [43]:
def calculate_accuracy(model):
    """
    This function calculates the accuracy for a given model as a parameter
    """
    
    model.fit(X=explanatory, y=target)

    result = model.score(X=explanatory, y=target)

    return result

In [44]:
calculate_accuracy(model_dt)

0.859877800407332

## Calculate Models' Accuracies

### `DecisionTreeClassifier()` Accuracy

In [48]:
accuracies = {}

In [45]:
calculate_accuracy(model_dt)

0.859877800407332

In [50]:
accuracies['dt'] = calculate_accuracy(model_dt)

### `SVC()` Accuracy

In [46]:
calculate_accuracy(model_svc)

0.7837067209775967

In [51]:
accuracies['sv'] = calculate_accuracy(model_svc)

### `LogisticRegression()` Accuracy

In [47]:
calculate_accuracy(model_lr)

0.8334012219959267

In [52]:
accuracies['lr'] = calculate_accuracy(model_lr)

## Which is the Best Model?

In [55]:
pd.Series(accuracies).sort_values(ascending=False)

dt    0.859878
lr    0.833401
sv    0.783707
dtype: float64

## University Access Exams Analogy

Let's **imagine**:

1. You have a `math exam` on Saturday
2. Today is Monday
3. You want to **calculate if you need to study more** for the math exam
4. How do you calibrate your `math level`?
5. Well, you've got **100 questions `X` with 100 solutions `y`** from past years exams
6. You may study the 100 questions with 100 solutions `fit(questions, solutions)`
7. Then, you may do a `mock exam` with the 100 questions `predict(questions)`
8. And compare `your_solutions` with the `real_solutions`
9. You've got **90/100 correct answers** `accuracy` in the mock exam
10. You think you are **prepared for the maths exam**
11. And when you do **the real exam on Saturday, the mark is 40/100**
12. Why? How could have we prevented this?
13. **Solution**: separate the 100 questions in
- `70 train` to study & `30 test` for the mock exam.

## `train_test_split()` the Data

- The documentation of the function contains a typical example.

In [58]:
from sklearn.model_selection import train_test_split

In [62]:
>>> X_train, X_test, y_train, y_test = train_test_split(
...     explanatory, target, test_size=0.30, random_state=42)

### What the heck is returning the function?

From all the data:
- 2455 rows
- 8 columns

In [63]:
df_internet

Unnamed: 0_level_0,internet_usage,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Josefina,0,66,0,0,0,0,0,0
Vicki,1,72,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
Christine,1,31,1,1,0,0,0,0
Kimberly,0,52,1,0,0,0,0,0


- 1728 rows (70% of all data) → to fit the model
- 7 columns (X: explanatory variables)

In [69]:
X_train

Unnamed: 0_level_0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Eileen,54,1,0,0,0,0,0
Lucinda,50,1,0,0,0,1,0
...,...,...,...,...,...,...,...
Corey,52,0,0,0,0,0,0
Robert,46,0,0,0,0,0,0


- 737 rows (30% of all data) → to evaluate the model
- 7 columns (X: explanatory variables)

In [70]:
X_test

Unnamed: 0_level_0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Thomas,52,0,0,0,0,0,0
Pedro,48,1,0,1,0,0,0
...,...,...,...,...,...,...,...
William,38,1,0,0,0,0,1
Charles,41,1,0,0,0,1,0


- 1728 rows (70% of all data) → to fit the model
- 1 columns (y: target variable)

In [73]:
y_train

name
Eileen     0
Lucinda    1
          ..
Corey      0
Robert     1
Name: internet_usage, Length: 1718, dtype: int64

- 737 rows (30% of all data) → to evaluate the model
- 1 columns (y: target variable)

In [75]:
y_test

name
Thomas     0
Pedro      1
          ..
William    1
Charles    1
Name: internet_usage, Length: 737, dtype: int64

### `fit()` the model with Train Data

In [55]:
model_dt.fit(X_train, y_train)

DecisionTreeClassifier()

### Compare the predictions with the real data

In [81]:
model_dt.score(X_test, y_test)

0.8887381275440976

## Optimize All Models & Compare Again

### Make a Procedure Sample for `DecisionTreeClassifier()`

In [85]:
model_dt = DecisionTreeClassifier()

In [86]:
model_dt.fit(X_train, y_train)

DecisionTreeClassifier()

In [87]:
model_dt.score(X_test, y_test)

0.8046132971506106

### Automate the Procedure into a `function()`

**Code Thinking**

1. Think of the functions `result`
2. Store that `object` to a variable
3. `return` the `result` at the end
4. **Indent the body** of the function to the right
5. `def`ine the `function():`
6. Think of what's gonna change when you execute the function with `different models`
7. Locate the **`variable` that you will change**
8. Turn it into the `parameter` of the `function()`

In [88]:
def calculate_accuracy_test(model):

    model.fit(X_train, y_train)

    result = model.score(X_test, y_test)

    return result

## Calculate Models' Accuracies

### `DecisionTreeClassifier()` Accuracy

In [92]:
accuracies = {}

In [93]:
calculate_accuracy_test(model_dt)

0.8046132971506106

In [94]:
accuracies['dt'] = calculate_accuracy_test(model_dt)

### `SVC()` Accuracy

In [95]:
calculate_accuracy_test(model_svc)

0.7788331071913162

In [96]:
accuracies['sv'] = calculate_accuracy_test(model_svc)

### `LogisticRegression()` Accuracy

In [97]:
calculate_accuracy_test(model_lr)

0.8548168249660787

In [98]:
accuracies['lr'] = calculate_accuracy_test(model_lr)

## Which is the Best Model with `train_test_split()`?

In [99]:
pd.Series(accuracies).sort_values(ascending=False)

lr    0.854817
dt    0.804613
sv    0.778833
dtype: float64

# Reflect

- Banks deploy models to predict the **probability for a customer to pay the loan**
- If the Bank used the `DecisionTreeClassifier()` instead of the `LogisticRegression()`
- What would have happened?
- Is `train_test_split()` always required to compare models?

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.