![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

# #03 | Train Test Split for Model Selection

<a href="https://colab.research.google.com/github/jsulopz/machine-learning/blob/main/03_Model%20Selection.%20Decision%20Tree%20vs%20Support%20Vector%20Machines%20vs%20Logistic%20Regression/03_model-selection_session_solution.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Chapter Importance

Machine Learning models learn a mathematical equation from historical data.

Not all Machine Learning models predict the same way; some models are better than others.

We measure how good a model is by calculating its score (accuracy).

So far, we have calculated the model's score with the same data that was used to fit (train) the mathematical equation. That's cheating. That's overfitting.

This tutorial compares 3 different models:

- Decision Tree
- Logistic Regression
- Support Vector Machines

We validate the models in 2 different ways:

1. Using the same data during training
2. Using 30% of the data; not used during training

To demonstrate how the selection of the best model changes if we are to validate the model with data not used during training.

For example, the image below shows the best model, when using the same data for validation, is the Decision Tree (0.86 accuracy). Nevertheless, everything changes when the model is evaluated with data not used during training; the best model is the Logistic Regression (0.85 accuracy). Whereas the Decision Tree only gets up to 0.80 of accuracy.

![](df_comp.jpeg)

Were we a bank whose losses rank up to 1M USD due to 0.01 fail in accuracy, we would have lost 5M USD. This is something that happens in real life.

In short, banks are interested in good models to predict new potential customers. Not historical customers who have already gotten a loan and the bank knows if they were good to pay or not.

This tutorial shows you step-by-step how to implement the `train_test_split` technique to reduce overfitting with a practical use case where we want to classify wether a person used Internet or not.

## Load the Data

Load the dataset from [CIS](https://www.cis.es/cis/opencms/ES/index.html) executing the lines of code below:
- The goal of this dataset is
- To predict `internet_usage` of **people** (rows)
- Based on their **socio-demographical characteristics** (columns)

In [161]:
import pandas as pd #!

df_internet = pd.read_excel('../data/internet_usage_spain.xlsx', sheet_name=1, index_col=0)
df_internet

Unnamed: 0_level_0,internet_usage,sex,age,education
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Josefina,0,Female,66,Elementary
Vicki,1,Male,72,Elementary
...,...,...,...,...
Christine,1,Male,31,High School
Kimberly,0,Male,52,Elementary


## Preprocess the Data

In [162]:
df_internet.isna().sum()

internet_usage    0
sex               0
age               0
education         0
dtype: int64

In [163]:
df_internet = pd.get_dummies(df_internet, drop_first=True)
df_internet

Unnamed: 0_level_0,internet_usage,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Josefina,0,66,0,0,0,0,0,0
Vicki,1,72,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
Christine,1,31,1,1,0,0,0,0
Kimberly,0,52,1,0,0,0,0,0


## Feature Selection

In [164]:
target = df_internet.internet_usage
features = df_internet.drop(columns='internet_usage')

## Build & Compare Models' Scores

### `DecisionTreeClassifier()` Model in Python

In [165]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)

model_dt.score(X=features, y=target)

0.859877800407332

### `SVC()` Model in Python

In [166]:
from sklearn.svm import SVC

model_svc = SVC(probability=True)
model_svc.fit(X=features, y=target)

model_svc.score(X=features, y=target)

0.7837067209775967

### `LogisticRegression()` Model in Python

In [167]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X=features, y=target)

model_lr.score(X=features, y=target)

0.8334012219959267

## Function to Automate Lines of Code

- We repeated all the time the same code:

```python
model.fit()
model.score()
```

- Why not turning the lines into a `function()` to **automate the process**?

```python
calculate_accuracy(model=dt)
calculate_accuracy(model=svm)
calculate_accuracy(model = lr)
```

- To calculate the `accuracy`

### Make a Procedure Sample for `DecisionTreeClassifier()`

In [168]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
model_dt.score(X=features, y=target)

0.859877800407332

### Automate the Procedure into a `function()`

**Code Thinking**

1. Think of the functions `result`
2. Store that `object` to a variable
3. `return` the `result` at the end
4. **Indent the body** of the function to the right
5. `def`ine the `function():`
6. Think of what's gonna change when you execute the function with `different models`
7. Locate the **`variable` that you will change**
8. Turn it into the `parameter` of the `function()`

In [169]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
model_dt.score(X=features, y=target)

0.859877800407332

#### Distinguish the line that gives you the `result` you want and put it into a variable

In [170]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
result = model_dt.score(X=features, y=target)

#### Add a line with a `return` to tell the function the object you want in the end

In [171]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X=features, y=target)
result = model_dt.score(X=features, y=target)

return result

SyntaxError: 'return' outside function (3200200737.py, line 5)

#### Indent everything to the right

In [172]:
    model_dt = DecisionTreeClassifier()
    model_dt.fit(X=features, y=target)
    result = model_dt.score(X=features, y=target)

    return result

SyntaxError: 'return' outside function (599614144.py, line 5)

#### Define the function in the first line

In [173]:
def calculate_accuracy():

    model_dt = DecisionTreeClassifier()
    model_dt.fit(X=features, y=target)
    result = model_dt.score(X=features, y=target)

    return result

#### What am I gonna change every time I run the function

In [174]:
def calculate_accuracy(model_dt):

    model_dt.fit(X=features, y=target)
    result = model_dt.score(X=features, y=target)

    return result

#### Generalize the name of the parameter

In [175]:
def calculate_accuracy(model):

    model.fit(X=features, y=target)
    result = model.score(X=features, y=target)

    return result

#### Add docstring

In [176]:
def calculate_accuracy(model):
    """
    This function calculates the accuracy for a given model as a parameter
    """
    
    model.fit(X=features, y=target)

    result = model.score(X=features, y=target)

    return result

In [177]:
calculate_accuracy(model_dt)

0.859877800407332

## Calculate Models' Accuracies

### `DecisionTreeClassifier()` Accuracy

In [178]:
calculate_accuracy(model_dt)

0.859877800407332

In [179]:
dic_accuracy = {}
dic_accuracy['Decision Tree'] = calculate_accuracy(model_dt)

### `SVC()` Accuracy

In [180]:
dic_accuracy['Support Vector Machines'] = calculate_accuracy(model_svc)
dic_accuracy

{'Decision Tree': 0.859877800407332,
 'Support Vector Machines': 0.7837067209775967}

### `LogisticRegression()` Accuracy

In [181]:
dic_accuracy['Logistic Regression'] = calculate_accuracy(model_lr)
dic_accuracy

{'Decision Tree': 0.859877800407332,
 'Support Vector Machines': 0.7837067209775967,
 'Logistic Regression': 0.8334012219959267}

## Which is the Best Model?

In [182]:
sr_accuracy = pd.Series(dic_accuracy).sort_values(ascending=False)
sr_accuracy

Decision Tree              0.859878
Logistic Regression        0.833401
Support Vector Machines    0.783707
dtype: float64

## University Access Exams Analogy

Let's **imagine**:

1. You have a `math exam` on Saturday
2. Today is Monday
3. You want to **calculate if you need to study more** for the math exam
4. How do you calibrate your `math level`?
5. Well, you've got **100 questions `X` with 100 solutions `y`** from past years exams
6. You may study the 100 questions with 100 solutions `fit(questions, solutions)`
7. Then, you may do a `mock exam` with the 100 questions `predict(questions)`
8. And compare `your_solutions` with the `real_solutions`
9. You've got **90/100 correct answers** `accuracy` in the mock exam
10. You think you are **prepared for the maths exam**
11. And when you do **the real exam on Saturday, the mark is 40/100**
12. Why? How could have we prevented this?
13. **Solution**: separate the 100 questions in
- `70 train` to study & `30 test` for the mock exam.

## `train_test_split()` the Data

- The documentation of the function contains a typical example.

In [1]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.30, random_state=42)

NameError: name 'features' is not defined

### What the heck is returning the function?

From all the data:
- 2455 rows
- 8 columns

In [184]:
df_internet

Unnamed: 0_level_0,internet_usage,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Josefina,0,66,0,0,0,0,0,0
Vicki,1,72,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
Christine,1,31,1,1,0,0,0,0
Kimberly,0,52,1,0,0,0,0,0


- 1728 rows (70% of all data) → to fit the model
- 7 columns (X: features variables)

In [185]:
X_train

Unnamed: 0_level_0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Eileen,54,1,0,0,0,0,0
Lucinda,50,1,0,0,0,1,0
...,...,...,...,...,...,...,...
Corey,52,0,0,0,0,0,0
Robert,46,0,0,0,0,0,0


- 737 rows (30% of all data) → to evaluate the model
- 7 columns (X: features variables)

In [186]:
X_test

Unnamed: 0_level_0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Thomas,52,0,0,0,0,0,0
Pedro,48,1,0,1,0,0,0
...,...,...,...,...,...,...,...
William,38,1,0,0,0,0,1
Charles,41,1,0,0,0,1,0


- 1728 rows (70% of all data) → to fit the model
- 1 columns (y: target variable)

In [187]:
y_train

name
Eileen     0
Lucinda    1
          ..
Corey      0
Robert     1
Name: internet_usage, Length: 1718, dtype: int64

- 737 rows (30% of all data) → to evaluate the model
- 1 columns (y: target variable)

In [188]:
y_test

name
Thomas     0
Pedro      1
          ..
William    1
Charles    1
Name: internet_usage, Length: 737, dtype: int64

### `fit()` the model with Train Data

In [189]:
model_dt.fit(X_train, y_train)

DecisionTreeClassifier()

### Compare the predictions with the real data

In [190]:
model_dt.score(X_test, y_test)

0.8032564450474898

## Optimize All Models & Compare Again

### Make a Procedure Sample for `DecisionTreeClassifier()`

In [191]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
model_dt.score(X_test, y_test)

0.8032564450474898

### Automate the Procedure into a `function()`

**Code Thinking**

1. Think of the functions `result`
2. Store that `object` to a variable
3. `return` the `result` at the end
4. **Indent the body** of the function to the right
5. `def`ine the `function():`
6. Think of what's gonna change when you execute the function with `different models`
7. Locate the **`variable` that you will change**
8. Turn it into the `parameter` of the `function()`

In [192]:
def calculate_accuracy_test(model):

    model.fit(X_train, y_train)
    result = model.score(X_test, y_test)

    return result

## Calculate Models' dic_accuracy

### `DecisionTreeClassifier()` Accuracy

In [193]:
dic_accuracy_test = {}

dic_accuracy_test['Decision Tree'] = calculate_accuracy_test(model_dt)
dic_accuracy_test

{'Decision Tree': 0.8046132971506106}

### `SVC()` Accuracy

In [194]:
dic_accuracy_test['Support Vector Machines'] = calculate_accuracy_test(model_svc)
dic_accuracy_test

{'Decision Tree': 0.8046132971506106,
 'Support Vector Machines': 0.7788331071913162}

### `LogisticRegression()` Accuracy

In [195]:
dic_accuracy_test['Logistic Regression'] = calculate_accuracy_test(model_lr)
dic_accuracy_test

{'Decision Tree': 0.8046132971506106,
 'Support Vector Machines': 0.7788331071913162,
 'Logistic Regression': 0.8548168249660787}

## Which is the Best Model with `train_test_split()`?

In [196]:
sr_accuracy_test = pd.Series(dic_accuracy_test).sort_values(ascending=False)
sr_accuracy_test

Logistic Regression        0.854817
Decision Tree              0.804613
Support Vector Machines    0.778833
dtype: float64

In [197]:
import dataframe_image as dfi

In [198]:
df_accuracy = pd.DataFrame({
    'Same Data': sr_accuracy,
    'Test Data': sr_accuracy_test
})

df_accuracy.style.format('{:.2f}').background_gradient()

Unnamed: 0,Same Data,Test Data
Decision Tree,0.86,0.8
Logistic Regression,0.83,0.85
Support Vector Machines,0.78,0.78


![](https://1.cms.s81c.com/sites/default/files/2021-03-03/model-over-fitting.png)

# Reflect

- Banks deploy models to predict the **probability for a customer to pay the loan**
- If the Bank used the `DecisionTreeClassifier()` instead of the `LogisticRegression()`
- What would have happened?
- Is `train_test_split()` always required to compare models?

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.