**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopzs)** or **[LinkedIn](https://linkedin.com/in/jsulopzs)**

## Chapter Importance

Machine Learning models learn a mathematical equation from historical data.

Not all Machine Learning models predict the same way; some models are better than others.

We measure how good a model is by calculating its score (accuracy).

So far, we have calculated the model's score using the same data to fit (train) the mathematical equation. That's cheating. That's overfitting.

This tutorial compares 3 different models:

- Decision Tree
- Logistic Regression
- Support Vector Machines

We validate the models in 2 different ways:

1. Using the same data during training
2. Using 30% of the data; not used during training

To demonstrate how the selection of the best model changes if we are to validate the model with data not used during training.

For example, the image below shows the best model, when using the same data for validation, is the Decision Tree (0.86 of accuracy). Nevertheless, everything changes when the model is evaluated with data not used during training; the best model is the Logistic Regression (0.85 of accuracy). Whereas the Decision Tree only gets up to 0.80 of accuracy.


![df_comp.jpeg](https://cdn.hashnode.com/res/hashnode/image/upload/v1661356658503/xtMfk_S0n.jpeg align="left")

Were we a bank whose losses rank up to 1M USD due to 0.01 fail in accuracy, we would have lost 5M USD. This is something that happens in real life.

In short, banks are interested in good models to predict new potential customers. Not historical customers who have already gotten a loan and the bank knows if they were good to pay or not.

This tutorial shows you how to implement the `train_test_split` technique to reduce overfitting with a practical use case where we want to classify whether a person used the Internet or not.

## [ ] Load the Data

Load the dataset from [CIS](https://www.cis.es/cis/opencms/ES/index.html), executing the following lines of code:

In [74]:
import pandas as pd #!

df_internet = pd.read_excel('https://github.com/jsulopzs/data/blob/main/internet_usage_spain.xlsx?raw=true', sheet_name=1, index_col=0)
df_internet

Unnamed: 0_level_0,internet_usage,sex,age,education
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Josefina,0,Female,66,Elementary
Vicki,1,Male,72,Elementary
David,1,Male,48,University
Curtis,0,Male,59,PhD
Josephine,1,Female,44,PhD
...,...,...,...,...
Frances,1,Male,43,Elementary
Harry,1,Female,18,High School
Adam,0,Female,54,Elementary
Christine,1,Male,31,High School


- The goal of this dataset is
- To predict `internet_usage` of **people** (rows)
- Based on their **socio-demographical characteristics** (columns)

## Preprocess the Data

### Missing Data

In [75]:
df_internet.isna().sum()

internet_usage    0
sex               0
age               0
education         0
dtype: int64

### Dummy Variables

In [76]:
df_internet=pd.get_dummies(data =df_internet, drop_first=True)

## Feature Selection

In [77]:
features =df_internet.drop(columns='internet_usage')
target = df_internet[['internet_usage']]

## [ ] Build & Compare Models' Scores

We should already know that the Machine Learning procedure is the same all the time:
1. Computing a mathematical equation: **fit**
2. To calculate predictions: **predict**
3. And compare them to reality: **score**

The only element that changes is the `Class()` that contains lines of code of a specific algorithm (DecisionTreeClassifier, SVC, LogisticRegression).

### `DecisionTreeClassifier()` Model in Python

In [78]:
from sklearn.tree import DecisionTreeClassifier

#### instantiante the Class

In [79]:
model_d =DecisionTreeClassifier()

### fit the model

In [80]:
model_d.fit(X= features, y=target)

In [81]:
model_d.predict(X=features)

array([0, 0, 1, ..., 0, 1, 0], dtype=int64)

In [82]:
model_d.score(X= features, y=target)

0.859877800407332

### `SVC()` Model in Python

In [83]:
from sklearn.svm import SVC

### instantiante the Class

In [84]:
model_svc = SVC()

#### FIT THE MODEL

In [85]:
model_svc.fit(X=features, y=target)

  y = column_or_1d(y, warn=True)


In [86]:
model_svc.predict(X=features)

array([0, 0, 1, ..., 0, 1, 0], dtype=int64)

In [87]:
model_svc.score(X=features,y=target)

0.7837067209775967

### `LogisticRegression()` Model in Python

In [88]:
from sklearn.linear_model import LogisticRegression

### instantiante the Class

In [89]:
model_Lr =LogisticRegression()

### fit the model

In [90]:
model_Lr.fit(X= features, y=target)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [91]:
model_Lr.predict(X=features)

array([0, 0, 1, ..., 0, 1, 0], dtype=int64)

In [92]:
model_Lr.score(X=features, y=target)

0.8325865580448065

## [ ] Function to Automate Lines of Code

- We repeated all the time the same code:

```Python
model.fit()
model.score()
```

- Why not turn the lines into a `function()` to **automate the process**?

```Python
calculate_accuracy(model_dt)
calculate_accuracy(model_sv)
calculate_accuracy(model_lr)
```

- To calculate the `accuracy`

### Make a Procedure Sample for `DecisionTreeClassifier()`

In [93]:
model_d = DecisionTreeClassifier()
model_d.fit(X=features , y= target)
model_d.score(X=features, y=target)

0.859877800407332

## lines to automate the code

In [94]:
model_d = DecisionTreeClassifier()
model_svc = SVC()
model_Lr =LogisticRegression()

### Automate the Procedure into a `function()`

**Code Thinking**

1. Think of the functions `result`
2. Store that `object` to a variable
3. `return` the `result` at the end
4. **Indent the body** of the function to the right
5. `def`ine the `function():`
6. Think of what's gonna change when you execute the function with `different models`
7. Locate the **`variable` that you will change**
8. Turn it into the `parameter` of the `function()`

In [95]:
     model_d=DecisionTreeClassifier()
     model_d.fit(X=features, y= target)
     result= model_d.scorec(X=features, y=traget)
     result result

SyntaxError: invalid syntax (2121526107.py, line 4)

### def the function

In [96]:
def calculate_accuracy():
    model_d= DecisionTreeClassifier()
    model_d.fit(X=features, y=target)
    result= model_d.score(X=features ,y= target)
    return result

### define a new function for the models

In [97]:
def calculate(model):
    model.fit(X= features , y= target)
    result = model.score(X=features,y=target)
    return result

In [98]:
calculate(model_Lr)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8325865580448065

## Calculate Models' Accuracies

### `DecisionTreeClassifier()` Accuracy

In [99]:
calculate(model_d)

0.859877800407332

In [100]:
dic_accuracy ={}
dic_accuracy['Decision-tree'] =calculate(model_d)

### `SVC()` Accuracy

In [101]:
calculate(model_svc)

  y = column_or_1d(y, warn=True)


0.7837067209775967

In [103]:

dic_accuracy['SVC'] =calculate(model_svc)
dic_accuracy

  y = column_or_1d(y, warn=True)


{'Decision-tree': 0.859877800407332, 'SVC': 0.7837067209775967}

### `LogisticRegression()` Accuracy

In [104]:
calculate(model_Lr)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8325865580448065

In [105]:

dic_accuracy['LogisticRegression'] =calculate(modfel_Lr)
dic_accuracy

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'Decision-tree': 0.859877800407332,
 'SVC': 0.7837067209775967,
 'LogisticRegression': 0.8325865580448065}

## Which is the Best Model?

In [106]:
dic_accuracy

{'Decision-tree': 0.859877800407332,
 'SVC': 0.7837067209775967,
 'LogisticRegression': 0.8325865580448065}

In [107]:
accuracy = pd.Series(dic_accuracy).sort_values(ascending=False)
accuracy

Decision-tree         0.859878
LogisticRegression    0.832587
SVC                   0.783707
dtype: float64

### DECISION TREE IS THE BEST MODEL

## [ ] University Access Exams Analogy

Let's **imagine**:

1. You have a `math exam` on Saturday
2. Today is Monday
3. You want to **calibrate your level in case you need to study more** for the math exam
4. How do you calibrate your `math level`?
5. Well, you've got **100 questions `X` with 100 solutions `y`** from past years exams
6. You may study the 100 questions with 100 solutions `fit(100questions, 100solutions)`
7. Then, you may do a `mock exam` with the 100 questions `predict(100questions)`
8. And compare `your_100solutions` with the `real_100solutions`
9. You've got **90/100 correct answers** `accuracy` in the mock exam
10. You think you are **prepared for the maths exam**
11. And when you do **the real exam on Saturday, the mark is 40/100**
12. Why? How could we have prevented this?
13. **Solution**: separate the 100 questions into `70 for train` to study & `30 for test` for the mock exam.

## `train_test_split()` the Data

In [109]:
from sklearn.model_selection import train_test_split

In [113]:
X_train,X_test, y_train, y_test = train_test_split(
    features, target, test_size =0.30 ,random_state=42)

### What the heck is returning the function?

In [114]:
df_internet

Unnamed: 0_level_0,internet_usage,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Josefina,0,66,0,0,0,0,0,0
Vicki,1,72,1,0,0,0,0,0
David,1,48,1,0,0,0,0,1
Curtis,0,59,1,0,0,0,1,0
Josephine,1,44,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
Frances,1,43,1,0,0,0,0,0
Harry,1,18,0,1,0,0,0,0
Adam,0,54,0,0,0,0,0,0
Christine,1,31,1,1,0,0,0,0


### `fit()` the model with Train Data

In [115]:
X_train

Unnamed: 0_level_0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Eileen,54,1,0,0,0,0,0
Lucinda,50,1,0,0,0,1,0
Ivan,26,1,0,0,0,0,1
Gilda,62,1,0,0,0,0,1
Stanley,86,0,0,0,1,0,0
...,...,...,...,...,...,...,...
Raymond,37,0,1,0,0,0,0
Antonio,35,0,0,1,0,0,0
Alma,58,0,0,0,0,0,0
Corey,52,0,0,0,0,0,0


In [116]:
X_test

Unnamed: 0_level_0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Thomas,52,0,0,0,0,0,0
Pedro,48,1,0,1,0,0,0
Walter,53,0,0,0,0,1,0
Pamela,43,1,0,0,0,0,1
Ernest,43,1,0,1,0,0,0
...,...,...,...,...,...,...,...
Christine,28,1,0,0,0,1,0
Leonard,35,0,0,1,0,0,0
Juanita,23,1,1,0,0,0,0
William,38,1,0,0,0,0,1


In [117]:
y_train

Unnamed: 0_level_0,internet_usage
name,Unnamed: 1_level_1
Eileen,0
Lucinda,1
Ivan,1
Gilda,1
Stanley,0
...,...
Raymond,1
Antonio,1
Alma,0
Corey,0


In [118]:
y_test

Unnamed: 0_level_0,internet_usage
name,Unnamed: 1_level_1
Thomas,0
Pedro,1
Walter,1
Pamela,1
Ernest,1
...,...
Christine,1
Leonard,1
Juanita,1
William,1


### DECISION TREE ON TRAIN DATA

In [119]:
model_d.fit(X_train,y_train)

In [120]:
model_d.predict(X_train)

array([0, 1, 1, ..., 0, 0, 1], dtype=int64)

In [121]:
model_d.score(X_train,y_train)

0.8562281722933643

###  SVC ON TRAIN DATA

In [122]:
model_svc.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [123]:
model_svc.predict(X_train)


array([0, 1, 1, ..., 0, 0, 1], dtype=int64)

In [124]:
model_svc.score(X_train,y_train)

0.7887077997671711

### `LogisticRegression on train data

In [126]:
model_Lr.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)


In [128]:
model_Lr.predict(X_train)

array([0, 1, 1, ..., 0, 0, 0], dtype=int64)

In [129]:
model_Lr.score(X_train,y_train)

0.8253783469150174

### Compare the predictions with the real data

In [131]:
model_d.score(X_test,y_test)

0.8032564450474898

In [132]:
model_svc.score(X_test, y_test)


0.7788331071913162

In [133]:
model_Lr.score(X_test,y_test)

0.8548168249660787

## [ ] Optimize All Models & Compare Again

### Make a Procedure Sample for `DecisionTreeClassifier()`

In [149]:
model_d=DecisionTreeClassifier()
model_d.fit(X_train,y_train)
model_d.score(X_test,y_test)

0.8032564450474898

### Automate the Procedure into a `function()`

**Code Thinking**

1. Think of the functions `result`
2. Store that `object` to a variable
3. `return` the `result` at the end
4. **Indent the body** of the function to the right
5. `def`ine the `function():`
6. Think of what's gonna change when you execute the function with `different models`
7. Locate the **`variable` that you will change**
8. Turn it into the `parameter` of the `function()`

In [166]:
def models_accuracy(models):
    models.fit(X_train,y_train)
    results= models.score(X_test,y_test)
    return results

In [167]:
models_accuracy(model_Lr)

  y = column_or_1d(y, warn=True)


0.8548168249660787

## Calculate Models' Accuracies

### `DecisionTreeClassifier()` Accuracy

In [168]:
models_accuracy(model_d)

0.8046132971506106

In [169]:
dic_accuracy_test ={}
dic_accuracy_test['Decision tree'] =models_accuracy(model_d)
dic_accuracy_test

{'Decision tree': 0.8046132971506106}

### `SVC()` Accuracy

In [170]:
models_accuracy(model_svc)

  y = column_or_1d(y, warn=True)


0.7788331071913162

In [171]:
dic_accuracy_test['svc']=models_accuracy(model_svc)
dic_accuracy_test

  y = column_or_1d(y, warn=True)


{'Decision tree': 0.8046132971506106, 'svc': 0.7788331071913162}

### `LogisticRegression()` Accuracy

In [172]:
models_accuracy(model_Lr)

  y = column_or_1d(y, warn=True)


0.8548168249660787

In [173]:
dic_accuracy_test['logisticRegression']= models_accuracy(model_Lr)
dic_accuracy_test

  y = column_or_1d(y, warn=True)


{'Decision tree': 0.8046132971506106,
 'svc': 0.7788331071913162,
 'logisticRegression': 0.8548168249660787}

## [ ] Which is the Best Model with `train_test_split()`?

In [177]:
sr_accuracy_test =pd.Series(dic_accuracy_test).sort_values(ascending=False)
sr_accuracy_test

logisticRegression    0.854817
Decision tree         0.804613
svc                   0.778833
dtype: float64

In [178]:
def models_accuracy(models):
    models.fit(X_train,y_train)
    results= models.score(X_train,y_train)
    return results

In [179]:
dic_accuracy_train ={}
dic_accuracy_train['Decision tree'] =models_accuracy(model_d)
dic_accuracy_train

{'Decision tree': 0.8562281722933643}

In [181]:

dic_accuracy_train['svc'] =models_accuracy(model_svc)
dic_accuracy_train

  y = column_or_1d(y, warn=True)


{'Decision tree': 0.7887077997671711, 'svc': 0.7887077997671711}

In [182]:

dic_accuracy_train['logisticRegression'] =models_accuracy(model_Lr)
dic_accuracy_train

  y = column_or_1d(y, warn=True)


{'Decision tree': 0.7887077997671711,
 'svc': 0.7887077997671711,
 'logisticRegression': 0.8253783469150174}

In [186]:
sr_accuracy=pd.Series(dic_accuracy_train).sort_values(ascending=False)
sr_accuracy

logisticRegression    0.825378
Decision tree         0.788708
svc                   0.788708
dtype: float64

In [188]:
df_accuracy =pd.DataFrame({'TRAIN DATA':sr_accuracy,
                          'TEST DATA':sr_accuracy_test})
df_accuracy.style.format('{:.2f}').background_gradient()

Unnamed: 0,TRAIN DATA,TEST DATA
logisticRegression,0.83,0.85
Decision tree,0.79,0.8
svc,0.79,0.78
