![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

# 04 | Overfitting & Hyperparameter Tuning with Cross Validation

## Chapter Importance

We have already covered:

1. Regression Models
2. Classification Models
3. Train Test Split for Model Selection

In short, we have computed all possible types of models to predict numerical and categorical variables with Regression and Classification models, respectively.

Although it is not enough with computing one model, we need to compare different models to choose the one whose predictions are close to reality.

Nevertheless, we cannot evaluate the model on the same data we used to `.fit()` (train) the mathematical equation (model). We need to separate the data into train and test sets; the first to train the model, the later to evaluate the model.

No we add an extra layer of complexity because we can improve a model (an algorithm) by configuring its parameters. This chapter is all about **computing different combinations of a single model's hyperparameters** to get the best.

## [ ] Load the [Data](https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset)

- The goal of this dataset is
- To predict if **bank's customers** (rows) `default` next month
- Based on their **socio-demographical characteristics** (columns)

In [1]:
import pandas as pd
pd.set_option("display.max_columns", None)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
df_credit = pd.read_excel(io=url, header=1, index_col=0)
df_credit.sample(10)

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
29660,50000,1,2,1,37,0,0,0,0,0,0,44270,42398,44346,44098,48698,26955,5000,3000,5000,5000,5000,6000,0
8785,50000,2,1,2,25,0,0,0,0,0,0,34476,25857,26621,27178,28038,28871,2000,1500,1000,1300,1300,1011,0
17808,100000,1,2,2,23,-1,-1,-1,-1,-1,-1,291,291,291,291,291,291,291,291,291,291,291,291,0
3269,80000,2,2,1,34,0,0,0,0,-1,0,78688,80539,76681,34197,27398,28646,4000,2333,3032,28298,2000,2000,0
2365,210000,2,2,1,44,0,0,0,0,0,0,73487,75456,86171,87895,89022,90868,3748,13000,3081,3217,3312,3151,0
6399,50000,1,1,2,24,-1,-1,-1,-1,-1,-1,236,4324,1861,0,1780,2581,4324,1861,0,1780,2581,2140,0
11288,360000,1,2,1,32,0,0,0,-1,-1,0,7230,9835,8217,2963,15437,15722,5017,2041,2980,16038,8032,10095,0
865,200000,1,1,2,32,0,0,0,0,0,0,43515,44557,45574,49022,47459,49666,1739,1756,2662,1726,3000,2500,0
29893,180000,1,1,1,37,1,-1,-1,-1,-1,-1,1660,2701,1832,919,500,7068,2701,1832,919,500,7068,9268,0
11586,90000,2,2,1,47,2,2,2,0,0,2,61066,62999,61388,64376,67347,74960,3500,0,4000,4000,8900,0,1


## Preprocess the Data

### Missing Data

### Dummy Variables

## Feature Selection

## Train Test Split

## [ ] `DecisionTreeClassifier()` with Default Hyperparameters

### Accuracy

#### In `train` data

#### In `test` data

### Model Visualization

## `DecisionTreeClassifier()` with Custom Hyperparameters

### 1st Configuration

#### Accuracy

##### In `train` data

##### In `test` data

#### Model Visualization

![](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F4756451%2F5724f9841b58cbd7838a851ac6df659b%2Frpqa6.jpg?generation=1608831884903054&alt=media)

### [ ] 2nd Configuration

#### Accuracy

##### In `train` data

##### In `test` data

#### Model Visualization

### 3rd Configuration

### 4th Configuration

### 5th Configuration

## [ ] `GridSearchCV()` to find Best Hyperparameters

<img src="src/grid_search_cross_validation.png" style="margin-top: 100px"/>

## [ ] Other Models

Now let's try to find the best hyperparameter configuration of other models, which don't have the same hyperparameters than the Decision Tree because their algorithm and mathematical equation are different.

### Support Vector Machines `SVC()`

In [54]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/efR1C6CvhmE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

### K Nearest Neighbors`KNeighborsClassifier()`

## [ ] Best Model with Best Hyperparameters

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.