![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

# 04 | Hyperparameter Tuning with Cross Validation

# The Challenge

<div class="alert alert-danger">
    Build different Models and choose the best one:
</div>

In [1]:
import pandas as pd
pd.set_option("display.max_columns", None)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
df_default = pd.read_excel(io=url, header=1, index_col=0)
df_default.sample(5)

Unnamed: 0_level_0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
28223,180000,2,1,2,30,-1,-1,-1,-1,-2,-1,1650,196,2500,0,0,3650,196,2500,0,0,3650,3859,0
323,50000,1,2,2,24,-1,0,0,0,0,0,1399,2441,3865,7476,8384,8085,1070,1500,3666,2477,1306,1000,0
6579,500000,2,1,1,41,-1,-1,-2,-2,-1,-1,680,0,0,0,74731,0,0,0,0,74731,0,0,0
20199,450000,2,2,2,39,-1,0,0,0,0,-1,217126,43647,39827,43978,389,389,10329,3001,5217,1,390,390,0
6894,370000,2,2,2,29,4,3,2,0,0,0,390509,382898,365461,304436,311426,275628,0,0,10019,11000,10000,10000,1


# The Covered Solution

<div class="alert alert-success">
    and get the following comparisons ↓
</div>

In [55]:
?? #! read the full story to find out the solution

dt    0.859878
lr    0.833401
sv    0.783707
dtype: float64

In [99]:
?? #! read the full story to find out the solution

lr    0.854817
dt    0.804613
sv    0.778833
dtype: float64

# What will we learn?

- How to improve the Accuracy of the same Model changing its hyperparameters?
- How can we now if the Model is overfitting the training data?
- How can randomisation improve the statistical estimation of the unknown?
- Why do some models cannot converge to fit the mathematical equation?
- Why do we need to scale the data for models who compare distances between explanatory variables?

# Which concepts will we use?

- Cross Validation
- Hyperparameters of a Model
- MinMaxScaler
- Grid

# Requirements?

- Train Test Split for Model Selection

# The starting *thing*

In [None]:
import pandas as pd
pd.set_option("display.max_columns", None)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
df_default = pd.read_excel(io=url, header=1, index_col=0)
df_default.sample(5)

# Syllabus for the [Notebook](01script_functions.ipynb)

1. Load the Data
2. Preprocess the Data
3. Feature Selection
4. Train Test Split
5. Decision TreeClassifier () with Default Hyperparameters
    1. Accuracy
        1. In train data
        2. In test data
    2. Model Visualization
6. DecisionTreeClassifier() with Custom Hyperparameters
    1. 1st Configuration
        1. Accuracy
            1. In train data
            2. In test data
        2. Model Visualization
    2. 2nd Configuration
        1. Accuracy
            1. In train data
            2. In test data
        2. Model Visualization
    3. 3rd Configuration
        1. Accuracy
            1. In train data
            2. In test data
        2. Model Visualization
    4. 4th Configuration
        1. Accuracy
            1. In train data
            2. In test data
        2. Model Visualization
    5. 5th Configuration
        1. Accuracy
            1. In train data
            2. In test data
        2. Model Visualization
7. GridSearchCV () to find Best Hyperparameters
8. Other Models
    1. Support Vector Machines SVC()
    2. KNeighborsClassifier()
Best Model with Best Hyperparameters

# The Uncovered Solution

In [11]:
import pandas as pd
pd.set_option("display.max_columns", None)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
df_default = pd.read_excel(io=url, header=1, index_col=0)

df_default = pd.get_dummies(df_default, drop_first=True)

y = df_default.iloc[:, -1]
X = df_default.iloc[:, :-1]

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

In [14]:
from sklearn.tree import DecisionTreeClassifier

model_dt = DecisionTreeClassifier()

In [15]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [None, 2, 3, 4, 5, 10],
    'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600],
    'criterion': ['gini', 'entropy']
}

cv = GridSearchCV(
    estimator=model_dt, param_grid=param_grid,
    cv=5, verbose=1
)
cv.fit(X_train, y_train)

Fitting 5 folds for each of 84 candidates, totalling 420 fits


GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 2, 3, 4, 5, 10],
                         'min_samples_leaf': [1, 50, 100, 200, 400, 800, 1600]},
             verbose=1)

In [77]:
cv.best_estimator_

DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=100)

In [78]:
cv.score(X_test, y_test)

0.8186868686868687

In [16]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_norm = pd.DataFrame(scaler.fit_transform(X))

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    X_norm, y, test_size=0.33, random_state=42)

In [None]:
from sklearn.svm import SVC

sv = SVC()
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'kernel']
}

cv_svc = GridSearchCV(estimator=sv, param_grid=param_grid, verbose=1)
cv_svc.fit(X_train, y_train)

In [21]:
cv_svc.best_estimator_

SVC(C=1, kernel='linear')

In [22]:
cv_svc.score(X_test, y_test)

0.8105050505050505

In [None]:
from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier()
param_grid = {
    'leaf_size': [10, 20, 30, 50],
    'metric': ['minkowski', 'euclidean', 'manhattan'],
    'n_neighbors': [3, 5, 10, 20]
}

cv_kn = GridSearchCV(estimator=kn, param_grid=param_grid, verbose=2)
cv_kn.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time=   1.5s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time=   1.1s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time=   1.1s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time=   1.0s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=3; total time=   1.1s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=5; total time=   1.2s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=5; total time=   1.2s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=5; total time=   1.2s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=5; total time=   1.2s
[CV] END ......leaf_size=10, metric=minkowski, n_neighbors=5; total time=   1.2s
[CV] END .....leaf_size=10, metric=minkowski, n_neighbors=10; total time=   1.3s
[CV] END .....leaf_size=10, metric=minkowski, n

In [None]:
cv_kn.best_estimator_

In [None]:
cv_kn.score(X_test, y_test)

In [126]:
dic_results = {
    'model': [
        cv.best_estimator_,
        cv_kn.best_estimator_,
        cv_svc.best_estimator_
    ],
    
    'score': [
        cv.score(X_test, y_test),
        cv_kn.score(X_test, y_test),
        cv_svc.score(X_test, y_test)
    ]
}



In [127]:
pd.DataFrame(dic_results)

Unnamed: 0,model,score
0,"DecisionTreeClassifier(criterion='entropy', ma...",0.78202
1,"KNeighborsClassifier(leaf_size=10, n_neighbors...",0.807677
2,"SVC(C=1, kernel='linear')",0.810505


<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.