# Tuning Hyperparameters of Machine Learning Model


Data Professor YouTube channel, http://youtube.com/dataprofessor

# 0. Import libraries

In [10]:
from sklearn.datasets import make_classification

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV
import numpy as np


ModuleNotFoundError: No module named 'plotly.graph_objs'

In this Jupyter notebook, we will be tuning hyperparameters of a classification model built by random forest algorithm using scikit-learn package in Python.

# 1. Make synthetic dataset
### 1.1. Generate the dataset

In [6]:

X, Y = make_classification(n_samples=200, n_classes=2, n_features=10, n_redundant=0, random_state=1)

Generate a random n-class classification problem.

### 1.2. Let's examine the data dimension
We can see that there are 100 rows (samples) and 5 columns (features) for the X variable and 100 rows and 1 column (class label) for the Y variable.

In [7]:
X.shape, Y.shape

((200, 10), (200,))

In [8]:
X

array([[-1.51107661,  0.60874908, -0.15323616, ..., -0.86482994,
        -0.20290111, -0.87142207],
       [ 1.44544531,  0.51896937,  0.64515265, ..., -1.04339961,
         0.04854689, -2.62101164],
       [ 0.37167029,  0.51350548, -1.39881282, ...,  0.14225137,
        -1.13283476,  1.85300949],
       ...,
       [-0.95090925, -0.21873346,  1.29354962, ..., -0.04586669,
        -0.97210712, -0.70435033],
       [-0.4466992 ,  0.74488454, -0.9612636 , ...,  0.61223252,
         1.67977906,  0.20437739],
       [ 1.00796648,  1.1253235 ,  0.43499832, ...,  0.44838065,
        -1.75951426,  0.39233491]])

In [9]:

pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-1.511077,0.608749,-0.153236,0.507984,-0.324032,-2.432509,1.592056,-0.864830,-0.202901,-0.871422
1,1.445445,0.518969,0.645153,2.038777,-0.396293,1.282142,-2.170249,-1.043400,0.048547,-2.621012
2,0.371670,0.513505,-1.398813,-0.459943,0.644354,0.081768,-1.757065,0.142251,-1.132835,1.853009
3,2.565453,0.145652,1.177052,1.322694,0.194175,-0.641108,0.878631,-0.202694,-1.199798,-0.464115
4,-0.710656,1.050615,0.354602,-1.774596,-0.312230,-0.212373,0.826484,-0.621252,-1.187774,1.131129
...,...,...,...,...,...,...,...,...,...,...
195,-1.098083,-1.277636,0.419595,0.482176,-1.879287,-0.091079,-2.428480,0.032615,1.164204,0.758637
196,0.165211,1.937132,-1.307971,0.074876,-1.786935,1.472396,1.666002,-0.696028,-0.162525,0.976296
197,-0.950909,-0.218733,1.293550,0.590039,-0.679384,-0.438998,-0.188582,-0.045867,-0.972107,-0.704350
198,-0.446699,0.744885,-0.961264,0.494342,-1.494194,-1.458324,2.820244,0.612233,1.679779,0.204377


In [10]:
pd.DataFrame(Y)

Unnamed: 0,0
0,1
1,0
2,0
3,1
4,1
...,...
195,0
196,1
197,0
198,1


# 2. Data split (80/20 ratio)
### 2.1. Data split
A ratio of 80/20 is used for data splitting such that 80% goes to the training subset and 20% to the testing subset.

In [11]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

### 2.2. Let's examine the data dimension
Here we see that the training set has 160 rows and 10 columns while there are 160 rows and 1 column for the Y variable.

In [12]:
X_train.shape, Y_train.shape

((160, 10), (160,))

In [13]:
X_test.shape, Y_test.shape

((40, 10), (40,))

# 3. Building a simple machine learning model using Random Forest
In the following blocks of codes, we will first start with building a random forest model. Finally, we will explore how to tune the hyperparameters (e.g. n_estimators and max_features) of the random forest algorithm.

We first start by importing the necessary libraries and assigning the random forest classifier to the rf variable.

In [14]:
rf = RandomForestClassifier(max_features=5, n_estimators=100)

Now, we will be applying the random forest classifier to build a classification model using the rf.fit() function on the training data (e.g. X_train and Y_train).

In [15]:
rf.fit(X_train, Y_train)

RandomForestClassifier(max_features=5)

The rf.score() function will be used to calculate the accuracy score of the RF model in predicting the test data (X_test).

In [16]:
rf.score(X_test, Y_test)

0.925

The following 2 code cells also calculate the accuracy score of the RF model in predicting the test data (X_test) but performs it in 2 steps using rf.predict() and accuracy_score() functions.

In [17]:
Y_pred = rf.predict(X_test)

In [18]:
accuracy_score(Y_pred, Y_test)

0.925

The advantage of using this latter approach is that you have access to the predicted data values.



In [19]:
Y_pred, Y_test

(array([0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
        0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1]),
 array([0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
        0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]))

# 4. Hyperparameter Tuning
Now we will be performing the tuning of hyperparameters of Random forest model. The hyperparameters that we will tune includes max_features and the n_estimators.

Note: Some codes modified from scikit-learn

Firstly, we will import the necessary modules.

The GridSearchCV() function from scikit-learn will be used to perform the hyperparameter tuning. Particularly, GridSearchCV() function can perform the typical functions of a classifier such as fit, score and predict as well as predict_proba, decision_function, transform and inverse_transform.

Secondly, we define variables that are necessary input to the GridSearchCV() function.



In [24]:
max_features_range = np.arange(1,6,1)
n_estimators_range = np.arange(10,210,10)
param_grid = dict(max_features=max_features_range, n_estimators=n_estimators_range)

rf = RandomForestClassifier()

grid = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

model = RandomForestClassifier()
parameter_space = {'n_estimators': [100, 300, 1000],
                   'max_features': ['sqrt', 0.5, None],
                   'max_depth': [None, 10, 30, 100],
                   'min_samples_leaf': [1, 3, 10]}

grid_search = GridSearchCV(model,
                           param_grid=parameter_space,
                           verbose=1,
                           n_jobs=-1,
                           cv=5)


In [22]:
np.arange(1,6,1), np.arange(10,210,10)

(array([1, 2, 3, 4, 5]),
 array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,
        140, 150, 160, 170, 180, 190, 200]))

In [27]:
%%time
grid.fit(X_train,Y_train)

CPU times: user 1min 14s, sys: 368 ms, total: 1min 15s
Wall time: 1min 15s


GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_features': array([1, 2, 3, 4, 5]),
                         'n_estimators': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,
       140, 150, 160, 170, 180, 190, 200])})

In [28]:
print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

The best parameters are {'max_features': 3, 'n_estimators': 80} with a score of 0.89


# 5. Dataframe of Grid search parameters and their Accuracy scores
Finally, we will be exporting the grid search parameters and their resulting accuracy scores into a dataframe.

In [30]:
grid_results = pd.concat([pd.DataFrame(grid.cv_results_["params"]),pd.DataFrame(grid.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
grid_results

Unnamed: 0,max_features,n_estimators,Accuracy
0,1,10,0.73750
1,1,20,0.79375
2,1,30,0.82500
3,1,40,0.83125
4,1,50,0.83125
...,...,...,...
95,5,160,0.87500
96,5,170,0.88125
97,5,180,0.87500
98,5,190,0.88125


# 6. Preparing data for making contour plots
Prior to making contour plots, we will have to reshape the data into a compatible format that will be recognized by the contour plot functions.

Firstly, we will be using Pandas' groupby() function to segment the data into groups based on the 2 hyperparameters: max_features and n_estimators.

In [31]:
grid_contour = grid_results.groupby(['max_features','n_estimators']).mean()
grid_contour

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy
max_features,n_estimators,Unnamed: 2_level_1
1,10,0.73750
1,20,0.79375
1,30,0.82500
1,40,0.83125
1,50,0.83125
...,...,...
5,160,0.87500
5,170,0.88125
5,180,0.87500
5,190,0.88125


### Pivoting the data
Data is reshaped by pivoting the data into an m by n matrix where rows and columns correspond to the max_features and n_estimators, respectively.

In [32]:
grid_reset = grid_contour.reset_index()
grid_reset.columns = ['max_features', 'n_estimators', 'Accuracy']
grid_pivot = grid_reset.pivot('max_features', 'n_estimators')
grid_pivot

Unnamed: 0_level_0,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy
n_estimators,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,200
max_features,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
1,0.7375,0.79375,0.825,0.83125,0.83125,0.8125,0.83125,0.825,0.85625,0.84375,0.8375,0.83125,0.8625,0.85,0.85,0.85625,0.84375,0.8625,0.84375,0.85625
2,0.81875,0.86875,0.825,0.86875,0.86875,0.86875,0.8625,0.875,0.86875,0.85625,0.86875,0.875,0.86875,0.875,0.86875,0.875,0.86875,0.875,0.85625,0.875
3,0.81875,0.84375,0.84375,0.86875,0.875,0.88125,0.875,0.8875,0.875,0.86875,0.88125,0.875,0.875,0.875,0.875,0.86875,0.875,0.875,0.875,0.875
4,0.8625,0.8625,0.88125,0.8625,0.86875,0.88125,0.875,0.875,0.88125,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.86875,0.875
5,0.8375,0.86875,0.8875,0.875,0.86875,0.86875,0.86875,0.88125,0.88125,0.88125,0.88125,0.88125,0.88125,0.875,0.875,0.875,0.88125,0.875,0.88125,0.875


Finally, we assign the pivoted data into the respective x, y and z variables.

In [33]:
x = grid_pivot.columns.levels[1].values
y = grid_pivot.index.values
z = grid_pivot.values