In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

SEED = 1

## Tuning a CART's hyperparameters

### Hyperparameters
- parameters: learned from data
 - CART example: split-point of a node, split-feature of a node, ...
- hyperparameters: net learned from data, set prior to training
 - CART example: max_depth, min_samples_leaf, splitting criterion, ...

### What is hyperparameter tuning?
- problem: search for a set of optimal hyperparameters for a learning algorithm
- solution: find a set of optimal hyperparameters that results in an optimal model
- optimal model: yields an optimal score
- score: in sklearn defualts to accuracy (classification) and R_square (regression)
- cross validation is used to estimate the generalization performance

### Why tune hyperparameters?
- in sklearn, a model's defualt hyperparameters are not optimal for all problems
- hyperparameters should bbe tuned to obtain the best model performance.

### Approaches to hyperparameter tuning
- Grid Search
- Random Search
- Bayesian Optimization
- Genetic Algorithms
- ...

### Grid search cross validation
- manually set a grid of discrete hyperparameter values
- set a metric for scoring model performance
- search exhaustively through the grid
- for each set of hyperparamters, evaluate each model's CV score
- the optimal hyperparameters are those of the model achieving the best CV score

### Inspecting the hyperparameters of a CART in sklearn (Breast-Cancer dataset)

In [2]:
wbc = pd.read_csv('wbc.csv').drop('Unnamed: 32', axis=1)
y = wbc['diagnosis']
X = wbc.drop(['id', 'diagnosis'], axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=SEED)

In [3]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Set seed to 1 for reproducibility
SEED = 1

# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)

# Print out 'dt's hyperparameters
dt.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': 1,
 'splitter': 'best'}

In [4]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of hyperparameters 'params_dt'
params_dt = {'max_depth': [3, 4, 5, 6], 
             'min_samples_leaf': [0.04, 0.06, 0.08], 
             'max_features': [0.2, 0.4, 0.6, 0.8]}

# Instantiate a 10-fold CV grid search object 'grid_dt'
grid_dt = GridSearchCV(estimator=dt, 
                       param_grid=params_dt, 
                       scoring='accuracy', 
                       cv=10, 
                       n_jobs=-1)

# Fit 'grid_dt' to the training data
grid_dt.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [3, 4, 5, 6],
                         'max_features': [0.2, 0.4, 0.6, 0.8],
                         'min_samples_leaf': [0.04, 0.06, 0.08]},
             scoring='accuracy')

In [5]:
# Extract best hyperparameters from 'grid_dt'
grid_dt.best_params_

{'max_depth': 4, 'max_features': 0.4, 'min_samples_leaf': 0.04}

In [6]:
# Extract best CV score from 'grid_dt'
grid_dt.best_score_

0.9406763285024156

In [7]:
# Extract best model from 'grid_dt'
best_model = grid_dt.best_estimator_
best_model 

DecisionTreeClassifier(max_depth=4, max_features=0.4, min_samples_leaf=0.04,
                       random_state=1)

In [8]:
# Evaluate test set accuracy
best_model.score(X_test, y_test)

0.9473684210526315

### Exercise: Tree hyperparameters
In the following exercises you'll revisit the Indian Liver Patient dataset which was introduced in a previous chapter.

Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.

We have instantiated a DecisionTreeClassifier and assigned to dt with sklearn's default hyperparameters. You can inspect the hyperparameters of dt in your console.

Which of the following is not a hyperparameter of dt?

In [9]:
liver = pd.read_csv('indian_liver_patient.csv')
liver.dropna(axis=0, inplace=True)
liver['Is_male'] = liver['Gender'].astype('category').cat.codes
X = liver.drop(['Dataset', 'Gender'] , axis=1)
y = liver['Dataset']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

In [10]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Set seed to 1 for reproducibility
SEED = 1

# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)

# Print out 'dt's hyperparameters
dt.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': 1,
 'splitter': 'best'}

### Exercise: Set the tree's hyperparameter grid
In this exercise, you'll manually set the grid of hyperparameters that will be used to tune the classification tree dt and find the optimal classifier in the next exercise.

In [11]:
# Define params_dt
params_dt = {'max_depth': [2, 3, 4], 
             'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]}

### Exercise: Search for the optimal tree
In this exercise, you'll perform grid search using 5-fold cross validation to find dt's optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

grid_object.fit(X_train, y_train)

An untuned classification tree dt as well as the dictionary params_dt that you defined in the previous exercise are available in your workspace.

In [12]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)

# Fit 'grid_dt' to the training data
grid_dt.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4],
                         'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]},
             scoring='roc_auc')

Awesome! As we said earlier, we will fit the model to the training data for you and in the next exercise you will compute the test set ROC AUC score.

### Exercise: Evaluate the optimal tree
In this exercise, you'll evaluate the test set ROC AUC score of grid_dt's optimal model.

In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the methodpredict_proba() of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns.

The dataset is already loaded and processed for you (numerical features are standardized); it is split into 80% train and 20% test. X_test, y_test are available in your workspace. In addition, we have also loaded the trained GridSearchCV object grid_dt that you instantiated in the previous exercise. Note that grid_dt was trained as follows:

grid_dt.fit(X_train, y_train)

In [13]:
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.731


## Tuning an RF's Hyperparameters

### Random Forests Hyperparameters
- CART hyperparameters
- number of estimators
- bootstrap
- ...

### Hyperparameter Tuning is expensive
- computationally expensive
- sometimes leads to very slight improvement
- so, weight the impact of tuning on the whole project

### Inspecting RF hyperparameters in sklearn (auto dataset)

In [14]:
auto = pd.read_csv('auto.csv')
auto['origin'] = auto['origin'].astype('category')
dummies = pd.get_dummies(auto['origin'], prefix='origin')
auto = pd.concat([auto, dummies], axis=1)

X = auto.drop(['mpg', 'origin'], axis=1)
y = auto['mpg']

In [15]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Set seed for reproducibility
SEED = 1

# Instantiate a random forests regressor 'rf'
rf = RandomForestRegressor(random_state= SEED)

In [16]:
# Inspect rf's hyperparameters
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}

In [17]:
# Basic imports
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import GridSearchCV

# Define a grid of hyperparameter 'params_rf'
params_rf = {'n_estimators': [300, 400, 500], 
             'max_depth': [4, 6, 8], 
             'min_samples_leaf': [0.1, 0.2], 
             'max_features': ['log2', 'sqrt']}

# Instantiate 'grid_rf'
grid_rf = GridSearchCV(estimator=rf, 
                       param_grid=params_rf, 
                       cv=3, 
                       scoring='neg_mean_squared_error', 
                       verbose=1, 
                       n_jobs=-1)

In [18]:
# Fit 'grid_rf' to the training set
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   12.6s finished


GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=1), n_jobs=-1,
             param_grid={'max_depth': [4, 6, 8],
                         'max_features': ['log2', 'sqrt'],
                         'min_samples_leaf': [0.1, 0.2],
                         'n_estimators': [300, 400, 500]},
             scoring='neg_mean_squared_error', verbose=1)

In [19]:
# Extract the best hyperparameters from 'grid_rf'
grid_rf.best_params_

{'max_depth': 4,
 'max_features': 'log2',
 'min_samples_leaf': 0.1,
 'n_estimators': 500}

In [20]:
# Extract the best model from 'grid_rf'
best_model = grid_rf.best_estimator_
best_model

RandomForestRegressor(max_depth=4, max_features='log2', min_samples_leaf=0.1,
                      n_estimators=500, random_state=1)

In [21]:
# Predict the test set labels
y_pred = best_model.predict(X_test)
y_pred[:5]

array([1.18834338, 1.23605306, 1.4672004 , 1.30955372, 1.23204304])

In [22]:
# Evaluate the test set RMSE
MSE(y_test, y_pred)**(1/2)

0.41163064876435845

### Exercise Random forests hyperparameters
In the following exercises, you'll be revisiting the Bike Sharing Demand dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be tuning the hyperparameters of a Random Forests regressor.

We have instantiated a RandomForestRegressor called rf using sklearn's default hyperparameters. You can inspect the hyperparameters of rf in your console.

Which of the following is not a hyperparameter of rf?

In [23]:
bikes = pd.read_csv('bikes.csv')
y = bikes['cnt']
X = bikes.drop('cnt', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [24]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Set seed for reproducibility
SEED = 1

# Instantiate a random forests regressor 'rf'
rf = RandomForestRegressor(random_state= SEED)

In [25]:
# Inspect rf's hyperparameters
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}

### Exercise: Set the hyperparameter grid of RF
In this exercise, you'll manually set the grid of hyperparameters that will be used to tune rf's hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf.

In [26]:
# Define a grid of hyperparameter 'params_rf'
params_rf = {'n_estimators': [100, 350, 500], 
             'min_samples_leaf': [2, 10, 30], 
             'max_features': ['log2', 'auto', 'sqrt']}

### Exercise: Search for the optimal forest
In this exercise, you'll perform grid search using 3-fold cross validation to find rf's optimal hyperparameters. To evaluate each model in the grid, you'll be using the negative mean squared error metric.

Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

grid_object.fit(X_train, y_train)

The untuned random forests regressor model rf as well as the dictionary params_rf that you defined in the previous exercise are available in your workspace.

In [27]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

# Fit 'grid_rf' to the training set
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:    9.4s finished


GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=1), n_jobs=-1,
             param_grid={'max_features': ['log2', 'auto', 'sqrt'],
                         'min_samples_leaf': [2, 10, 30],
                         'n_estimators': [100, 350, 500]},
             scoring='neg_mean_squared_error', verbose=1)

In [28]:
# Extract the best hyperparameters from 'grid_rf'
grid_rf.best_params_

{'max_features': 'auto', 'min_samples_leaf': 2, 'n_estimators': 100}

### Exercise: Evaluate the optimal forest
In this last exercise of the course, you'll evaluate the test set RMSE of grid_rf's optimal model.

The dataset is already loaded and processed for you and is split into 80% train and 20% test. In your environment are available X_test, y_test and the function mean_squared_error from sklearn.metrics under the alias MSE. In addition, we have also loaded the trained GridSearchCV object grid_rf that you instantiated in the previous exercise. Note that grid_rf was trained as follows:

grid_rf.fit(X_train, y_train)

In [29]:
# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 

Test RMSE of best model: 51.779
