# Tuning a CART's hyperparameters

## Hyperparameters
Machine learning model:
* parameters: learned from data
    * CART example: split-point of a node, split-feature of a node, ...
* hyperparameters: not learned from data, set prior to training
    * CART example: max_depth , min_samples_leaf , splitting criterion, ...

## What is hyperparameter tuning?
* Problem: search for a set of optimal hyperparameters for a learning algorithm.
* Solution: find a set of optimal hyperparameters that results in an optimal model.
* Optimal model: yields an optimal score (given hyperparameters).
* Score: in sklearn defaults to accuracy (classification) and $R^2$ (regression)
* Cross validation is used to estimate the generalization performance

## Why tune hyperparameters?
* In sklearn , a model's default hyperparameters are not optimal for all problems.
* Hyperparameters should be tuned to obtain the best model performance.

## Approaches to hyperparameter tuning
* Grid Search
* Random Search
* Bayesian Optimization
* Genetic Algorithms
* ...

## Grid search cross validation
* Manually set a grid of discrete hyperparameter values.
* Set a metric for scoring model performance.
* Search exhaustively through the grid.
* For each set of hyperparameters, evaluate each model's CV score.
* The optimal hyperparameters are those of the model achieving the best CV score.

> Note that grid-search suffers from the curse of dimensionality, i.e., the bigger the grid, the longer it takes to find the solution. 

## Grid search cross validation: example
* Hyperparameters grids:
    * max_depth = {2,3,4},
    * min_samples_leaf = {0.05, 0.1}
* hyperparameter space = { (2,0.05) , (2,0.1) , (3,0.05), ... }
    * CV scores = { $score_{(2, 0.05)}$ , ... } (using k-fold CV for example)
* optimal hyperparameters = set of hyperparameters corresponding to the best CV score

## Tunning hyperparameters (Breast Cancer)

### Inspecting the hyperparameters of a CART in sklearn

In [1]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Set seed to 1 for reproducibility
SEED = 1
# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)
# Print out 'dt's hyperparameters
dt.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 1,
 'splitter': 'best'}

In [2]:
import pandas as pd
wbc = pd.read_csv('wbc.zip')
X = wbc.iloc[:, 2:-1]
y = pd.Categorical(wbc['diagnosis']).codes

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)

In [4]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of hyperparameters 'params_dt'
params_dt = {
    "max_depth": [3, 4, 5, 6],
    "min_samples_leaf": [0.04, 0.06, 0.08],
    "max_features": [0.2, 0.4, 0.6, 0.8],
}
# Instantiate a 10-fold CV grid search object 'grid_dt'
grid_dt = GridSearchCV(
    estimator=dt, param_grid=params_dt, scoring="accuracy", cv=10, n_jobs=-1
)
# Fit 'grid_dt' to the training data
grid_dt.fit(X_train, y_train)

### Extracting the best hyperparameters

In [5]:
# Extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_dt.best_params_
print("Best hyerparameters:\n", best_hyperparams)

Best hyerparameters:
 {'max_depth': 4, 'max_features': 0.2, 'min_samples_leaf': 0.06}


In [6]:
# Extract best CV score from 'grid_dt'
best_CV_score = grid_dt.best_score_
print("Best CV accuracy {:.3f}".format(best_CV_score))

Best CV accuracy 0.935


### Extracting the best estimator

In [7]:
# Extract best model from 'grid_dt'
best_model = grid_dt.best_estimator_
# Evaluate test set accuracy
test_acc = best_model.score(X_test, y_test)
# Print test set accuracy
print("Test set accuracy of best model: {:.3f}".format(test_acc))

Test set accuracy of best model: 0.906


## Tunning hyperparameters (Indian Liver Patient)

> Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.

In [8]:
indian = pd.read_csv('indian_liver_patient.zip').dropna()
indian['Gender'] = pd.Categorical(indian['Gender']).codes
X = indian.iloc[:, :-1]
y = indian['Dataset']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

In [10]:
# Define params_dt
params_dt = {'max_depth': [2,3,4], 'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]}

In [11]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)

grid_dt.fit(X_train, y_train)

In [12]:
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.728


# Tuning an (Ensemble Model) RF's Hyperparameters

## Random Forests Hyperparameters
* CART hyperparameters
* number of estimators
* bootstrap
* ...

## Tuning is expensive
Hyperparameter tuning:
* computationally expensive,
* sometimes leads to very slight improvement,

Weight the impact of tuning on the whole project.

## Tunning hyperparameters (Auto-mpg dataset)

In [13]:
auto = pd.read_csv('auto.zip')
X = auto.iloc[:, 1:]
X['origin'] = pd.Categorical(X['origin']).codes
y = auto['mpg']

In [14]:
# Split dataset into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Inspecting RF Hyperparameters in sklearn

In [15]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
# Set seed for reproducibility
SEED = 1
# Instantiate a random forests regressor 'rf'
rf = RandomForestRegressor(random_state= SEED)

In [16]:
# Inspect rf' s hyperparameters
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}

In [17]:
# Basic imports
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import GridSearchCV

# Define a grid of hyperparameter 'params_rf'
params_rf = {
    "n_estimators": [300, 400, 500],
    "max_depth": [4, 6, 8],
    "min_samples_leaf": [0.1, 0.2],
    "max_features": ["log2", "sqrt"],
}
# Instantiate 'grid_rf'
grid_rf = GridSearchCV(
    estimator=rf,
    param_grid=params_rf,
    cv=3,
    scoring="neg_mean_squared_error",
    verbose=1,
    n_jobs=-1,
)

### Searching for the best hyperparameters

In [18]:
# Fit 'grid_rf' to the training set
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


### Extracting the best hyperparameters

In [19]:
# Extract the best hyperparameters from 'grid_rf'
best_hyperparams = grid_rf.best_params_
print('Best hyperparameters:\n', best_hyperparams)

Best hyperparameters:
 {'max_depth': 4, 'max_features': 'log2', 'min_samples_leaf': 0.1, 'n_estimators': 300}


### Evaluating the best model performance

In [20]:
# Extract the best model from 'grid_rf'
best_model = grid_rf.best_estimator_
# Predict the test set labels
y_pred = best_model.predict(X_test)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
# Print the test set RMSE
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))

Test set RMSE of rf: 3.86


## Tunning hyperparameters (Bike Sharing Demand)

In [21]:
bikes = pd.read_csv('bikes.zip')
X = bikes.drop(columns=['cnt'])
y = bikes['cnt']

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [23]:
# Define the dictionary 'params_rf'
params_rf = {
    "n_estimators": [100, 350, 500],
    "max_features": ["log2", 1.0, "sqrt"],
    "min_samples_leaf": [2, 10, 30],
}

In [24]:
rf = RandomForestRegressor(n_jobs=-1, random_state=2)

In [25]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


In [26]:
# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE
# Extract the best estimator
best_model = grid_rf.best_estimator_
# Predict test set labels
y_pred = best_model.predict(X_test)
# Compute rmse_test
rmse_test = MSE(y_test, y_pred) ** (1/2)
# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 

Test RMSE of best model: 51.755
