# Model Tuning

## Tuning a CART's Hyperparameters


**Grid search cross validation**

- Manually set a grid of discrete hyperparameter values
- Set a metric for scoring model performance
- Search exhaustively through the grid
- For each set of hyperparameters, evaluate each model's CV score
- The optimal hyperparameters are those of the model achieving the best CV score.

In [1]:
import pandas as pd
breast_cancer = pd.read_csv('datasets/wbc.csv')
breast_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [2]:
X = breast_cancer.drop(['diagnosis','Unnamed: 32'],axis=1)
y = breast_cancer['diagnosis']

In [3]:
# Import DecisionTreeClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Set seed to 1 for reproducibility
SEED = 1

# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size=0.2,
                                                   random_state=SEED)

In [4]:
# Instantiate a DecisionTreeClassifier 'dt'
dt = DecisionTreeClassifier(random_state=SEED)

# Fit dt
dt.fit(X_train,y_train)

# Predict X_test on dt
y_pred = dt.predict(X_test)

In [5]:
# Evaluate accuracy_score
accuracy_dt = accuracy_score(y_test,y_pred)

# Print accuracy without hyperparameter tuning
print("The accuracy score without hyperparameter tuning on Decision Tree Classifier: {:.3f}".format(accuracy_dt))

The accuracy score without hyperparameter tuning on Decision Tree Classifier: 0.939


In [6]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of hyperparameters 'params_dt'
params_dt = {
    'max_depth':[3,4,5,6],
    'min_samples_leaf':[0.04,0.06,0.08],
    'max_features':[0.2,0.4,0.6,0.8]
}

# Instantiate a 10-fold CV grid search object 'grid_dt'
grid_dt = GridSearchCV(estimator=dt,
                      param_grid=params_dt,
                      scoring='accuracy',
                      cv=10,
                      n_jobs=-1)

In [7]:
# Fit 'grid_dt' to the training data
grid_dt.fit(X_train, y_train)

# Extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_dt.best_params_
print('Best hyperparameters: \n', best_hyperparams)

Best hyperparameters: 
 {'max_depth': 3, 'max_features': 0.6, 'min_samples_leaf': 0.04}


In [8]:
# Extract best CV score from 'grid_dt'
best_CV_score = grid_dt.best_score_
print('Best CV accuracy: {}'.format(best_CV_score))

Best CV accuracy: 0.9408212560386474


In [9]:
#Extraxt best model from 'grid_dt'
best_model = grid_dt.best_estimator_
print(best_model)

DecisionTreeClassifier(max_depth=3, max_features=0.6, min_samples_leaf=0.04,
                       random_state=1)


In [10]:
# Evaluate test set accuracy for the tunned, best model
test_acc = best_model.score(X_test,y_test)

# Print test set accuracy for tunned, best model
print("The accuracy score with Hyperparameter Tuning for Decision Tree Classifier: {:.3f}".format(test_acc))

The accuracy score with Hyperparameter Tuning for Decision Tree Classifier: 0.956


The model was better with hyperparameter tuning with 95.6% accuracy whereas, without hyperparameter tuning was 93.9% accuracy.

## Tuning a RF's Hyperparameters

In [11]:
auto = pd.read_csv('datasets/auto.csv')
auto.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [12]:
X = auto.drop(['mpg','origin'],axis=1)
y = auto['mpg']

In [13]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split

# Set seed to 1 for reproducibility
SEED = 1

# Split into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size=0.2,
                                                   random_state=SEED)

In [14]:
# Instantiate a random forests regressor 'rf'
rf = RandomForestRegressor(random_state = SEED)

# Fit rf to train data
rf.fit(X_train,y_train)

# Predict on test data using rf
y_pred = rf.predict(X_test)

In [16]:
#Evaluate RMSE for rf 
rmse_without_hpt = MSE(y_test,y_pred)**(1/2)

# Print RMSE without hyperparameter tuning
print("RMSE without hyperparameter tuning: {:.2f}".format(rmse_without_hpt))

RMSE without hyperparameter tuning: 4.02


In [17]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define a grid of hyperparameter 'params_rf'
params_rf = {
    'n_estimators':[300,400,500],
    'max_depth':[4,6,8],
    'min_samples_leaf':[0.1,0.2],
    'max_features':['log2','sqrt']
}

# Instantiate 'grid_rf' to perform 3-fold cross-validation
grid_rf = GridSearchCV(estimator=rf,
                      param_grid=params_rf,
                      cv=3,
                      scoring='neg_mean_squared_error',
                      verbose=1, #higher its value, the more messages are printed during fitting
                      n_jobs=-1)

In [18]:
# Fit 'grid_rf' to the training set
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


In [19]:
# Extract the best hyperparameters from 'grid_rf'
best_hyperparams = grid_rf.best_params_
print('Best hyperparameters: \n', best_hyperparams)

Best hyperparameters: 
 {'max_depth': 4, 'max_features': 'log2', 'min_samples_leaf': 0.1, 'n_estimators': 300}


In [20]:
# Extract the best model from 'grid_rf'
best_model = grid_rf.best_estimator_
best_model

In [21]:
# Predict the test set labels using best model
y_pred = best_model.predict(X_test)

# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print the test set RMSE with hyperparameter tuning
print("RMSE of rf with hyperparamter tuning: {:.2f}".format(rmse_test))

RMSE of rf with hyperparamter tuning: 3.92


Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response. Hence, hypterparameter tunned RandomForest has RMSE 3.92 and for RandomForest without hyperparameter tuning, RMSE is 4.02.