# Grid Search with Scikit-Learn
Steps:
1. An algorithm to tune the hyperparameters. (Sometimes called an 'estimator')
2. Defining which hyperparameters we will tune 
3. Defining a range of values for each hyperparameter
4. Setting a cross-validation scheme; and
5. Define a score function so we can decide which square on our grid was 'the best'
6. Include ectra useful information or functions

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load data
credits = pd.read_csv('datasets/credit-card-full.csv')
credits.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [2]:
# Features and labels
X = credits.drop(['ID','default payment next month'], axis=1)
y = credits['default payment next month']

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size=0.3,
                                                   random_state=42)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')

# Create the parameter grid
param_grid = {'max_depth': [2,4,8,15], 'max_features': ['auto','sqrt']} 

# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=4,
    cv=5,
    refit=True, return_train_score=True)
print(grid_rf_class)

GridSearchCV(cv=5, estimator=RandomForestClassifier(criterion='entropy'),
             n_jobs=4,
             param_grid={'max_depth': [2, 4, 8, 15],
                         'max_features': ['auto', 'sqrt']},
             return_train_score=True, scoring='roc_auc')


# Analyzing the output

Three different groups for the GridSearchCV properties;
- A results log
    - `cv_results_`

- The best results
    - `best_index_`, `best_params_` & `best_score_`

- Extra information
    - `scorer_`, `n_splits_` & `refit_time_`

In [4]:
# Fit the train data
grid_rf_class.fit(X_train, y_train)

In [5]:
# The .cv_results_ property
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)

print(cv_results_df.shape)

(8, 22)


- The 8 rows for the 8 squares in our grid or 8 models we ran.

In [7]:
# cv_results_df
cv_results_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,1.711629,0.015337,0.06954,0.020753,2,auto,"{'max_depth': 2, 'max_features': 'auto'}",0.781222,0.772902,0.781252,0.758474,0.754107,0.769591,0.011363,7,0.768849,0.771232,0.77051,0.776376,0.772834,0.77196,0.002552
1,1.682767,0.026882,0.048465,0.024721,2,sqrt,"{'max_depth': 2, 'max_features': 'sqrt'}",0.778152,0.769362,0.7797,0.757376,0.75633,0.768184,0.009906,8,0.767061,0.769399,0.768663,0.775294,0.775033,0.77109,0.003412
2,2.982903,0.071135,0.073356,0.020548,4,auto,"{'max_depth': 4, 'max_features': 'auto'}",0.784801,0.779501,0.78804,0.76097,0.760155,0.774693,0.011858,5,0.778375,0.780127,0.779274,0.784719,0.783317,0.781162,0.002437
3,2.936743,0.050936,0.082027,0.001532,4,sqrt,"{'max_depth': 4, 'max_features': 'sqrt'}",0.782884,0.778736,0.786301,0.76166,0.761372,0.774191,0.010623,6,0.777663,0.779735,0.778793,0.784733,0.784002,0.780985,0.002848
4,5.291219,0.073306,0.091318,0.000186,8,auto,"{'max_depth': 8, 'max_features': 'auto'}",0.793355,0.787294,0.792168,0.768211,0.768369,0.781879,0.01128,1,0.829105,0.830581,0.829874,0.833285,0.833743,0.831318,0.001859
5,5.25994,0.06817,0.081745,0.020399,8,sqrt,"{'max_depth': 8, 'max_features': 'sqrt'}",0.79261,0.785116,0.792305,0.767554,0.769517,0.78142,0.010875,2,0.8308,0.829858,0.828923,0.835348,0.831865,0.831359,0.002221
6,8.408438,0.127746,0.128887,0.03473,15,auto,"{'max_depth': 15, 'max_features': 'auto'}",0.791554,0.784119,0.784921,0.770116,0.767488,0.77964,0.009255,3,0.974398,0.97201,0.974482,0.974509,0.972545,0.973589,0.001085
7,7.562271,0.975023,0.101132,0.037497,15,sqrt,"{'max_depth': 15, 'max_features': 'sqrt'}",0.79055,0.783959,0.786621,0.766272,0.770149,0.77951,0.009541,4,0.973887,0.972101,0.973588,0.973806,0.973431,0.973363,0.000651


- **The .cv_results_ 'time' columns:** 
    - The 'time' columns refer to the time it took to fit and score the model. We did a cross-validation so this ran 5 times and stored the average and standard deviation of the times it took in seconds.

- **The .cv_results_ 'param_' columns:**
    - The param_ columns contain information on the different parameters that were used in the model. Remember, each row in this DataFrame is about one model. 

- **The .cv_results_ 'param' column:**
    - The params column is a dictionary of all the parameters from the previous 'param' columns.

- **The .cv_results_ 'test_score' columns:**
    - The testing scores for each of the 5 cross-folds, or splits, we made, followed by the the mean and standard deviation for those cross-folds.

- **The .cv_results_ 'rank_test_score' column:**
    - The rank column conveniently ranks the rows by the mean_test_score. 

- **The .cv_results_ 'train_score' columns:**
    - The test_score columns are then repeated for the training scores. Note that if we had not set return_train_score to True this would not include the training scores. There is also no ranking column for the training scores, as we only care about performance on the test set in each fold.

In [9]:
# Extracting best row
best_row = cv_results_df[cv_results_df['rank_test_score']==1]
best_row

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
4,5.291219,0.073306,0.091318,0.000186,8,auto,"{'max_depth': 8, 'max_features': 'auto'}",0.793355,0.787294,0.792168,0.768211,0.768369,0.781879,0.01128,1,0.829105,0.830581,0.829874,0.833285,0.833743,0.831318,0.001859


In [12]:
# The best_estimator_ 
print(grid_rf_class.best_estimator_)

print()

# The best parameters from the param_grid
print(grid_rf_class.best_params_)

# The actual best score
print(grid_rf_class.best_score_)

RandomForestClassifier(criterion='entropy', max_depth=8, max_features='auto')

{'max_depth': 8, 'max_features': 'auto'}
0.7818793901937214


In [14]:
# The scorer function used
print(grid_rf_class.scorer_)

# The number of cv splits
print(grid_rf_class.n_splits_)

# The number of sec used for refitting the best model on the whole dataset
print(grid_rf_class.refit_time_)

make_scorer(roc_auc_score, needs_threshold=True)
5
3.3065145015716553


# Using the best result

In [16]:
from sklearn.metrics import confusion_matrix, roc_auc_score

# See what type of object the best_estimator_ property is
print(type(grid_rf_class.best_estimator_))

# Create an array of predictions directly using the best_estimator_ property
predictions = grid_rf_class.best_estimator_.predict(X_test)

# Take a look to confirm it worked, this should be an array of 1's and 0's
print(predictions[0:5])

# Now create a confusion matrix 
print("Confusion Matrix \n", confusion_matrix(y_test, predictions))

# Get the ROC-AUC score
predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:,1]
print("ROC-AUC Score \n", roc_auc_score(y_test, predictions_proba))

<class 'sklearn.ensemble._forest.RandomForestClassifier'>
[0 0 0 0 0]
Confusion Matrix 
 [[6709  331]
 [1312  648]]
ROC-AUC Score 
 0.7735151176948053
