# Grid search
>  This chapter introduces you to a popular automated hyperparameter tuning methodology called Grid Search. You will learn what it is, how it works and practice undertaking a Grid Search using Scikit Learn. You will then learn how to analyze the output of a Grid Search & gain practical experience doing this.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 2 exercises "Hyperparameter Tuning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## Introducing Grid Search

### Build Grid Search functions

<div class=""><p>In data science it is a great idea to try building algorithms, models and processes 'from scratch' so you can really understand what is happening at a deeper level. Of course there are great packages and libraries for this work (and we will get to that very soon!) but building from scratch will give you a great edge in your data science work.</p>
<p>In this exercise, you will create a function to take in 2 hyperparameters, build models and return results. You will use this function in a future exercise.</p>
<p>You will have available the <code>X_train</code>, <code>X_test</code>, <code>y_train</code> and <code>y_test</code> datasets available.</p></div>

In [3]:
from sklearn.model_selection import train_test_split
df = pd.read_csv('https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/20-hyperparameter-tuning-in-python/datasets/credit-card-full.csv')
df = pd.get_dummies(df, columns=['SEX', 'EDUCATION', 'MARRIAGE'], drop_first=True)
X = df.drop(['ID', 'default payment next month'], axis=1)
y = df['default payment next month']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

Instructions
<ul>
<li>Build a function that takes two parameters called <code>learn_rate</code> and <code>max_depth</code> for the learning rate and maximum depth.</li>
<li>Add capability in the function to build a GBM model and fit it to the data with the input hyperparameters.</li>
<li>Have the function return the results of that model and the chosen hyperparameters (<code>learn_rate</code> and <code>max_depth</code>).</li>
</ul>

In [None]:
# Create the function
def gbm_grid_search(learn_rate, max_depth):

	# Create the model
    model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth)
    
    # Use the model to make predictions
    predictions = model.fit(X_train, y_train).predict(X_test)
    
    # Return the hyperparameters and score
    return([learn_rate, max_depth, accuracy_score(y_test, predictions)])

**You now have a function you can call to test different combinations of two hyperparameters for the GBM algorithm. In the next exercise we will use it to test some values and analyze the results.**

### Iteratively tune multiple hyperparameters

<div class=""><p>In this exercise, you will build on the function you previously created to take in 2 hyperparameters, build a model and return the results. You will now use that to loop through some values and then extend this function and loop with another hyperparameter.</p>
<p>The function <code>gbm_grid_search(learn_rate, max_depth)</code> is available in this exercise.</p>
<p>If you need to remind yourself of the function you can run the function <code>print_func()</code> that has been created for you</p></div>

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

Instructions 1/3
<li>Write a for-loop to test the values (0.01, 0.1, 0.5) for the <code>learning_rate</code> and (2, 4, 6) for the <code>max_depth</code> using the function you created <code>gbm_grid_search</code> and print the results.</li>

In [None]:
# Create the relevant lists
results_list = []
learn_rate_list = [0.01, 0.1, 0.5]
max_depth_list = [2, 4, 6]

# Create the for loop
for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
        results_list.append(gbm_grid_search(learn_rate,max_depth))

# Print the results
results_list

[[0.01, 2, 0.8176666666666667],
 [0.01, 4, 0.817],
 [0.01, 6, 0.8146666666666667],
 [0.1, 2, 0.8183333333333334],
 [0.1, 4, 0.8182222222222222],
 [0.1, 6, 0.8145555555555556],
 [0.5, 2, 0.8171111111111111],
 [0.5, 4, 0.801],
 [0.5, 6, 0.7867777777777778]]

Instructions 2/3
<li>Extend the <code>gbm_grid_search</code> function to include the hyperparameter <code>subsample</code>. Name this new function <code>gbm_grid_search_extended</code>.</li>

In [None]:
results_list = []

# Extend the function input
def gbm_grid_search_extended(learn_rate, max_depth, subsample):

	# Extend the model creation section
    model = GradientBoostingClassifier(learning_rate=learn_rate, max_depth=max_depth, subsample=subsample)
    
    predictions = model.fit(X_train, y_train).predict(X_test)
    
    # Extend the return part
    return([learn_rate, max_depth, subsample, accuracy_score(y_test, predictions)])       

Instructions 3/3
<li>Extend your loop to call <code>gbm_grid_search</code> (available in your console), then test the values [0.4 , 0.6] for the <code>subsample</code> hyperparameter and print the results. <code>max_depth_list</code> &amp; <code>learn_rate_list</code> are available in your environment.</li>

In [None]:
results_list = []

# Create the new list to test
subsample_list = [0.4 , 0.6]

for learn_rate in learn_rate_list:
    for max_depth in max_depth_list:
    
    	# Extend the for loop
        for subsample in subsample_list:
        	
            # Extend the results to include the new hyperparameter
            results_list.append(gbm_grid_search_extended(learn_rate, max_depth, subsample))
            
# Print results
results_list    

[[0.01, 2, 0.4, 0.816],
 [0.01, 2, 0.6, 0.8173333333333334],
 [0.01, 4, 0.4, 0.8156666666666667],
 [0.01, 4, 0.6, 0.8168888888888889],
 [0.01, 6, 0.4, 0.8143333333333334],
 [0.01, 6, 0.6, 0.8152222222222222],
 [0.1, 2, 0.4, 0.8181111111111111],
 [0.1, 2, 0.6, 0.8191111111111111],
 [0.1, 4, 0.4, 0.8186666666666667],
 [0.1, 4, 0.6, 0.8178888888888889],
 [0.1, 6, 0.4, 0.8118888888888889],
 [0.1, 6, 0.6, 0.8132222222222222],
 [0.5, 2, 0.4, 0.8127777777777778],
 [0.5, 2, 0.6, 0.8134444444444444],
 [0.5, 4, 0.4, 0.7926666666666666],
 [0.5, 4, 0.6, 0.7917777777777778],
 [0.5, 6, 0.4, 0.7808888888888889],
 [0.5, 6, 0.6, 0.7721111111111111]]

**You have effectively built your own grid search! You went from 2 to 3 hyperparameters and can see how you could extend that to even more values and hyperparameters. That was a lot of effort though. Be warned - we are now entering a world that can get very computationally expensive very fast!**

### How Many Models?

<div class=""><p>Adding more hyperparameters or values, you increase the amount of models created but the increases is not linear it is proportional to how many values and hyperparameters you already have.</p>
<p>How many models would be created when running a grid search over the following hyperparameters and values for a GBM algorithm?</p>
<ul>
<li>learning_rate = [0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1, 2]</li>
<li>max_depth = [4,6,8,10,12,14,16,18, 20]</li>
<li>subsample = [0.4, 0.6, 0.7, 0.8, 0.9]</li>
<li>max_features = ['auto', 'sqrt', 'log2']</li>
</ul>
<p>These lists are in your console so you can utilize properties of them to help you!</p></div>

In [None]:
9*9*5*3

1215

<pre>
Possible Answers
26
9 of one model, 9 of another
1 large model
<b>1215</b>
</pre>

**For every value of one hyperparameter, we test EVERY value of EVERY other hyperparameter. So you correctly multiplied the number of values (the lengths of the lists).**

## Grid Search with Scikit Learn

### GridSearchCV inputs

<div class=""><p>Let's test your knowledge of <code>GridSeachCV</code> inputs by answering the question below.</p>
<p>Three <code>GridSearchCV</code> objects are available in the console, named <code>model_1</code>, <code>model_2</code>, <code>model_3</code>. Note that there is no data available to fit these models. Instead, you must answer by looking at their construct.</p>
<p>Which of these <code>GridSearchCV</code> objects would not work when we try to fit it?</p></div>

Model 1:
 GridSearchCV(
    estimator = RandomForestClassifier(),
    param_grid = {'max_depth': [2, 4, 8, 15], 'max_features': ['auto', 'sqrt']},
    scoring='roc_auc',
    n_jobs=4,
    cv=5,
    refit=True, return_train_score=True) 


Model 2:
 GridSearchCV(
    estimator = KNeighborsClassifier(),
    param_grid = {'n_neighbors': [5, 10, 20], 'algorithm': ['ball_tree', 'brute']},
    scoring='accuracy',
    n_jobs=8,
    cv=10,
    refit=False) 


Model 3:
 GridSearchCV(
    estimator = GradientBoostingClassifier(),
    param_grid = {'number_attempts': [2, 4, 6], 'max_depth': [3, 6, 9, 12]},
    scoring='accuracy',
    n_jobs=2,
    cv=7,
    refit=True) 

<pre>
Possible Answers
model_1 would not work when we try to fit it.
model_2 would not work when we try to fit it.
<b>model_3 would not work when we try to fit it.</b>
None - they will all work when we try to fit them.
</pre>

**By looking at the Scikit Learn documentation (or your excellent memory!) you know that number_attempts is not a valid hyperparameter. This GridSearchCV will not fit to our data.**

### GridSearchCV with Scikit Learn

<div class=""><p>The <code>GridSearchCV</code> module from Scikit Learn provides many useful features to assist with efficiently undertaking a grid search. You will now put your learning into practice by creating a <code>GridSearchCV</code> object with certain parameters.</p>
<p>The desired options are:</p>
<ul>
<li>A Random Forest Estimator, with the split criterion as 'entropy'</li>
<li>5-fold cross validation</li>
<li>The hyperparameters <code>max_depth</code> (2, 4, 8, 15) and <code>max_features</code> ('auto' vs 'sqrt')</li>
<li>Use <code>roc_auc</code> to score the models</li>
<li>Use 4 cores for processing in parallel</li>
<li>Ensure you refit the best model and return training scores</li>
</ul>
<p>You will have available <code>X_train</code>, <code>X_test</code>, <code>y_train</code> &amp; <code>y_test</code> datasets.</p></div>

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

Instructions
<ul>
<li>Create a Random Forest estimator as specified in the context above.</li>
<li>Create a parameter grid as specified in the context above.</li>
<li>Create a <code>GridSearchCV</code> object as outlined in the context above, using the two elements created in the previous two instructions.</li>
</ul>

In [None]:
# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')

# Create the parameter grid
param_grid = {'max_depth': [2, 4, 8, 15], 'max_features': ['auto', 'sqrt']} 

# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=4,
    cv=5,
    refit=True, return_train_score=True)
print(grid_rf_class)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='entropy',
                                              max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
  

**You now understand all the inputs to a GridSearchCV object and can tune many different hyperparameters and many different values for each on a chosen algorithm!**

## Understanding a grid search output

### Using the best outputs

<p>Which of the following parameters must be set in order to be able to directly use the <code>best_estimator_</code> property for predictions?</p>

<pre>
Possible Answers
return_train_score = True
<b>refit = True</b>
refit = False
verbose = 1
</pre>

**When we set this to true, the creation of the grid search object automatically refits the best parameters on the whole training set and creates the best_estimator_ property.**

### Exploring the grid search results

<div class=""><p>You will now explore the <code>cv_results_</code> property of the GridSearchCV object defined in the video. This is a dictionary that we can read into a pandas DataFrame and contains a lot of useful information about the grid search we just undertook. </p>
<p>A reminder of the different column types in this property:</p>
<ul>
<li><code>time_</code> columns</li>
<li><code>param_</code> columns (one for each hyperparameter) and <strong>the</strong> singular <code>params</code> column (with all hyperparameter settings)</li>
<li>a <code>train_score</code> column for each cv fold including the <code>mean_train_score</code> and <code>std_train_score</code> columns</li>
<li>a <code>test_score</code> column for each cv fold including the <code>mean_test_score</code> and <code>std_test_score</code> columns</li>
<li>a <code>rank_test_score</code> column with a number from 1 to n (number of iterations) ranking the rows based on their <code>mean_test_score</code></li>
</ul></div>

In [None]:
grid_rf_class.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='entropy',
                                              max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
  

Instructions
<ul>
<li>Read the <code>cv_results_</code> property of the <code>grid_rf_class</code> GridSearchCV object into a data frame &amp; print the whole thing out to inspect.</li>
<li>Extract &amp; print the <strong>singular</strong> column containing a dictionary of all hyperparameters used in each iteration of the grid search.</li>
<li>Extract &amp; print the row that had the best mean test score by indexing using the <code>rank_test_score</code> column.</li>
</ul>

In [None]:
# Read the cv_results property into a dataframe & print it out
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df)

# Extract and print the column with a dictionary of hyperparameters used
column = cv_results_df.loc[:, ['params']]
print(column)

# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1 ]
print(best_row)

   mean_fit_time  std_fit_time  ...  mean_train_score  std_train_score
0       3.458513      0.285226  ...          0.769632         0.001048
1       3.697389      0.172889  ...          0.769441         0.001575
2       6.008191      0.111924  ...          0.780005         0.001331
3       6.044000      0.153031  ...          0.780220         0.001659
4      11.024294      0.151925  ...          0.830187         0.001384
5      11.009805      0.118585  ...          0.830242         0.001592
6      18.288561      0.181061  ...          0.975750         0.001260
7      16.278501      2.493098  ...          0.974560         0.001332

[8 rows x 22 columns]
                                      params
0   {'max_depth': 2, 'max_features': 'auto'}
1   {'max_depth': 2, 'max_features': 'sqrt'}
2   {'max_depth': 4, 'max_features': 'auto'}
3   {'max_depth': 4, 'max_features': 'sqrt'}
4   {'max_depth': 8, 'max_features': 'auto'}
5   {'max_depth': 8, 'max_features': 'sqrt'}
6  {'max_depth': 15, 'm

**You have built invaluable skills in looking 'under the hood' at what your grid search is doing by extracting and analysing the cv_results_ property.**

### Analyzing the best results

<div class=""><p>At the end of the day, we primarily care about the best performing 'square' in a grid search. Luckily Scikit Learn's <code>gridSearchCv</code> objects have a number of parameters that provide key information on just the best square (or row in <code>cv_results_</code>).</p>
<p>Three properties you will explore are:</p>
<ul>
<li><code>best_score_</code> – The score (here ROC_AUC) from the best-performing square.</li>
<li><code>best_index_</code> – The index of the row in <code>cv_results_</code> containing information on the best-performing square.</li>
<li><code>best_params_</code> – A dictionary of the parameters that gave the best score, for example <code>'max_depth': 10</code></li>
</ul>
<p>The grid search object <code>grid_rf_class</code> is available. </p>
<p>A dataframe (<code>cv_results_df</code>) has been created from the <code>cv_results_</code> for you on line 6. This will help you index into the results.</p></div>

Instructions
<ul>
<li>Extract and print out the ROC_AUC <strong>score</strong> from the <strong>best</strong> performing square in <code>grid_rf_class</code>.</li>
<li>Create a variable from the best-performing row by <strong>index</strong>ing into <code>cv_results_df</code>.</li>
<li>Create a variable, <code>best_n_estimators</code> by extracting the <code>n_estimators</code> parameter from the best-performing square in <code>grid_rf_class</code> and print it out.</li>
</ul>

In [None]:
# Print out the ROC_AUC score from the best-performing square
best_score = grid_rf_class.best_score_
print(best_score)

# Create a variable from the row related to the best-performing square
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
best_row = cv_results_df.loc[[grid_rf_class.best_index_]]
print(best_row)

# Get the n_estimators parameter from the best-performing square and print
best_n_estimators = grid_rf_class.best_params_["max_depth"] #n_estimators
print(best_n_estimators)

0.780853884604309
   mean_fit_time  std_fit_time  ...  mean_train_score  std_train_score
4      11.024294      0.151925  ...          0.830187         0.001384

[1 rows x 22 columns]
8


**Being able to quickly find and prioritize the huge volume of information given back from machine learning modeling output is a great skill. Here you had great practice doing that with cv_results_ by quickly isolating the key information on the best performing square. This will be very important when your grids grow from 12 squares to many more!**

### Using the best results

<div class=""><p>While it is interesting to analyze the results of our grid search, our final goal is practical in nature; we want to make predictions on our test set using our estimator object.</p>
<p>We can access this object through the <code>best_estimator_</code> property of our grid search object.</p>
<p>Let's take a look inside the <code>best_estimator_</code> property, make predictions, and generate evaluation scores. We will firstly use the default <code>predict</code> (giving class predictions), but then we will need to use <code>predict_proba</code> rather than <code>predict</code> to generate the roc-auc score as roc-auc needs probability scores for its calculation. We use a slice <code>[:,1]</code> to get probabilities of the positive class.</p>
<p>You have available the <code>X_test</code> and <code>y_test</code> datasets to use and the <code>grid_rf_class</code> object from previous exercises.</p></div>

In [None]:
from sklearn.metrics import confusion_matrix, roc_auc_score

Instructions
<ul>
<li>Check the type of the <code>best_estimator_</code> property.</li>
<li>Use the <code>best_estimator_</code> property to make predictions on our test set.</li>
<li>Generate a confusion matrix and ROC_AUC score from our predictions.</li>
</ul>

In [None]:
# See what type of object the best_estimator_ property is
print(type(grid_rf_class.best_estimator_))

# Create an array of predictions directly using the best_estimator_ property
predictions = grid_rf_class.best_estimator_.predict(X_test)

# Take a look to confirm it worked, this should be an array of 1's and 0's
print(predictions[0:5])

# Now create a confusion matrix 
print("Confusion Matrix \n", confusion_matrix(y_test, predictions))

# Get the ROC-AUC score
predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:,1]
print("ROC-AUC Score \n", roc_auc_score(y_test, predictions_proba))

<class 'sklearn.ensemble._forest.RandomForestClassifier'>
[0 0 0 0 0]
Confusion Matrix 
 [[6664  315]
 [1343  678]]
ROC-AUC Score 
 0.7763614587311805


**The .best_estimator_ property is a really powerful property to understand for streamlining your machine learning model building process. You now can run a grid search and seamlessly use the best model from that search to make predictions.**