# Model Tuning
>  The hyperparameters of a machine learning model are parameters that are not learned from data. They should be set prior to fitting the model to the training set. In this chapter, you'll learn how to tune the hyperparameters of a tree-based model using grid search cross validation.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Datacamp, Ensemble Learning]
- image: images/datacamp/___

> Note: This is a summary of the course's chapter 5 exercises "Machine Learning with Tree-Based Models in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## Tuning a CART's Hyperparameters

### Tree hyperparameters

<div class=""><p>In the following exercises you'll revisit the <a href="https://www.kaggle.com/uciml/indian-liver-patient-records" target="_blank" rel="noopener noreferrer">Indian Liver Patient</a> dataset which was introduced in a previous chapter. </p>
<p>Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.</p>
<p>We have instantiated a <code>DecisionTreeClassifier</code> and assigned to <code>dt</code> with <code>sklearn</code>'s default hyperparameters. You can inspect the hyperparameters of <code>dt</code> in your console.</p>
<p>Which of the following is not a hyperparameter of <code>dt</code>?</p></div>

<pre>
Possible Answers

<b></b>
min_impurity_decrease

min_weight_fraction_leaf

<b>min_features</b>

splitter
</pre>

**Well done! There is no hyperparameter named min_features.**

### Set the tree's hyperparameter grid

<p>In this exercise, you'll manually set the grid of hyperparameters that will be used to tune the classification tree <code>dt</code> and find the optimal classifier in the next exercise.</p>

Instructions
<p>Define a grid of hyperparameters corresponding to a Python dictionary called <code>params_dt</code> with:</p>
<ul>
<li><p>the key <code>'max_depth'</code> set to a list of values 2, 3, and 4</p></li>
<li><p>the key <code>'min_samples_leaf'</code> set to a list of values 0.12, 0.14, 0.16, 0.18</p></li></ul></li>

In [None]:
# Define params_dt
params_dt = {
    'max_depth': [2, 3, 4],
    'min_samples_leaf': [0.12, 0.14, 0.16, 0.18],
    }

### Search for the optimal tree

<div class=""><p>In this exercise, you'll perform grid search using 5-fold cross validation to find <code>dt</code>'s optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you'll only be instantiating the <code>GridSearchCV</code> object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the <code>.fit()</code> method:</p>
<pre><code>grid_object.fit(X_train, y_train)
</code></pre>
<p>An untuned classification tree <code>dt</code> as well as the dictionary <code>params_dt</code> that you defined in the previous exercise are available in your workspace.</p></div>

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': None,
 'splitter': 'best'}

Instructions
<ul>
<li><p>Import <code>GridSearchCV</code> from <code>sklearn.model_selection</code>.</p></li>
<li><p>Instantiate a <code>GridSearchCV</code> object using 5-fold CV by setting the parameters:</p>
<ul>
<li><p><code>estimator</code> to <code>dt</code>, <code>param_grid</code> to <code>params_dt</code> and</p></li>
<li><p><code>scoring</code> to <code>'roc_auc'</code>.</p></li></ul></li>
</ul>

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)

**As we said earlier, we will fit the model to the training data for you and in the next exercise you will compute the test set ROC AUC score.**

### Evaluate the optimal tree

<div class=""><p>In this exercise, you'll evaluate the test set ROC AUC score of <code>grid_dt</code>'s optimal model. </p>
<p>In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the method<code>predict_proba()</code> of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns.</p>
<p>The dataset is already loaded and processed for you (numerical features are standardized); it is split into 80% train and 20% test. <code>X_test</code>, <code>y_test</code> are available in your workspace. In addition, we have also loaded the trained <code>GridSearchCV</code> object <code>grid_dt</code> that you instantiated in the previous exercise. Note that <code>grid_dt</code> was trained as follows:</p>
<pre><code>grid_dt.fit(X_train, y_train)
</code></pre></div>

In [None]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/04-machine-learning-with-tree-based-models-in-python/datasets/indian_liver_patient_preprocessed.csv', index_col=0)
X = df.drop('Liver_disease', 1)
y = df['Liver_disease']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from sklearn.model_selection import GridSearchCV
grid_dt = GridSearchCV(estimator=dt, param_grid=params_dt, scoring='roc_auc', cv=5, n_jobs=-1)
grid_dt.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=-1,
             param_grid={'max_depth': [2, 3, 4],
                         'min_s

Instructions
<ul>
<li><p>Import <code>roc_auc_score</code> from <code>sklearn.metrics</code>. </p></li>
<li><p>Extract the <code>.best_estimator_</code> attribute from <code>grid_dt</code> and assign it to <code>best_model</code>. </p></li>
<li><p>Predict the test set probabilities of obtaining the positive class <code>y_pred_proba</code>. </p></li>
<li><p>Compute the test set ROC AUC score <code>test_roc_auc</code> of <code>best_model</code>.</p></li>
</ul>

In [None]:
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

Test set ROC AUC score: 0.674


**An untuned classification-tree would achieve a ROC AUC score of 0.54!**

## Tuning a RF's Hyperparameters

### Random forests hyperparameters

<div class=""><p>In the following exercises, you'll be revisiting the <a href="https://www.kaggle.com/c/bike-sharing-demand" target="_blank" rel="noopener noreferrer">Bike Sharing Demand</a> dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be tuning the hyperparameters of a Random Forests regressor.</p>
<p>We have instantiated a <code>RandomForestRegressor</code> called <code>rf</code> using <code>sklearn</code>'s default hyperparameters. You can inspect the hyperparameters of <code>rf</code> in your console.</p>
<p>Which of the following is not a hyperparameter of <code>rf</code>?</p></div>

<pre>
Possible Answers

min_weight_fraction_leaf

criterion

<b>learning_rate</b>

warm_start
</pre>

In [None]:
from sklearn.ensemble.forest import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=2)
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 2,
 'verbose': 0,
 'warm_start': False}

**There is no hyperparameter named learning_rate.**

### Set the hyperparameter grid of RF

<p>In this exercise, you'll manually set the grid of hyperparameters that will be used to tune <code>rf</code>'s hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf.</p>

Instructions
<p>Define a grid of hyperparameters corresponding to a Python dictionary called <code>params_rf</code> with:</p>
<ul>
<li><p>the key <code>'n_estimators'</code> set to a list of values 100, 350, 500</p></li>
<li><p>the key <code>'max_features'</code> set to a list of values 'log2', 'auto', 'sqrt'</p></li>
<li><p>the key <code>'min_samples_leaf'</code> set to a list of values 2, 10, 30</p></li></ul></li>

In [None]:
# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators': [100, 350, 500],
    'max_features': ['log2', 'auto', 'sqrt'],
    'min_samples_leaf': [2, 10, 30],
    }

**Time to perform the grid search.**

### Search for the optimal forest

<div class=""><p>In this exercise, you'll perform grid search using 3-fold cross validation to find <code>rf</code>'s optimal hyperparameters. To evaluate each model in the grid, you'll be using the <a href="http://scikit-learn.org/stable/modules/model_evaluation.html" target="_blank" rel="noopener noreferrer">negative mean squared error</a> metric. </p>
<p>Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you'll only be instantiating the <code>GridSearchCV</code> object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the <code>.fit()</code> method: </p>
<pre><code>grid_object.fit(X_train, y_train)
</code></pre>
<p>The untuned random forests regressor model <code>rf</code> as well as the dictionary <code>params_rf</code> that you defined in the  previous exercise are available in your workspace.</p></div>

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()

Instructions
<ul>
<li><p>Import <code>GridSearchCV</code> from <code>sklearn.model_selection</code>.</p></li>
<li><p>Instantiate a <code>GridSearchCV</code> object using 3-fold CV by using negative mean squared error as the scoring metric.</p></li>
</ul>

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

**Next comes evaluating the test set RMSE of the best model.**

### Evaluate the optimal forest

<div class=""><p>In this last exercise of the course, you'll evaluate the test set RMSE of <code>grid_rf</code>'s optimal model.</p>
<p>The dataset is already loaded and processed for you and is split into 80% train and 20% test. In your environment are available <code>X_test</code>, <code>y_test</code> and the function <code>mean_squared_error</code> from <code>sklearn.metrics</code> under the alias <code>MSE</code>.  In addition, we have also loaded the trained <code>GridSearchCV</code> object <code>grid_rf</code> that you instantiated in the previous exercise. Note that <code>grid_rf</code> was trained as follows:</p>
<pre><code>grid_rf.fit(X_train, y_train)
</code></pre></div>

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/04-machine-learning-with-tree-based-models-in-python/datasets/bikes.csv')
X = df.drop('cnt', axis='columns')
y = df['cnt']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1111)

grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   29.4s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:   48.4s finished


GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=10, n_jobs=-1,
                                             oob_score=False, random_state=2,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,

Instructions
<ul>
<li><p>Import <code>mean_squared_error</code> as <code>MSE</code> from <code>sklearn.metrics</code>. </p></li>
<li><p>Extract the best estimator from <code>grid_rf</code> and assign it to <code>best_model</code>. </p></li>
<li><p>Predict <code>best_model</code>'s test set labels and assign the result to <code>y_pred</code>.</p></li>
<li><p>Compute <code>best_model</code>'s test set RMSE.</p></li>
</ul>

In [None]:
# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**(1/2)


# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 

Test RMSE of best model: 52.882
