# Selecting the best model with Hyperparameter tuning.
>  The first three chapters focused on model validation techniques. In chapter 4 we apply these techniques, specifically cross-validation, while learning about hyperparameter tuning. After all, model validation makes tuning possible and helps us select the overall best model.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 4 exercises "Model Validation in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## Introduction to hyperparameter tuning

### Creating Hyperparameters

<div class=""><p>For a school assignment, your professor has asked your class to create a random forest model to predict the average test score for the final exam.</p>
<p>After developing an initial random forest model, you are unsatisfied with the overall accuracy. You realize that there are too many hyperparameters to choose from, and each one has <em>a lot</em> of possible values. You have decided to make a list of possible ranges for the hyperparameters you might use in your next model.</p>
<p>Your professor has provided de-identified data for the last ten quizzes to act as the training data. There are 30 students in your class.</p></div>

In [8]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators='warn', random_state=1111)

Instructions 1/3
<li>Print<code>.get_params()</code> in the console to review the possible parameters of the model that you can tune.</li>

In [9]:
# Review the parameters of rfr
rfr.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1111,
 'verbose': 0,
 'warm_start': False}

Instructions 2/3
<li>Create a maximum depth list, <code>[4, 8, 12]</code> and a minimum samples list <code>[2, 5, 10]</code> that specify possible values for each hyperparameter.</li>

In [10]:
# Maximum Depth
max_depth = [4, 8, 12]

# Minimum samples for a split
min_samples_split = [2, 5, 10]

Instructions 3/3
<li>Create one final list to use for the maximum features.<ul>
<li>Use values 4 through the maximum number of features possible (10), by 2.</li></ul></li>

In [11]:
# Max features 
max_features = [4, 6, 8, 10]

**Hyperparameter tuning requires selecting parameters to tune, as well the possible values these parameters can be set to.**

### Running a model using ranges

<p>You have just finished creating a list of hyperparameters and ranges to use when tuning a predictive model for an assignment. You have used <code>max_depth</code>, <code>min_samples_split</code>, and <code>max_features</code> as your range variable names.</p>

Instructions
<ul>
<li>Randomly select a <code>max_depth</code>, <code>min_samples_split</code>, and <code>max_features</code> using your range variables.</li>
<li>Print out all of the parameters for <code>rfr</code> to see which values were randomly selected.</li>
</ul>

In [16]:
random.sample(max_depth, 1)[0]

12

In [14]:
import random

from sklearn.ensemble import RandomForestRegressor

# Fill in rfr using your variables
rfr = RandomForestRegressor(
    n_estimators=100,
    max_depth=random.choice(max_depth),
    min_samples_split=random.choice(min_samples_split),
    max_features=random.choice(max_features))

# Print out the parameters
rfr.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': 12,
 'max_features': 4,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

**Notice that min_samples_split was randomly set to 2. Since you specified a random state, min_samples_split will always be set to 2 if you only run this model one time.**

## RandomizedSearchCV

### Preparing for RandomizedSearch

<div class=""><p>Last semester your professor challenged your class to build a predictive model to predict final exam test scores. You tried running a few different models by randomly selecting hyperparameters. However, running each model required you to code it individually. </p>
<p>After learning about <code>RandomizedSearchCV()</code>, you're revisiting your professors challenge to build the best model. In this exercise, you will prepare the three necessary inputs for completing a random search.</p></div>

Instructions
<ul>
<li>Finalize the parameter dictionary by adding a list for the <code>max_depth</code> parameter with options 2, 4, 6, and 8. </li>
<li>Create a random forest regression model with ten trees and a <code>random_state</code> of 1111.</li>
<li>Create a mean squared error scorer to use.</li>
</ul>

In [19]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer, mean_squared_error

# Finish the dictionary by adding the max_depth parameter
param_dist = {"max_depth": [2, 4, 6, 8],
              "max_features": [2, 4, 6, 8, 10],
              "min_samples_split": [2, 4, 8, 16]}

# Create a random forest regression model
rfr = RandomForestRegressor(n_estimators=10, random_state=1111)

# Create a scorer to use (use the mean squared error)
scorer = make_scorer(mean_squared_error)

**To use RandomizedSearchCV(), you need a distribution dictionary, an estimator, and a scorer—once you've got these, you can run a random search to find the best parameters for your model.**

### Implementing RandomizedSearchCV

<div class=""><p>You are hoping that using a random search algorithm will help you improve predictions for a class assignment. You professor has challenged your class to predict the overall final exam average score. </p>
<p>In preparation for completing a random search, you have created:</p>
<ul>
<li><code>param_dist</code>: the hyperparameter distributions</li>
<li><code>rfr</code>: a random forest regression model</li>
<li><code>scorer</code>: a scoring method to use</li>
</ul></div>

Instructions
<ul>
<li>Load the method for conducting a random search in <code>sklearn</code>.</li>
<li>Complete a random search by filling in the parameters: <code>estimator</code>, <code>param_distributions</code>, and <code>scoring</code>. </li>
<li>Use 5-fold cross validation for this random search.</li>
</ul>

In [23]:
# Import the method for random search
from sklearn.model_selection import RandomizedSearchCV

# Build a random search using param_dist, rfr, and scorer
random_search =\
    RandomizedSearchCV(
        estimator=rfr,
        param_distributions=param_dist,
        n_iter=10,
        cv=5,
        scoring=scorer)

**Although it takes a lot of steps, hyperparameter tuning with random search is well worth it and can improve the accuracy of your models. Plus, you are already using cross-validation to validate your best model.**

### Selecting your final model

### Best classification accuracy

<div class=""><p>You are in a competition at work to build the best model for predicting the winner of a Tic-Tac-Toe game. You already ran a random search and saved the results of the most accurate model to <code>rs</code>.</p>
<p>Which parameter set produces the best classification accuracy?</p></div>

<pre>
Possible Answers

{'max_depth': 8, 'min_samples_split': 4, 'n_estimators': 10

{'max_depth': 2, 'min_samples_split': 4, 'n_estimators': 10}

<b>{'max_depth': 12, 'min_samples_split': 4, 'n_estimators': 20}</b>

{'max_depth': 2, 'min_samples_split': 2, 'n_estimators': 50}
 
</pre>

In [21]:
#rs.best_params_

**These parameters do produce the best testing accuracy. Good job! Remember, to reuse this model you can use rs.best_estimator_.**

### Selecting the best precision model

<div class=""><p>Your boss has offered to pay for you to see three sports games this year. Of the 41 home games your favorite team plays, you want to ensure you go to three home games that they will <em>definitely</em> win. You build a model to decide which games your team will win. </p>
<p>To do this, you will build a random search algorithm and focus on model precision (to ensure your team wins). You also want to keep track of your best model and best parameters, so that you can use them again next year (if the model does well, of course). You have already decided on using the random forest classification model <code>rfc</code> and generated a parameter distribution <code>param_dist</code>.</p></div>

In [26]:
df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/11-model-validation-in-python/datasets/sport_preprocessed.csv')
X, y = df.iloc[:, :-1], df.iloc[:, -1]

In [33]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators='warn', random_state=1111)

param_dist = {
    'max_depth': range(2, 12, 2),
    'min_samples_split': range(2, 12, 2),
    'n_estimators': [10, 25, 50]
}

Instructions
<ul>
<li>Create a precision scorer, <code>precision</code> using <code>make_scorer(&lt;scoring_function&gt;)</code>.</li>
<li>Complete the random search method by using <code>rfc</code> and <code>param_dist</code>. </li>
<li>Use <code>rs.cv_results_</code> to print the mean test scores.</li>
<li>Print the best overall score.</li>
</ul>

In [35]:
from sklearn.metrics import precision_score, make_scorer

# Create a precision scorer
precision = make_scorer(precision_score)
# Finalize the random search
rs = RandomizedSearchCV(
  estimator=rfc, param_distributions=param_dist,
  scoring = precision,
  cv=5, n_iter=10, random_state=1111)
rs.fit(X, y)

# print the mean test scores:
print('The accuracy for each run was: {}.'.format(rs.cv_results_['mean_test_score']))
# print the best model score:
print('The best accuracy for a single model was: {}'.format(rs.best_score_))

The accuracy for each run was: [0.87614978 0.75561877 0.67740077 0.89141614 0.87024051 0.85772772
 0.68244199 0.82867397 0.88717239 0.91980724].
The best accuracy for a single model was: 0.9198072369317106


**Your model's precision was 93%! The best model accurately predicts a winning game 93% of the time. If you look at the mean test scores, you can tell some of the other parameter sets did really poorly. Also, since you used cross-validation, you can be confident in your predictions. Well done!**