<a href="https://colab.research.google.com/github/mohd-faizy/CAREER-TRACK-Machine-Learning-Scientist-with-Python/blob/main/04_Selecting_the_best_model_with_Hyperparameter_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

--- 
<strong> 
    <h1 align='center'>Selecting the best model with Hyperparameter tuning</h1> 
</strong>

---

The first three chapters focused on model validation techniques. In chapter 4 we apply these techniques, specifically cross-validation, while learning about hyperparameter tuning. After all, model validation makes tuning possible and helps us select the overall best model.

In [1]:
!git clone https://github.com/mohd-faizy/CAREER-TRACK-Machine-Learning-Scientist-with-Python.git

Cloning into 'CAREER-TRACK-Machine-Learning-Scientist-with-Python'...
remote: Enumerating objects: 689, done.[K
remote: Counting objects: 100% (360/360), done.[K
remote: Compressing objects: 100% (320/320), done.[K
remote: Total 689 (delta 93), reused 284 (delta 38), pack-reused 329[K
Receiving objects: 100% (689/689), 202.36 MiB | 26.10 MiB/s, done.
Resolving deltas: 100% (219/219), done.
Checking out files: 100% (315/315), done.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

plt.style.use('fivethirtyeight')
#plt.style.use('ggplot')
#sns.set_theme()

%matplotlib inline

In [3]:
os.chdir('/content/CAREER-TRACK-Machine-Learning-Scientist-with-Python/10_Model_Validation_in_Python/_dataset')
cwd = os.getcwd()
print('Curent working directory is ', cwd)

Curent working directory is  /content/CAREER-TRACK-Machine-Learning-Scientist-with-Python/10_Model_Validation_in_Python/_dataset


In [4]:
ls

candy-data.csv  sports.csv  tic-tac-toe.csv


## Introduction to hyperparameter tuning

**Model Parameters**
> _Model parameters are created as the result of **fitting** a model and are estimated by the input data. They are used to make predictions on new data and are not manually set by the modeler._

```python
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, y)
# Here coefficients and intercept are considered model parameters
print(lr.coef_, lr.intercept_)
```
- Learned or estimated from the data
- The result of fitting a model
- Used when making future predictions
- Not manually set

**Model Hyperparameters**
> _Hyperparameters are the values that are set before training occurs. So anytime we refer to a parameter as being manually set, we are referring to hyperparameters._

- Manually set.
- Specify how the training is supposed to happen

__Random forest hyperparameters__

$$
\begin{array}{|l|l|l|}
\hline \text { Hyperparameter } & \text { Description } & \begin{array}{l}
\text { Possible Values } \\
\text { (default) }
\end{array} \\
\hline \text { n_estimators } & \text { Number of decision trees in the forest } & 2+(10) \\
\hline \text { max_depth } & \text { Maximum depth of the decision trees } & 2+\text { (None) } \\
\hline \text { max_features } & \begin{array}{l}
\text { Number of features to consider when making a } \\
\text { split }
\end{array} & \text { See documentation } \\
\hline \text { min_samples_split } & \begin{array}{l}
\text { The minimum number of samples required to } \\
\text { make a split }
\end{array} & 2+(2)
\end{array}
$$

**Hyperparameter tuning**

- Hyperparameter tuning consists of selecting hyperparameters to test
- and then running a specific type of model with various values for these hyperparameters.
- Create ranges of possible values to select from
- Specify a single **accuracy metric**

### Creating Hyperparameters
For a school assignment, your professor has asked your class to create a random forest model to predict the average test score for the final exam.

After developing an initial random forest model, you are unsatisfied with the overall accuracy. You realize that there are too many hyperparameters to choose from, and each one has a lot of possible values. You have decided to make a list of possible ranges for the hyperparameters you might use in your next model.

Your professor has provided de-identified data for the last ten quizzes to act as the training data. There are 30 students in your class.

In [5]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators='warn',
                            max_features='auto',
                            random_state=1111)

rfr.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1111,
 'verbose': 0,
 'warm_start': False}

In [6]:
# Maximum Depth
max_depth = [4, 8, 12]

# Minimum samples for a split
min_samples_split = [2, 5, 10]

# Max features 
max_features = [4, 6, 8, 10]

Hyperparameter tuning requires selecting parameters to tune, as well the possible values these parameters can be set to.

### Running a model using ranges
You have just finished creating a list of hyperparameters and ranges to use when tuning a predictive model for an assignment.

In [7]:
from sklearn.ensemble import RandomForestRegressor
import random

# Fill in rfr using your variables
rfr = RandomForestRegressor(
    n_estimators = 100,
    max_depth = random.sample(max_depth, 1)[0],
    min_samples_split = random.sample(min_samples_split, 1)[0],
    max_features = random.sample(max_features, 1)[0]
)

# Print out the parameters
print(rfr.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 4, 'max_features': 6, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


**Notice** that `min_samples_split` was randomly set to 2. Since you specified a random state, `min_samples_split` will always be set to 2 if you only run this model one time.

## RandomizedSearchCV

- **Grid Search**: Grid search is a process that searches exhaustively through a manually specified subset of the hyperparameter space of the targeted algorithm. 
    - *Benefits*
        - Tests every possible combination
    - *Drawbacks*
        - Additional hyperparameters increase training time exponentially

- There are two amazing **alternatives** to using grid search, which both have their advantages over grid searching.
    - **Random searching**: which consists of randomly selecting from all hyperparameter values from the list of possible ranges.
    - **Bayesian optimization**: which uses the past results of each test to update the hyperparameters for the next run. 

### Preparing for RandomizedSearch
Last semester your professor challenged your class to build a predictive model to predict final exam test scores. You tried running a few different models by randomly selecting hyperparameters. However, running each model required you to code it individually.

After learning about `RandomizedSearchCV()`, you're revisiting your professors challenge to build the best model. In this exercise, you will prepare the three necessary inputs for completing a random search.

In [8]:
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.ensemble import RandomForestRegressor

# Finish the dictionary by adding the max_depth parameter
param_dict = {
    "max_depth": [2, 4, 6, 8],
    "max_features": [2, 4, 6, 8, 10],
    "min_samples_split": [2, 4, 8, 16]
}

# Create a random forest regression model
rfr = RandomForestRegressor(n_estimators=10, random_state=1111)

# Create a scorer to use (use the mean squared error)
scorer = make_scorer(mean_squared_error)

To use `RandomizedSearchCV()`, you need a distribution dictionary, an estimator, and a scorer—once you've got these, you can run a random search to find the best parameters for your model.

### Implementing RandomizedSearchCV
You are hoping that using a random search algorithm will help you improve predictions for a class assignment. You professor has challenged your class to predict the overall final exam average score.

In preparation for completing a random search, you have created:

- `param_dist`: the hyperparameter distributions
- `rfr`: a random forest regression model
- `scorer`: a scoring method to use

In [9]:
from sklearn.model_selection import RandomizedSearchCV

# Build a random search using param_dist, rfr, and scorer
random_search = RandomizedSearchCV(estimator=rfr,
                                   param_distributions=param_dict,
                                   n_iter=10,
                                   cv=5,
                                   scoring=scorer
                                  )

Although it takes a lot of steps, hyperparameter tuning with random search is well worth it and can improve the accuracy of your models. Plus, you are already using cross-validation to validate your best model.

## Selecting your final model


### Best classification accuracy
You are in a competition at work to build the best model for predicting the winner of a Tic-Tac-Toe game. You already ran a random search and saved the results of the most accurate model to `rs`.

In [10]:
tic_tac_toe = pd.read_csv('tic-tac-toe.csv')

# Create dummy variables using pandas
X = pd.get_dummies(tic_tac_toe.iloc[:, 0:9])
y = tic_tac_toe.iloc[:, 9]
y = tic_tac_toe['Class'].apply(lambda x: 1 if x == 'positive' else 0)

In [11]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(max_features='auto')

param_dist = {
    'max_depth': [2, 4, 8, 12],
    'min_samples_split': [2, 4, 6, 8],
    'n_estimators':[10, 20, 30]
}

rs = RandomizedSearchCV(estimator=rfc,
                        param_distributions=param_dict,
                        n_iter=10, 
                        cv=5,
                        scoring=None,
                        random_state=1111)

In [12]:
rs.fit(X, y)

RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [13]:
rs.best_score_

0.8654832024432808

In [14]:
rs.best_params_

{'max_depth': 8, 'max_features': 8, 'min_samples_split': 2}

However, we perform hyperparameter tuning; in the end, we need to select one final model.

The `best_estimator_` attribute contains the model that performed the best during cross-validation.

In [15]:
rs.best_estimator_

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=8, max_features=8,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

### Using `.best_estimator_`

```python
# Predict new data:
rs.best_estimator_.predict(<new_data>)

# Check the parameters:
random_search.best_estimator_.get_params()

# Save model for use later:
from sklearn.externals import joblib
joblib.dump(rfr, 'rfr_best_<date>.pkl')
```

### __Using `.cv_results_`__

In [16]:
rs.cv_results_

{'mean_fit_time': array([0.16623254, 0.16883354, 0.15294428, 0.1837472 , 0.16054978,
        0.16837282, 0.16173091, 0.1714088 , 0.14909573, 0.1791543 ]),
 'mean_score_time': array([0.01078253, 0.01067472, 0.01005745, 0.01044941, 0.01048779,
        0.01149135, 0.0113945 , 0.01135135, 0.00982294, 0.01209359]),
 'mean_test_score': array([0.79133944, 0.72969568, 0.65974586, 0.82475458, 0.73278796,
        0.84349913, 0.73908159, 0.79238656, 0.65659359, 0.8654832 ]),
 'param_max_depth': masked_array(data=[6, 4, 2, 6, 4, 8, 4, 6, 2, 8],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_max_features': masked_array(data=[6, 10, 10, 10, 8, 4, 10, 10, 10, 8],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_min_samples_split': masked_array(data=[8, 16, 16

In [17]:
rs.cv_results_['mean_test_score']

array([0.79133944, 0.72969568, 0.65974586, 0.82475458, 0.73278796,
       0.84349913, 0.73908159, 0.79238656, 0.65659359, 0.8654832 ])

In [18]:
# The dictionary also contains the key "params,"
# which contains all of the selected parameters for each model run.
rs.cv_results_['params']

[{'max_depth': 6, 'max_features': 6, 'min_samples_split': 8},
 {'max_depth': 4, 'max_features': 10, 'min_samples_split': 16},
 {'max_depth': 2, 'max_features': 10, 'min_samples_split': 16},
 {'max_depth': 6, 'max_features': 10, 'min_samples_split': 8},
 {'max_depth': 4, 'max_features': 8, 'min_samples_split': 4},
 {'max_depth': 8, 'max_features': 4, 'min_samples_split': 4},
 {'max_depth': 4, 'max_features': 10, 'min_samples_split': 8},
 {'max_depth': 6, 'max_features': 10, 'min_samples_split': 16},
 {'max_depth': 2, 'max_features': 10, 'min_samples_split': 2},
 {'max_depth': 8, 'max_features': 8, 'min_samples_split': 2}]

In [19]:
# Group the max depths:
max_depth = [item['max_depth'] for item in rs.cv_results_['params']]
scores = list(rs.cv_results_['mean_test_score'])

d = pd.DataFrame([max_depth, scores]).T
d.columns = ['Max Depth', 'Score']
d.groupby(['Max Depth']).mean()

Unnamed: 0_level_0,Score
Max Depth,Unnamed: 1_level_1
2.0,0.65817
4.0,0.733855
6.0,0.802827
8.0,0.854491


If we look at the output, a max depth of $2$, $4$, and even $6$ all produced really low scores. However, a max depth of $8$ and $10$ almost achieved $90\%$ accuracy.

### Selecting the best precision model
Your boss has offered to pay for you to see three sports games this year. Of the 41 home games your favorite team plays, you want to ensure you go to three home games that they will definitely win. You build a model to decide which games your team will win.

To do this, you will build a random search algorithm and focus on model precision (to ensure your team wins). You also want to keep track of your best model and best parameters, so that you can use them again next year (if the model does well, of course).

In [20]:
sports = pd.read_csv('sports.csv')
sports.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,win
0,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0,1
1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,1
3,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1
4,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,1


In [21]:
X = sports.drop('win', axis=1)
y = sports['win']

In [22]:
rfc = RandomForestClassifier()

param_dist = {
    'max_depth': range(2, 12, 2),
    'min_samples_split': range(2, 12, 2),
    'n_estimators': [10, 25, 50]
}

In [23]:
from sklearn.metrics import precision_score

# Create a precision scorer
precision = make_scorer(precision_score)

# Finalize the random search
rs = RandomizedSearchCV(estimator=rfc,
                        param_distributions=param_dict,
                        scoring=precision,
                        cv=5,
                        n_iter=10,
                        random_state=1111)

rs.fit(X, y)

# Print the mean test scores:
print('The accuracy for each run was: {}.'.format(rs.cv_results_['mean_test_score']))
# Print the best model scores:
print('The best accuracy for a single model was: {}'.format(rs.best_score_))

The accuracy for each run was: [0.84026653 0.77758575 0.71609262 0.89357232 0.7645119  0.89167668
 0.7673058  0.86132285 0.70377573 0.95290194].
The best accuracy for a single model was: 0.9529019405138808


> ***our model's precision was 93%! The best model accurately predicts a winning game 93% of the time. If you look at the mean test scores, you can tell some of the other parameter sets did really poorly. Also, since you used cross-validation, you can be confident in your predictions.***

## __Connect with Me__ 
--- 
[<img align="left" alt="codeSTACKr | Twitter" width="40px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/twitter.svg" />][twitter] 
[<img align="left" alt="codeSTACKr | LinkedIn" width="40px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/linkedin.svg" />][linkedin] 
[<img align="left" alt="codeSTACKr.com" width="40px" src="https://raw.githubusercontent.com/iconic/open-iconic/master/svg/globe.svg" />][StackExchange AI] 
[twitter]: https://twitter.com/F4izy 
--- 
[linkedin]: https://www.linkedin.com/in/mohd-faizy/ 
--- 
[StackExchange AI]: https://mohd-faizy.github.io
--- 
--- 
--- 
---