# Notebook Instructions
<i>You can run the notebook document sequentially (one cell at a time) by pressing <b> shift + enter</b>. While a cell is running, a [*] will display on the left. When it has been run, a number will display indicating the order in which it was run in the notebook [8].</i>

<i>Enter edit mode by pressing <b>`Enter`</b> or using the mouse to click on a cell's editor area. Edit mode is indicated by a green cell border and a prompt showing in the editor area.</i>

# Hyperparameter tuning

Hyperparameters cannot be learned by the model but need to be specified by the user before training the models. In this notebook, we will find the best hyperparameters for random forest model created in the previous section using random search and grid search cross validation techniques.

Let's start with below steps which you already know!
1. Import the data
2. Define predictor variables and a target variable
3. Split the data into train and test dataset

In [1]:
import pandas as pd
data = pd.read_csv('AAPL.csv')

# Returns
data['ret1'] = data.Adj_Close.pct_change()
data['ret5'] = pd.rolling_sum(data.ret1, 5)
data['ret10'] = pd.rolling_sum(data.ret1, 10)
data['ret20'] = pd.rolling_sum(data.ret1, 20)
data['ret40'] = pd.rolling_sum(data.ret1, 40)

# Standard Deviation
data['std5'] = pd.rolling_std(data.ret1, 5)
data['std10'] = pd.rolling_std(data.ret1, 10)
data['std20'] = pd.rolling_std(data.ret1, 20)
data['std40'] = pd.rolling_std(data.ret1, 40)

# Future returns
data['retFut1'] = data.ret1.shift(-1)

# Define predictor variables (X) and a target variable (y)
data = data.dropna()
predictor_list = ['ret1','ret5', 'ret10', 'ret20', 'ret40', 'std5', 'std10', 'std20', 'std40']
X = data[predictor_list]
y = data.retFut1

# Split the data into train and test dataset
train_length = int(len(data)*0.80)
X_train = X[:train_length] 
X_test = X[train_length:]
y_train = y[:train_length]
y_test = y[train_length:]       

The key hyperparameters in random forest method are
- n_estimators,
- max_features, 
- max_depth, 
- min_samples_leaf, 
- and bootstrap.   

We have defined below a range of values for each of these hyperparameters.

In [2]:
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 20, num = 5)]

# Number of features to consider at every split
max_features = [round(x,2) for x in np.linspace(start = 0.3, stop = 1.0, num = 5)]

# Max depth of the tree
max_depth = [round(x,2) for x in np.linspace(start = 2, stop = 10, num = 5)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start = 300, stop = 600, num = 10)]

# Method of selecting training subset for training each tree
bootstrap = [True, False]

# Save these parameters in a dictionry
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
              }

# Print the dictionary
param_grid

{'bootstrap': [True, False],
 'max_depth': [2.0, 4.0, 6.0, 8.0, 10.0],
 'max_features': [0.3, 0.47, 0.65, 0.82, 1.0],
 'min_samples_leaf': [300, 333, 366, 400, 433, 466, 500, 533, 566, 600],
 'n_estimators': [10, 12, 15, 17, 20]}

## Random Search
The RandomizedSearchCV function from sklearn.model_selection package is used to find best hyperparameter values.

In [4]:
from sklearn.model_selection import RandomizedSearchCV

# Uncomment below line to see detail about RandomizedSearchCV function
# help(RandomizedSearchCV)

In [5]:
# Create the base model to tune
from sklearn.ensemble import RandomForestRegressor
random_forest = RandomForestRegressor()

The RandomizedSearchCV takes following parameter as input

1. estimator: The base estimator model for which best hyperparameter values are found.
2. param_distributions: Dictionary of parameter names and list of values to try.
3. n_iter: Number of parameters that are tried to find the best values.
4. random_state: The random seed value.

In [6]:
# Random search of parameters by searching across 50 different combinations
rf_random = RandomizedSearchCV(estimator = random_forest, 
                               param_distributions = param_grid, 
                               n_iter = 50,                               
                               random_state= 42 
                               )

# Fit the model to find the best hyperparameter values
rf_random.fit(X_train, y_train)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=50, n_jobs=1,
          param_distributions={'n_estimators': [10, 12, 15, 17, 20], 'max_features': [0.3, 0.47, 0.65, 0.82, 1.0], 'bootstrap': [True, False], 'max_depth': [2.0, 4.0, 6.0, 8.0, 10.0], 'min_samples_leaf': [300, 333, 366, 400, 433, 466, 500, 533, 566, 600]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

The best hyperparameters values for the random forest model is found below.

In [7]:
rf_random.best_params_

{'bootstrap': False,
 'max_depth': 2.0,
 'max_features': 0.3,
 'min_samples_leaf': 333,
 'n_estimators': 17}

In this step, we train the model created using the best hyperparameter values.

In [8]:
# Assign the best model to best_random_forest
best_random_forest = rf_random.best_estimator_

# Initialize random_state to 42
best_random_forest.random_state = 42

# Fit the best random forest model on train dataset
best_random_forest.fit(X_train, y_train)

RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=2.0,
           max_features=0.3, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=333, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=17, n_jobs=1,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

# Grid search

Similarly, we can find the best model using grid search cross validation technique. Since this method is time consuming as it tries out all possible combinations, we have defined below less hyperparameters values for illustration purpose only. You may specify more values for hyperparameter.

In [8]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 20, num = 3)]

# Number of features to consider at every split
max_features = [round(x,2) for x in np.linspace(start = 0.3, stop = 1.0, num = 3)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start = 300, stop = 600, num = 3)]

# Method of selecting training subset for training each tree
bootstrap = [True, False]

# Create the random grid
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap
              }

param_grid

{'bootstrap': [True, False],
 'max_features': [0.3, 0.65, 1.0],
 'min_samples_leaf': [300, 450, 600],
 'n_estimators': [10, 15, 20]}

The below code finds the best hyperparameter values.

In [9]:
from sklearn.model_selection import GridSearchCV

# Uncomment below line to see detail about GridSearchCV function
# help(GridSearchCV)

# Grid search of parameters by searching all the possible combinations
rf_grid = GridSearchCV(estimator = random_forest, 
                               param_grid = param_grid
                               )

# Fit the model to find the best hyperparameter values
rf_grid.fit(X_train, y_train)

# Best hyperparameter values
rf_grid.best_params_

{'bootstrap': False,
 'max_features': 0.3,
 'min_samples_leaf': 300,
 'n_estimators': 15}

## Practice

You can try it yourself of how the random forest model created through RandomSearchCV and GridSearchCV performs on test dataset.