# 3 Methods for Hyperparameter Tuning with Random Forest

In this notebook, we'll explore 3 different approaches for hyperparameter tuning with a Random Forest classifier. These will include:

1. **Grid Search** : cycle through every configuration in a predetermined set of parameter values
2. **Randomized Search** : randomly select configurations from a set of parameter distributions
3. **Bayesian Optimisation** : select configurations based on prior distributions for each parameter

There are numerous different hyperparameters available for Random Forest. A complete listing of these parameters for the scikit-learn implementation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. For the purpose of our work here, I'll only consider tuning the following:

* *criterion* : function used to measure quality of splits
* *n_estimators* : number of trees to include in the ensemble
* *max_depth* : maximum number of splits per tree
* *min_samples_split* : minimum number of samples in a node for a split to occur
* *min_samples_leaf* : minimum samples in a node for it to be considered a leaf node
* *max_features* : function used to determine number of features to consider when doing a split

For this demonstration, we will create a toy dataset using scikit-learn's *make_classification*. 

We can start by importing the packages necessary here:

In [1]:
# imports
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score
)
from skopt import BayesSearchCV
from skopt.space import Integer
from scipy.stats import poisson, randint
import numpy as np
from typing import Callable

Now let's create our dataset, and do a train-test split:

In [2]:
# load in and prepare data
X, y = make_classification(n_samples=5000, 
                           n_features=100, 
                           n_informative=50,
                           n_classes=2, 
                           weights=[0.6,0.4],
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# helper function
def print_results(clf: Callable, X_test: np.array, y_test: np.array) -> None:
    y_pred = clf.predict(X_test)
    y_prob = clf.predict_proba(X_test)
    print(f'accuracy score: {accuracy_score(y_test, y_pred):.2f}')
    print(f"precision score: {precision_score(y_test, y_pred):.2f}")
    print(f"recall score: {recall_score(y_test, y_pred):.2f}")
    print(f"f1 score: {f1_score(y_test, y_pred):.2f}")
    print(f"ROC AUC score: {roc_auc_score(y_test, y_prob[:,1])}")

## Baseline

To be able to measure the effects of our tuning, let's first measure how well Random Forest does on the test set with all default hyperparameter values:

In [4]:
%%time

# fit a model with default parameters
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# compute performance on test set
print_results(clf, X_test, y_test)

accuracy score: 0.88
precision score: 0.96
recall score: 0.72
f1 score: 0.83
ROC AUC score: 0.9547358510304426
CPU times: user 1.79 s, sys: 7.59 ms, total: 1.8 s
Wall time: 1.79 s


## Grid Search

Brute force approach to hyperparameter tuning. Each parameter configuration will be validated using 5-fold Cross-Validation. Afterwards, the best model will be selected, and tested against our held-out test set.

In [5]:
%%time

# setup parameter space
parameters = {
    'criterion':["gini", "entropy", "log_loss"],
    'n_estimators':[50, 100, 500],
    'max_depth':[1, 5, 10, 15],
    'min_samples_split':[2, 4, 6, 8],
    'min_samples_leaf':[1, 2, 3, 4],
    'max_features':["sqrt", "log2"]
}

# create an instance of the grid search object
g = GridSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_jobs=-1)

# conduct grid search over the parameter space
g.fit(X_train, y_train)

# show best parameter configuration found for classifier
cls_params = g.best_params_
cls_params

CPU times: user 17 s, sys: 1.96 s, total: 18.9 s
Wall time: 32min 36s


{'criterion': 'entropy',
 'max_depth': 15,
 'max_features': 'sqrt',
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'n_estimators': 500}

In [6]:
# compute performance on test set
print_results(g.best_estimator_, X_test, y_test)

accuracy score: 0.88
precision score: 0.97
recall score: 0.73
f1 score: 0.83
ROC AUC score: 0.9658667224605083


## Randomized Search

We can do hyperparameter tuning through random sampling from a probability distribution, for non-categorical hyperparameters. Each parameter configuration will be validated using 5-fold Cross-Validation. Afterwards, the best model will be selected, and tested against our held-out test set.

In [7]:
%%time

# setup parameter space
parameters = {
    'criterion':["gini", "entropy", "log_loss"],
    'n_estimators':poisson(mu=500),
    'max_depth':poisson(mu=10),
    'min_samples_split':randint(low=2, high=5),
    'min_samples_leaf':randint(low=1, high=5),
    'max_features':["sqrt", "log2"]
}

# create an instance of the randomized search object
r = RandomizedSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_iter=10, random_state=42, n_jobs=-1)

# conduct grid search over the parameter space
r.fit(X_train,y_train)

# show best parameter configuration found for classifier
cls_params2 = r.best_params_
cls_params2

CPU times: user 7.89 s, sys: 57.9 ms, total: 7.95 s
Wall time: 52.1 s


{'criterion': 'gini',
 'max_depth': 14,
 'max_features': 'sqrt',
 'min_samples_leaf': 4,
 'min_samples_split': 4,
 'n_estimators': 487}

In [8]:
# compute performance on test set
print_results(r.best_estimator_, X_test, y_test)

accuracy score: 0.89
precision score: 0.98
recall score: 0.73
f1 score: 0.84
ROC AUC score: 0.9621090072183283


## Bayesian Optimization

The final method we'll try takes advantage of Bayes theorem for hyperparameter tuning. Like before, the search space for non-categorical hyperparameters is defined by a set of probability distributions, in this case in the form of priors. Care will be needed when selecting these prior distributions. Each parameter configuration will be validated using 5-fold Cross-Validation. Afterwards, the best model will be selected, and tested against our held-out test set.

In [9]:
%%time

# setup parameter space
parameters = {
    'criterion':["gini", "entropy", "log_loss"],
    'n_estimators':Integer(50,1000,prior='uniform'),
    'max_depth':Integer(1,20,prior='uniform'),
    'min_samples_split':Integer(2,5,prior='log-uniform'),
    'min_samples_leaf':Integer(1,5,prior='log-uniform'),
    'max_features':["sqrt", "log2"]
}

# create an instance of the bayesian search object
b = BayesSearchCV(RandomForestClassifier(random_state=42), parameters, cv=5, n_iter=10, random_state=42, n_jobs=-1)

# conduct randomized search over the parameter space
b.fit(X_train,y_train)

# show best parameter configuration found for classifier
cls_params3 = b.best_params_
cls_params3

CPU times: user 9.86 s, sys: 60.4 ms, total: 9.92 s
Wall time: 58.9 s


OrderedDict([('criterion', 'entropy'),
             ('max_depth', 18),
             ('max_features', 'sqrt'),
             ('min_samples_leaf', 2),
             ('min_samples_split', 2),
             ('n_estimators', 481)])

In [10]:
# compute performance on test set
print_results(b.best_estimator_, X_test, y_test)

accuracy score: 0.89
precision score: 0.98
recall score: 0.74
f1 score: 0.84
ROC AUC score: 0.9657453708546919


## Conclusion

We have tested out 6 key hyperparameters for Random Forest, using 3 popular techniques for parameter tuning. All results listed below are dependent on the dataset used in this notebook: 

* Bayesian optimization takes a bit longer to run than Randomized Search, but the Bayesian approach yields slightly better results for the same number of iterations.
* Both Randomized Search and Bayesian Optimization benefit from being able to handle distributions for their non-categorical hyperparameters.
* Grid Search is by far the slowest method, and suffers from needing the parameters defined in an array (as opposed to a distribution). Yields results that are somewhat better than the baseline!  