## Hyperparameter Tuning: Score Estimator
The notebook was produced in Google Colab. 

In [1]:
import pickle as pkl
import pandas as pd

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import uniform, truncnorm, randint

from sklearn.ensemble import RandomForestRegressor

### Load train and test data

Pull in the topic-document matrices of our train and test data from Google Drive. 

In [11]:
X_train = pd.read_csv('drive/My Drive/Data Science/X_train_CV_with_len.csv', index_col=0)
X_test = pd.read_csv('drive/My Drive/Data Science/X_test_CV_with_len.csv', index_col=0)
y_train = pd.read_csv('drive/My Drive/Data Science/y_train.csv', index_col=0)
y_test = pd.read_csv('drive/My Drive/Data Science/y_test.csv', index_col=0)

### Define parameter grid we'll perform Randomized Search over

In [15]:
param_grid = {
    # randomly sample numbers from 4 to 200 estimators
    'n_estimators': randint(4,200),
    'bootstrap': [True, False],
    'max_depth': [5, 10, 20, 50, 75, 100, None],
    'max_features': ['auto', 'sqrt', 0.2],
    # uniform distribution from 0.01 to 0.2 (0.01 + 0.199)
    'min_samples_split': uniform(0.01, 0.199),
}

### Run Randomized Search

We'll initialize a RF regressor and search random combinations over the specified parameters. 

For each combination we run 5-fold cross-validation. 

In [16]:
rf_model = RandomForestRegressor()

In [17]:
clf = RandomizedSearchCV(rf_model, param_grid, n_iter=100, cv=5, random_state=77, verbose=50, n_jobs=-1, scoring='r2')

model = clf.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   28.6s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   30.3s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:   59.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  

  self.best_estimator_.fit(X, y, **fit_params)


## Inspect the best parameters and mean validation r^2

In [18]:
print(model.best_params_)
print(model.best_score_)

{'bootstrap': True, 'max_depth': 100, 'max_features': 0.2, 'min_samples_split': 0.010222780897322526, 'n_estimators': 79}
0.2949310635642264


In [19]:
rand_best_params = model.best_params_
rand_best_score = model.best_score_

### Apply Grid Search to inspect every combination in the vicinity of our best random parameters

In [24]:
g_param_grid = {
    'bootstrap': [True], 
    'max_depth': [90, 100, 120],
    'max_features': [0.1, 0.2, 0.3, 0.5],
    'min_samples_split': [0.005, 0.01, 0.05],
    'n_estimators': [70, 79, 90, 100]
    }

In [27]:
g_clf = GridSearchCV(rf_model, param_grid=g_param_grid, cv=5, verbose=50, n_jobs=-1, scoring='r2')

g_model = g_clf.fit(X_train, y_train)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    6.8s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:    7.1s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    9.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   12.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   12.2s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  

  self.best_estimator_.fit(X, y, **fit_params)


## Our final parameter set and optimized r^2 -- a noticeable improvement

In [28]:
g_best_params = g_model.best_params_
g_best_score = g_model.best_score_
print(g_best_params)
print(g_best_score)

{'bootstrap': True, 'max_depth': 120, 'max_features': 0.3, 'min_samples_split': 0.005, 'n_estimators': 100}
0.30535053655049244
