# Tuning Hyperparameters

Each estimators (linear regressions, nearest neighbors, support vector machines, XGBoost, and etc.) has its own set of hyperparameters to be selected/tuned. [Further Reading](https://scikit-learn.org/stable/modules/grid_search.html)

A search consists of:
1. An estimator/model
2. A parameter space
3. A method for searching or sampling candidates
4. A cross validation scheme (refer to the concept [k-fold cross validation](https://machinelearningmastery.com/k-fold-cross-validation/))
5. A score function

We "search" among the set of parameters that leads to better models. Better models can be quantified using scores such as **mean squared errors** and **accuracy score**. [Further Reading](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [24]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
data = pd.read_csv("../data/scores_synth.csv")
data

Unnamed: 0,income,score,internet_connection
0,69.454075,635.305372,1
1,47.632800,743.301322,1
2,22.905094,673.037833,1
3,4.465032,442.894112,0
4,19.360381,627.178633,1
...,...,...,...
995,8.801915,464.993872,0
996,15.317348,641.288260,1
997,25.411924,641.858088,1
998,4.898013,447.408180,0


## Exhaustive Grid Search

In [25]:
param_grid = {}
param_grid['criterion'] = ['squared_error', 'absolute_error']
param_grid['max_depth'] = list(range(2,5))
param_grid['min_samples_split'] = list(range(2,10,2))

In [45]:
for key, value in param_grid.items():
    print(f'{key:20}{value}')

criterion           ['squared_error', 'absolute_error']
max_depth           [2, 3, 4]
min_samples_split   [2, 4, 6, 8]


In [27]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

gsearch = GridSearchCV(RandomForestRegressor(),param_grid,cv=3) 

In [28]:
gsearch.fit(data[['income']], data['score']) 

GridSearchCV(cv=3, estimator=RandomForestRegressor(),
             param_grid={'criterion': ['squared_error', 'absolute_error'],
                         'max_depth': [2, 3, 4],
                         'min_samples_split': [2, 4, 6, 8]})

In [29]:
print(gsearch.best_params_) 


{'criterion': 'absolute_error', 'max_depth': 3, 'min_samples_split': 6}


In [30]:
losses = {}
losses['Random Forest: GridSearch'] = mean_absolute_error(y_test, gsearch.predict(X_test))
print(losses)

{'Random Forest: GridSearch': 45.391040566709016}


## Randomized Search

In [37]:
from sklearn.model_selection import RandomizedSearchCV

rsearch = RandomizedSearchCV(RandomForestRegressor(),param_grid,n_iter=1) 
rsearch.fit(X_train, y_train)


RandomizedSearchCV(estimator=RandomForestRegressor(), n_iter=1,
                   param_distributions={'criterion': ['squared_error',
                                                      'absolute_error'],
                                        'max_depth': [2, 3, 4],
                                        'min_samples_split': [2, 4, 6, 8]})

In [38]:
print(rsearch.best_params_) 

{'min_samples_split': 4, 'max_depth': 4, 'criterion': 'squared_error'}


In [43]:
losses['Random Forest: Random Search'] = mean_absolute_error(y_test, rsearch.predict(X_test))
for key, value in losses.items():
    print(f'{key:30}{value}')

Random Forest: GridSearch     45.391040566709016
Random Forest: Random Search  48.30231651125688
