<center><img src=img/MScAI_brand.png width=70%></center>

# Scikit-Learn: Hyperparameter Tuning

When approaching an ML problem we often train multiple models. There are at least three possibilities:

* Different models *per se*, e.g. Logistic Regression versus SVM
* Tuning hyperparameter values (aka **model selection**)
* Different features (**feature selection**)

This notebook is mostly about hyperparameter tuning, plus a quick example of multiple models. We'll cover feature selection in the next. 

Let's compare a `LinearRegression` with the `RandomForestRegressor` which is an ensemble of decision trees.

In [1]:
import itertools
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

The **diabetes** dataset is a well-known "toy" dataset for testing out regression algorithms.

In [2]:
X, y = load_diabetes(return_X_y=True)
X.shape, y.shape

((442, 10), (442,))

In [3]:
print(X[:5])
print(y[:5])

[[ 0.03807591  0.05068012  0.06169621  0.02187235 -0.0442235  -0.03482076
  -0.04340085 -0.00259226  0.01990842 -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 -0.02632783 -0.00844872 -0.01916334
   0.07441156 -0.03949338 -0.06832974 -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 -0.00567061 -0.04559945 -0.03419447
  -0.03235593 -0.00259226  0.00286377 -0.02593034]
 [-0.08906294 -0.04464164 -0.01159501 -0.03665645  0.01219057  0.02499059
  -0.03603757  0.03430886  0.02269202 -0.00936191]
 [ 0.00538306 -0.04464164 -0.03638469  0.02187235  0.00393485  0.01559614
   0.00814208 -0.00259226 -0.03199144 -0.04664087]]
[151.  75. 141. 206. 135.]


We'll make a train-test split. It's good practice to use a `random_state`, so that anyone else who runs our notebook will get exactly the same train-test split as us.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Now, the point of this exercise: if we put several models into a list, then we can use a loop to fit and evaluate them all on a test set. This is good scientific practice: it ensures that all are trained and evaluated in exactly the same way.

We can even do some hyperparameter tuning in this way:

In [5]:
models = [
    LinearRegression(),
    RandomForestRegressor(max_depth=3),
    RandomForestRegressor(max_depth=10),
    RandomForestRegressor(max_depth=20)
]

In [6]:
for model in models:
    model.fit(X_train, y_train)
    print(f"{repr(model)}: {model.score(X_test, y_test):.2f}")

LinearRegression(): 0.36
RandomForestRegressor(max_depth=3): 0.26
RandomForestRegressor(max_depth=10): 0.24
RandomForestRegressor(max_depth=20): 0.27


Remember that by default the `score` for regression is the coefficient of determination $R^2$, where higher is better. It looks like the linear regression is the best (so far) on this dataset! 

Let's notice that this is a nice example of **duck typing**: we have objects of different types `LinearRegression` and `RandomForestRegressor`, but it's fine because they all have `fit`.

We could go further, and try out a factorial design on multiple hyperparameters, again using the same train-test split.

In [7]:
n_estimatorss = [1, 10, 100]
max_depths = [2, 4, 8, 16, None] # NB None means "no max"
for n_estimators, max_depth in (
    itertools.product(n_estimatorss, max_depths)):
    
    rf = RandomForestRegressor(
        n_estimators=n_estimators, 
        max_depth=max_depth)
    rf.fit(X_train, y_train)
    print(f"{n_estimators} {max_depth} {rf.score(X_test, y_test)}")

1 2 0.005036263471429714
1 4 -0.030175303514740515
1 8 -0.3839607726192866
1 16 -0.38368773188740146
1 None -0.17998930461000429
10 2 0.18797936594620124
10 4 0.2238222412656934
10 8 0.21950958155874456
10 16 0.16619076684681944
10 None 0.20491942359335658
100 2 0.26611441247806633
100 4 0.27645787478626593
100 8 0.2566173693599645
100 16 0.2387430173417251
100 None 0.24536849562685736


### Hyperparameter Tuning with Cross-Validation

Next we'll look at a better approach to hyperparameter turning, using **cross-validation**.

Disadvantages of a single train-test split:

* Vulnerable to a single random decision (e.g. many "easy" examples in the test set)
* Some of the data doesn't contribute to training.

Cross-validation solves these problems by splitting the data into $k$ *folds*, and then training $k$ times, each time on $1-1/k$ of the data, and validating on the remaining $1/k$:

<center><img src=img/grid_search_cross_validation.png width=40%></centre>

<font size=1>https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation</font>

Scikit-Learn provides easy interfaces for cross-validation.

Notice that we **instantiate** the RF, but **we** don't fit it. The `cross_val_score` function will call `fit` 5 times and return the 5 values from the unseen data.

In [8]:
from sklearn.model_selection import cross_val_score
rf = RandomForestRegressor(n_estimators=50, 
                           max_depth=8)
cross_val_score(rf, X_train, y_train, cv=5) 

array([0.47257668, 0.47663777, 0.46703285, 0.3990208 , 0.54202861])

Here, we see a *big* difference in performance due to different folds. This is a warning not to blindly trust the result of any single train-test split.

That's why it's better to use CV to help tune hyperparameters. We'll use a factorial design as before.

In [9]:
n_estimatorss = [1, 10, 100]
max_depths = [2, 4, 8, 16, None] # NB None means "no max"
for n_estimators, max_depth in (
    itertools.product(n_estimatorss, max_depths)):
    rf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth)
    mean_score = cross_val_score(rf, X_train, y_train, cv=5).mean()     
    print(f"{n_estimators} {max_depth} {mean_score}")

1 2 0.3235351070408104
1 4 0.23780398440987857
1 8 0.08777054778276028
1 16 -0.15436064785076192
1 None 0.02897105400395337
10 2 0.46863211679478545
10 4 0.4655115657947097
10 8 0.463712201839847
10 16 0.43926584565145665
10 None 0.4216133640818417
100 2 0.4677339723301569
100 4 0.4865973857891942
100 8 0.48311321733985124
100 16 0.4876417234486432
100 None 0.48204196653204223


### Grid Search

The factorial design is also known as a **grid search** in the context of machine learning and optimisation. We did the grid search manually, but let's never do that again: Scikit-Learn provides it for us. We provide a `dict` giving the parameter names and the values to be tried.

In [10]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [1, 10, 100],
              'max_depth': [2, 4, 8, 16, None]}
grid = GridSearchCV(RandomForestRegressor(), 
                    param_grid, cv=5) 
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [2, 4, 8, 16, None],
                         'n_estimators': [1, 10, 100]})

Notice that we create a `GridSearchCV` object and call `fit` on that. Again, we don't call `fit` on the RF.

After `fit`, we can find out the best parameter values and the `score` on test data with those parameters. Notice this is our unseen test data, not used during cross-validation.

In [11]:
grid.best_params_

{'max_depth': 8, 'n_estimators': 100}

In [12]:
grid.score(X_test, y_test)

0.22934099677565734

### Some other utilities

* If you want to report several `score` functions, such as $R^2$ and mean square error, you can pass `scoring` to some of the CV and Grid Search function.
* There is also **leave-one-out** cross-validation which can sometimes be useful when you have little data.
* There is **stratified CV**.
* And lots more https://scikit-learn.org/stable/model_selection.html

