In [1]:
import utils
import warnings

warnings.filterwarnings('ignore')
utils.set_css_style('style.css')

# 4. Hyperparameter Tuning

Choosing the right hyperparameters for your machine learning algorithm is a crucial task, since it can make a big difference on the performance of a model. 

Machine Learning models are composed of two different types of parameters:

* **Hyperparameters** = are all the parameters which can be arbitrarily set by the user before starting training (eg. Learning rate, regularization parameter, batch size in the mini-batch gradient descent...).

* Model **parameters** = are instead learned during the model training (eg. weights in Linear Regression, Neural Networks...).

The model parameters define how to use input data to get the desired output and are learned at training time. Instead, Hyperparameters determine how our model is structured in the first place.
Hyperparameter tuning is a type of optimization problem. We have a set of hyperparameters and we aim to find the right combination of their values which can help us to find either the minimum (eg. loss) or the maximum (eg. accuracy) of a function.

<img src="figures/hyperparameter-tuning.png" alt="hyperparameter-tuning" style="width: 500px;"/>

## 4.1. Grid search

Grid search is a method by which we create sets of possible hyper-parameters values for each hyper-parameter, then test them against each other in a “grid.” 

The recipe below evaluates different $\lambda$ values for the regularized linear regression algorithm we have seen above (a Linear Regression with an $L_2$ regularization is called Ridge Regression) on the standard diabetes dataset. This is a one-dimensional grid search.

Grid Search can be implemented in Python using `scikit-learn` `GridSearchCV()` function. The `verbose` parameter dictates whether the function will print information as it runs, and the `cv` parameter refers to cross validation folds. 

In [25]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

In [26]:
# load the diabetes datasets
dataset = datasets.load_diabetes()

Dataset Description:

In [27]:
print(dataset.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - Age
      - Sex
      - Body mass index
      - Average blood pressure
      - S1
      - S2
      - S3
      - S4
      - S5
      - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bra

In [28]:
# Split the data into 2 sets
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)
print("Training set: ", X_train.shape, y_train.shape)
print("Testing set: ", X_test.shape, y_test.shape)

Training set:  (353, 10) (353,)
Testing set:  (89, 10) (89,)


In [29]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

# load the diabetes datasets
dataset = datasets.load_diabetes()

# prepare a range of alpha values to test
grid = {'alpha': [1,0.1,0.01,0.001,0.0001,0]}

# create and fit a ridge regression model, testing each alpha
model = Ridge()
gsearch = GridSearchCV(estimator=model, param_grid=grid, scoring="neg_mean_squared_error", verbose=1)
gsearch.fit(X_train, y_train)

# summarize the results of the grid search
print('RMSE: ', np.round(np.sqrt(-1*gsearch.best_score_), 2))
print('Best regularization parameter: ', gsearch.best_estimator_.alpha)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
RMSE:  55.08
Best regularization parameter:  0.001


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.0s finished


Evaluating RMSE performance on the testing set:

In [30]:
rmse_test = gsearch.score(X_test, y_test)
print('RMSE: ', np.round(np.sqrt(-1*rmse_test), 2))

RMSE:  54.37


## 4.2. Random search

As its name suggests, Random Search uses random combinations of hyperparameters. This means that not all of the parameter values are tried, and instead, parameters will be sampled with fixed numbers of iterations given by `n_iter` in the `RandomizedSearchCV()` function.

Random Search would be advised to use over Grid Search when the searching space is high meaning that there are more than 3 dimensions as Random Search is able to explore a wider hyperparameter space. In the below example, grid search only tested three unique values for each hyperperameter, whereas the random search tested 9 unique values for each. That means if one hyperparameter is more important than the others, random search will be better. Think of it this way: if hyperparameter 2 doesn’t really matter, then we would want 9 different hyperparameter 1 values to test instead of 3. The same holds true for higher dimensions (more hyperparameters).

<div class="item">
    <img src="figures/gs-vs-rs.png" alt="/gs-vs-rs" width="600px"/>
    <span class="caption">With grid search, nine trials only test three distinct places. With random search, all nine trails explore distinct values.</span>
</div>

For this we will use a logistic regression (a linear algorithm used for classification) which has many different hyperparameters. For this example we will only consider these hyperparameters:

* The penalty (The regularization type L1 or L2)
* The C value (The regularization parameter)

The data set we will be using is the classic and simple iris data set. First we need to import the things we need, as well as separate the target variable from the independent variables

In [31]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
iris = datasets.load_iris()
features = iris.data
target = iris.target

In [32]:
# Split the 2 data into 2 sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=1)
print("Training set: ", X_train.shape, y_train.shape)
print("Testing set: ", X_test.shape, y_test.shape)

Training set:  (120, 4) (120,)
Testing set:  (30, 4) (30,)


In [33]:
# prepare a uniform distribution to sample for the penalty & C hyperparameters
penalty = ['l1', 'l2']
C = np.logspace(0, 4, num=10)
grid = dict(C=C, penalty=penalty)

# create and fit a ridge regression model, testing random alpha values
logistic = LogisticRegression()
rsearch = RandomizedSearchCV(estimator=logistic, param_distributions=grid, n_iter=50, scoring = 'accuracy')

rsearch.fit(X_train, y_train)

# summarize the results of the grid search
print('Accuracy: ', np.round(rsearch.best_score_, 2))
print('Best regularization penalty type: ', rsearch.best_estimator_.penalty)
print('Best regularization parameter: ', rsearch.best_estimator_.C)

Accuracy:  0.98
Best regularization penalty type:  l2
Best regularization parameter:  2.7825594022071245
