# Hyperparameter tuning

 * Parameters, which needed to be specified before fitting a model (like 𝛼 in Ridge and Lasso regression or 'k' in kNN) are called hyperparameters.
 * These are parameters which cannot be learned by fitting the model.
 
#### Basic idea:
     1. Try a bunch of different values for the parameters
     2. Fit all them separately
     3. Check how well each performs 
     4. Choose the best one
     
**NB!** When fitting different values of hyperparameter it is essential to use cross-validation. 

# GridSearchCV

 * Like the 𝛼 parameter of Lasso and Ridge regularization, logistic regression also has a regularization parameter: C.
 * C controls the inverse of the regularization strength.
 * A large C can lead to an overfit model, while a small C can lead to an underfit model.
 
<img src="classification/img/grid_cv.png" alt="Drawing" style="width: 300px;"/>

In [78]:
from sklearn import datasets
from sklearn import linear_model
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import randint
import seaborn as sns

In [62]:
# use diabetes.csv dataset
diab = pd.read_csv('classification/data/diabetes.csv', header=0)
diab.head(2)

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0


In [34]:
# data
X = diab.drop('diabetes', axis=1)
y = diab.diabetes

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, 
    random_state=0)

# setup hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C' : c_space}

# init logistic regression classifier
logist_reg = linear_model.LogisticRegression(solver='liblinear')

# init GridSearchCV object
logist_reg_cv = model_selection.GridSearchCV(
    estimator=logist_reg, param_grid=param_grid, cv=5, return_train_score=False)

# fit to the data
logist_reg_cv.fit(X_train, y_train)

# Print the tuned parameters and best score
print("Tuned Logistic Regression Parameters: {}".format(logist_reg_cv.best_params_))
print("\nBest score is {:.3f}".format(logist_reg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 31.622776601683793}

Best score is 0.760


**NB!** GridSearchCV can be computationally expensive. A solution to this is to use RandomizedSearchCV.

## RandomizedSearchCV

 * Not all hyperparameter values are tried out.
 * A fixed number of hyperparameter settings is sampled from specified probability distributions.
 
Find optimal hyperparameters with RandomizedSearchCV

In [32]:
# Setup the parameters and distributions to sample from: param_dist
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5, return_train_score=False)

# Fit it to the data
tree_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': 3, 'max_features': 5, 'min_samples_leaf': 3}
Best score is 0.75


## Summary

 * RandomizedSearchCV will never outperform GridSearchCV
 * Saves computation time

# Practice GridSearchCV with LogisiticRegression

 * In addition to C , logistic regression has a 'penalty' hyperparameter which specifies whether to use 'l1' or 'l2' regularization.

In [42]:
# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}

# Instantiate the logistic regression classifier: logreg
logreg = linear_model.LogisticRegression(solver='liblinear')

# Create train and test sets, he test set here will function as the hold-out set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5, iid=False)

# Fit it to the training data
logreg_cv.fit(X_train, y_train)

# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameter: {'C': 31.622776601683793, 'penalty': 'l2'}
Tuned Logistic Regression Accuracy: 0.7673337648793469


# Practice GridSearchCV with Regression

 * Lasso used the L1 penalty to regularize, while ridge used the L2 penalty.
 * There is another type of regularized regression known as the **elastic net**. 
 * In elastic net regularization, the penalty term is a linear combination of the L1 and L2
 
 $$\text{Elastic Net} = aL_1 + bL_2$$
 <br>
 * In scikit-learn, this term is represented by the 'L1_ratio' parameter: 
 
 $$ \text{L1_ratio} = \frac{a}{a+b} $$
 <br>
 * An 'L1_ratio' of 1 corresponds to an L1 penalty, and anything lower is a combination of L1 and L2.
 
##### Let's tune the 'l1_ratio' of an elastic net model trained on the Gapminder data.

In [64]:
# import data as DF
gapminder = pd.read_csv('regression/data/gm_2008_region.csv', header=0)
gapminder.head(2)

Unnamed: 0,population,fertility,HIV,CO2,BMI_male,GDP,BMI_female,life,child_mortality,Region
0,34811059.0,2.73,0.1,3.328945,24.5962,12314.0,129.9049,75.3,29.5,Middle East & North Africa
1,19842251.0,6.43,2.0,1.474353,22.25083,7103.0,130.1247,58.3,192.0,Sub-Saharan Africa


In [106]:
# extract features and target
X = gapminder.drop(['life', 'Region'], axis=1).values
y = gapminder.life.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# init Elastic Net object
elastic_net = linear_model.ElasticNet()

# create hyperparameter space
l1_space = np.linspace(0, 1, 30)

param_grid = {'l1_ratio' : l1_space}

# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(estimator=elastic_net, param_grid=param_grid, cv=5, 
                     return_train_score=False, iid=False)

gm_cv.fit(X_train, y_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=False, n_jobs=None,
       param_grid={'l1_ratio': array([0.     , 0.03448, 0.06897, 0.10345, 0.13793, 0.17241, 0.2069 ,
       0.24138, 0.27586, 0.31034, 0.34483, 0.37931, 0.41379, 0.44828,
       0.48276, 0.51724, 0.55172, 0.58621, 0.62069, 0.65517, 0.68966,
       0.72414, 0.75862, 0.7931 , 0.82759, 0.86207, 0.89655, 0.93103,
       0.96552, 1.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
       scoring=None, verbose=0)

In [105]:
y_pred = gm_cv.predict(X_test)
rmse = mean_squared_error(y_test, y_pred)

# Predict on the test set and compute metrics R^2 and RMSE
print("Elastic Net l1_ratio = {:.3f}".format(gm_cv.best_params_['l1_ratio']))
print("Elastic Net R^2 = {:.3f}".format(gm_cv.score(X_test, y_test)))
print("Elastic Net RMSE = {:.3f}".format(rmse))

Elastic Net l1_ratio = 0.207
Elastic Net R^2 = 0.867
Elastic Net RMSE = 10.058


Since the ratio is more toward 0 the Elstic Net regressor is mostly using Ridge regressor over Lasso. WIth this example we end up with:

$$\text{Elastic Net} = 0.2L_1 + 0.8L_2$$ 