## Get the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.simplefilter('ignore')

## Cross validation: k-fold CV

The purpose of testing is to estimate a models quality of predicting data out of sample. A single train-test split of the data bears the risk of not being representative for a model's ability to generalize. Hence, where possible, multiple splits are preferred. An unordered dataset is typically split into folds. <u>Out of k folds each one is used as test set in turn</u>. There are as many train-test splits and scores as there are folds.

This maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data. However, <u>the number of folds also determines the computational cost</u>.

In [2]:
from sklearn.model_selection import cross_val_score

boston_X, boston_y = datasets.load_boston(return_X_y=True)

reg = LinearRegression()
cv_results = cross_val_score(reg, boston_X, boston_y, cv=5)  # gives array of R2s
print(cv_results)

[ 0.63919994  0.71386698  0.58702344  0.07923081 -0.25294154]


## Hyperparameter tuning

### Basics

Hyperparameters are parameters that cannot be directley learned by fitting a model. Hyperparameter tuning chooses these parameters based on the success of the related model.

In scikit-learn hyperparameters are passed as arguments to the constructor of the estimator class. Examples include `C`, `kernel` and `gamma` for Support Vector Classifier and `alpha` (coefficient penalty) for Lasso.

One can search the hyper-parameter space for the best cross validation score. Any parameter provided when constructing an estimator may be optimized in this manner. To find names and values of all parameters for a given estimator one can use `estimator.get_params()`.

In [4]:
knn = KNeighborsClassifier()
knn.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

### Grid search

A basic method for hyperparameter optimization is grid search. A <u>grid refers to combinations of plausible hyperparameter values</u>. The combination is then determined through a 'grid search'. The `sklearn` documentation shows the names of each model's hyperparameters.

**Example 1**: Find in-sample optimal number of neighbors for knn classifier with `GridSearchCV`.

In [3]:
X, y = datasets.load_iris(return_X_y=True) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

from sklearn.model_selection import GridSearchCV

param_grid = {'n_neighbors': np.arange(1, 50)}  # grid is specified as dictionary of key-range pairs
knn = KNeighborsClassifier()  # initiate estimator

knn_cv = GridSearchCV(knn, param_grid, cv=5)  # initiate grid search object
knn_cv.fit(X_train, y_train)  # fit performs the actual grid search in place
print(knn_cv.best_params_)  # return the most successful hyperparameters
print(knn_cv.best_score_)  # return the score (here accuracy) of the most successful hyperparameters

{'n_neighbors': 3}
0.9714285714285715


**Example 2**: Find historically optimal regularization parameter for logistic regression

In [4]:
X, y = datasets.load_iris(return_X_y=True) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

from sklearn.model_selection import GridSearchCV

param_grid = {'C': np.logspace(-5, 8, 15)}  # Setup the hyperparameter grid
logreg = LogisticRegression()  # instantiate logistic regression classifier

logreg_cv = GridSearchCV(logreg, param_grid, cv=5)  # instantiate the GridSearchCV object
logreg_cv.fit(X, y)  # fit it to the data
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 31.622776601683793}
Best score is 0.9800000000000001


Hyperparameter tuning should split out a test or **hold-out set** before the tuning. The hold-out set serves as basis for testing the predictive quality of the overall method we used, i.e. hypterparameter tuning and estimation.

In [5]:
X, y = datasets.load_iris(return_X_y=True) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

from sklearn.model_selection import GridSearchCV

param_grid = {'C': np.logspace(-5, 8, 15), 'penalty': ['l1', 'l2']}  # Create the hyperparameter grid
logreg = LogisticRegression()  # instantiate the logistic regression classifier

logreg_cv = GridSearchCV(logreg, param_grid, cv=5)  # instantiate GridSearchCV object
logreg_cv.fit(X_train, y_train)  # fit it to the training data to hyperparameter tuning

print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {:.2f}".format(logreg_cv.best_score_))

y_pred = logreg_cv.predict(X_test)  # predict for the holdout test set
logreg_cv.score(X_test, y_test)  # checkout test accuracy
print("Test set accuracy: {:.2f}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameter: {'C': 3.727593720314938, 'penalty': 'l2'}
Tuned Logistic Regression Accuracy: 0.97
Test set accuracy: 0.97


### Randomized search

GridSearchCV can be computationally too expensive when searching over a large multi-variate hyperparameter spaces. Randomized search with `RandomizedSearchCV` is a "cheaper" solution in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. 

For example, decision trees have many parameters that can be tuned, such as `max_features`, `max_depth`, and `min_samples_leaf`. This makes it an ideal use case for randomized search.

In [15]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_dist = {"max_depth": [3, None],  # set up the parameters and distributions to sample from
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}

tree = DecisionTreeClassifier()  # instantiate a Decision Tree classifier
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)  # instantiate the RandomizedSearchCV object
tree_cv.fit(X, y)  # fit the RSCV object to the data

print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': 3, 'max_features': 3, 'min_samples_leaf': 3}
Best score is 0.9466666666666667


## Tryouts

GridSearch typically uses full pipelines and grids of hyperparameters to perform cross validation.

In [24]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

steps = [('scale', StandardScaler()), ('model', KNeighborsRegressor())]
pipe = Pipeline(steps)

print(pipe.get_params()) # view all settings of pipeline
param_grid = {'model__n_neighbors': [i for i in range(1, 11)]} 

mod = GridSearchCV(estimator = pipe,
                  param_grid = param_grid,
                  cv = 3)

X, y = datasets.load_boston(return_X_y=True)
mod.fit(X, y)
df = pd.DataFrame(mod.cv_results_)  # pass output dictionary to pandas df
df

{'memory': None, 'steps': [('scale', StandardScaler()), ('model', KNeighborsRegressor())], 'verbose': False, 'scale': StandardScaler(), 'model': KNeighborsRegressor(), 'scale__copy': True, 'scale__with_mean': True, 'scale__with_std': True, 'model__algorithm': 'auto', 'model__leaf_size': 30, 'model__metric': 'minkowski', 'model__metric_params': None, 'model__n_jobs': None, 'model__n_neighbors': 5, 'model__p': 2, 'model__weights': 'uniform'}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001337,0.0004768777,0.002003,5.510628e-06,1,{'model__n_neighbors': 1},0.226933,0.432998,0.127635,0.262522,0.127179,10
1,0.001664,0.0004700285,0.001334,0.0004703637,2,{'model__n_neighbors': 2},0.358216,0.409229,0.172294,0.313246,0.101821,9
2,0.001006,5.74298e-06,0.001657,0.0004743768,3,{'model__n_neighbors': 3},0.413515,0.476651,0.318534,0.4029,0.064986,1
3,0.001328,0.0004653012,0.001998,1.011524e-06,4,{'model__n_neighbors': 4},0.475349,0.402495,0.273014,0.383619,0.083675,7
4,0.001333,0.0004713142,0.001999,1.94668e-07,5,{'model__n_neighbors': 5},0.512318,0.347951,0.26259,0.374286,0.103638,8
5,0.000999,2.973602e-07,0.001999,1.030086e-06,6,{'model__n_neighbors': 6},0.533611,0.389504,0.248482,0.390532,0.116406,6
6,0.001023,2.896391e-05,0.001975,0.0008314123,7,{'model__n_neighbors': 7},0.544782,0.385199,0.243668,0.391216,0.123003,5
7,0.001004,6.858651e-06,0.002332,0.0004713147,8,{'model__n_neighbors': 8},0.589644,0.39465,0.209714,0.398003,0.155124,2
8,0.001329,0.0004651889,0.002002,6.244564e-06,9,{'model__n_neighbors': 9},0.590352,0.407556,0.185253,0.394387,0.165643,3
9,0.001004,7.362293e-06,0.002332,0.0004712583,10,{'model__n_neighbors': 10},0.61651,0.395077,0.164023,0.39187,0.184741,4


In [15]:
[i for i in range(1, 11)]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

https://www.youtube.com/watch?v=0B5eIE_1vpU  from min 23 on GridSearch for pipeline

https://www.youtube.com/watch?v=4PXAztQtoTg

https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu

now:
https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html