# SK3: Cross-Validation and Hyper Parameter Tuning

## Learning Objective
- Implement various cross-validation strategies
- Perform grid search to indentify optimal hyperparameter values.

As in Part 1, we shall use the following datasets for regression, binary, and multiclass
classification problems.
1. `Breast Cancer Wiscons` in Data. The target feature is binary, i.e., if a cancer diagnosis is
"malignant" or "benign"
2. `Boston Housing` Data. The target feature is continuous. The target is house prices in Boston
in 1970's.
3. `Wine` Data. The target feature is multiclass. It consists of three types of wines in Italy.

We use KNN, DT, and NB models to illustrate how cross-validation is used to tune
hyperparameters of a machine learning algorithm via grid search by going through the Breast
Cancer Data and Boston Housing Data. We will leave Wine Data and other machine learning
models as exercises.

In [None]:
# !pip install --upgrade altair
# !pip install vega vega_datasets

In [2]:
import altair as alt
alt.renderers.enable('html')

RendererRegistry.enable('html')

In [3]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn import preprocessing

cancer_df = load_breast_cancer()
Data,target = cancer_df.data, cancer_df.target

Data = preprocessing.MinMaxScaler().fit_transform(Data)

# target is already encoded, but we need to reverse the labels so that
# malignant is the positive class
target = np.where(target==0,1,0)

D_train, D_test, t_train, t_test = train_test_split(Data, target,
                                                    test_size=0.3,
                                                   random_state=999)

## Nearest Neighbor Model

In [4]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=1,p=2)
knn_classifier.fit(D_train, t_train)
knn_classifier.score(D_test,t_test)

0.9473684210526315

The 1-NN classifier yields an accuracy score of around 94.7%. So, how can we improve this
score? One way is to search the set of "hyperparameters" which produces the highest accuracy
score. For a nearest neighbor model, the hyperparameters are as follows:
- Number of neighbors.
- Metric: Manhattan (p=1), Euclidean (p=2) or Minkowski (any p larger than 2). Technically,
p=1 and p=2 are also Minkowski metrics, but in this notebook, we shall adopt the
convention that the Minkowski metric corresponds to $p ≥ 3$.

To search for the "best" set of hyperparameters, popular approaches are as follows:
- Random search: As its name suggests, it randomly selects the hyperparameter set to train models.
- Bayesian search: It is beyond the scope of this course. So we shall not cover it here.
- Grid search

Grid search is the most common approach. It exhaustively searches through all possible
combinations of hyperparameters during training the phase. For example, consider a KNN
model. We can specify a grid of number of neighbors (K = 1, 2, 3) and two metrics (p=1, 2). The
grid search starts training a model of K = 1 and p=1 and calculates its accuracy score. Then it
moves to train models of (K = 2, p = 1), (K = 3, p = 1), (K = 1, p = 2), ..., and (K = 3, p = 2) and
obtain their score values. Based on the accuracy scores, the grid search will rank the models
and determine the set of hyperparameter values that give the highest accuracy score.

Before we proceed further, we shall cover other cross-validation (CV) methods since tuning
hyperparameters via grid search is usually cross-validated to avoid overfitting.

## Cross-Validation
Two popular options for cross-validation are 5-fold and 10-fold. In 5-fold cross-validation, for
instance, the entire dataset is partitioned into 5 equal-sized chunks. The first four chunks are
used for training and the 5-th chunk is used for testing. Next, all the chunks other than the 4-th
chunk are used for training and the 4-th chunk is used for testing, and so on. In the last iteration,
all the chunks other than the 1-st chunk are used for training and the 1-st chunk is used for
testing. The final step is to take the average of these 5 test accuracies and report it as the
overall cross-validation accuracy. Please see the figure below for an illustration of a 10-fold
cross-validation (source: karlrosaen.com) . Please refer to Chapter 8 in the textbook for more
information.

<img src=http://karlrosaen.com/ml/learning-log/2016-06-20/k-fold-diagram.png width="600">

In contrast to hold-out-sampling, cross-validation is usually the preferred option due to the
following two reasons:
- Sometimes there is just not enough data for a hold-out-sampling.
- Cross-validation reduces the risk of what is called "lucky split" where the difficult instances
are put in the training partition and the easy instances are put in the test partition.
A downside of cross-validation is that it apparently requires more computer time. Also, if it
happens to be the case that there is a good amount of data available already (say millions of
rows), then the risk of "lucky split" diminishes and hold-out-sampling can be preferred. Another
extension of cross-validation is repeated cross-validation (say 3 times) where data is partitioned
into 5 equal-sized chunks multiple times and the cross-validation procedure is repeated, each
time with a different partitioning of data per repeatition.

We can perform -fold cross-validation by calling the `KFold` function imported from
sklearn.model_selection module. It randomly splits the full dataset into K subsets or
"folds". Then it trains the model on K-1 folds and evaluates the model against the remaining
fold. This process is repeated exactly K times where each time a different fold is used for
testing.
Other cross-validation variants from scikit-learn are as follows:
- `model_selection.RepeatedKFold()`: Repeated K-Fold cross-validator
- `model_selection.RepeatedSaratifiedKFold()` : Repeated Stratified K-Fold cross-
validator
- `model_selection.StratifiedKFold()`: Stratified K-Fold cross-validator
- `model_selection.LeaveOneOut()` : Leave One Out cross-validator
To learn more about cross-validators, please refer to scikit-learn documentation.

**Refresher questions**
1. What are the disadvantages of a simple test/ train split?
2. Can you tell the difference between the cross-validators above?

In the following example, we illustrate how we can conduct a stratified 5-fold (n splits = 5)
cross-validation with 3 repetitions (n _repeats = 3 ) using the RepeatedStratifiedKFold
function. Since the target labels have fewer malignant labels than benign, stratification
ensures that the proportion of the two labels in both train and test sets are the same as the
proportion in the full dataset in each cross-validation repetition.

In [5]:
from sklearn.model_selection import RepeatedStratifiedKFold

cv_method = RepeatedStratifiedKFold(n_splits=5,
                                   n_repeats=3,
                                   random_state=999)

## KNN HyperParameter and VIsualisation

It's hyperparameter tuning time, First, we need to define a dictonary of KNN parameters for the grid search. Here, we will consider K values between 3 and 7 and $p$ values of 1(Manhattan), 2(Euclidean), and 5(Minkowski).

In [6]:
params_KNN = {'n_neighbors': [1,2,3,4,5,6,7],
             'p':[1,2,5]}

Second, we pass the `KNeighborsClassifier()` and `KNN_params` as the model and the
parameter dictonary into the `GridSearchCV` function. In addition, we include the repeated stratified CV method we defined previously `(cv=cv method)`. Also, we tell sklearn which metric to optimize, which is accuracy in our example `(scoring='accuracy',refit='accuracy)`.

In [9]:
from sklearn.model_selection import GridSearchCV

gs_KNN = GridSearchCV(estimator=KNeighborsClassifier(),
                     param_grid=params_KNN,
                     cv=cv_method,
                     verbose=1,
                     scoring='accuracy',
                     return_train_score=True)

The last step is to fit a KNN model using the full dataset.

In [10]:
gs_KNN.fit(Data,target)

Fitting 15 folds for each of 21 candidates, totalling 315 fits


GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=999),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7], 'p': [1, 2, 5]},
             return_train_score=True, scoring='accuracy', verbose=1)

 ----------------------

To get the best parameter values, we call the `best_params_` attribute.

In [12]:
gs_KNN.best_params_

{'n_neighbors': 3, 'p': 1}