## Get the data

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.simplefilter('ignore')

## Cross validation: k-fold CV

The purpose of testing is to estimate a models quality of predicting data out of sample. For the purpose of testing a single split of the data has greater risk of of not being representative for a model's ability to generalize. Hence, where possible, multiple splits are preferred. An unordered dataset is typically split into folds. <u>Out of k folds each one is used as test set in turn</u>. There are as many train-test splits and scores as there are folds.

This maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data. However, <u>the number of folds also determines the computational cost</u>.

In [12]:
from sklearn.model_selection import cross_val_score

boston_X, boston_y = datasets.load_boston(return_X_y=True)

reg = LinearRegression()
cv_results = cross_val_score(reg, boston_X, boston_y, cv=5)  # gives array of R2s
print(cv_results)

[ 0.63919994  0.71386698  0.58702344  0.07923081 -0.25294154]


## Hyperparameter tuning

### Grid search

Hyperparameters are parameters that cannot be learned by fitting a model. Tuning refers to methods for setting these parameters before fitting a model. Specifically, hyperparameter tuning makes choices based on the success of the related model.

A basic method is grid search. A <u>grid refers to combinations of plausible hyperparameter values</u>. The combination is then determined through a 'grid search'. The `sklearn` documentation shows the names of each model's hyperparameters.

Example 1: Tuning knn classifier with `GridSearchCV`.

In [10]:
X, y = datasets.load_iris(return_X_y=True) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbours': np.arange(1, 50)}  # grid is specified as dictionary of key-range pairs

knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, param_grid, cv=5)  # creates grid search object
# knn_cv.fit(X_train, y_train)  # fit performs the actual grid search in place
# knn_cv.best_params_  # return the most successful hyperparameters
# knn_cv.best_score_  # return the score (here accuracy) of the most successful hyperparameters

In [12]:
y_train

array([1, 1, 2, 0, 2, 0, 1, 2, 1, 0, 2, 0, 1, 1, 2, 1, 2, 1, 1, 1, 2, 0,
       1, 1, 1, 0, 1, 0, 2, 2, 1, 1, 2, 2, 2, 0, 0, 1, 0, 1, 1, 0, 1, 2,
       2, 1, 0, 1, 2, 2, 0, 0, 0, 2, 2, 1, 2, 2, 1, 2, 0, 2, 1, 2, 0, 2,
       0, 0, 1, 2, 0, 0, 1, 0, 2, 0, 2, 0, 0, 1, 2, 0, 2, 2, 1, 2, 0, 2,
       1, 0, 0, 0, 0, 2, 1, 2, 1, 1, 1, 0, 1, 0, 0, 2, 0])

## Tryouts

https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu

now:
https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html