# Cross Validation and Hyperparameter Search

## Cross validation

Cross validation is simply splitting a dataset into two components: a train dataset, which the model is trained on, and a test dataset, which the model is tested on. For example, a default split between these two parts of the dataset is 80-20. A model will be trained on a random 80% of the dataset; then we will evaluate how well it did using the remaining 20%.

Since that 20% was never seen by the model during training, it is not optimized for, and we can except model performance "in the wild" to be closely approximated by model performance on our training data!

## 5-fold cross-validation

**Cross-validation** is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data. It also solves the problem of arbitrary split of data to train and test datasets.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's `cross_val_score()` function uses `R2` as the metric of choice for **regression**. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

<img src="images/cross-validation.png" alt="" style="width: 600px;"/>

**Cross validation** is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes.

In [3]:
import pandas as pd
import numpy as np

path = 'data/dc18/'

# Read the CSV file into a DataFrame: df
df = pd.read_csv(path+'gapminder.csv')
X = np.array(df.drop('life', axis=1).values)
y = np.array(df.life.values)

In [4]:
# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

print("Average 5-Fold CV Score: {} <- R2 mean value".format(np.mean(cv_scores)))

[0.81720569 0.82917058 0.90214134 0.80633989 0.94495637]
Average 5-Fold CV Score: 0.8599627722793232 <- R2 mean value


In [5]:
%timeit cross_val_score(reg, X, y, cv=3)

1.97 ms ± 89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
%timeit cross_val_score(reg, X, y, cv=10)

5.75 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Hyperparameter search

Which cross validation now in our toolbelt, we can approach the problem again: what polynomial model performs best when applied to our problem?

We expect cross validation to be a good estimator on model performance against never-before-seen data. Hence we can use cross validation—that is, checking the mean squared error of the classifier applied to the test dataset—for, say, every degree polynomial regression function between 1 and 10. Clearly our best-performing model will be somewhere in there!

This is known as a hyperparameter search. Hyperparameter searches are important because they are, effectively, how we go about finding the most useful and effective model in a series of possible models controlled by some magic number (the so-named "hyperparameter").

## Resources

- [Gaming Cross Validation and Hyperparameter Search](https://www.kaggle.com/residentmario/gaming-cross-validation-and-hyperparameter-search/notebook)