<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Cross-Validation" data-toc-modified-id="Cross-Validation-1">Cross Validation</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span></li><li><span><a href="#What-is-the-goal-of-Machine-Learning?" data-toc-modified-id="What-is-the-goal-of-Machine-Learning?-3">What is the goal of Machine Learning?</a></span></li><li><span><a href="#Different-Model-Evaluation-Procedures-" data-toc-modified-id="Different-Model-Evaluation-Procedures--4">Different Model Evaluation Procedures </a></span></li><li><span><a href="#Training-and-testing-on-the-same-data" data-toc-modified-id="Training-and-testing-on-the-same-data-5">Training and testing on the same data</a></span></li><li><span><a href="#Check-for-understanding" data-toc-modified-id="Check-for-understanding-6">Check for understanding</a></span></li><li><span><a href="#Train/test-split" data-toc-modified-id="Train/test-split-7">Train/test split</a></span></li><li><span><a href="#What-percentage-of-the-data-should-you-hold-out-for-testing?" data-toc-modified-id="What-percentage-of-the-data-should-you-hold-out-for-testing?-8">What percentage of the data should you hold-out for testing?</a></span></li><li><span><a href="#Common-train-/-test-splits" data-toc-modified-id="Common-train-/-test-splits-9">Common train / test splits</a></span></li><li><span><a href="#3-way-Split:-Train/Test/Validation" data-toc-modified-id="3-way-Split:-Train/Test/Validation-10">3-way Split: Train/Test/Validation</a></span></li><li><span><a href="#Common-uses-of-validation-set" data-toc-modified-id="Common-uses-of-validation-set-11">Common uses of validation set</a></span></li><li><span><a href="#k-fold-CV" data-toc-modified-id="k-fold-CV-12">k-fold CV</a></span></li><li><span><a href="#What-should-k-be-for-k-fold-cross-validation-(CV)?" data-toc-modified-id="What-should-k-be-for-k-fold-cross-validation-(CV)?-13">What should k be for k-fold cross validation (CV)?</a></span></li><li><span><a href="#Leave-one-out-Cross-Validation-(LOOCV)" data-toc-modified-id="Leave-one-out-Cross-Validation-(LOOCV)-14">Leave-one-out Cross-Validation (LOOCV)</a></span></li><li><span><a href="#Leave-one-out-Cross-Validation-(LOOCV)" data-toc-modified-id="Leave-one-out-Cross-Validation-(LOOCV)-15">Leave-one-out Cross-Validation (LOOCV)</a></span></li><li><span><a href="#Protip---Stratified-Sampling" data-toc-modified-id="Protip---Stratified-Sampling-16">Protip - Stratified Sampling</a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-17">Takeaways</a></span></li></ul></div>

<center><h2>Cross Validation</h2></center>
<center><img src="https://i.stack.imgur.com/c6ECF.png" width="70%"/></center>

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Explain 3-way split between training, validation, and test datasets.
- Explain the purpose of each type of dataset.
- Describe k-fold cross validation (CV).

<center><h2>What is the goal of Machine Learning?</h2></center>

<center>Learn a function from data that can generalize to novel data.</center>

<center><h2>Different Model Evaluation Procedures </h2></center>

1. Training and testing on the same data
1. Train/test split
1. Cross-validation

<center><h2>Training and testing on the same data</h2></center>

Train on __all__ your data. Consider your performance on training data the best evaluation of model's performance.

Why is only having training data a bad idea?

Evaluating only on training data will encourage you to overfit (low bias, high variance).

<center><h2>Check for understanding</h2></center>

When would you want to train and test on the same data?

1. Too little data to split up.
2. The domain is static - new data will be the same as the training data.
3. Too little time (e.g., learning in a low-latency system).

<center><h2>Train/test split</h2></center>

<center><img src="images/test.png" width="70%"/></center>

Split the dataset into two sets

1. Training set: Data points used to train the model.
1. Testing set: Data points used to check the performance once training is __completely finished__.

In other words, the model is trained and tested on different data.

<center><h2>What percentage of the data should you hold-out for testing?</h2></center>

Mostly an empirical choice based on domain complexity and size of the data. 

Hopefully, you have "Big Data".

For example, if you only have 100 examples, a 90 train / 10 test split does not make a lot of sense.

If you have 1M examples, a 90 train / 10 test split does make a lot of sense because there are 100,000 examples in the test set. 

<center><h2>Common train / test splits</h2></center>

- 70% train / 30% test
- 80% train / 20% test
- 90% train / 10% test

<center><h2>3-way Split: Train/Test/Validation</h2></center>

<center><img src="images/data_complete.png" width="75%"/></center>
<center><img src="images/split.png" width="75%"/></center>

Split your data into a 3 separate sets:

1. Test set - Final dataset for one-time evaluation.
1. Training set - Dataset for repeated training.
1. Validation set - Paired with the training dataset to evaluate performance during training.

Validiation dataset is also called the Development dataset

[Source](https://scikit-learn.org/stable/modules/cross_validation.html)

<center><img src="images/cross_validation.png" width="75%"/></center>

<center><h2>Common uses of validation set</h2></center>

<center><img src="images/validation_dataset.png" width="70%"/></center>

1) Compare different hyperparameters.


2) Compare different features.   

3) Compare different algorithms.

4) Estimate Variance (and calculate error bars).

Source: https://www.quora.com/What-is-the-definition-of-development-set-in-machine-learning

In [40]:
reset -fs

In [41]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

In [42]:
# Load and define data
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

In [43]:
# Performance on training dataset only
lm = LinearRegression()
lm.fit(X, y)
print(f"{lm.score(X, y):.4f}")

0.7406


In [44]:
from sklearn.model_selection import train_test_split

# Performance on test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y)
lm = LinearRegression()
lm.fit(X_train, y_train)
print(f"{lm.score(X_test, y_test):.4f}")

0.7708


In [45]:
# Select hyperparameters with CV

from sklearn.linear_model import LassoCV
lm = LassoCV(cv=5)
lm.fit(X_train, y_train)

# The amount of penalization chosen by cross validation
lm.alpha_

0.6855183916848253

In [37]:
# Estimate the variance of your model

from sklearn.model_selection import cross_val_score

cross_val_score(estimator=lm, X=X_train, y=y_train, cv=10)

array([0.5962131 , 0.67954356, 0.6882651 , 0.69120943, 0.70929128,
       0.56752193, 0.68652901, 0.6231209 , 0.60529778, 0.68425526])

<center><h2>k-fold CV</h2></center>
<br>
<center><img src="images/validation_dataset.png" width="100%"/></center>

The training set is split into k number of smaller sets.

In [38]:
reset -fs

In [39]:
# Simulate splitting a dataset of observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True).split(range(20))

print(f"{'Iteration'} {'Training set observations':^48} {'Validate set observations'}")
for iteration, data in enumerate(kf, start=1):
    print(f"{iteration:^9} {data[0]} {data[1]}")

Iteration            Training set observations             Validate set observations
    1     [ 0  1  2  3  4  5  6  7  8 10 12 13 14 15 16 17] [ 9 11 18 19]
    2     [ 0  1  3  4  5  6  7  9 10 11 12 13 14 15 18 19] [ 2  8 16 17]
    3     [ 1  2  3  4  6  7  8  9 10 11 14 15 16 17 18 19] [ 0  5 12 13]
    4     [ 0  2  3  5  7  8  9 10 11 12 13 14 16 17 18 19] [ 1  4  6 15]
    5     [ 0  1  2  4  5  6  8  9 11 12 13 15 16 17 18 19] [ 3  7 10 14]


Source: https://scikit-learn.org/stable/modules/cross_validation.html

<center><h2>What should k be for k-fold cross validation (CV)?</h2></center>


Again, an empirical choice based on how many variations of an algorithm you want to explore.

`k=10` tends to be the most popular.

Source: https://github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb

<center><h2>Leave-one-out Cross-Validation (LOOCV)</h2></center>

<center><img src="images/loocv.png" width="70%"/></center>

<center>A special case of k-fold CV is when k=n.</center>

<center><h2>Leave-one-out Cross-Validation (LOOCV)</h2></center>

- LOOCV is computationally intensive because the model has to be fit n times.

- Performing LOOCV multiple times always yields the same results because there is no randomness in the training/validation set splits.

- Useful if you have a tiny dataset where you can not afford a large validation set.

Source: An Introduction to Statistical Learning by by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani

<center><h2>Protip - Stratified Sampling</h2></center>

<center><img src="images/XJZve.png" width="70%"/></center>

For classification problems, stratified sampling is recommended for creating the folds.

Imbalanced classes will be proportionally represented in each CV fold.

scikit-learn's `cross_val_score` function does this by default

[Source](https://stackoverflow.com/questions/45969390/difference-between-stratifiedkfold-and-stratifiedshufflesplit-in-sklearn)

<center><h2>Takeaways</h2></center>

- Always do train/test splits.
- Do k-fold cross validation (CV) whenever possible. 
- `scikit-learn` makes it easy to do the right thing.