# Entry 18 - Cross Validation

## The Problem

## The Options

### K-fold cross-validation

The data is broken into *k* pieces (where *k* is a user specified number, usually five or ten). The model is trained on all *k* pieces except one, then tested on the final piece. This is repeated *k* times, at which point each of the *k* pieces has been used as the test set.

The below images is an example of 5-fold cross-validation:

<img src="../img/k_fold_cv.png" width=500>

*Note: See The Proposed Solution for a visualization of split and fold.*

- **Split**: this is the row in the above image. It includes all data that the algorithm is being trained and tested on.
- **Fold**: this is a segment of data. A fold holds the same specific observations from one split to the next, whether it's allocated for training or testing is what changes.


An [Introduction to Statistical Learning](https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf) says this about the choice of the size of *k*: 'there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.'

Once all the k-fold models have been run and tested against their specific test set, the performance of all models is averaged to produce an overall score.

When using this method, it is important to separate the training and test sets before any preprocessing. If preprocessing is done first, there will be *data leakage*. Data leakage happens when information the model wouldn't have at the time of prediction has been included in the training data. For example, in <font color='red'>Entry 8</font> I covered centering and scaling. Including the values from the test set would alter the mean which is used to center the values. This gives the model information it wouldn't have had if preprocessing was done on only the training set.

When using k-fold cross-validation for classification, it's best to stratify the classes so that the portion of each class is the same in each split as it is in the overall dataset. I'll cover stratification in more detail in the next entry.

### Leave-one-out cross-validation

Leave-one-out is very similar to k-fold. It works on the same principle, but only holds out a single observation as the test set. As such, it creates the same number of models as the number of observations.

This method can get really computationally expensive really quickly. If the dataset is only as big as some of the toy datasets I've been using as examples, it should be fine. However, at work the training data regularly includes hundreds of thousands of observations, if not over a million (depending on the time range we train on).

Leave-one-out can easily be adapted to leave-k-out, where k is a user defined number of observations to include in the test set. However, for the large datasets I'm practicing for, I'm leaning toward k-fold cross-validation.

### Shuffle-split cross-validation



### Repeated cross-validation



## The Proposed Solution


Steps to complete:
- load data
- split data
- segment into k-folds
- preprocess
- train
- return mean score

## The Fail



## Up Next

### Resources

- [Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Python-Scientists/dp/1449369413/ref=sr_1_15?keywords=scikit+learn&qid=1583195970&s=books&sr=1-15)
- [Train/Test Split and Cross Validation in Python](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)
- [A Gentle Introduction to k-fold Cross-Validation](https://machinelearningmastery.com/k-fold-cross-validation/)
- [3.1. Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Why every statistician should know about cross-validation](https://robjhyndman.com/hyndsight/crossvalidation/)
- [Cross Validation Gone Wrong](https://betatim.github.io/posts/cross-validation-gone-wrong/)