# Entry 16 - Cross Validation

## The Problem

Once a model is trained I need a way to see how well it performs. However, if I evaluate the model on how well it performs on the same data it was trained on, this gives a very skewed view of model performance. There is always the chance that the model just memorized the correct answer for each individual observation.

For example, if I create model to predict the culprit in a game of clue, if my training data only has one observation with Col. Mustard, say he did it with the revolver in the library, then the model will assume that every time revolver and library is seen that Col. Mustard is the culprit. It just memorized the combination.

The Titanic dataset is a good example too. If the names of the passengers is included and the model has been **overfit**, it could just memorize that 'Heikkinen, Miss. Laina' survived. The model will score 100% accuracy on the data it trained on, but when presented with a passenger it has never seen before, I wouldn't expect the results to be better than random guessing.

- **Overfitting**
- **Underfitting**

The goal is to gauge how well the model performs on data it's never seen before.

## The Options

### Validation/hold-out set

The first obvious solution is to remove a portion of the data to test on once the model has been trained - a validation or hold-out dataset. This is a very common practice, the two portions of data are called the *training data* and the *test data*. By separating the train and test sets first, this leaves the test set as all new data to the model.

A drawback of calling it quits at this point is that:
- it is highly dependent on what data is placed into the training dataset and what data is placed into the test dataset
- it limits the amount of data available for training (most models prefer more data to less).

### k-fold Cross-validation

The solution to this is called *k-fold cross-validation*. The data is broken into *k* pieces, for the sake of example, I'll say five. The model is trained on four of the pieces and tested on the fifth. This is done five (*k*) times until each pieces has been used as the test set.

<img src="../img/k_fold_cv.png" width=500>

Once all five models have been run and tested against their specific test set, the performance of all models is averaged to produce an overall score.

When using this method, it is important to separate the training and test sets before any preprocessing. If preprocessing is done first, there will be *data leakage.* Data leakage happens when information the model wouldn't have at the time of prediction has been included in the training data. For example, in <font color='red'>Entry 8</font> I covered sentering and scaling. Including the values from the test set would alter the mean which is used to center the values. This gives the model information it wouldn't have had if preprocessing was done on only the training set.

### Hybrid

DataRobot [uses a hybrid](https://www.datarobot.com/platform/automated-machine-learning/) of these methods. They separate out a test set, automatically run validation, then allow cross-validation on a model by model basis.

## The Proposed Solution

I'm most interested in a hybrid method where the test set is separated and left untouched, then k-fold cross-validation is run on the training set. This allows me to tinker with hyperparameters and examine the cross-validated scores while still have a hold-out test set to compare against once all the fine tuning has been completed.

Steps to complete:
- load data
- split data
- segment into k-folds
- preprocess
- train
- return mean score

## The Fail

## Up Next

### Resources