In Machine Learning, the goal is to achieve models that *generalize* well to unseen data. *Overfitting* is the central obstacle.

Evaluating a model always boild down to splitting the available data into three sets:
- **training** (train the model on this)
- **validation** (evaluate the model on this)
- **test** (final check for the model)

We want 3 datasets instead of 2 because we don't want to "overfit" our *tuning* process. We can end up choosing *hyperparameter* values that are specific to good performance on the validation set, but not in general.

Central to this phenomenon is the notion of *information leaks*. Every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data "leaks" into the model.

By the time the test data set is used it shouldn't have impacted the model.

There are several ways to split data into training, validation, and test sets...

In [2]:
import random
import numpy as np

data = random.choices(range(100000), k=50000)

### Simple Hold-Out Validation

Set apart some fraction of your data as your test set. Train on the remaining data and evaluate on the test set.

In [3]:
num_validation_samples = 10000

np.random.shuffle(data)

validation_data = data[:num_validation_samples]
data = data[num_validation_samples]

training_data = data[:]

model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)

# At this point you can tune your model,
# retrain it, evaluate it, tune it again...

model = get_model()
model.train(np.concatenate(
    [training_data, validation_data]
))

test_score = model.evaluate(test_data)

This is the simplest evaluation protocol and it suffers from one flaw: if little data is available, then your validation and test sets may contain too few samples to be statistically representative of the data at hand.

### K-Fold Validation

Split the data into *K* partitions of equal size. For each partition `i`, train a model on the remaining *K*-*1* partitions and evaluate it on partition `i`. The final score is then the averages of the *K* scores obtained. Helpful when the performance of the model shows significant variance based on your train-test split.

In [None]:
k = 4
num_validation_samples = len(data) // k

np.random.shuffle(data)

validation_scores = []
for fold in range(k):
    validation_data = data[num_validation_samples * fold: num_validation_samples * (fold + 1)]
    training_data = data[:num_validation_samples * fold] + data[num_validation_samples * (fold + 1):]

    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)

validation_score = np.average(validation_scores)

model = get_model()
model.train(data)
test_score = model.evaluate(test_data)

### Iterated K-Fold Validation

This technique is for situations in which you have relatively little data available and you need to evaluate your model as precisely as possible. (Helpful in Kaggle competitions)

Consists of applying *K*-fold validation multiple times, shuffling the data every time befroe splitting it *K* ways. The final score is the average of the scores obtained at each run of *K*-fold validation.

### Things to keep in mind

- **Data representativeness**: want both the training set and test set to be representative of the data at hand. Try random shuffling before sampling to ensure this.
- **The arrow of time**: If data includes a time component, don't randomly shuffle the data before splitting it because otherwise you will cause a *temporal leak* and the model will be trained on future data it should not have access to.
- **Redundancy in your data**: If some data points in yout data appear twice or more, then shuffling the data and splitting it into training and validation sets will result in redundancy between the two sets. So the training process will have access to data that it is being evaluated on which is taints model effectiveness.