# How and Why to Create a Good Validation Set
---
![](meme.jpeg)

Frequently, machine learning model seem impressive in development but when we implemented in production then it worse. How can this happen?

One of the most likely culprits for this disconnect between results in development and results in production is a **poorly chosen validation set (or even worse, no validation set at all)**. Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a [train_test_split method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), this method takes a random subset of the data, **which is a poor choice for many real-world problems.**

## What is a validation set?
---
When creating a machine learning model, the ultimate goal for it is to be accurate on new data, not just the data you are using to build it. Consider the below example of 3 different models for a set of data:

![](https://www.fast.ai/images/overfitting2.png)

The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it’s not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.

**The underlying idea is that:**
- the training set is used to train a given model
- the validation set is used to choose between models (for instance, does a random forest or a neural net work better for your problem? do you want a random forest with 40 trees or 50 trees?)
- the test set tells you how you’ve done. If you’ve tried out a lot of different models, you may get one that does well on your validation set just by chance, and having a test set helps make sure that is not the case.

> A key property of the validation and test sets is that they must be representative of the new data you will see in the future. This may sound like an impossible order! By definition, you haven’t seen this data yet. But there are still a few things you know about it.

## Data Splitting Techniques
---
In this notebook, I will focus on 3 commonly used data splitting approaches as follows:
1. Random Split
2. Leave-One-Last-Basket Split
3. Time-based Split  

There are real world several data splitting approaches that you should research by yourselves.

### Random Split

![](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/4_train-test-split.jpg)

Random Split involves dividing the dataset into training and testing sets randomly. This is suitable for datasets where the samples are independent and identically distributed (i.i.d.).

In [None]:
from sklearn.model_selection import train_test_split

import pandas as pd
import warnings

In [None]:
warnings.filterwarnings('ignore')

In [None]:
data = pd.DataFrame({
    'feature1': range(10),
    'feature2': range(10, 20),
    'label': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
})

In [None]:
data.head()

Unnamed: 0,feature1,feature2,label
0,0,10,A
1,1,11,B
2,2,12,A
3,3,13,B
4,4,14,A


In [None]:
data['label'].value_counts(normalize=True)

Out[4]: A    0.5
B    0.5
Name: label, dtype: float64

In [None]:
X = data[['feature1', 'feature2']]
y = data['label']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
print(f"### Training set: {len(X_train)} ###")
print(f"Label distribution: {y_train.value_counts(normalize=True)}\n")
print(f"### Test set: {len(X_test)} ###")
print(f"Label distribution: {y_test.value_counts(normalize=True)}\n")

### Training set: 8 ###
Label distribution: B    0.5
A    0.5
Name: label, dtype: float64

### Test set: 2 ###
Label distribution: A    0.5
B    0.5
Name: label, dtype: float64



### Leave-One-Last-Basket Split

![](https://dl.acm.org/cms/attachment/1eb39211-e41f-4afb-a1fd-eebd4b834b3b/recsys20-144-fig1.jpg)

Leave-One-Last-Basket Split, also known as Leave-One-Out Cross-Validation for Sequential Data, is a specialized data splitting technique primarily used in the context of recommendation systems, especially for basket or session-based data. In this method, the goal is to predict the items a user will interact with in their last basket (or session) based on their previous baskets (or sessions).

In the Leave-One-Last-Basket Split, you partition the dataset for each user, leaving their last basket (or session) as the test set and using all previous baskets (or sessions) for training. This approach allows you to evaluate the model's ability to predict the next items a user will interact with, based on their historical behavior.

In [None]:
data = pd.DataFrame({
    'user_id': [1, 1, 1, 2, 2, 3, 3, 3, 4, 4],
    'basket_id': [1, 2, 3, 1, 2, 1, 2, 3, 1, 2],
    'item_id': range(1, 11),
    'feature1': range(10),
    'feature2': range(10, 20),
    'label': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
})

In [None]:
data.head()

Unnamed: 0,user_id,basket_id,item_id,feature1,feature2,label
0,1,1,1,0,10,A
1,1,2,2,1,11,B
2,1,3,3,2,12,A
3,2,1,4,3,13,B
4,2,2,5,4,14,A


In [None]:
train_data = pd.DataFrame(columns=data.columns)
test_data = pd.DataFrame(columns=data.columns)

In [None]:
for user_id in data['user_id'].unique():
    user_data = data[data['user_id'] == user_id]
    last_basket_id = user_data['basket_id'].max()
    train_data = train_data.append(user_data[user_data['basket_id'] < last_basket_id])
    test_data = test_data.append(user_data[user_data['basket_id'] == last_basket_id])

In [None]:
print(f"""### Training set ###
Unique user: {train_data['user_id'].nunique()}
Basket per user: {train_data.groupby(['user_id'])['basket_id'].count()}
""")
print(f"""\n### Test set ###
Unique user: {test_data['user_id'].nunique()}
Basket per user: {test_data.groupby(['user_id'])['basket_id'].count()}
""")


### Training set ###
Unique user: 4
Basket per user: user_id
1    2
2    1
3    2
4    1
Name: basket_id, dtype: int64


### Test set ###
Unique user: 4
Basket per user: user_id
1    1
2    1
3    1
4    1
Name: basket_id, dtype: int64



### Time-Based Split

![](https://i.stack.imgur.com/Q37Bn.png)

Time-based split is essential for time series data, where the order of the data matters. The data is split based on a specific time, with the training set containing data before the split point and the test set containing data after the split point.

In [None]:
data = pd.DataFrame({
    'timestamp': pd.date_range(start='2021-01-01', periods=10, freq='D'),
    'feature1': range(10),
    'feature2': range(10, 20),
    'label': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
})

In [None]:
data

Unnamed: 0,timestamp,feature1,feature2,label
0,2021-01-01,0,10,A
1,2021-01-02,1,11,B
2,2021-01-03,2,12,A
3,2021-01-04,3,13,B
4,2021-01-05,4,14,A
5,2021-01-06,5,15,B
6,2021-01-07,6,16,A
7,2021-01-08,7,17,B
8,2021-01-09,8,18,A
9,2021-01-10,9,19,B


In [None]:
time_threshold = '2021-01-05'

In [None]:
train_data = data[data['timestamp'] <= time_threshold]
test_data = data[data['timestamp'] > time_threshold]

In [None]:
train_data

Unnamed: 0,timestamp,feature1,feature2,label
0,2021-01-01,0,10,A
1,2021-01-02,1,11,B
2,2021-01-03,2,12,A
3,2021-01-04,3,13,B
4,2021-01-05,4,14,A


In [None]:
test_data

Unnamed: 0,timestamp,feature1,feature2,label
5,2021-01-06,5,15,B
6,2021-01-07,6,16,A
7,2021-01-08,7,17,B
8,2021-01-09,8,18,A
9,2021-01-10,9,19,B


In [None]:
print(f"""### Training set ###
Date range: {train_data['timestamp'].min()} - {train_data['timestamp'].max()}
""")
print(f"""\n### Test set ###
Date range: {test_data['timestamp'].min()} - {test_data['timestamp'].max()}
""")


### Training set ###
Date range: 2021-01-01 00:00:00 - 2021-01-05 00:00:00


### Test set ###
Date range: 2021-01-06 00:00:00 - 2021-01-10 00:00:00



## When is a Random Subset Not Good Enough?
---
It is instructive to look at a few examples. Although many of these examples come from Kaggle competitions, they are representative of problems you would see in the workplace.

### Time Series
If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates you are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future). 

> If your data includes the date and you are building a model to use in the future, you will want to choose a continuous section with the latest dates as your validation set (for instance, the last two weeks or last month of the available data).

Suppose you want to split the time series data below into training and validation sets:

![](https://www.fast.ai/images/timeseries1.png)

A random subset is a poor choice (too easy to fill in the gaps, and not indicative of what you’ll need in production):

![](https://www.fast.ai/images/timeseries2.png)

Use the earlier data as your training set and the later data for the validation set:

![](https://www.fast.ai/images/timeseries3.png)

Kaggle currently has a competition to [predict the sales in a chain of Ecuadorian grocery stores](https://www.kaggle.com/c/favorita-grocery-sales-forecasting). Kaggle’s “training data” runs from Jan 1 2013 to Aug 15 2017 and the test data spans Aug 16 2017 to Aug 31 2017. A good approach would be to use Aug 1 to Aug 15 2017 as your validation set, and all the earlier data as your training set.

### New people, New Boats, New…
You also need to think about what ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with.

For example, in the Kaggle [distracted driver competition](https://www.kaggle.com/c/state-farm-distracted-driver-detection), the independent data are pictures of drivers in a car, and the dependent variable is a category such as texting, eating, or safely looking ahead. 
> If you were the insurance company building a model from this data, note that you would be most interested in how the model performs on drivers you haven’t seen before (since you would likely have training data only for a small group of people). This is true of the Kaggle competition as well: the test data consists of people that weren’t used in the training set.

![](https://www.fast.ai/images/driver_phone.png)
![](https://www.fast.ai/images/driver_phone2.png)

If you put one of the above images in your training set and one in the validation set, your model will seem to be performing better than it would be on new people but another perspective is that if you used all the people in training your model (all people are in training and validation set), your model may be overfitting to characteristics of these specific people, and not just learning the states (texting, eating, etc). The better way for this problem is to include some people in validation set only who are not in training set like the test data consists of people who aren't in training set (before splitting to training and validation set)

A similar dynamic was at work in the [Kaggle fisheries competition](https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring) to identify the species of fish caught by fishing boats in order to reduce illegal fishing of endangered populations. 
> The test set consisted of boats that didn’t appear in the training data. This means that you’d want your validation set to include boats that are not in the training set.

![](train_validation_test_set.png)

Sometimes it may not be clear how your test data will differ. For instance, for a problem using satellite imagery, you’d need to gather more information on whether the training set just contained certain geographic locations, or if it came from geographically scattered data.

## The Dangers of Cross-Validation
---
The reason that sklearn doesn’t have a `train_validation_test` split is that it is assumed you will often be using **cross-validation**, in which different subsets of the training set serve as the validation set. For example, for a 3-fold cross validation, the data is divided into 3 sets: A, B, and C. A model is first trained on A and B combined as the training set, and evaluated on the validation set C. Next, a model is trained on A and C combined as the training set, and evaluated on validation set B. And so on, with the model performance from the 3 folds being averaged in the end.

However, the problem with cross-validation is that it is rarely applicable to real world problems, for all the reasons described in the above sections. Cross-validation only works in the some cases where you can randomly shuffle your data to choose a validation set.

## Kaggle’s Training Set = Your Training + Validation Sets
---
One great thing about Kaggle competitions is that they force you to think about validation sets more rigorously (in order to do well). Kaggle typically breaks the data into two sets you can download:

1. a **training set**, which includes the independent variables, as well as the dependent variable (what you are trying to predict). For the example of an Ecuadorian grocery store trying to predict sales, the independent variables include the store id, item id, and date; the dependent variable is the number sold. For the example of trying to determine whether a driver is engaging in dangerous behaviors behind the wheel, the independent variable could be a picture of the driver, and the dependent variable is a category (such as texting, eating, or safely looking forward).
2. a **test set**, which just has the independent variables. You will make predictions for the test set, which you can submit to Kaggle and get back a score of how well you did.

This is the basic idea needed to get started with machine learning, but to do well, there is a bit more complexity to understand. You will want to create your own training and validation sets (by splitting the Kaggle “training” data). You will just use your smaller training set (a subset of Kaggle’s training data) for building your model, and you can evaluate it on your validation set (also a subset of Kaggle’s training data) before you submit to Kaggle.

The most important reason for this is that Kaggle has split the test data into two sets: for the public and private leaderboards. The score you see on the public leaderboard is just for a subset of your predictions (and you don’t know which subset!). How your predictions fare on the private leaderboard won’t be revealed until the end of the competition. The reason this is important is that you could end up overfitting to the public leaderboard and you wouldn’t realize it until the very end when you did poorly on the private leaderboard. Using a good validation set can prevent this. You can check if your validation set is any good by seeing if your model has similar scores on it to compared with on the Kaggle test set.

Another reason it’s important to create your own validation set is that Kaggle limits you to two submissions per day, and you will likely want to experiment more than that. Thirdly, it can be instructive to see exactly what you’re getting wrong on the validation set, and Kaggle doesn’t tell you the right answers for the test set or even which data points you’re getting wrong, just your overall score.

Understanding these distinctions is not just useful for Kaggle. In any predictive machine learning project, you want your model to be able to perform well on new data.

## Resources
---
- [How (and why) to create a good validation set](https://www.fast.ai/posts/2017-11-13-validation-sets.html)
- [Exploring Data Splitting Strategies for the Evaluation of Recommendation Models](https://arxiv.org/pdf/2007.13237.pdf)
- [Sklearn: Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)