# Data Splits

We want our models to generalize well to unseen data. To that end we

* Split our data into **training** and **testing** sets.  
    * We never do anything with the testing set until the **very end of our work** as a final sanity check.
* During model selection we further split our training set using either
    * A single **validation** set or
    * A k-fold **cross-validation** split

## What we will accomplish

In this notebook we will:
- Discuss the rationale for splitting our data set
- Introduce train test splits, validation sets, and cross-validation

In [None]:
## We will now start importing a common set
## of items at the onset of most notebooks
import numpy as np
import matplotlib.pyplot as plt
from seaborn import set_style
set_style("whitegrid")

## Data splits guard against over-fitting

Over-fitting is when a model fits too closely to the data it was trained on and does not generalize to new data as well as it otherwise could.

We will give a more formal presentation in lecture 5 (the "Bias/Variance Tradeoff" notebook) but we need at least an informal understanding immediately.

<img src="lecture_3_assets/overfit.png"></img>

The 2nd model is over-fitting:  we can see that it would not generalize well to new data which follows the same distribution as our training data.

It was easy to see that we are over-fitting here because the relationship is relatively simple and the data is low dimensional enough that we can visualize it.  When we are dealing with real data we might have hundreds of features, and simple visual checks would not be sufficient.

One of the best ways to guard against over-fitting is to use a **data split**.

## Train test splits

The first split we will touch on is the first split you would do in a new data science project: the **train test split**.

The purpose of the train test split is to create two data sets:
1. <b>The training set</b> - This subset is used to fit models and compare model candidates. This data set is usually split further.
2. <b>The testing set</b> - This subset is used as a final check on your selected model prior to putting your model into its desired final state.

The training set usually contains the majority of the original data. Common train test split percentage divisions are $80\% - 20\%$ or $75\% - 25\%$, but it may sometimes be appropriate to use different split sizes. Train test splits are done randomly, with the form of randomness dependent upon your project.

Here is an illustration of a train test split:

<img src="lecture_3_assets/train_test.png" width="40%"></img>

<b>IMPORTANT:  The test set is not directly used to compare models</b>

Model comparison is typically done using further splits of the training set. 

It is embarrassing and costly to ship a product which doesn't perform well on novel data.  The test set serves as a **final sanity check** on your work before sending it out into the world.

### Performing train test splits in `sklearn`

The `sklearn` package has a useful `train_test_split` function that will perform the train test split. Here is a link to the documentation:

 <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html</a>

In [None]:
## First we will make a data set
X = np.random.random((1000,10))
y = np.random.randn(1000)

In [None]:
## Now we import train_test_split


In [None]:
## Here we make the split
## train_test_split returns 4 outputs: X_train, X_test, y_train and y_test
##
## First you input the X and y for your data
##
## then set the shuffle argument to True, this randomly shuffles the
## data before it is split
##
## The random_state ensures that the random split is the same each time
## someone runs the code chunk, it can be any strictly positive integer
##
## You can specify the size of the test set with test_size,
## here I want 20% of the data


In [None]:
## check the data lengths to see that they match
## what we'd expect
print("The shape of X_train is",X_train.shape)
print("The shape of X_test is",X_test.shape)
print("The length of y_train is",len(y_train))
print("The length of y_test is",len(y_test))

## Two split types for model comparison and selection

We will now cover two data splits you can make from the training set for model comparison purposes. Which you choose depends upon the project you are working on, but we will give some reasons to choose one over the other below.

### Validation sets

A <i>validation set</i> is a subset of the training data (the result of the train test split defined above) used solely for the purpose of comparing candidate models. This split is typically also performed randomly. Further, the validation set should be a small subset, common sizes range from $10\%-25\%$ of the training set depending on the training set size. An illustration of this concept is given below:

<img src="lecture_3_assets/validation_set.png" width="45%"></img>

The best model in this setting would be the one that has the best performance metric on the validation set.

#### In practice

In practice we can once again use `sklearn`'s `train_test_split` function to make the validation split. Note that it is good practice to not overwrite the original `X_train` or `y_train` sets when making the validation split.

In [None]:
## Here we make a validation set with 15% of the 
## training data in the validation set


In [None]:
print(len(X_train))
print("15% of",800,"is",.15*800)

In [None]:
print("Shape of X_train_train", X_train_train.shape)
print("Shape of X_val", X_val.shape)
print("Length of y_train_train", len(y_train_train))
print("Length of y_val", len(y_val))

### $k$-Fold cross-validation

The validation set approach only gives us one check on how well our model generalizes.  We might get unusually lucky or unlucky with this one check.  $k$-fold cross validation gives us $k$ opportunities to see how well our model will generalize instead of just one.

<img src="lecture_3_assets/cv1.png" width="60%"></img>

Common values for $k$ are between $5$ and $10$.  "Leave out one" cross validation is another strategy which is equivalent to taking $k = n-1$ where $n$ is the number of samples in your training data.

You can implement cross-validation using `sklearn`'s `KFold` object. Documentation for this method can be found here 

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html">https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html</a>

In [None]:
## import KFold


In [None]:
## make a KFold object
## n_splits controls the value of k
## shuffle=True, randomly shuffles the data prior to splitting
## random_state is the same as for train_test_split
kfold = 

In [None]:
## demonstrate.split


Side note on generators:  Notice that kfold.split returns a generator object.  If you are not familiar with them, you can think of this as being similar to a list except that instead of storing all of the elements in memory it stores the current element and a rule for getting the next element.

KFold is implemented this way to deal with memory issues if you use a large number of splits.  For instance, if a Leave Out One split was implemented as a list on a dataset of size $10000$ the size of the list would be $10000*9999$.  If you use a generator instead then at each stage you only need to keep a list of size $10000$ in memory, also remember which element you should leave out next.

In [None]:
## use for loop to demonstrate .split
for train_index, test_index in kfold.split(X_train, y_train):
    print("Train index:", train_index)
    print("Test index:", test_index)
    print()
    print()

In [None]:
## When fitting a model we'd do something like the following
for train_index, test_index in kfold.split(X_train, y_train):
    ## get the kfold training data
    X_train_train = X_train[train_index,:]
    y_train_train = y_train[train_index]
    
    ## get the holdout data
    X_holdout = X_train[test_index,:]
    y_holdout = y_train[test_index]
    
    ## Then you'd fit your model
    ## Then you'd record the error on the holdout set here
#     model.fit(X_train_train, y_train_train)
    
#     error(y_holdout, model.predict(X_holdout))

### Validation set or cross-validation

Cross-validation, when feasible, is preferred to a single validation set. In general it is better to have a collection of estimates of the generalization error instead of a single point estimate.

However, it is not always feasible to perform cross-validation. Models that take prohibitively long to train limit the usefulness of cross-validation since $k$-fold cross-validation requires you to train the model $k$ distinct times.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.  Modified by Steven Gubkin 2024.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)