# Splitting Data into Training and Testing

## Learning objectives
- Understand why and how to split datasets into training and testing sets, and training, validation and testing sets.
- Understand some technical nuances on splitting datasets such as reproducibility and how to deal with imbalanced datasets.
- Implement dataset splits in Python with `scikit-learn`.

## Splitting Data into Training and Testing datasets

When training a model we want to separate concerns in our datasets. We need a portion for our dataset to be the data the model trains on, this set is usually known as **training set**. We also need another portion of our data to evaluate the performance of our model (here we understand _performance_ as _some measure of how good the predictions of our model are_). This second dataset is known as **testing dataset**. We don't want these sets to be overlapping as this is a source of overfitting. In practice the proportions of these splits are around **70-30** to **90-10** percent (**train-test**).

In [4]:
# import necessary libraries
%pip install scikit-learn pandas numpy

Collecting scikit-learn
  Using cached scikit_learn-1.5.0-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting pandas
  Using cached pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.13.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Using cached scikit_learn-1.5.0-cp312-cp312-win_amd64.whl (10.9 MB)
Using cached pandas-2.2.2-cp312-cp312-win_amd64.whl (11.5 MB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Using cached scipy-1.13.1-cp312-cp312-win_amd64.whl (45.9 MB)
Installing collected packages: scipy, joblib, scikit-learn, pandas
Successfully installed joblib-1.4.2 pandas-2.2.2 scikit-learn-1.5.0 scipy-1.13.1


### Making random splits

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

TRAIN_SIZE = 0.8

wine_ds = load_wine()
wine_X = wine_ds.data
wine_y = wine_ds.target

X_train, X_test, y_train, y_test = train_test_split(wine_X,
                                                    wine_y,
                                                    train_size=TRAIN_SIZE)

### Using a seed for reproducibility

When doing splits, we might want to ensure subsequent runs of the same training pipelines yield the same result. For doing that we want to make our split consistent across runs. Scikit-learn `train_test_split` function has an optional parameter for a "random state" which ensures the data split to be the same across runs.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

TRAIN_SIZE = 0.8
RANDOM_SEED = 314159

wine_ds = load_wine()
wine_X = wine_ds.data
wine_y = wine_ds.target

X_train, X_test, y_train, y_test = train_test_split(wine_X,
                                                    wine_y,
                                                    train_size=TRAIN_SIZE,
                                                    random_state=RANDOM_SEED)

### Splitting into training, testing, and validation

We can include additional validation of our model during the training phase. This can help us reduce overfitting. For this purpose we include a third split, which is usually known as a **validation set**. This set usually consists of **40** to **60** percent of the data that would be reserved for the testing set. The three of these sets should not be overlapping.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

TRAIN_SIZE = 0.8
TEST_VAL_RATIO = 0.5
RANDOM_SEED = 314159

wine_ds = load_wine()
wine_X = wine_ds.data
wine_y = wine_ds.target

X_train, X_temp, y_train, y_temp = train_test_split(wine_X,
                                                    wine_y,
                                                    train_size=TRAIN_SIZE,
                                                    random_state=RANDOM_SEED)

X_test, X_val, y_test, y_val = train_test_split(X_temp,
                                                y_temp,
                                                train_size=TEST_VAL_RATIO,
                                                random_state=RANDOM_SEED)

## Handling imbalanced data or time-series

When handling highly imbalanced(skewed) data, it is possible that some data class gets misrepresented in either the training or testing/validation sets. For instance, if we have a dataset for fraud occurrences, we might have a low percent of fraud events, which might not make it to the testing dataset. Other kind of consideration when splitting data might be when working with time-based events, for instance when dealing with forecasting, we would like to have data around the same time points in all of our dataset splits.

### Stratified Split

A stratified split ensures class distribution in classification models remains consistent across dataset splits.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine

TRAIN_SIZE = 0.8
TEST_VAL_RATIO = 0.5
RANDOM_SEED = 314159

wine_ds = load_wine()
wine_X = wine_ds.data
wine_y = wine_ds.target

X_train, X_temp, y_train, y_temp = train_test_split(wine_X,
                                                    wine_y,
                                                    train_size=TRAIN_SIZE,
                                                    random_state=RANDOM_SEED,
                                                    stratify=wine_y)
# We want to use target columns as these mark our classes in a classification model

X_test, X_val, y_test, y_val = train_test_split(X_temp,
                                                y_temp,
                                                train_size=TEST_VAL_RATIO,
                                                random_state=RANDOM_SEED,
                                                stratify=y_temp)

### Time-series splits

When working with time series, a method to split data consists in:
- Splitting first the data into segments corresponding a time period.
- Create splits for all of these segments.
- Join resulting datasets.
The `scikit-learn` library also contains a helper module for this kind of split.

In [17]:
from sklearn.datasets import fetch_openml

bike_sharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True)
bike_df = bike_sharing.frame
# We retrieve a bike sharing dataset from OpenML repository.
bike_y = bike_df["count"]
# We use the quantity of rentals as labels.
bike_X = bike_df.drop("count", axis="columns")


In [25]:
from sklearn.model_selection import TimeSeriesSplit


ts_cv = TimeSeriesSplit(n_splits=5)

all_splits = list(ts_cv.split(bike_X, bike_y)) # Gets all splits indices

# We can recover them as follows:
train_0, test_0 = all_splits[0]

print(
    bike_X.iloc[train_0],
    bike_X.iloc[test_0])


      season  year  month  hour holiday  weekday workingday weather   temp  \
0     spring     0      1     0   False        6      False   clear   9.84   
1     spring     0      1     1   False        6      False   clear   9.02   
2     spring     0      1     2   False        6      False   clear   9.02   
3     spring     0      1     3   False        6      False   clear   9.84   
4     spring     0      1     4   False        6      False   clear   9.84   
...      ...   ...    ...   ...     ...      ...        ...     ...    ...   
2894  summer     0      5    12   False        4       True   clear  21.32   
2895  summer     0      5    13   False        4       True   clear  22.14   
2896  summer     0      5    14   False        4       True   clear  22.14   
2897  summer     0      5    15   False        4       True   clear  22.96   
2898  summer     0      5    16   False        4       True   clear  23.78   

      feel_temp  humidity  windspeed  
0        14.395      0.8

### Sources:
- https://medium.com/data-and-beyond/how-to-split-data-in-machine-learning-5-simple-strategies-and-python-examples-a500c3f2f750#:~:text=Data%20Splitting%20is%20an%20important,and%20finally%20evaluating%20its%20performance. 
- https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html 