# Train-validation-test split

One of the main concerns in every statistical learning project is that of finding a function that describes the training data as closely as possible while, at the same time, being able to generalise well enough on unseen data. To achieve this, measuring the performance of a model on the same data where it has been trained is not enough as this could lead to overfitting, meaning that the function obtained could describe the training data too closely and, at the same time, it may not able to recognise variations brought in by new data.

While during a model training phase it might not always be possible to test the model performance on outside data, there are various techniques that can be used to have a fair way to test the model performance. When multiple models are compared with each other, a popular technique is a tree-way split of the data available into a train, validation, and test sets. Each model is fitted on the training set and the performance is measured on the predictions made on the validation set. The validation set is the one used to assess any experimentation. Finally, the test set is used to evaluate the final performance of the model on unseen data.

In this notebook, the arXiv dataset is split into these three parts before starting training any of the models. At this point, the indices for each set are saved and referenced on every subsequent steps. This allows
efficient data storage while the datasets is used with different data processing techniques, according to the model implemented at each phase. At the same time, there is reassurance that the sets are consistent throughout the project, for a fair comparison among all classifiers.

In [1]:
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split

In [2]:
SEED = 3742

In [3]:
FILE = "../data/data.parquet.gzip"
data = pd.read_parquet(FILE)

In [4]:
X = data["text"]
y = data["target"]

In [5]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.15, random_state=SEED)

X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.15, random_state=SEED)

In [6]:
datasets_sizes = pd.concat((y_train.value_counts(), 
          y_val.value_counts(),
          y_test.value_counts(),
          ), axis=1)\
        .reset_index()
datasets_sizes.columns = ["target", "train", "val" ,"test"]

relatives = datasets_sizes/datasets_sizes.sum()
relatives.drop("target", axis=1, inplace=True)
relatives.columns = ["train_perc", "val_perc" ,"test_perc"]
datasets_sizes = pd.concat((datasets_sizes,relatives), axis=1)
datasets_sizes

Unnamed: 0,target,train,val,test,train_perc,val_perc,test_perc
0,0,870462,153261,180999,0.550031,0.548778,0.550882
1,1,344714,60818,71066,0.217819,0.217769,0.216294
2,2,283762,50476,59196,0.179305,0.180738,0.180167
3,3,83630,14722,17301,0.052844,0.052715,0.052657


In [7]:
# totals
datasets_sizes.sum().iloc[1:4].astype(int)

train    1582568
val       279277
test      328562
dtype: int64

In [2]:
!mkdir ../data/wip

In [8]:
pickle.dump(X_train.index, open("../data/wip/train_idx.pkl", 'wb'))
pickle.dump(X_val.index, open("../data/wip/val_idx.pkl", 'wb'))
pickle.dump(X_test.index, open("../data/wip/test_idx.pkl", 'wb'))