# Create train-test split to use across notebooks

Note: When I built this notebook, I planned to focus on Day Four. This later changed, and I now focus on Day 2.

## Imports

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

## Read in the Data

In [None]:
genes = pd.read_hdf('./data/train_cite_inputs.h5')
proteins = pd.read_hdf('./data/train_cite_targets.h5')
meta = pd.read_csv('./data/metadata.csv')

In [None]:
genes.shape, proteins.shape, meta.shape

((70988, 22050), (70988, 140), (281528, 5))

In [None]:
meta.set_index('cell_id', inplace = True)

In [None]:
proteins = meta.merge(proteins, how = 'right', left_index = True, right_index = True)
del meta

In [None]:
proteins.isnull().sum().unique()

array([0])

The merge worked correctly; no data from proteins was lost.

## Apply Train Test split
I will use this split across all models.

In other notebooks to reduce memory consumption I will only model only cells from day four. So it is important the the train test split be stratified on this.

In [None]:
proteins['to_stratify'] = [1 if day == 4 else 0 for day in proteins['day']]

In [None]:
proteins['day'].value_counts()[4] == proteins['to_stratify'].sum()

True

The dummification of day four worked correctly.

In [None]:
X = genes # The set of all genes used as predictors.
Y = proteins # The set of all target proteins.
             # (There are 140 different protein targets.)


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 2022, train_size = 0.8, stratify = Y['to_stratify'])

In [None]:
(Y_train['to_stratify'].sum(), Y_test['to_stratify'].sum()) / proteins['to_stratify'].sum()

array([0.8, 0.2])

The stratification also worked as expected.

In [None]:
# To preserve memory I will delete the DataFrames that contain all the data.
del genes
del proteins
del X
del Y

### Check that the proteins are reasonably stratified without special intervention.

In [None]:
mean_diff = []
std_diff = []

for gene in X_train.columns:
    gene_mean_diff = X_train[gene].mean() - X_test[gene].mean()
    mean_diff.append(gene_mean_diff)

    gene_std_diff = X_train[gene].std() - X_test[gene].std()
    std_diff.append(gene_std_diff)

stratify_check_df = pd.DataFrame({'difference_in_mean' : mean_diff, 
                                  'difference_in_standard_deviation': std_diff}, 
                                 index = X_train.columns)

In [None]:
stratify_check_df.abs().describe()

Unnamed: 0,difference_in_mean,difference_in_standard_deviation
count,22050.0,22050.0
mean,0.007373,0.010697
std,0.008742,0.009584
min,0.0,0.0
25%,0.001257,0.003329
50%,0.003872,0.007971
75%,0.010494,0.015691
max,0.080008,0.080533


This is a distribution of the difference of the means and standard deviations between the each predictor column as compared between train and test.

The differences are all fairly small which shows that even without any oversight on the stratification the train test split occured reasonably anyway.

I will preserve this to check during future modeling to see if this turns into an issue in such a case.

In [None]:
stratify_check_df.to_csv('./train_test_split/stratify_check.csv')

### Save this split for future usage

In [None]:
X_train.to_hdf('./train_test_split/X_train_cite_seq.h5', index = False, key = 'df', mode = 'w')
X_test.to_hdf('./train_test_split/X_test_cite_seq.h5', index = False, key = 'df', mode = 'w')
Y_train.to_hdf('./train_test_split/Y_train_cite_seq.h5', index = False, key = 'df', mode = 'w')
Y_test.to_hdf('./train_test_split/Y_test_cite_seq.h5', index = False, key = 'df', mode = 'w')

In [None]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((56790, 22050), (14198, 22050), (56790, 145), (14198, 145))