# Checking statistical overlap between train and test sets

In this notebook we illustrate a simple method to check to see if the training set is statistically representative of the test set. We do this by building a simple linear classifier that classifies between the training set and the test set. 

The motivation behind this is simple: if the test set and training set do NOT overlap, then there should be a simple linear classifier that separates them perfectly. On the other hand, if the the two data sets do overlap, then this linear classifier will have the same accuracy of a random classifier.

We illustrate this for two different datasets: MNIST and HAR.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle

Vendor:  Continuum Analytics, Inc.
Package: mkl
Message: trial mode expires in 29 days


We create a function that prepares our data by taking the training and test sets as input and returns a single data set X with labels y corresponding to binary labels (label 0 for the training set and label 1 for the test set).

In [2]:
def prep_train_test_data(train_data, test_data):
    X_0 = train_data
    X_1 = test_data
    X = np.vstack((X_0, X_1))
    y = np.vstack((np.zeros((X_0.shape[0],1)), np.ones((X_1.shape[0],1))))
    X_shuffled, y_shuffled = shuffle(X, y, random_state = 216) # use a random seed for reproducibility
    return X_shuffled, y_shuffled.squeeze()

## MNIST dataset

In [3]:
df_train = pd.read_csv("./data/digits/train.csv")
df_train.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
df_test = pd.read_csv("./data/digits/test.csv")
df_test.head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
X, y = prep_train_test_data(np.array(df_train.iloc[:,1:]), np.array(df_test))

In [6]:
LR = LogisticRegression()

In [7]:
N = int(.7*X.shape[0])
LR.fit(X[:N,:], y[:N])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [11]:
LR.score(X[N:,:], y[N:])

0.59080952380952378

Since the training and test sets are often not equal, it would be easier to interpert the score of this classifier with the AUC metric instead. Scikit-learn provides a nice convinience function for to compute AUC. 

In [12]:
from sklearn import metrics
preds = LR.predict(X[N:,:])
metrics.roc_auc_score(y[N:],preds)

0.49869746850284996

We see that the AUC is essentially random, suggesting that the training set for MNIST is statistically representative of the test set. Can you think of an instannce when the training set might not be statistically representatie of the test set?

- left handed training set vs right handed test set
- hardware: different hardware used for 

## HAR dataset 

In [13]:
df_train = pd.read_csv("./data/HAR/trian-har.csv")
df_train.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,1,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,1,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,1,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,1,STANDING


In [14]:
df_test = pd.read_csv("./data/HAR/test-har.csv")
df_test.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.257178,-0.023285,-0.014654,-0.938404,-0.920091,-0.667683,-0.952501,-0.925249,-0.674302,-0.894088,...,-0.705974,0.006462,0.16292,-0.825886,0.271151,-0.720009,0.276801,-0.057978,2,STANDING
1,0.286027,-0.013163,-0.119083,-0.975415,-0.967458,-0.944958,-0.986799,-0.968401,-0.945823,-0.894088,...,-0.594944,-0.083495,0.0175,-0.434375,0.920593,-0.698091,0.281343,-0.083898,2,STANDING
2,0.275485,-0.02605,-0.118152,-0.993819,-0.969926,-0.962748,-0.994403,-0.970735,-0.963483,-0.93926,...,-0.640736,-0.034956,0.202302,0.064103,0.145068,-0.702771,0.280083,-0.079346,2,STANDING
3,0.270298,-0.032614,-0.11752,-0.994743,-0.973268,-0.967091,-0.995274,-0.974471,-0.968897,-0.93861,...,-0.736124,-0.017067,0.154438,0.340134,0.296407,-0.698954,0.284114,-0.077108,2,STANDING
4,0.274833,-0.027848,-0.129527,-0.993852,-0.967445,-0.978295,-0.994111,-0.965953,-0.977346,-0.93861,...,-0.846595,-0.002223,-0.040046,0.736715,-0.118545,-0.692245,0.290722,-0.073857,2,STANDING


In [15]:
X, y = prep_train_test_data(np.array(df_train.iloc[:,:-2]), np.array(df_test.iloc[:,:-2]))
print X.shape

(10299, 561)


In [16]:
LR = LogisticRegression()

In [17]:
N = int(.7*X.shape[0])
LR.fit(X[:N,:], y[:N])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Lets evaluate with AUC as we did with the MNIST dataset

In [18]:
preds = LR.predict(X[N:,:])
metrics.roc_auc_score(y[N:],preds)

0.66380522412125131

Unlike MNIST, the AUC for the HAR dataset is significaintly better than random. This suggests that the training set we have for this problem is not as statistically representative as we would like. What may be some solutions to this?

- find more subjects that are representative of the population and control for different cohorts!