<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-5/data_leakage_cv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Leakage in Cross Validation

The impact of data leakage in the cross-validation varies, depending on what kind of pre-processing. Estimate the scaling factor, e.g. as described in lecture, usually does not have a large impact, but for others such as feature extraction or feature selection, data leakage can lead to vast differences in the model 'true' predictive power. 

The purpose of this exercise is to illustrate the impact of data leakage on model's accuracy.  It is based on an excellent example from Elements of Statistical Learning (by Trevor Hastie, et al.), from the section *The Wrong and Eight Way to Do Cross-validation*. 

## Generate Data

Let’s consider a synthetic classification task with 100 samples and
1,000 features that are sampled independently from a Gaussian distribution. We also
randomly sample the response from \[0,1\] for binary labels.

In [1]:
import numpy as np

rnd = np.random.RandomState(seed=0)

# generate 100 samples with 10000 features from Normal distribution
X = rnd.normal(size=(1000, 10000))

# generate 100 binary labels with equal probability
y =  np.random.choice([0, 1], size=(1000,), p=[0.5, 0.5])

In [2]:
X.shape, y.shape

((1000, 10000), (1000,))

Given that X and y are sampled independently from the distribution, there should not be any relation between X and y, and the expected test error rate should be around 50%. 

## Feature Selection

First, select the most informative of the features using SelectPercentile feature selection, and then we evaluate a LogisticRegressor using cross-validation.

In [3]:
from sklearn.feature_selection import SelectPercentile, f_regression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

select = SelectPercentile(score_func=f_regression, percentile=5).fit(X_train, y_train)
X_selected = select.transform(X_train)

print("X_selected.shape: {}".format(X_selected.shape))

X_selected.shape: (800, 500)


## Cross Validation 

In [4]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

mean_accuracy = np.mean(cross_val_score(LogisticRegression(), X_selected, y_train, cv=5))
print("Cross-validation accuracy: {:.2f}".format(mean_accuracy))

Cross-validation accuracy: 0.89


Looks like a decent validation accuracy, let's try on our test set.

In [5]:
lr = LogisticRegression()
lr.fit(X_train, y_train).score(X_test, y_test)

0.51

The result is no better than random guess (50%)!! 

Now let's do a 'proper' cross validation on our model: 

In [7]:
from sklearn.pipeline import Pipeline


pipeline = Pipeline([("select", SelectPercentile(score_func=f_regression, percentile=5)),
                     ("lr", LogisticRegression())])

mean_accuracy = np.mean(cross_val_score(pipeline, X_train, y_train, cv=5))
print("Cross-validation accuracy (pipeline): {:.2f}".format(mean_accuracy))


Cross-validation accuracy (pipeline): 0.49


this time round, the cross-validation accuracy gives a true-er picture of the model performance (50%). 