# Introduction

In [None]:
"""
What? Illustrating leakage in CV
"""

# Import libraries/modules

In [11]:
import numpy as np
from sklearn.feature_selection import SelectPercentile, f_regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

# Create a dataset

In [None]:
"""
Let’s consider a synthetic regression task with 100 samples and 1,000 features that are sampled independently 
from a Gaussian distribution. We also sample the response from a Gaussian distribution:
"""

In [3]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))

In [None]:
"""
Given the way we created the dataset, there is no relation between the data, X, and the target, y (they are 
independent), so it should not be possible to learn anything from this dataset. We will now do the following.
HOWEVER if we are not careful ....
"""

# Leaking info in CV - WRONG APPROACH

In [None]:
"""
First, select the most informative of the 10 features using SelectPercentile feature selection, and then we 
evaluate a Ridge regressor using cross-validation:
"""

In [5]:
select = SelectPercentile(score_func=f_regression, percentile=5).fit(X, y) 
X_selected = select.transform(X)
print("X_selected.shape: {}".format(X_selected.shape))

X_selected.shape: (100, 500)


In [7]:
print("Cross-validation accuracy (cv only on ridge): {:.2f}".format(
np.mean(cross_val_score(Ridge(), X_selected, y, cv=5))))

Cross-validation accuracy (cv only on ridge): 0.91


# What has gone wrong?

In [None]:
"""
The mean R2 computed by cross-validation is 0.91, indicating a very good model. This clearly cannot be right, as 
our data is entirely random. What happened here is that our feature selection picked out some features among the 
10,000 random features that are (by chance) very well correlated with the target. Because we fit the feature 
selection outside of the cross-validation, it could find features that are correlated both on the training and 
the test folds. The information we leaked from the test folds was very informative, leading to highly unrealistic 
results
"""

# No leakage - RIGHT APPROACH

In [12]:
pipe = Pipeline([("select", SelectPercentile(score_func=f_regression,
                                                 percentile=5)), ("ridge", Ridge())]) 
print("Cross-validation accuracy (pipeline): {:.2f}".format(np.mean(cross_val_score(pipe, X, y, cv=5))))

Cross-validation accuracy (pipeline): -0.25


# Why has this approach shown the correct result?

In [None]:
"""
This time, we get a negative R2 score, indicating a very poor model. Using the pipe‐ line, the feature selection 
is now inside the cross-validation loop. This means features can only be selected using the training folds of the 
data, not the test fold. The feature selection finds features that are correlated with the target on the training
set, but because the data is entirely random, these features are not correlated with the target on the test set.
"""