# Introduction to this python notebook

In [1]:
"""
What? Data leakage and how to avoid it.

Naive Data Preparation
----------------------
[1] Prepare Dataset
[2] Split Data
[3] Evaluate Models
Correct procedure
-----------------
[1] Split Data.
[2] Fit Data Preparation on Training Dataset
[3] Apply Data Preparation to Train and Test Datasets
[4] Evaluate Models

Examples are as follows:
[1] Train-Test Evaluation With Naive Data Preparation         -> WRONG
[2] Train-Test Evaluation With Correct Data Preparation       -> RIGHT 
[3] Cross-Validation Evaluation With Naive Data Preparation   -> WRONG
[4] Cross-Validation Evaluation With Correct Data Preparation -> RIGHT

https://machinelearningmastery.com/data-preparation-without-data-leakage/
"""

# Import python modules

In [4]:
from numpy import mean, std
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from IPython.display import Markdown, display
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold

# Creation of a syntetic dataset

In [7]:
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


# [1] Train-Test Evaluation With Naive Data Preparation -> WRONG

In [None]:
"""
Naive approach to normalizing the data BEFORE splitting the data and evaluating the model
Given we know that there was data leakage, we know that this estimate of model accuracy 
is wrong.
"""

In [8]:
# [WRONG SERIAL STEP] standardize the dataset -> This is the MISTAKE
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # fit the model

# Fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('[WRONG] Accuracy: %.3f' % (accuracy*100))

[WRONG] Accuracy: 84.848


# [2] Train-Test Evaluation With Correct Data Preparation

In [None]:
"""
The correct approach to performing data preparation with a train-test split 
evaluation is to FIRST fit the data preparation on the training set, THEN 
apply the transform to the train and test sets.

In this case, we can see that the estimate for the model is about 85.455 percent,
which is more accurate than the estimate with data leakage in the previous section
that achieved an accuracy of 84.848 percent. We would expect this to be an 
OPTIMISTIC estimate with data leakage
"""

In [9]:
# [CORRECT SERIAL STEP] split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

# Fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat) 
print('[CORRECT] Accuracy: %.3f' % (accuracy*100))



[CORRECT] Accuracy: 85.455


# [3] Cross-Validation Evaluation With Naive Data Preparation

In [None]:
"""
In this case, we can see that the model achieved an estimated accuracy of 
about 85.300 percent, which we know is incorrect given the data leakage 
allowed via the data preparation procedure.
"""

In [10]:
# [WRONG serial step] standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)


# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) 
# Report performance using mean and standard deviation on all the folds
print('[WRONG] Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

[WRONG] Accuracy: 85.300 (3.607)


# [4] Cross-Validation Evaluation With Correct Data Preparation

In [None]:
"""
Generally we expect for the data NOT leaked the performance is lower, due
to the OPTIMISTIC prodiction provided done on leaked data. This is normal
as leaking may or may not provide overfitting and therefore an improve in 
our score. NEVERTHELESS, the point here is that there is an effect and 
we should be aware of it.
"""

In [11]:
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) 
# Report performance using mean and standard deviation on all the folds
print('[CORRECT] Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

[CORRECT] Accuracy: 85.433 (3.471)
