# Data leakage

Data leakage occurs when the training data contains information about the target, but similar data will not be available when the model is used for prediction. This means that the model performs well on the training set, and possibly even the validation data, but performs poorly in production.

There are two main types of leakage: **target leakage** and **train_test contamination**. 

### Target leakage

Target leakage occurs when some of the predictor variables were changed or collected *after* the target variable was determined. 

In that way, these predictor variables can be *dependent* on the target variable. When it comes to putting the model into production, those predictor variables will not yet have been created or updated based on the target variable. Because our model is based on the wrong point in time, out model will form poor predictions based on those predictor variables. 

The way to prevent this leakage is to not include predictor variables that have been created or updated after the target value is realized. 

![](images/leakage.png)

### Train-test contamination

Train-test occurs when you make the mistake of including the validation data in any of the preprocessing or model fitting steps. For example, if you first pre-processed the data and did imputation before doing a train-test split then the imputed values in the training data, and thus the training data itself, would be based in part on the validation data. In that way, the training data is partly contaminated by the validation data. 

The validation data needs to be excluded from *any* type of fitting or preprocessing. 

In [14]:
import pandas as pd

data = pd.read_csv('../data/AER_credit_card_data.csv', 
                   true_values=['yes'], false_values=['no'])

# Select target
y = data.card

# Select predictors
X = data.drop(['card'], axis=1)

print('Number of rows in the dataset:', X.shape[0])
X.head()

Number of rows in the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


Because it is a small dataset, we use cross-validation to ensure accurate measures of model quality. 

In [28]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# We don't need a pipeline because there's no preprocessing, but it's best practice
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
# Use accuracy because of bool: 
# https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-scorea
cv_scores = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy')
print('Cross-validation\nAccuracy: {}, \nSt. dev: {}'.format(cv_scores.mean(), cv_scores.std()))

Cross-validation
Accuracy: 0.9810403772280519, 
St. dev: 0.006798571041005265


This is a very high accuracy, and so we might suspect target leakage. 

In [37]:
data

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.5200,0.033270,124.983300,True,False,3,54,1,12
1,True,0,33.25000,2.4200,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5000,0.004156,15.000000,True,False,4,58,1,5
3,True,0,30.50000,2.5400,0.065214,137.869200,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.503300,True,False,2,64,1,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1314,True,0,33.58333,4.5660,0.002146,7.333333,True,False,0,94,1,19
1315,False,5,23.91667,3.1920,0.000376,0.000000,False,False,3,12,1,5
1316,True,0,40.58333,4.6000,0.026513,101.298300,True,False,2,1,1,2
1317,True,0,32.83333,3.7000,0.008999,26.996670,False,True,0,60,1,7


In [44]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

In [48]:
print('Fraction of those who did not receive a card and had expenditures: %.2f' \
      %((expenditures_noncardholders > 0).mean()))
print('Fraction of those who received a card and had expenditures: %.2f' \
      %(( expenditures_cardholders > 0).mean()))

Fraction of those who did not receive a card and had expenditures: 0.00
Fraction of those who received a card and had expenditures: 0.98


Basically, expenditures means expenditures on the card applied for. So only those who received a card could have had expenditures. Expenditures is determined *after* the target is realized, so this is a case of target leakage. 

`share` (the ratio of monthly credit card expenditure to yearly income) is partially determined by `expenditure`, so it should be excluded too. 

These also seem concerning:

* `majorcards`: Number of major credit cards held
* `active`: Number of active credit accounts



In [54]:
# Drop leaky predictors
potential_leaks = ['expenditure', 'share', 'majorcards', 'active']

X2 = X.drop(potential_leaks, axis=1)

cv_scores = cross_val_score(my_pipeline, X2, y, cv=5, scoring='accuracy')
print('Cross-validation\nAccuracy: {}, \nSt. dev: {}'.format(cv_scores.mean(), cv_scores.std()))

Cross-validation
Accuracy: 0.8316731994599846, 
St. dev: 0.011151614678995284
