# Data Leakage

Leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate.

There are two main types of leakage: **Leaky Predictors** and a **Leaky Validation Strategies**.

## Leaky predictors

This occurs when your predictors include data that will not be available at the time you make predictions.

For example, imagine you want to predict who will get sick with pneumonia. The top few rows of your raw data might look like this:

|got_pneumonia| age | weight | male | took_antibiotic_medicine | ...|
| ---         | --  | ---    |  --- |      ----                | -- |
|False|65|100|False|False|...|
|False|72|130|True|False|...|
|True|58|100|False|True|...|
|...|...|...|...|...|...|

People take antibiotic medicines after getting pneumonia in order to recover. So the raw data shows a strong relationship between those columns. But `took_antibiotic_medicine` is frequently changed **after** the value for got_pneumonia is determined. This is **target leakage**.

The model would see that anyone who has a value of `False` for `took_antibiotic_medicine` didn't have pneumonia. Validation data comes from the same source, so the pattern will repeat itself in validation, and the model will have great validation (or cross-validation) scores. But the model will be very inaccurate when subsequently deployed in the real world.

### Preventing leaky predictors
There is no single solution that universally prevents leaky predictors. It requires knowledge about your data, case-specific inspection and common sense.

However, leaky predictors frequently have high statistical correlations to the target. So two tactics to keep in mind:

* To screen for possible leaky predictors, look for columns that are statistically correlated to your target.
* If you build a model and find it extremely accurate, you likely have a leakage problem.

**To prevent this type of data leakage, any variable updated (or created) after the target value is realized should be excluded**. Because when we use this model to make new predictions, that data won't be available to the model.

## Leaky validation strategy
A much different type of leak occurs when you aren't careful distinguishing training data from validation data. For example, this happens if you run preprocessing (like fitting the Imputer for missing values) before calling `train_test_split`. Validation is meant to be a measure of how the model does on data it hasn't considered before. You can corrupt this process in subtle ways if the validation data affects the preprocessing behavoir.. The end result? Your model will get very good validation scores, giving you great confidence in it, but perform poorly when you deploy it to make decisions.

### Preventing leaky validation strategies
If your validation is based on a simple train-test split, exclude the validation data from any type of fitting, including the fitting of preprocessing steps.

## Example
I'll run an example with a simple [credit card dataset](https://www.kaggle.com/like1008/aer-credit-card-datacsv).

In [1]:
import pandas as pd

df = pd.read_csv('data/AER_credit_card_data.csv', true_values=['yes'], false_values=['no'])
df.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,True,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,True,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [2]:
df.shape

(1319, 12)

The dataset is fairly small, so we may want to do cross validation.

Here is a summary of the data:

* card: Dummy variable, 1 if application for credit card accepted, 0 if not
* reports: Number of major derogatory reports
* age: Age n years plus twelfths of a year
* income: Yearly income (divided by 10,000)
* share: Ratio of monthly credit card expenditure to yearly income
* expenditure: Average monthly credit card expenditure
* owner: 1 if owns their home, 0 if rent
* selfempl: 1 if self employed, 0 if not.
* dependents: 1 + number of dependents
* months: Months living at current address
* majorcards: Number of major credit cards held
* active: Number of active credit accounts

Let's create a model and fit it _as it is_

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

y = df.card
X = df.drop(['card'], axis=1)

leaky_model = RandomForestClassifier()
cv_scores = cross_val_score(leaky_model, X, y, scoring='accuracy')
print("Cross-val accuracy: {}".format(cv_scores.mean()))

Cross-val accuracy: 0.9802840477669634




As discussed above, the accuracy for this simple model is surprisingly high.

There are a couple of variables in this dataset that look suspicious. For example **expenditure**: Does expenditure mean expenditure on this card, or on cards used before applying. _The first case would be a case of data leakage_.

Let's do some data comparison:

In [4]:
expenditures_cardholders = df.expenditure[df.card]
expenditures_noncardholders = df.expenditure[~df.card]

print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))
print('Fraction of those who didn\'t receive a card with no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))

Fraction of those who received a card and had no expenditures: 0.02
Fraction of those who didn't receive a card with no expenditures: 1.00


Everyone with `card == False` had no expenditures, while only `2%` of those with `card == True` had no expenditures. It's not surprising that our model appeared to have a high accuracy. But this seems a data leak, where expenditures probably means *expenditures on the card they applied for.*

Since share is partially determined by expenditure, it should be excluded too. The variables active, majorcards are a little less clear, but from the description, they sound concerning. In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more.

We would run a model without leakage as follows:

In [5]:
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']

non_leaky_model = RandomForestClassifier()
X2 = X.drop(potential_leaks, axis=1)
cv_scores = cross_val_score(non_leaky_model, X2, y, scoring='accuracy')
print("Cross-val accuracy: {}".format(cv_scores.mean()))

Cross-val accuracy: 0.8051580727548838




This accuracy is quite a bit lower, which on the one hand is disappointing. However, we can expect it to be right about `80%` of the time when used on new applications, whereas the leaky model would likely do much worse then that (even in spite of it's higher apparent score in cross-validation.).

## Conclusion

Careful separation of training and validation data is a first step. Leaking predictors are a more frequent issue, and leaking predictors are harder to track down. A combination of caution, common sense and data exploration can help identify leaking predictors so you remove them from your model.