## Intermediate Machine Learning - kaggle

https://www.kaggle.com/learn/intermediate-machine-learning

### Data Leakage

https://www.kaggle.com/code/alexisbcook/data-leakage

#### Introduction
What is data leakage? 
Leakage happens when trianing data contains information about the target, but similar data will not be available when the model is used for prediction. This results in high performance for training but poor performance in production. 

Two types of leakage: 
1. target leakage
2. train-test contamination

#### Target Leakage
**Target Leakage** happens when predictors include data that will not be available at the time predictions are made. This can be thought of in terms of *timing or chronological order* that data becomes available. 

Example of trying to predict who will get sick with pneumonia: 

| got_pneumonia | age | weight | male | took_antibiotic_medicine | ... |
| --- | --- | --- | --- | --- | --- |
| False | 65 | 100 | False | False | ... |
| False | 72 | 130 | True | False | ... | 
| True | 58 | 100 | False | True | ... |

Ppl take antiobiotics *after* getting pneumonia to recover. The raw data shows a strong relationship between those columns. But, `took_antibiotic_medicine` is frequently changed *after* the value for `got_penumonia` is determined. This is **target leakage**. This model will be very inaccurate when deployed because patients who will get pneumonia will not have received antibiotics yet when predictions need to be made about their future health. 

To prevent this type of data leakage, any variable updated (or created) after the target value is realized should be excluded. 

#### Train-Test Contamination
Validation is meant to be a measure of how the model does on data that hasn't been considered before. This can corrupt the process if the validation data affects the preprocessing behavior. This is called **train-test contamination**. 

#### Example
One way to detect and remove target leakage.

We will use a dataset about credit card application and skip the basic data set-up code. The ene result is that information about each credit card application is stored in a Dataframe `X`. We'll use it to predict which applications were accepted in a Series `y`. 





In [3]:
import pandas as pd

# Read the data
data = pd.read_csv('./input/aer-credit-card-data/AER_credit_card_data.csv', 
                   true_values = ['yes'], false_values = ['no'])

# Select target
y = data.card

# Select predictors
X = data.drop(['card'], axis=1)

print("Number of rows in the dataset:", X.shape[0])
X.head()

Number of rows in the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


This is a small dataset. **Cross-validation** will be used to ensure accurate measures of model quality

In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Since there is no preprocessing, we don't need a pipeline. It is used anyways as best practice!
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y,
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())

Cross-validation accuracy: 0.980292


Models that are 98% accurate are very rare. It is uncommon enough that the data should be inspected for target leakage. The summary of data can be found here: https://www.kaggle.com/code/alexisbcook/data-leakage/data

