<a href="https://colab.research.google.com/github/natalia7244/Machine-Learning-Exercises/blob/main/Data_Leakage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Leakage

Data leakage is like cheating on a test.

Your model learns from training data. But if you accidentally give it answers from the future, it will look super smart in training — but fail badly when it's used in real life.

For example:
Imagine you're training a model to predict if someone will buy something, but one of your input columns tells you how much they spent. That gives away the answer! The model learns it too easily and doesn’t learn the real patterns.

So it gets great scores during training, but in real use, when that info isn't there, it totally fails.

# Target leakage


    Happens when the model sees data that was only available after the outcome.

    Makes training and validation scores look good.

    Model fails in real life because it used info from the future.

    To prevent it: Remove features created after the target happened.

Example:

    Predicting pneumonia using took_antibiotic_medicine causes leakage, because antibiotics are taken after getting sick.

# Train-Test Contamination

Happens when the model sees validation data too early, often during preprocessing.

Makes validation scores look better than they really are.

Can happen when doing feature engineering or imputing before the data is split.

Fix it by:

    Splitting data first.

    Preprocessing only the training data.

    Using pipelines to avoid mistakes.

Even more important when using cross-validation.

# Example - target leakage

In [8]:
import pandas as pd

data = pd.read_csv('/content/drive/MyDrive/Data_sets/AER_credit_card_data.csv')

y = data.card
X = data.drop(['card'], axis =1)

print("Number of rows in the dataset:", X.shape[0])
X.head()



Number of rows in the dataset: 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,yes,no,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,no,no,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,yes,no,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,no,no,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,yes,no,2,64,1,5


In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Wybierz kolumny kategoryczne
categorical_cols = [cname for cname in X.columns if X[cname].dtype == "object"]
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]

# Zbuduj preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'  # liczby przechodzą bez zmian
)

# Pipeline z preprocessingiem i modelem
model = RandomForestClassifier(n_estimators=100, random_state=0)
my_pipeline = make_pipeline(preprocessor, model)

# Cross-validation
cv_scores = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy')
print("Cross-validation accuracy: %f" % cv_scores.mean())


Cross-validation accuracy: 0.980292


In [14]:
# Przekształć y na wartości boolowskie
y_bool = y == 'yes'

# Teraz filtruj
expenditures_cardholders = X.expenditure[y_bool]
expenditures_noncardholders = X.expenditure[~y_bool]

print('Fraction of those who did not receive a card and had no expenditures: %.2f'
      % ((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f'
      % ((expenditures_cardholders == 0).mean()))


Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


In [15]:
# Drop leaky predictors from dataset

potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

#Evaluate model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, cv=5, scoring = 'accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.827892
