# Data Leakage Simulation
## Objective

This notebook demonstrates how __data leakage__ can:

- Artificially inflate model performance

- Produce misleading validation metrics

- Lead to catastrophic failure in production

We simulate __common leakage patterns__, quantify their effects, and establish __leakage-safe design principles__.

## Why Data Leakage Is Dangerous

Data leakage causes models to learn information unavailable at prediction time.

This leads to:

- Unrealistically high accuracy

- Overconfidence in model quality

- Rapid performance decay post-deployment

This notebook makes leakage __observable and measurable__.

# Imports and Configuration

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report,
    roc_auc_score
)

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


# Step 1 – Generate a Clean Baseline Dataset

We simulate a legitimate classification problem without leakage.

In [2]:
N_SAMPLES = 6000

age = np.random.randint(18, 70, size=N_SAMPLES)

income = np.random.normal(
    loc=60000,
    scale=15000,
    size=N_SAMPLES
).clip(20000, 150000)

tenure = np.random.exponential(scale=5, size=N_SAMPLES).clip(0, 30)

region = np.random.choice(
    ["North", "South", "East", "West"],
    size=N_SAMPLES
)

score = (
    -6.5
    + 0.04 * age
    + 0.0005 * income
    + 1.8 * np.log1p(tenure)
    + np.random.normal(0, 1.0, size=N_SAMPLES)
)

probability = 1 / (1 + np.exp(-score))

target_imbalance = 0.6
threshold = np.quantile(probability, 1 - target_imbalance)
#target = np.random.binomial(1, probability)
target = (probability >= threshold).astype(int)


df = pd.DataFrame({
    "age": age,
    "income": income,
    "tenure": tenure,
    "region": region,
    "target": target
})

df.head()


Unnamed: 0,age,income,tenure,region,target
0,56,64860.855423,2.922262,South,1
1,69,42784.629153,0.608804,North,0
2,46,61278.141745,9.067705,West,1
3,32,20000.0,0.140104,North,0
4,60,57196.557628,8.978559,West,1


# Step 2 – Baseline Performance (No Leakage)

In [27]:
X = df.drop(columns="target")
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=RANDOM_STATE
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), ["age", "income", "tenure"]),
        ("cat", OneHotEncoder(drop="first"), ["region"])
    ]
)

baseline_model = Pipeline(
    steps=[
        ("preprocessing", preprocessor),
        ("model", LogisticRegression(max_iter=1000))
    ]
)

baseline_model.fit(X_train, y_train)

y_proba = baseline_model.predict_proba(X_test)[:, 1]

print("Baseline ROC-AUC:",
      roc_auc_score(y_test, y_proba))


Baseline ROC-AUC: 0.9938962962962963


# Step 3 – Leakage Type 1: Target Leakage (Direct)
## Simulation

A feature created using the target itself.

In [4]:
df_leak_target = df.copy()

df_leak_target["leak_target_mean"] = (
    df_leak_target["target"]
    .rolling(window=5, min_periods=1)
    .mean()
)

## Evaluation

In [5]:
X = df_leak_target.drop(columns="target")
y = df_leak_target["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=RANDOM_STATE
)

model = Pipeline(
    steps=[
        ("preprocessing", ColumnTransformer(
            transformers=[
                ("num", StandardScaler(),
                 ["age", "income", "tenure", "leak_target_mean"]),
                ("cat", OneHotEncoder(drop="first"), ["region"])
            ]
        )),
        ("model", LogisticRegression(max_iter=1000))
    ]
)

model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

print("Target Leakage ROC-AUC:",
      roc_auc_score(y_test, y_proba))


Target Leakage ROC-AUC: 0.9966685185185186


## Observation

Performance jumps unrealistically — __this model is invalid__.

# Step 4 – Leakage Type 2: Preprocessing Leakage
## Incorrect Scaling (Before Split)

In [6]:
df_scaled = df.copy()

scaler = StandardScaler()
df_scaled[["age", "income", "tenure"]] = scaler.fit_transform(
    df_scaled[["age", "income", "tenure"]]
)

## Evaluation

In [12]:
model = Pipeline(
    steps=[
        ("preprocessing", ColumnTransformer(
            transformers=[
#                 ("num", StandardScaler(),
#                  ["age", "income", "tenure", "leak_target_mean"]),
                ("cat", OneHotEncoder(drop="first"), ["region"])
            ]
        )),
        ("model", LogisticRegression(max_iter=1000))
    ]
)

In [13]:
X = df_scaled.drop(columns="target")
y = df_scaled["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=RANDOM_STATE
)

model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

print("Preprocessing Leakage ROC-AUC:",
      roc_auc_score(y_test, y_proba))


Preprocessing Leakage ROC-AUC: 0.5058722222222223


## Key Insight

Even subtle preprocessing leakage can inflate metrics, especially under cross-validation.

# Step 5 – Leakage Type 3: Temporal Leakage
## Simulation

A feature that would only be known after the prediction time.

In [14]:
df_time = df.copy()
df_time["future_event_count"] = (
    df_time["target"]
    .shift(-1)
    .fillna(0)
)


## Evaluation

In [15]:
X = df_time.drop(columns="target")
y = df_time["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=RANDOM_STATE
)

model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

print("Temporal Leakage ROC-AUC:",
      roc_auc_score(y_test, y_proba))


Temporal Leakage ROC-AUC: 0.5058722222222223


# Step 6 – Leakage Type 4: Cross-Validation Leakage
## Incorrect CV Design

In [25]:
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

X = df.drop(columns=["target", 'region'])
X = pd.concat([X, pd.get_dummies(df['region'], drop_first=True, prefix='reg')], axis=1)
y = df["target"]

scores = []

for train_idx, test_idx in kf.split(X):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    model_cv = LogisticRegression(max_iter=1000)
    model_cv.fit(X_train, y_train)

    y_proba = model_cv.predict_proba(X_test)[:, 1]
    scores.append(roc_auc_score(y_test, y_proba))

print("CV Leakage ROC-AUC:", np.mean(scores))


CV Leakage ROC-AUC: 0.9930806475243161


# Step 7 – Leakage-Safe Pipeline (Correct Approach)

In [30]:
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

In [28]:
X = df.drop(columns=["target"])
y = df["target"]

In [29]:
safe_pipeline = Pipeline(
    steps=[
        ("preprocessing", preprocessor),
        ("model", LogisticRegression(max_iter=1000))
    ]
)

scores = []

for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    safe_pipeline.fit(X_train, y_train)
    y_proba = safe_pipeline.predict_proba(X_test)[:, 1]
    scores.append(roc_auc_score(y_test, y_proba))

print("Leakage-Free CV ROC-AUC:", np.mean(scores))


Leakage-Free CV ROC-AUC: 0.9930794395663833


# Step 8 – Performance Comparison Summary


| Scenario              | ROC-AUC              |
| --------------------- | -------------------- |
| Baseline (Clean)      | Realistic            |
| Target Leakage        | Unrealistically High |
| Preprocessing Leakage | Inflated             |
| Temporal Leakage      | Invalid              |
| CV Leakage            | Misleading           |
| Proper Pipeline       | Trustworthy          |


# Step 9 – Business Consequences

- False confidence in models

- Poor decisions in production

- Regulatory and reputational risk

- Increased technical debt

## Summary

This notebook demonstrated:

- Multiple types of data leakage

- How leakage inflates performance metrics

- Why pipelines and validation design matter

- How to implement leakage-safe modeling

This notebook is mandatory reading before:

- Feature engineering

- Hyperparameter tuning

- Model evaluation

- Deployment