# Leakage-Safe Cross-Validation
    Correct Model Evaluation Under Realistic Data Constraints
## Objective

This notebook demonstrates how to perform cross-validation without data leakage, covering:

- Why naïve CV is often wrong

- Proper use of pipelines inside CV

- Stratified, grouped, and time-aware splits

- Common leakage patterns and how to avoid them

It answers:

    How do we evaluate models so that validation performance reflects real-world behavior?

## Why Leakage-Safe CV Matters

Data leakage causes:

- Inflated validation scores

- Model selection errors

- Production failures

- False stakeholder confidence

> Most over-performing models fail due to leakage.

## Imports and dataset

In [48]:
import numpy as np
import pandas as pd

from sklearn.model_selection import (
    cross_val_score,
    StratifiedKFold,
    GroupKFold,
    TimeSeriesSplit
)

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
    OrdinalEncoder
)

from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression


In [73]:

DATA_PATH =  """D:/GitHub/Data-Science-Techniques/datasets/synthetic_customer_churn_classification_complete.csv"""
df = pd.read_csv(DATA_PATH)

X = df.drop(columns=["churn", "customer_id"])
y = df["churn"]


In [75]:
df.satisfaction_level.unique()

array([nan, 'Very High', 'Medium', 'High', 'Very Low', 'Low'],
      dtype=object)

# Define Leakage-Safe Pipeline

In [78]:
categorical_features

['customer_segment', 'region']

In [80]:
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns

categorical_features = ['customer_segment', 'region']

ordinal_features = ["satisfaction_level"]
ordinal_categories = [["Very Low", "Low", "Medium", "High", "Very High"]]

# ---

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", drop="first"))
])

ordinal_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(categories=ordinal_categories))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_features),
        ("cat", categorical_pipeline, categorical_features),
        ('ord', ordinal_pipeline, ordinal_features)
    ]
)

pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", LogisticRegression(
        max_iter=1000,
        class_weight="balanced"
    ))
])


All transformations are encapsulated.

## Stratified Cross-Validation (Default)

In [83]:
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=2010
)

scores = cross_val_score(
    pipeline,
    X,
    y,
    cv=cv,
    scoring="roc_auc"
)

scores, scores.mean()


(array([1., 1., 1., 1., 1.]), np.float64(1.0))

-
- ✔ Preserves class distribution
-
- ✔ No leakage
-
- ✔ Reproducible

## What NOT to Do (Illustration)

❌ Incorrect Pattern (do not run in production):

In [86]:
# WRONG — leakage example
X_scaled = StandardScaler().fit_transform(X)
cross_val_score(
    LogisticRegression(),
    X_scaled,
    y,
    cv=5
)


ValueError: could not convert string to float: 'Very High'

Scaling occurs before CV split → leakage.

## Group-Aware Cross-Validation

Used when samples are not independent (e.g., customers, sessions).

In [93]:
groups = df["customer_id"]

group_cv = GroupKFold(n_splits=5)

group_scores = cross_val_score(
    pipeline,
    X,
    y,
    cv=group_cv,
    groups=groups,
    scoring="roc_auc"
)

group_scores, group_scores.mean()


(array([1., 1., 1., 1., 1.]), np.float64(1.0))

Prevents same customer appearing in train and validation.

## Time-Aware Cross-Validation

Used for temporal data.

In [97]:
ts_cv = TimeSeriesSplit(n_splits=5)

time_scores = cross_val_score(
    pipeline,
    X,
    y,
    cv=ts_cv,
    scoring="roc_auc"
)

time_scores, time_scores.mean()


(array([1., 1., 1., 1., 1.]), np.float64(1.0))

Training always precedes validation in time.

# Choosing the Right CV Strategy


| Scenario           | CV Strategy     |
| ------------------ | --------------- |
| IID classification | StratifiedKFold |
| Repeated entities  | GroupKFold      |
| Temporal data      | TimeSeriesSplit |
| Imbalanced classes | StratifiedKFold |
| Panel data         | Group + Time    |


# CV for Model Comparison

In [102]:
from sklearn.ensemble import RandomForestClassifier

rf_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=6,
        random_state=42,
        class_weight="balanced"
    ))
])

rf_scores = cross_val_score(
    rf_pipeline,
    X,
    y,
    cv=cv,
    scoring="roc_auc"
)

rf_scores.mean()


np.float64(1.0)

Fair comparison under identical CV splits.

# Common Leakage Sources (Checklist)


| Source                               | Risk |
| ------------------------------------ | ---- |
| Pre-scaling                          | ❌    |
| Feature engineering outside pipeline | ❌    |
| Target encoding before split         | ❌    |
| Global statistics                    | ❌    |
| Temporal leakage                     | ❌    |


## Best Practices

- ✔ Always CV the full pipeline
- ✔ Match CV strategy to data structure
- ✔ Use same splits for model comparison
- ✔ Log CV configuration
- ✔ Treat CV as a modeling decision

## Key Takeaways

- Cross-validation is not plug-and-play

- Pipelines prevent silent leakage

- CV strategy must reflect data reality

- Evaluation mistakes are irreversible

- This notebook protects model credibility

# Related Notebooks

[09_Pipelines_and_Workflows/]()

├── 	[	01_basic_pipeline.ipynb	](	01_basic_pipeline.ipynb	)

├── 	[	02_column_transformer_pipeline.ipynb	](	02_column_transformer_pipeline.ipynb	)

├── 	[	03_pipeline_with_feature_engineering.ipynb	](	03_pipeline_with_feature_engineering.ipynb	)

├── 	[	02_leakage_safe_cross_validation.ipynb	](	02_leakage_safe_cross_validation.ipynb	)

├── 	[	04_pipeline_with_model_tuning.ipynb	](	04_pipeline_with_model_tuning.ipynb	)

├── 	[	05_pipeline_serialization_and_inference.ipynb	](	05_pipeline_serialization_and_inference.ipynb	)

├── 	[	03_pipeline_monitoring_and_reusability.ipynb	](	03_pipeline_monitoring_and_reusability.ipynb	)