# Cross-Validation Strategies

    Reliable Model Evaluation Under Real-World Constraints
    
##  Objective

This notebook provides a systematic and comparative treatment of cross-validation (CV) strategies, covering:

- Why a single train/test split is insufficient

- When standard K-Fold fails

- Stratification, grouping, and temporal ordering

- Leakage prevention

- Business-aligned validation design

It answers:

    How do we choose a validation strategy that reflects how the model will be used in production?

##  Why Cross-Validation Matters

Improper validation leads to:

- Inflated performance

- Model selection bias

- Silent leakage

- Deployment failures

- Evaluation design is part of the model.

##  Validation Is Data-Dependent

    No single CV strategy is “best”.

Validation depends on:

- Temporal structure

- Group dependencies

- Class imbalance

- Data volume

- Decision latency

##  Imports and Dataset

In [4]:
import numpy as np
import pandas as pd

from sklearn.model_selection import (
    train_test_split,
    KFold,
    StratifiedKFold,
    GroupKFold,
    TimeSeriesSplit,
    cross_val_score
)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


df = pd.read_csv("D:/GitHub/Data-Science-Techniques/datasets/Supervised-classification/synthetic_credit_default_classification.csv")

df.head()

Unnamed: 0,customer_id,age,annual_income,credit_utilization,debt_to_income,loan_amount,loan_term_months,num_past_defaults,employment_years,credit_score,default
0,1,59,23283.682822,0.187813,0.245248,20232.165654,24,0,4.575844,689.627408,1
1,2,49,61262.608063,0.291774,0.396763,26484.067591,36,0,3.317515,697.770541,1
2,3,35,60221.74316,0.230557,0.122859,27142.522594,24,1,11.871955,713.721429,0
3,4,63,93603.112731,0.157906,0.635484,1000.0,12,0,2.256651,655.306417,1
4,5,28,71674.557271,0.167549,0.422446,15254.246561,48,0,6.97127,644.247643,0


In [6]:
X = df.drop(columns=["default", "customer_id"], axis=1)
y = df["default"]
groups = df["customer_id"]

In [8]:
# Train test Spit

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    stratify=y,
    random_state=42
)


# Baseline Model
## Logistic Regression

In [9]:
model = LogisticRegression(max_iter=1000)


model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

roc_auc_score(y_test, y_prob)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


np.float64(0.9056513734526068)

- ✔ Simple
- ❌ High variance
- ❌ Sensitive to split

## K-Fold Cross-Validation

In [13]:
kf = KFold(n_splits=5, shuffle=True, random_state=2010)

scores = cross_val_score(
    model, X, y,
    cv=kf,
    scoring= "roc_auc" #"recall" #"accuracy"
)

scores, scores.mean()


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

(array([0.91175189, 0.90693434, 0.89692883, 0.90754763, 0.90936074]),
 np.float64(0.9065046865768455))

- ❌ Breaks under imbalance

- ❌ Can leak group information

## Stratified K-Fold (Classification Default)

In [14]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(
    model, X, y,
    cv=skf,
    scoring="roc_auc"
)

scores, scores.mean()


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

(array([0.91786545, 0.89770203, 0.90308529, 0.90825761, 0.90078158]),
 np.float64(0.9055383946557258))

- ✔ Preserves class ratio
- ✔ Standard for classification

## - Group K-Fold (Customer-Level Leakage)

Use when:

- Multiple rows per entity

- Strong intra-group correlation

In [15]:
gkf = GroupKFold(n_splits=5)

scores = cross_val_score(
    model, X, y,
    cv=gkf,
    groups=groups,
    scoring="roc_auc"
)

scores, scores.mean()


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

(array([0.91759902, 0.90331231, 0.91436468, 0.88958248, 0.90411466]),
 np.float64(0.9057946316144522))

- ✔ Prevents customer leakage
- ✔ Reflects production deployment

## Time Series Split (Temporal Validation)

Use when:

- Data evolves over time

- Future must not see the past

In [16]:
tscv = TimeSeriesSplit(n_splits=5)

scores = cross_val_score(
    model,
    X.sort_index(),
    y.sort_index(),
    cv=tscv,
    scoring="roc_auc"
)

scores, scores.mean()


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

(array([0.89141054, 0.90644672, 0.90776043, 0.92552232, 0.89499696]),
 np.float64(0.9052273961287638))

✔ Respects causality
✔ Mandatory for forecasting and risk

### Visualizing Time Splits

In [17]:
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {fold}: Train {len(train_idx)}, Test {len(test_idx)}")


Fold 0: Train 835, Test 833
Fold 1: Train 1668, Test 833
Fold 2: Train 2501, Test 833
Fold 3: Train 3334, Test 833
Fold 4: Train 4167, Test 833


## When Each Strategy Is Appropriate

| Scenario             | Strategy          |
| -------------------- | ----------------- |
| Small dataset        | K-Fold            |
| Classification       | Stratified K-Fold |
| Repeated entities    | Group K-Fold      |
| Time dependency      | TimeSeriesSplit   |
| Production mirroring | Custom split      |


## Nested Cross-Validation (Conceptual)

Used for:

- Hyperparameter tuning

- Model selection

Outer loop → performance

Inner loop → tuning

Prevents optimistic bias.

##  Leakage Scenarios (Avoided)

- ❌ Shuffling time series
- ❌ Mixing customer histories
- ❌ Fitting preprocessing outside CV
- ❌ Using target leakage features
- ❌ Tuning on test data

##  Cross-Validation with Pipelines

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=1000))
])

scores = cross_val_score(
    pipeline,
    X, y,
    cv=skf,
    scoring="roc_auc"
)

scores.mean()


np.float64(0.9156075666101782)

- ✔ Leakage-safe
- ✔ Reproducible
- ✔ Deployment-ready

## Key Takeaways

- Validation strategy is data-dependent

- Stratification is not optional under imbalance

- Grouping and time must be respected

- Pipelines prevent leakage

- Nested CV prevents selection bias


## Summary Table


| Strategy          | Handles Imbalance | Prevents Leakage | Use            |
| ----------------- | ----------------- | ---------------- | -------------- |
| Hold-out          | ❌                 | ❌                | Quick check    |
| K-Fold            | ❌                 | ❌                | Regression     |
| Stratified K-Fold | ✔                 | ❌                | Classification |
| Group K-Fold      | ✔                 | ✔                | Entity data    |
| TimeSeriesSplit   | ✔                 | ✔                | Temporal       |


06_Model_Evaluation_and_Validation/

└── [04_bias_variance_tradeoff.ipynb](04_bias_variance_tradeoff.ipynb)

<br><br><br><br><br>



# Complete: [Data Science Techniques](https://github.com/lei-soares/Data-Science-Techniques)

- [00_Data_Generation_and_Simulation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/00_Data_Generation_and_Simulation)


- [01_Exploratory_Data_Analysis_(EDA)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/01_Exploratory_Data_Analysis_(EDA))


- [02_Data_Preprocessing](https://github.com/lei-soares/Data-Science-Techniques/tree/main/02_Data_Preprocessing)


- [03_Feature_Engineering](https://github.com/lei-soares/Data-Science-Techniques/tree/main/03_Feature_Engineering)


- [04_Supervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/04_Supervised_Learning)

    - [Regression Models](https://github.com/lei-soares/Data-Science-Techniques/tree/49de369e0600a513b54445e8cb4196b26ce71853/04_Supervised_Learning/01_regression_models)
    
    - [Classification Models](https://github.com/lei-soares/Data-Science-Techniques/tree/49de369e0600a513b54445e8cb4196b26ce71853/04_Supervised_Learning/02_classification_models)


- [05_Unsupervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/05_Unsupervised_Learning)


- [06_Model_Evaluation_and_Validation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/06_Model_Evaluation_and_Validation)


- [07_Model_Tuning_and_Optimization](https://github.com/lei-soares/Data-Science-Techniques/tree/main/07_Model_Tuning_and_Optimization)


- [08_Interpretability_and_Explainability](https://github.com/lei-soares/Data-Science-Techniques/tree/main/08_Interpretability_and_Explainability)


- [09_Pipelines_and_Workflows](https://github.com/lei-soares/Data-Science-Techniques/tree/main/09_Pipelines_and_Workflows)


- [10_Natural_Language_Processing_(NLP)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/10_Natural_Language_Processing_(NLP))


- [11_Time_Series](https://github.com/lei-soares/Data-Science-Techniques/tree/main/11_Time_Series)


- [12_Anomaly_and_Fraud_Detection](https://github.com/lei-soares/Data-Science-Techniques/tree/main/12_Anomaly_and_Fraud_Detection)


- [13_Imbalanced_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/13_Imbalanced_Learning)


- [14_Deployment_and_Production_Concepts](https://github.com/lei-soares/Data-Science-Techniques/tree/main/14_Deployment_and_Production_Concepts)


- [15_Business_and_Experimental_Design](https://github.com/lei-soares/Data-Science-Techniques/tree/main/15_Business_and_Experimental_Design)




<br><br><br><br><br>

[Panfugo Dados](www.pantufodados.com)


[Pantufo Dados - YouTube Channel](https://www.youtube.com/@pantufodados)