In [1]:
from pathlib import Path
from IPython.display import HTML, display
css = Path("../../../css/custom.css").read_text(encoding="utf-8")
display(HTML(f"<style>{css}</style>"))


# Chapter 2 — Basics of Data and Preprocessing
## Lesson 12: Train/Validation/Test Hygiene (Temporal Splits, Group Splits, Entity Leakage)

This lesson is about making evaluation realistic. The most sophisticated model is not useful if the evaluation overstates real-world performance.

You will work through time-aware splitting, group/entity splitting, leakage forensics, and safe preprocessing patterns. The code is designed for tabular ML workflows common in industry.


### Learning objectives

By the end of this lesson you should be able to:

1. Explain why evaluation requires (approximate) independence between training and testing.
2. Decide when random splits are acceptable and when they are misleading.
3. Implement temporal splits, rolling/expanding validation, and time gaps.
4. Implement group/entity splits and group-aware cross-validation.
5. Detect entity overlap, near-duplicates, and post-outcome features.
6. Prevent preprocessing leakage by construction using pipelines.
7. Separate model selection from final testing (avoid validation leakage).


### Core idea: what your test set is estimating

Let the deployment distribution be $\mathcal{P}$ and the loss be $\ell(\cdot)$. The generalization risk is:

$$R(f) = \mathbb{E}_{(X,Y) \sim \mathcal{P}}[\ell(f(X), Y)].$$

A test set is useful only if it approximates an i.i.d. sample from $\mathcal{P}$ *at the time and granularity of deployment*.
When the split violates that approximation (because of time dependence, repeated entities, or leakage), the test estimate becomes optimistic.

A compact way to see the optimism is through dependence. Suppose $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{test}}$ are not independent. Then the expected test risk is generally:

$$\mathbb{E}[\widehat{R}_{\text{test}}(f)] \ne R(f),$$

because the training procedure may exploit information correlated across the two sets.


### Leakage taxonomy

Leakage is any path by which information unavailable at prediction time influences training or evaluation.

- **Temporal leakage:** learning from the future (directly or via statistics that include future rows).
- **Group/entity leakage:** the same entity appears in train and test; the model captures entity-specific signals.
- **Target leakage:** a feature is a direct or indirect proxy for the label (post-outcome).
- **Validation leakage:** hyperparameters or feature choices are tuned by repeatedly checking the test set.

Operational rule: define the prediction timestamp/event, then remove any feature that would be unknown at that moment.


In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import (
    train_test_split, TimeSeriesSplit, GroupShuffleSplit, GroupKFold, KFold
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import accuracy_score, roc_auc_score, r2_score, mean_squared_error
from sklearn.model_selection import GridSearchCV

pd.set_option('display.max_columns', 60)
pd.set_option('display.width', 160)


## 1) A clean evaluation protocol

A robust protocol for most tabular ML projects:

1. **Specify the prediction contract**: what exactly is predicted, when, and with what inputs.
2. **Design the split** to match deployment (random, temporal, group, or a combination).
3. **Lock the test set**: do not use it for iterative decisions.
4. **Use validation/CV on training data** for model selection.
5. **Report uncertainty** (fold variability) when possible.

Three-way splitting can be described as:

$$\mathcal{D} = \mathcal{D}_{\text{train}} \cup \mathcal{D}_{\text{val}} \cup \mathcal{D}_{\text{test}},\quad
\mathcal{D}_{\text{train}} \cap \mathcal{D}_{\text{val}} = \varnothing,\quad
\mathcal{D}_{\text{train}} \cap \mathcal{D}_{\text{test}} = \varnothing,\quad
\mathcal{D}_{\text{val}} \cap \mathcal{D}_{\text{test}} = \varnothing.$$

The key is that “disjoint” must also hold along dependency structure: time must not run backward, and entities must not overlap when the goal is generalization to new entities.


### When is a random split acceptable?

A random split is often acceptable when:

- Rows are approximately i.i.d. (no meaningful time ordering, no repeated entities).
- The deployment distribution is stable (no major drift).
- The model will be used on “similar” data to what was collected.

If any of the following is true, random splitting becomes risky:

- Multiple rows per entity.
- Time dependence or seasonality.
- Operational changes (policy shifts, product changes, sensor upgrades).
- Features that are aggregates over time windows.

In such cases, a time-aware or entity-aware split is typically required.


## 2) Temporal splits

Use temporal splits when the task is forward-looking or the data-generating process changes over time.

A time-respecting holdout is:

$$\text{Train} = \{t \le t_0\},\quad \text{Test} = \{t > t_0\}.$$

If labels are observed with delay or features use time windows, also consider a **gap** to prevent subtle look-ahead.


### Example A: Random split vs time split on consumer complaints

Dataset: `ConsumerComplaints.csv` with `Date Received`.

Task: predict whether a complaint got a timely response (`Timely Response`).
We compare random stratified splitting with a forward-in-time split.


In [3]:
complaints_path = "../../../Datasets/Clustering/ConsumerComplaints.csv"
df = pd.read_csv(complaints_path, low_memory=False)
df.head()

Unnamed: 0,Date Received,Product Name,Sub Product,Issue,Sub Issue,Consumer Complaint Narrative,Company Public Response,Company,State Name,Zip Code,Tags,Consumer Consent Provided,Submitted via,Date Sent to Company,Company Response to Consumer,Timely Response,Consumer Disputed,Complaint ID
0,2013-07-29,Consumer Loan,Vehicle loan,Managing the loan or lease,,,,Wells Fargo & Company,VA,24540,,,Phone,2013-07-30,Closed with explanation,Yes,No,468882
1,2013-07-29,Bank account or service,Checking account,Using a debit or ATM card,,,,Wells Fargo & Company,CA,95992,Older American,,Web,2013-07-31,Closed with explanation,Yes,No,468889
2,2013-07-29,Bank account or service,Checking account,"Account opening, closing, or management",,,,Santander Bank US,NY,10065,,,Fax,2013-07-31,Closed,Yes,No,468879
3,2013-07-29,Bank account or service,Checking account,Deposits and withdrawals,,,,Wells Fargo & Company,GA,30084,,,Web,2013-07-30,Closed with explanation,Yes,No,468949
4,2013-07-29,Mortgage,Conventional fixed mortgage,"Loan servicing, payments, escrow account",,,,Franklin Credit Management,CT,6106,,,Web,2013-07-30,Closed with explanation,Yes,No,475823


In [4]:
# Basic cleanup for the demo
df = df.copy()
df['Date Received'] = pd.to_datetime(df['Date Received'], errors='coerce')
df = df.dropna(subset=['Date Received', 'Timely Response'])

y = (df['Timely Response'].astype(str).str.strip().str.lower() == 'yes').astype(int)
feature_cols = ['Product Name', 'Sub Product', 'Issue', 'Sub Issue', 'Company', 'State Name', 'Submitted via']
X = df[feature_cols]

print('Rows:', len(df))
print('Positive rate:', float(y.mean()))
print('Date range:', df['Date Received'].min().date(), 'to', df['Date Received'].max().date())

Rows: 65499
Positive rate: 0.9772210262752103
Date range: 2013-07-22 to 2015-09-02


In [5]:
categorical_features = feature_cols
preprocess = ColumnTransformer(
    transformers=[
        ('cat', Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_features)
    ],
    remainder='drop'
)

clf = Pipeline(steps=[
    ('preprocess', preprocess),
    ('model', LogisticRegression(max_iter=200))
])

In [6]:
# (1) Random stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
clf.fit(X_train, y_train)

proba = clf.predict_proba(X_test)[:, 1]
pred = (proba >= 0.5).astype(int)

acc_random = accuracy_score(y_test, pred)
auc_random = roc_auc_score(y_test, proba)
print('Random split accuracy:', round(acc_random, 4))
print('Random split ROC-AUC  :', round(auc_random, 4))

Random split accuracy: 0.9781
Random split ROC-AUC  : 0.9154


In [7]:
# (2) Time-based split: train on earlier 80%, test on later 20%
df_sorted = df.sort_values('Date Received')
X_sorted = df_sorted[feature_cols]
y_sorted = (df_sorted['Timely Response'].astype(str).str.strip().str.lower() == 'yes').astype(int)

cut = int(0.8 * len(df_sorted))
X_train_t, X_test_t = X_sorted.iloc[:cut], X_sorted.iloc[cut:]
y_train_t, y_test_t = y_sorted.iloc[:cut], y_sorted.iloc[cut:]

clf.fit(X_train_t, y_train_t)
proba_t = clf.predict_proba(X_test_t)[:, 1]
pred_t = (proba_t >= 0.5).astype(int)

acc_time = accuracy_score(y_test_t, pred_t)
auc_time = roc_auc_score(y_test_t, proba_t)
print('Time split accuracy:', round(acc_time, 4))
print('Time split ROC-AUC  :', round(auc_time, 4))
print('Train end date:', df_sorted['Date Received'].iloc[cut-1].date())
print('Test start date:', df_sorted['Date Received'].iloc[cut].date())

Time split accuracy: 0.9705
Time split ROC-AUC  : 0.8955
Train end date: 2015-01-20
Test start date: 2015-01-20


### Interpreting the difference

If the random split score is higher than the time split score, this is usually not “bad news.” It is evidence that the future is harder than a shuffled snapshot.

Typical contributors:

- Distribution shift (new products, policy changes).
- Time correlation (nearby dates share context).
- Changing base rates.

If the production system must predict on future periods, a forward-in-time split is the relevant estimate.


### Rolling/expanding validation with `TimeSeriesSplit`

When a single holdout is noisy, use rolling validation.

In expanding-window evaluation:

$$\text{Train}_k = \{t \le t_k\},\quad \text{Test}_k = \{t_k < t \le t_{k+1}\}.$$

This matches the operational reality where you train on all history available up to some date.


In [8]:
tscv = TimeSeriesSplit(n_splits=5)
X_ts = X_sorted.reset_index(drop=True)
y_ts = y_sorted.reset_index(drop=True)

aucs = []
for fold, (tr, te) in enumerate(tscv.split(X_ts), start=1):
    clf.fit(X_ts.iloc[tr], y_ts.iloc[tr])
    proba = clf.predict_proba(X_ts.iloc[te])[:, 1]
    auc = roc_auc_score(y_ts.iloc[te], proba)
    aucs.append(auc)
    train_end = df_sorted.iloc[tr[-1]]['Date Received'].date()
    test_start = df_sorted.iloc[te[0]]['Date Received'].date()
    test_end = df_sorted.iloc[te[-1]]['Date Received'].date()
    print(f'Fold {fold}: AUC={auc:.4f} | train_end={train_end} | test={test_start}..{test_end}')

print('Mean AUC:', float(np.mean(aucs)))
print('Std  AUC:', float(np.std(aucs)))

Fold 1: AUC=0.8265 | train_end=2013-12-08 | test=2013-12-08..2014-03-25
Fold 2: AUC=0.8235 | train_end=2014-03-25 | test=2014-03-25..2014-07-08
Fold 3: AUC=0.8747 | train_end=2014-07-08 | test=2014-07-08..2014-10-20
Fold 4: AUC=0.8834 | train_end=2014-10-20 | test=2014-10-20..2015-02-09
Fold 5: AUC=0.8959 | train_end=2015-02-09 | test=2015-02-09..2015-09-02
Mean AUC: 0.8607962707647356
Std  AUC: 0.029990864243419388


### Temporal guardrails: gaps and label delay

If labels occur after some delay (e.g., default after 90 days) or features aggregate over windows, you can inadvertently let information from the label window leak into features.

A simple guardrail is a gap $g$:

$$\text{Train} = \{t \le t_0\},\quad \text{Gap} = (t_0, t_0 + g],\quad \text{Test} = \{t > t_0 + g\}.$$

Choose $g$ to cover the maximum look-ahead horizon embedded in feature definitions.


## 3) Group and entity splits

Group splits address dependence caused by repeated entities or shared context.

Two common deployment questions:

- **Generalize to new entities?** (e.g., new patients, new hosts, new devices) → group-disjoint train/test.
- **Forecast future for existing entities?** (e.g., next month for the same customers) → time split within entity, possibly with additional gaps.

The split must match which of these is operationally true.


### A short bias argument

Let $G$ be an entity ID and suppose there is an unobserved entity effect $\alpha_G$.
A simple model is:

$$Y = \beta^\top X + \alpha_G + \epsilon.$$

If train and test share the same $G$, the learner can partially infer $\alpha_G$ from training rows, making prediction easier on test rows with that same entity.
This yields a test estimate that is biased toward the “seen entity” regime.

If the production requirement is performance on unseen entities, enforce disjointness of $G$ across splits.


### Example B: Host-level leakage in listings data

Dataset: `listings.csv`. Entity: `host_id`.
Task: predict `room_type`.


In [9]:
listings_path = "../../../Datasets/Regression/listings.csv"
ldf = pd.read_csv(listings_path, low_memory=False)
ldf.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,13913,Holiday London DB Room Let-on going,54730,Alina,,Islington,51.56861,-0.1127,Private room,57.0,1,51,2025-02-09,0.29,3,344,10,
1,15400,Bright Chelsea Apartment. Chelsea!,60302,Philippa,,Kensington and Chelsea,51.4878,-0.16813,Entire home/apt,,4,96,2024-04-28,0.52,1,11,2,
2,17402,Very Central Modern 3-Bed/2 Bath By Oxford St W1,67564,Liz,,Westminster,51.52195,-0.14094,Entire home/apt,510.0,3,56,2024-02-19,0.33,5,293,0,
3,24328,Battersea live/work artist house,41759,Joe,,Wandsworth,51.47072,-0.16266,Entire home/apt,213.0,90,94,2022-07-19,0.54,1,194,0,
4,31036,Bright compact 1 Bedroom Apartment Brick Lane,133271,Hendryks,,Tower Hamlets,51.52425,-0.06997,Entire home/apt,100.0,2,126,2025-02-20,0.7,8,353,3,


In [10]:
ldf = ldf.copy()
ldf = ldf.dropna(subset=['host_id', 'room_type'])

features = ['neighbourhood', 'latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']
for col in ['price', 'reviews_per_month']:
    ldf[col] = pd.to_numeric(ldf[col], errors='coerce')

X2 = ldf[features]
y2 = ldf['room_type'].astype(str)
groups = ldf['host_id'].astype(str)

print('Rows:', len(ldf))
print('Unique hosts:', groups.nunique())
print('Class counts (top):')
print(y2.value_counts().head())

Rows: 94559
Unique hosts: 55395
Class counts (top):
room_type
Entire home/apt    60750
Private room       33487
Shared room          164
Hotel room           158
Name: count, dtype: int64


In [11]:
num_features = ['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']
cat_features = ['neighbourhood']

preprocess2 = ColumnTransformer(
    transformers=[
        ('num', Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                               ('scaler', StandardScaler())]), num_features),
        ('cat', Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                               ('onehot', OneHotEncoder(handle_unknown='ignore'))]), cat_features)
    ],
    remainder='drop'
)

clf2 = Pipeline(steps=[
    ('preprocess', preprocess2),
    ('model', LogisticRegression(max_iter=300))
])

In [12]:
# Random split
X_tr, X_te, y_tr, y_te, g_tr, g_te = train_test_split(
    X2, y2, groups, test_size=0.2, random_state=42, stratify=y2
)
clf2.fit(X_tr, y_tr)
pred = clf2.predict(X_te)
acc = accuracy_score(y_te, pred)

shared_hosts = len(set(g_tr) & set(g_te))
print('Random split accuracy:', round(acc, 4))
print('Hosts shared train/test:', shared_hosts)
print('Train hosts:', len(set(g_tr)), '| Test hosts:', len(set(g_te)))

Random split accuracy: 0.7474
Hosts shared train/test: 5126
Train hosts: 46337 | Test hosts: 14184


In [13]:
# Group split (host-disjoint)
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
tr_idx, te_idx = next(gss.split(X2, y2, groups=groups))

X_trg, X_teg = X2.iloc[tr_idx], X2.iloc[te_idx]
y_trg, y_teg = y2.iloc[tr_idx], y2.iloc[te_idx]
g_trg, g_teg = groups.iloc[tr_idx], groups.iloc[te_idx]

clf2.fit(X_trg, y_trg)
pred_g = clf2.predict(X_teg)
acc_g = accuracy_score(y_teg, pred_g)

shared_hosts_g = len(set(g_trg) & set(g_teg))
print('Group split accuracy:', round(acc_g, 4))
print('Hosts shared train/test:', shared_hosts_g)
print('Train hosts:', len(set(g_trg)), '| Test hosts:', len(set(g_teg)))

Group split accuracy: 0.7529
Hosts shared train/test: 0
Train hosts: 44316 | Test hosts: 11079


### Leakage forensics: overlap and near-duplicates

Even when entity IDs are disjoint, near-duplicates can leak information (for example, the same item replicated with small edits).

A simple, practical check is to hash a subset of stable columns and measure how many hashes appear in both train and test.
This is not perfect, but it often catches obvious duplication.


In [14]:
def overlap_count(a, b):
    return len(set(a) & set(b))

# Create a rough signature for listings (you can adjust columns depending on your data)
sig_cols = ['neighbourhood', 'latitude', 'longitude', 'minimum_nights']
sig = X2[sig_cols].copy()
sig['latitude'] = sig['latitude'].round(5)
sig['longitude'] = sig['longitude'].round(5)
signature = pd.util.hash_pandas_object(sig, index=False)

# Compare duplication under random split vs group split
sig_tr = signature.iloc[X_tr.index]
sig_te = signature.iloc[X_te.index]
dup_random = overlap_count(sig_tr, sig_te)

sig_trg = signature.iloc[tr_idx]
sig_teg = signature.iloc[te_idx]
dup_group = overlap_count(sig_trg, sig_teg)

print('Approx duplicate signatures (random split):', dup_random)
print('Approx duplicate signatures (group split) :', dup_group)

Approx duplicate signatures (random split): 556
Approx duplicate signatures (group split) : 59


### Group-aware cross-validation with `GroupKFold`

When you need multiple folds while keeping entities disjoint, use `GroupKFold`.
This produces variance estimates and reduces dependence on one arbitrary split.


In [15]:
gkf = GroupKFold(n_splits=5)
accs = []
for fold, (tr, te) in enumerate(gkf.split(X2, y2, groups=groups), start=1):
    clf2.fit(X2.iloc[tr], y2.iloc[tr])
    pred = clf2.predict(X2.iloc[te])
    acc = accuracy_score(y2.iloc[te], pred)
    accs.append(acc)
    shared = len(set(groups.iloc[tr]) & set(groups.iloc[te]))
    print(f'Fold {fold}: acc={acc:.4f} | shared hosts={shared} | test hosts={groups.iloc[te].nunique()}')

print('Mean acc:', float(np.mean(accs)))
print('Std  acc:', float(np.std(accs)))

Fold 1: acc=0.7404 | shared hosts=0 | test hosts=11078
Fold 2: acc=0.7416 | shared hosts=0 | test hosts=11079
Fold 3: acc=0.7348 | shared hosts=0 | test hosts=11079
Fold 4: acc=0.7481 | shared hosts=0 | test hosts=11080
Fold 5: acc=0.7515 | shared hosts=0 | test hosts=11079
Mean acc: 0.7432714839285025
Std  acc: 0.00591220648589423


## 4) Combined structure: time + entity (panel data)

Many real datasets are **panel data**: repeated measurements for each entity over time (stores over weeks, states over years, machines over cycles).

In panel data you often must decide between two different evaluation targets:

- **Unseen-entity generalization:** predict for entities never seen before.
- **Within-entity forecasting:** predict future for entities already observed historically.

These require different splits, and the performance numbers answer different questions.


### Example C: Education panel data (`states_all.csv`)

Dataset: `states_all.csv` with columns `STATE` and `YEAR`.

Task: predict `AVG_MATH_8_SCORE` from funding/expenditure variables.

We will compare three splits:

1. Random row split (often optimistic because the same state appears in train and test).
2. Group split by `STATE` (unseen states).
3. Within-state temporal split (forecast future years for the same states).


In [16]:
states_path = "../../../Datasets/Regression/states_all.csv"
sdf = pd.read_csv(states_path, low_memory=False)
sdf.head()

Unnamed: 0,PRIMARY_KEY,STATE,YEAR,ENROLL,TOTAL_REVENUE,FEDERAL_REVENUE,STATE_REVENUE,LOCAL_REVENUE,TOTAL_EXPENDITURE,INSTRUCTION_EXPENDITURE,SUPPORT_SERVICES_EXPENDITURE,OTHER_EXPENDITURE,CAPITAL_OUTLAY_EXPENDITURE,GRADES_PK_G,GRADES_KG_G,GRADES_4_G,GRADES_8_G,GRADES_12_G,GRADES_1_8_G,GRADES_9_12_G,GRADES_ALL_G,AVG_MATH_4_SCORE,AVG_MATH_8_SCORE,AVG_READING_4_SCORE,AVG_READING_8_SCORE
0,1992_ALABAMA,ALABAMA,1992,,2678885.0,304177.0,1659028.0,715680.0,2653798.0,1481703.0,735036.0,,174053.0,8224.0,55460.0,57948.0,58025.0,41167.0,,,731634.0,208.0,252.0,207.0,
1,1992_ALASKA,ALASKA,1992,,1049591.0,106780.0,720711.0,222100.0,972488.0,498362.0,350902.0,,37451.0,2371.0,10152.0,9748.0,8789.0,6714.0,,,122487.0,,,,
2,1992_ARIZONA,ARIZONA,1992,,3258079.0,297888.0,1369815.0,1590376.0,3401580.0,1435908.0,1007732.0,,609114.0,2544.0,53497.0,55433.0,49081.0,37410.0,,,673477.0,215.0,265.0,209.0,
3,1992_ARKANSAS,ARKANSAS,1992,,1711959.0,178571.0,958785.0,574603.0,1743022.0,964323.0,483488.0,,145212.0,808.0,33511.0,34632.0,36011.0,27651.0,,,441490.0,210.0,256.0,211.0,
4,1992_CALIFORNIA,CALIFORNIA,1992,,26260025.0,2072470.0,16546514.0,7641041.0,27138832.0,14358922.0,8520926.0,,2044688.0,59067.0,431763.0,418418.0,363296.0,270675.0,,,5254844.0,208.0,261.0,202.0,


In [17]:
sdf = sdf.copy()
sdf['YEAR'] = pd.to_numeric(sdf['YEAR'], errors='coerce')
sdf = sdf.dropna(subset=['STATE', 'YEAR', 'AVG_MATH_8_SCORE'])

target = 'AVG_MATH_8_SCORE'
group_col = 'STATE'

num_cols = [
    'ENROLL', 'TOTAL_REVENUE', 'FEDERAL_REVENUE', 'STATE_REVENUE', 'LOCAL_REVENUE',
    'TOTAL_EXPENDITURE', 'INSTRUCTION_EXPENDITURE', 'SUPPORT_SERVICES_EXPENDITURE', 'CAPITAL_OUTLAY_EXPENDITURE'
]
X4 = sdf[num_cols].apply(pd.to_numeric, errors='coerce')
y4 = pd.to_numeric(sdf[target], errors='coerce')
g4 = sdf[group_col].astype(str)
t4 = sdf['YEAR'].astype(int)

mask = y4.notna()
X4, y4, g4, t4 = X4.loc[mask], y4.loc[mask], g4.loc[mask], t4.loc[mask]

print('Rows:', len(X4))
print('States:', g4.nunique())
print('Year range:', int(t4.min()), 'to', int(t4.max()))

Rows: 602
States: 53
Year range: 1990 to 2019


In [18]:
reg_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', Ridge())
])

def eval_regression(y_true, y_pred):
    rmse = mean_squared_error(y_true, y_pred) ** 0.5
    r2 = r2_score(y_true, y_pred)
    return rmse, r2


In [19]:
# (1) Random row split
X_tr, X_te, y_tr, y_te, g_tr, g_te, t_tr, t_te = train_test_split(
    X4, y4, g4, t4, test_size=0.2, random_state=42
)
reg_pipe.fit(X_tr, y_tr)
pred = reg_pipe.predict(X_te)
rmse, r2 = eval_regression(y_te, pred)
print('Random row split | RMSE:', round(rmse, 3), '| R^2:', round(r2, 3))
print('Shared states:', len(set(g_tr) & set(g_te)))
print('Shared years  :', len(set(t_tr) & set(t_te)))

Random row split | RMSE: 9.193 | R^2: 0.092
Shared states: 48
Shared years  : 12


In [20]:
# (2) Group split by STATE (unseen states)
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
tr_idx, te_idx = next(gss.split(X4, y4, groups=g4))
X_trg, X_teg = X4.iloc[tr_idx], X4.iloc[te_idx]
y_trg, y_teg = y4.iloc[tr_idx], y4.iloc[te_idx]
g_trg, g_teg = g4.iloc[tr_idx], g4.iloc[te_idx]

reg_pipe.fit(X_trg, y_trg)
pred = reg_pipe.predict(X_teg)
rmse, r2 = eval_regression(y_teg, pred)
print('Group split (STATE) | RMSE:', round(rmse, 3), '| R^2:', round(r2, 3))
print('Shared states:', len(set(g_trg) & set(g_teg)))

Group split (STATE) | RMSE: 9.889 | R^2: 0.164
Shared states: 0


In [21]:
# (3) Within-state temporal split: for each state, keep its latest 20% years as test
sdf_work = pd.DataFrame({'state': g4.values, 'year': t4.values})
sdf_work['idx'] = np.arange(len(sdf_work))

test_mask = np.zeros(len(sdf_work), dtype=bool)
for state, sub in sdf_work.groupby('state'):
    years = np.sort(sub['year'].unique())
    if len(years) < 5:
        continue
    cut_year = years[int(np.floor(0.8 * len(years)))]
    # test = years strictly greater than cut_year
    sub_idx = sub.loc[sub['year'] > cut_year, 'idx'].values
    test_mask[sub_idx] = True

tr_idx = np.where(~test_mask)[0]
te_idx = np.where(test_mask)[0]

X_trt, X_tet = X4.iloc[tr_idx], X4.iloc[te_idx]
y_trt, y_tet = y4.iloc[tr_idx], y4.iloc[te_idx]

reg_pipe.fit(X_trt, y_trt)
pred = reg_pipe.predict(X_tet)
rmse, r2 = eval_regression(y_tet, pred)
print('Within-state time split | RMSE:', round(rmse, 3), '| R^2:', round(r2, 3))

# Verify time ordering holds within each state
violations = 0
for state, sub in sdf_work.groupby('state'):
    tr_years = t4.iloc[tr_idx][g4.iloc[tr_idx] == state]
    te_years = t4.iloc[te_idx][g4.iloc[te_idx] == state]
    if len(tr_years) and len(te_years) and tr_years.max() >= te_years.min():
        violations += 1
print('States with time-order violations:', violations)

Within-state time split | RMSE: 8.322 | R^2: -0.604
States with time-order violations: 0


### Reading panel split results

These three numbers answer different questions:

- Random row split: “How well can we predict if we mix states and years?” (often optimistic).
- Group split by state: “How well do we generalize to entirely new states?” (harder).
- Within-state time split: “How well do we forecast future years for the same states?” (deployment-like for tracking existing entities).

Choosing the wrong split can lead to the wrong product decision.


## 5) Preprocessing hygiene: fit on train only

Preprocessing steps (imputation, scaling, encoding) estimate parameters from data.

For example, standardization uses:

$$\tilde{x} = \frac{x - \mu}{\sigma},\quad \mu = \frac{1}{n}\sum_{i=1}^n x_i,\quad \sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2.$$

If $\mu$ and $\sigma$ are computed using test rows, then the training procedure has indirectly used test information.

Pipelines ensure that parameter estimation is nested inside the split.


### Example D: Scaling leakage on the diabetes dataset

Task: predict `classification` from numeric features.
We compare a leaky scaling pattern with the correct pattern.


In [22]:
diabetes_path = "../../../Datasets/Classification/diabetes.csv"
ddf = pd.read_csv(diabetes_path)
ddf.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,classification
0,6,148,72,35,0,33.6,0.627,50,Diabetic
1,1,85,66,29,0,26.6,0.351,31,Non-Diabetic
2,8,183,64,0,0,23.3,0.672,32,Diabetic
3,1,89,66,23,94,28.1,0.167,21,Non-Diabetic
4,0,137,40,35,168,43.1,2.288,33,Diabetic


In [23]:
ddf = ddf.copy()
y5 = (ddf['classification'].astype(str).str.strip().str.lower() == 'diabetic').astype(int)
X5 = ddf.drop(columns=['classification'])

X_tr, X_te, y_tr, y_te = train_test_split(X5, y5, test_size=0.25, random_state=42, stratify=y5)
print('Train size:', len(X_tr), '| Test size:', len(X_te))

Train size: 576 | Test size: 192


In [24]:
# Leaky: fit scaler on all data
scaler_bad = StandardScaler().fit(X5)
X_tr_bad = scaler_bad.transform(X_tr)
X_te_bad = scaler_bad.transform(X_te)

m_bad = LogisticRegression(max_iter=600)
m_bad.fit(X_tr_bad, y_tr)
auc_bad = roc_auc_score(y_te, m_bad.predict_proba(X_te_bad)[:, 1])
print('ROC-AUC (leaky scaling):', round(auc_bad, 4))

ROC-AUC (leaky scaling): 0.832


In [25]:
# Correct: fit scaler on train only
scaler_ok = StandardScaler().fit(X_tr)
X_tr_ok = scaler_ok.transform(X_tr)
X_te_ok = scaler_ok.transform(X_te)

m_ok = LogisticRegression(max_iter=600)
m_ok.fit(X_tr_ok, y_tr)
auc_ok = roc_auc_score(y_te, m_ok.predict_proba(X_te_ok)[:, 1])
print('ROC-AUC (correct scaling):', round(auc_ok, 4))

ROC-AUC (correct scaling): 0.832


In [26]:
# Best practice: Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression(max_iter=600))])
pipe.fit(X_tr, y_tr)
auc_pipe = roc_auc_score(y_te, pipe.predict_proba(X_te)[:, 1])
print('ROC-AUC (pipeline):', round(auc_pipe, 4))

ROC-AUC (pipeline): 0.832


## 6) Target encoding: avoiding leakage via cross-fitting

Target encoding can be powerful for high-cardinality categoricals, but it is also a high-risk leakage vector.

Naive target encoding uses the entire dataset to compute category means:

$$\widehat{m}(c) = \frac{1}{|\{i : C_i=c\}|}\sum_{i:C_i=c} Y_i.$$

If test rows contribute to $\widehat{m}(c)$, the evaluation becomes optimistic.

Cross-fitting approximates an out-of-sample encoding for training rows, making the evaluation honest.


In [27]:
def target_encode_crossfit(train_col: pd.Series, y: pd.Series, n_splits: int = 5, smoothing: float = 20.0, random_state: int = 42):
    """Cross-fitted target encoding for a single categorical column.
    Returns: encoded_train (aligned to train_col index), enc_map (means on full train), global_mean
    """
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    global_mean = float(y.mean())
    enc = pd.Series(index=train_col.index, dtype=float)
    for tr_idx, te_idx in kf.split(train_col):
        tr_c = train_col.iloc[tr_idx]
        tr_y = y.iloc[tr_idx]
        stats = tr_y.groupby(tr_c).agg(['mean', 'count'])
        smooth = (stats['count'] * stats['mean'] + smoothing * global_mean) / (stats['count'] + smoothing)
        te_c = train_col.iloc[te_idx]
        enc.iloc[te_idx] = te_c.map(smooth).fillna(global_mean).astype(float)
    full_stats = y.groupby(train_col).agg(['mean', 'count'])
    full_smooth = (full_stats['count'] * full_stats['mean'] + smoothing * global_mean) / (full_stats['count'] + smoothing)
    return enc, full_smooth, global_mean

def target_encode_apply(col: pd.Series, enc_map: pd.Series, global_mean: float):
    return col.map(enc_map).fillna(global_mean).astype(float)


In [28]:
df_small = df_sorted.tail(60000).copy()
df_small = df_small.dropna(subset=['Company', 'Timely Response'])
y6 = (df_small['Timely Response'].astype(str).str.strip().str.lower() == 'yes').astype(int)
X6 = df_small[['Company', 'State Name', 'Submitted via', 'Product Name']].copy()

X_tr, X_te, y_tr, y_te = train_test_split(X6, y6, test_size=0.2, random_state=42, stratify=y6)
print('Train rows:', len(X_tr), '| Test rows:', len(X_te))
print('Unique companies in train:', X_tr['Company'].nunique())

Train rows: 48000 | Test rows: 12000
Unique companies in train: 1710


In [29]:
# Leaky encoding (do NOT use in real work)
global_mean_all = float(pd.concat([y_tr, y_te]).mean())
tmp = pd.DataFrame({'Company': pd.concat([X_tr['Company'], X_te['Company']]).values,
                    'y': pd.concat([y_tr, y_te]).values})
means_all = tmp.groupby('Company')['y'].mean()

X_tr_leak = X_tr.copy(); X_te_leak = X_te.copy()
X_tr_leak['Company_te'] = X_tr_leak['Company'].map(means_all).fillna(global_mean_all)
X_te_leak['Company_te'] = X_te_leak['Company'].map(means_all).fillna(global_mean_all)

pre = ColumnTransformer([
    ('num', 'passthrough', ['Company_te']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['State Name', 'Submitted via', 'Product Name']),
])
m = Pipeline([('pre', pre), ('lr', LogisticRegression(max_iter=300))])
m.fit(X_tr_leak[['Company_te', 'State Name', 'Submitted via', 'Product Name']], y_tr)
auc_leaky = roc_auc_score(y_te, m.predict_proba(X_te_leak[['Company_te', 'State Name', 'Submitted via', 'Product Name']])[:, 1])
print('ROC-AUC (leaky target encoding):', round(auc_leaky, 4))

ROC-AUC (leaky target encoding): 0.9417


In [30]:
# Cross-fitted encoding (safe)
enc_tr, enc_map, gmean = target_encode_crossfit(X_tr['Company'], y_tr, n_splits=5, smoothing=50.0)
X_tr_safe = X_tr.copy(); X_te_safe = X_te.copy()
X_tr_safe['Company_te'] = enc_tr
X_te_safe['Company_te'] = target_encode_apply(X_te_safe['Company'], enc_map, gmean)

m2 = Pipeline([('pre', pre), ('lr', LogisticRegression(max_iter=300))])
m2.fit(X_tr_safe[['Company_te', 'State Name', 'Submitted via', 'Product Name']], y_tr)
auc_safe = roc_auc_score(y_te, m2.predict_proba(X_te_safe[['Company_te', 'State Name', 'Submitted via', 'Product Name']])[:, 1])
print('ROC-AUC (cross-fitted target encoding):', round(auc_safe, 4))

ROC-AUC (cross-fitted target encoding): 0.8669


## 7) Validation leakage and honest model selection

A common process failure is to iterate on the model while repeatedly checking the test set.
This makes the test set act like a validation set, and the final reported metric becomes optimistic.

A safer pattern:

1. Hold out a test set (time-aware or group-aware).
2. On the remaining data, run cross-validation (or a validation split) for hyperparameter search.
3. Fit the selected model on all non-test data.
4. Evaluate once on the test set.

Below is a small example using the diabetes dataset, showing a grid search that never touches the test labels during selection.


In [31]:
from sklearn.linear_model import LogisticRegression

ddf = pd.read_csv(diabetes_path)
y7 = (ddf['classification'].astype(str).str.strip().str.lower() == 'diabetic').astype(int)
X7 = ddf.drop(columns=['classification'])

X_train, X_test, y_train, y_test = train_test_split(X7, y7, test_size=0.25, random_state=42, stratify=y7)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(max_iter=800))
])

param_grid = {
    'lr__C': [0.05, 0.1, 0.3, 1.0, 3.0, 10.0],
    'lr__penalty': ['l2'],
    'lr__solver': ['lbfgs'],
}

search = GridSearchCV(pipe, param_grid=param_grid, scoring='roc_auc', cv=5, n_jobs=None)
search.fit(X_train, y_train)
print('Best CV AUC:', round(search.best_score_, 4))
print('Best params:', search.best_params_)

# Single final evaluation on the held-out test set
best_model = search.best_estimator_
test_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
print('Test AUC:', round(test_auc, 4))

Best CV AUC: 0.8249
Best params: {'lr__C': 0.05, 'lr__penalty': 'l2', 'lr__solver': 'lbfgs'}
Test AUC: 0.8301


## 8) Practical checklist

Before trusting an evaluation, verify:

### Split design

1. Prediction timestamp/event is defined.
2. Split matches deployment:
   - forward-in-time prediction → temporal split / rolling CV
   - new entities → group split
   - panel forecasting → within-entity time split (and consider a time gap)

### Leakage defenses

3. No preprocessing leakage: use pipelines.
4. No target leakage: exclude post-outcome variables and proxies.
5. No validation leakage: test set is evaluated once.

### Forensics

6. Entity overlap counts are zero when they should be.
7. Near-duplicate overlap is low.
8. Performance stability across folds is acceptable.


## 9) Exercises

1. In the complaints dataset, create a time split at 70/30 and compare with 80/20 and 90/10.
2. In listings, change the task to `availability_365 > 200` and compare random vs host-disjoint splits.
3. In the panel data (`states_all.csv`), try predicting `AVG_READING_8_SCORE` instead of math.
4. Add a time gap to the panel split (e.g., skip one year between train and test) and observe the impact.
5. For the complaints dataset, identify at least three columns that are post-outcome and should not be used.


### Key takeaway

Split design is part of modeling. If you want reliable performance, you must evaluate under the same dependency structure and information constraints that exist in production.
