# Module 1.12: What's Known at Forecast Time (Leakage Guardrails)

> **Goal:** Spot and prevent leakage systematically.
- "Features for time t must be computable at or before t."
- "Leakage is the silent killer of forecasting projects."

---

## Prerequisites

**Inputs:**
- `./output/m5_weekly_clean.parquet` ‚Äî From Module 1.9

**What this module produces:**
- Leakage checklist framework
- Feature timeline audit
- Safe lag feature functions

**Data Flow:**
```
Module 1.11 (pattern segmentation)
    ‚Üí Module 1.12 (leakage guardrails) ‚Üê YOU ARE HERE
        ‚Üí Module 1.13 (data quality checks)
```

---

## 1. What is Leakage?

**Leakage** occurs when information from the future "leaks" into your training data.

### The Core Rule

> **At forecast time t, you can only use information available at or before time t.**

### Why Leakage is Dangerous

| Backtest | Production |
|----------|------------|
| Model sees future info | Future info unavailable |
| Metrics look great | Metrics crash |
| "This model is amazing!" | "What went wrong?" |

### Types of Leakage

| Type | Example | Fix |
|------|---------|-----|
| **Direct** | Using `y[t+1]` to predict `y[t+1]` | Remove future target |
| **Indirect** | Using "actual promo lift" | Use planned promo only |
| **Via aggregates** | Month average includes future | Use only past in window |
| **Via joins** | Join realized prices | Join planned prices |

## 2. Setup

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from pathlib import Path
import tsforge as tsf
import warnings
warnings.filterwarnings('ignore')

Path('./output').mkdir(exist_ok=True)

# Load data
df = pd.read_parquet('./output/m5_weekly_clean.parquet')

# Extract metadata
df['store_id'] = df['unique_id'].str.extract(r'_([A-Z]{2}_\d+)$')[0]
df['dept_id'] = df['unique_id'].str.extract(r'^([A-Z]+_\d+)')[0]
df['cat_id'] = df['unique_id'].str.extract(r'^([A-Z]+)')[0]

print(f"Loaded {len(df):,} rows, {df['unique_id'].nunique():,} series")
print(f"Date range: {df['ds'].min().date()} to {df['ds'].max().date()}")

Loaded 6,848,887 rows, 30,490 series
Date range: 2011-01-29 to 2016-06-25


## 3. The Feature Framework

### Classify Every Feature

For each feature, determine: **When is this information available?**

| Availability | Description | Examples | Safe? |
|--------------|-------------|----------|-------|
| **Static** | Never changes | item_id, category | ‚úÖ Yes |
| **Known ahead** | Scheduled/planned | holidays, day-of-week | ‚úÖ Yes |
| **Known at t** | Available at forecast time | lag features | ‚úÖ If done right |
| **Known after t** | Only after the fact | actual sales, actual promo lift | ‚ùå No |

In [None]:
feature_audit = tsf.classify_features()
display(feature_audit)

Feature Timeline Audit

‚úÖ Safe features: 13
‚ö†Ô∏è Caution features: 4



Unnamed: 0,feature,availability,safe,notes
0,unique_id,static,True,Identifier - never changes
1,item_id,static,True,Item identifier
2,store_id,static,True,Store identifier
3,cat_id,static,True,Category
4,dept_id,static,True,Department
5,day_of_week,known_ahead,True,Calendar - known infinitely ahead
6,week_of_year,known_ahead,True,Calendar - known infinitely ahead
7,month,known_ahead,True,Calendar - known infinitely ahead
8,is_holiday,known_ahead,True,Public holidays are scheduled
9,snap_flag,known_ahead,True,SNAP schedule is known ahead


## 4. Common Leakage Traps

### 4.1 The Lag Feature Trap

**Wrong:** Rolling window includes current value
```python
# ‚ùå LEAKAGE - includes y[t]
df['rolling_mean'] = df.groupby('id')['y'].transform(
    lambda x: x.rolling(4).mean()
)
```

**Right:** Shift THEN roll
```python
# ‚úÖ SAFE - only uses y[t-1] and earlier
df['rolling_mean'] = df.groupby('id')['y'].transform(
    lambda x: x.shift(1).rolling(4).mean()
)
```

In [3]:
# Create safe features
df_features = tsf.create_safe_lag_features(df)

# Verify no leakage - lag1 should equal previous row's y
sample = df_features[df_features['unique_id'] == df_features['unique_id'].iloc[0]].head(10)
print("\nVerification (y_lag1 should be previous y):")
display(sample[['ds', 'y', 'y_lag1', 'y_roll_mean_4']].head(6))

‚úÖ Created safe lag features:
   Simple lags: y_lag1, y_lag2, y_lag4, y_lag52
   Rolling: y_roll_mean_4, y_roll_std_4, y_roll_mean_12
   Expanding: y_expanding_mean

   All features use shift(1) BEFORE aggregation

Verification (y_lag1 should be previous y):


Unnamed: 0,ds,y,y_lag1,y_roll_mean_4
0,2011-01-29,3.0,,
1,2011-02-05,9.0,3.0,3.0
2,2011-02-12,7.0,9.0,6.0
3,2011-02-19,8.0,7.0,6.333333
4,2011-02-26,14.0,8.0,6.75
5,2011-03-05,15.0,14.0,9.5


### 4.2 The Price Trap

**Problem:** Using actual transaction prices instead of planned prices.

```
Timeline:
‚îú‚îÄ‚îÄ Monday: Plan price = $9.99
‚îú‚îÄ‚îÄ Wednesday: Competitor drops price ‚Üí we react ‚Üí actual = $8.99
‚îú‚îÄ‚îÄ Friday: Forecast for next week using $8.99 ‚Üê LEAKAGE!
‚îÇ
‚îî‚îÄ‚îÄ At forecast time (Monday), we didn't KNOW about the $8.99
```

**Solution:** Use planned/base prices, or lag actual prices.

### 4.3 The Promotion Trap

**Leaky promo features (don't use):**
- `promo_lift` ‚Äî computed after the promotion
- `promo_success` ‚Äî only known after sales come in
- `actual_promo_dates` ‚Äî vs planned promo dates

**Safe promo features:**
- `planned_promo_flag` ‚Äî from promotional calendar
- `promo_type` ‚Äî planned type of promotion
- `historical_avg_lift` ‚Äî average from past promos (not current)

## 5. The Cutoff Framework

A **cutoff date** is when you make your forecast. Everything before = known. Everything after = unknown.

```
Past (known)          Cutoff         Future (unknown)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Training data          ‚îÇ         Forecast horizon
  Historical features    ‚îÇ         What we predict
  y[t-1], y[t-2], ...   ‚îÇ         y[t], y[t+1], ...
```

In [None]:
# Example: validate a cutoff
cutoff = df['ds'].max() - pd.Timedelta(weeks=8)
validation = tsf.validate_cutoff(df, cutoff)

Cutoff Validation: 2016-04-30
  Train: 6,604,967 rows, ends 2016-04-30
  Test: 243,920 rows, starts 2016-05-07
  Gap: 7 days
  Valid: ‚úÖ


## 6. The Leakage Checklist

### Before Feature Engineering

- [ ] List all features and their data source
- [ ] Classify each: static / known-ahead / known-at-t / known-after-t
- [ ] Flag caution features: prices, promos, aggregates

### During Feature Engineering

- [ ] Lag features use `shift(1)` BEFORE any rolling/expanding
- [ ] Rolling windows exclude current value
- [ ] Joins use planned data, not actual outcomes
- [ ] Aggregates are backward-looking only

### During Evaluation

- [ ] Cutoff dates are respected in cross-validation
- [ ] No future data in training folds
- [ ] Test performance is realistic (not too good to be true)

### Red Flags üö©

- [ ] Test RMSE << Train RMSE (suspiciously good)
- [ ] Model relies heavily on price/promo features
- [ ] Any feature named with "actual_" or "realized_"
- [ ] Lag-0 features exist (current period values)

In [None]:
# Run checklist
issues = tsf.run_leakage_checklist(df_features)

LEAKAGE CHECKLIST

1. Checking column names...
   ‚úÖ No suspicious column names

2. Checking date ordering...
   ‚úÖ Dates properly ordered

3. Checking for potential lag issues...
   Found lag/rolling columns: ['y_lag1', 'y_lag2', 'y_lag4', 'y_lag52', 'y_roll_mean_4', 'y_roll_std_4', 'y_roll_mean_12']
   ‚ÑπÔ∏è Manually verify these use shift(1) before aggregation

4. Checking target column...
   Target NaN: 0.00%

‚úÖ NO OBVIOUS LEAKAGE DETECTED


## 7. Key Takeaways

### The Golden Rule

> **At forecast time t, you can only use information available at or before t.**

### Quick Reference

| Feature Type | Safe? | Example Fix |
|--------------|-------|-------------|
| Static attributes | ‚úÖ Yes | ‚Äî |
| Calendar features | ‚úÖ Yes | ‚Äî |
| Lag features | ‚úÖ If shifted | Always `shift(1)` first |
| Rolling features | ‚ö†Ô∏è Careful | `shift(1).rolling()` |
| Prices | ‚ö†Ô∏è Careful | Use planned, not actual |
| Promos | ‚ö†Ô∏è Careful | Use planned calendar |
| Actual outcomes | ‚ùå Never | Remove from features |

### The Checklist Habit

Before any model training:
1. **Audit** every feature's timeline
2. **Validate** cutoffs are respected
3. **Check** for "too good to be true" metrics

---

## What's Next

**Module 1.13: Data Quality & Readiness Checks**
- Confirm data is usable for forecasting
- Check history length, gaps, coverage

In [6]:
# Save feature audit
feature_audit.to_csv('./output/feature_audit.csv', index=False)
print("‚úì Saved ./output/feature_audit.csv")

# Summary
print("\n" + "=" * 60)
print("MODULE 1.12 COMPLETE")
print("=" * 60)
print("\nKey functions:")
print("  create_safe_lag_features() - Build leakage-free lag features")
print("  validate_cutoff() - Check train/test split")
print("  run_leakage_checklist() - Automated leakage detection")
print("\nOutputs:")
print("  ./output/feature_audit.csv")

‚úì Saved ./output/feature_audit.csv

MODULE 1.12 COMPLETE

Key functions:
  create_safe_lag_features() - Build leakage-free lag features
  validate_cutoff() - Check train/test split
  run_leakage_checklist() - Automated leakage detection

Outputs:
  ./output/feature_audit.csv
