# Example 28: WorkflowSet Per-Group Cross-Validation

## Overview

This notebook demonstrates **WorkflowSet per-group cross-validation**, a NEW feature in py-tidymodels v1.0.0 (released 2025-11-12).

### What's New?

Traditional WorkflowSet evaluation uses a single CV strategy for all data. For **panel/grouped data** (multiple entities like countries, stores, products), this approach has limitations:

‚ùå **Old Approach**: Single global CV
- All groups share the same CV splits
- Doesn't respect group-specific temporal patterns
- May leak information across groups

‚úÖ **New Approach**: Per-Group CV (this notebook!)
- **Each group gets its own CV splits**
- Respects group-specific seasonality and trends
- Prevents cross-group information leakage
- Identifies group-specific overfitting

---

## Key Features Demonstrated

### 1. `time_series_nested_cv()` - Per-Group CV Splits
```python
cv_by_country = time_series_nested_cv(
    data=train_data,
    group_col='country',
    date_column='date',
    initial='2 years',
    assess='6 months'
)
# Returns: {'USA': cv_usa, 'Germany': cv_germany, ...}
```

### 2. `WorkflowSet.fit_nested_resamples()` - Evaluate All Workflows Per-Group
```python
results = wf_set.fit_nested_resamples(
    resamples=cv_by_country,
    group_col='country',
    metrics=metric_set(rmse, mae),
    verbose=True  # Show progress per workflow and group
)
```

### 3. `compare_train_cv()` - One-Line Overfitting Detection
```python
# Compare training stats vs CV stats
comparison = cv_results.compare_train_cv(train_stats)

# Find overfitting workflows
overfit = comparison[comparison['rmse_overfit_ratio'] > 1.2]
```

---

## Use Cases

**When to use per-group CV**:
- ‚úÖ Panel/grouped data (multiple stores, countries, products)
- ‚úÖ Each group has different temporal patterns
- ‚úÖ Need group-specific overfitting detection
- ‚úÖ Want to identify which groups are hard to forecast
- ‚úÖ Comparing models across heterogeneous entities

**When to use global CV**:
- ‚úÖ All groups share similar patterns
- ‚úÖ Limited data per group
- ‚úÖ Want single model for all groups

---

## Dataset

**JODI Refinery Production Data** (13,122 rows)
- Multiple countries (Algeria, Argentina, Australia, Brazil, Canada, China, etc.)
- Monthly refinery intake data (2002-2023)
- Unit: Thousand Barrels per Day (KBD)
- Real-world panel data with heterogeneous patterns

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

# Per-group CV imports (NEW!)
from py_rsample import time_series_nested_cv, initial_time_split
from py_workflowsets import WorkflowSet

# Core py-tidymodels
from py_parsnip import linear_reg, prophet_reg, arima_reg
from py_recipes import recipe, step_normalize, step_lag, all_numeric_predictors
from py_workflows import Workflow
from py_yardstick import metric_set, rmse, mae, r_squared

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úì Imports successful")

## Load Real-World Panel Data

JODI (Joint Organisations Data Initiative) refinery production data:
- **13,122 rows** (monthly observations)
- **Multiple countries** (panel structure)
- **22 years** of data (2002-2023)
- **Heterogeneous patterns** (different countries, different seasonality)

In [None]:
# Load data
raw_data = pd.read_csv('../_md/__data/jodi_refinery_production_data.csv')
raw_data['date'] = pd.to_datetime(raw_data['date'])

print(f"Total dataset: {len(raw_data):,} rows")
print(f"Countries: {raw_data['country'].nunique()}")
print(f"Date range: {raw_data['date'].min()} to {raw_data['date'].max()}")
print(f"\nColumns: {list(raw_data.columns)}")

# Show top countries by data availability
print("\nTop 10 countries by data points:")
print(raw_data['country'].value_counts().head(10))

### Select Representative Countries

For this demo, we'll focus on **5 countries** with complete data:
- Different regions (Asia, Americas, Middle East, Europe)
- Different refining capacities
- Heterogeneous temporal patterns

In [None]:
# Select countries with substantial data
selected_countries = ['China', 'United States', 'Saudi Arabia', 'Germany', 'Brazil']

data = raw_data[raw_data['country'].isin(selected_countries)].copy()
data = data.sort_values(['country', 'date']).reset_index(drop=True)

# Filter out zeros (non-production months)
data = data[data['value'] > 0]

print(f"\nFiltered data: {len(data):,} rows")
print(f"Countries: {data['country'].nunique()}")
print(f"\nRows per country:")
print(data.groupby('country').size())

print(f"\nData summary:")
print(data[['value']].describe())

### Visualize Heterogeneous Patterns

In [None]:
# Plot each country
fig, axes = plt.subplots(len(selected_countries), 1, figsize=(14, 12))

for i, country in enumerate(selected_countries):
    country_data = data[data['country'] == country]
    axes[i].plot(country_data['date'], country_data['value'], linewidth=1, alpha=0.7)
    axes[i].set_title(f'{country} - Refinery Intake (KBD)', fontsize=11, fontweight='bold')
    axes[i].set_ylabel('KBD')
    axes[i].grid(True, alpha=0.3)
    
axes[-1].set_xlabel('Date')
plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("- Different scales (China >> Germany)")
print("- Different trends (China growing, others stable/declining)")
print("- Different volatility levels")
print("- Heterogeneous patterns ‚Üí per-group CV is essential!")

## Train/Test Split

Time-based split: 80% train, 20% test

In [None]:
# Split per country (to maintain temporal order within each country)
train_list = []
test_list = []

for country in selected_countries:
    country_data = data[data['country'] == country].copy()
    split = initial_time_split(country_data, prop=0.8)
    train_list.append(split.training())
    test_list.append(split.testing())

train_data = pd.concat(train_list, ignore_index=True)
test_data = pd.concat(test_list, ignore_index=True)

print(f"Train: {len(train_data):,} rows ({train_data['date'].min()} to {train_data['date'].max()})")
print(f"Test:  {len(test_data):,} rows ({test_data['date'].min()} to {test_data['date'].max()})")
print(f"\nTrain distribution:")
print(train_data.groupby('country').size())
print(f"\nTest distribution:")
print(test_data.groupby('country').size())

---

# Step 1: Create Per-Group CV Splits with `time_series_nested_cv()`

**NEW FEATURE** (2025-11-12): Create separate CV splits for each group.

Each country gets its own independent CV folds based on **that country's data**.

In [None]:
# Create per-group CV splits
cv_by_country = time_series_nested_cv(
    data=train_data,
    group_col='country',
    date_column='date',
    initial='5 years',      # Use first 5 years for initial training
    assess='1 year',        # Forecast 1 year ahead
    skip='6 months',        # Move forward 6 months each fold
    cumulative=False        # Rolling window (not expanding)
)

print("‚úÖ Per-Group CV Splits Created")
print(f"Groups with CV: {len(cv_by_country)}")
print(f"Countries: {list(cv_by_country.keys())}")

# Show CV info for each country
for country, cv in cv_by_country.items():
    print(f"\n{country}:")
    print(f"  Folds: {len(cv.splits)}")
    print(f"  Training size: {len(cv.splits[0].training())} months (first fold)")
    print(f"  Test size: {len(cv.splits[0].testing())} months (each fold)")

### Visualize CV Splits for One Country

In [None]:
# Visualize China's CV splits
china_cv = cv_by_country['China']
china_train = train_data[train_data['country'] == 'China'].copy()

fig, ax = plt.subplots(figsize=(14, 6))

# Plot full training data
ax.plot(china_train['date'], china_train['value'], 'o-', linewidth=1, markersize=3, 
        alpha=0.5, label='Full Training Data', color='gray')

# Highlight each CV fold
colors = ['blue', 'green', 'red', 'purple', 'orange']
for i, split in enumerate(china_cv.splits[:5]):  # Show first 5 folds
    train_fold = split.training()
    test_fold = split.testing()
    
    # Mark test period
    ax.axvspan(test_fold['date'].min(), test_fold['date'].max(), 
               alpha=0.2, color=colors[i], label=f'Fold {i+1} Test')

ax.set_xlabel('Date', fontsize=11)
ax.set_ylabel('Refinery Intake (KBD)', fontsize=11)
ax.set_title('China: Rolling Window Cross-Validation Splits', fontsize=12, fontweight='bold')
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Each colored region shows a test fold")
print("Training data expands as we move forward in time (rolling window)")

---

# Step 2: Create WorkflowSet with Multiple Models

We'll compare **3 preprocessing strategies √ó 2 models = 6 workflows**.

In [None]:
# Define preprocessing strategies
formulas = [
    'value ~ date',                           # Minimal (date only)
    'value ~ date + mean_production',         # With mean feature
    'value ~ date + mean_production + pct_zero'  # Full features
]

# Define models
models = [
    linear_reg(),
    prophet_reg()
]

# Create WorkflowSet (3 √ó 2 = 6 workflows)
wf_set = WorkflowSet.from_cross(
    preproc=formulas,
    models=models
)

print(f"‚úÖ WorkflowSet created with {len(wf_set)} workflows")
print("\nWorkflows:")
for wf_id in wf_set.workflow_ids:
    print(f"  - {wf_id}")

---

# Step 3: Evaluate All Workflows with Per-Group CV

**NEW METHOD**: `WorkflowSet.fit_nested_resamples()`

This evaluates:
- **6 workflows** (3 formulas √ó 2 models)
- Across **5 countries** (China, USA, Saudi Arabia, Germany, Brazil)
- Using **per-group CV** (each country has ~2-4 folds)

**Total evaluations**: 6 workflows √ó 5 countries √ó ~3 folds = ~90 model fits!

**Verbose Mode** shows progress:
```
[1/6] Workflow: prep_1_linear_reg_1
  [1/5] Group: China (3 folds) ‚úì
  [2/5] Group: United States (3 folds) ‚úì
  ...
```

In [None]:
# Evaluate all workflows with per-group CV (verbose mode)
print("üî¨ Starting Per-Group CV Evaluation...\n")
print("This will take a few minutes (6 workflows √ó 5 countries √ó ~3 folds = ~90 fits)\n")

cv_results = wf_set.fit_nested_resamples(
    resamples=cv_by_country,
    group_col='country',
    metrics=metric_set(rmse, mae, r_squared),
    verbose=True  # Show detailed progress
)

print("\n‚úÖ Per-Group CV Complete!")

---

# Step 4: Analyze Results

## 4.1 Collect Metrics (Averaged Across Groups)

In [None]:
# Get average metrics across all groups
metrics_avg = cv_results.collect_metrics(by_group=False, summarize=True)

print("üìä Average CV Metrics (across all countries):")
print(metrics_avg[['wflow_id', 'metric', 'mean', 'std']].to_string(index=False))

## 4.2 Rank Workflows by RMSE

In [None]:
# Rank workflows by average CV RMSE
ranked = cv_results.rank_results('rmse', by_group=False, n=6)

print("\nüèÜ Workflow Ranking (by CV RMSE):")
print(ranked[['rank', 'wflow_id', 'rmse', 'mae', 'r_squared']].to_string(index=False))

best_wf = ranked.iloc[0]
print(f"\nü•á Best Workflow: {best_wf['wflow_id']}")
print(f"   CV RMSE: {best_wf['rmse']:.2f} (¬±{ranked.iloc[0].get('rmse_std', 0):.2f})")

## 4.3 Per-Group Analysis

See which countries are easier/harder to forecast.

In [None]:
# Get metrics by group
metrics_by_group = cv_results.collect_metrics(by_group=True, summarize=True)

# Focus on RMSE
rmse_by_group = metrics_by_group[metrics_by_group['metric'] == 'rmse'].copy()
rmse_pivot = rmse_by_group.pivot(index='wflow_id', columns='group', values='mean')

print("\nüìä CV RMSE by Country (lower is better):")
print(rmse_pivot.to_string())

# Which country is hardest to forecast?
avg_rmse_by_country = rmse_pivot.mean(axis=0).sort_values(ascending=False)
print("\nüéØ Average RMSE by Country (hardest to easiest):")
for country, avg_rmse in avg_rmse_by_country.items():
    print(f"  {country}: {avg_rmse:.2f}")

## 4.4 Visualize Workflow Comparison

In [None]:
# Plot average RMSE with error bars
fig = cv_results.autoplot('rmse', by_group=False, top_n=6)
fig.update_layout(
    title='CV RMSE by Workflow (averaged across countries)',
    xaxis_title='CV RMSE',
    yaxis_title='Workflow'
)
fig.show()

print("\nüìä Error bars show variability across CV folds and groups")

---

# Step 5: Detect Overfitting with `compare_train_cv()`

**NEW HELPER** (2025-11-12): One-line overfitting detection!

Compare training performance vs CV performance to identify overfitting.

## 5.1 Fit on Full Training Data

In [None]:
# Fit all workflows on full training data (per-group)
print("Fitting all workflows on full training data...\n")

train_results = wf_set.fit_nested(
    data=train_data,
    group_col='country'
)

# Extract training stats
outputs, coeffs, train_stats = train_results.extract_outputs()

print("\n‚úÖ Training complete")
print(f"Training stats: {len(train_stats)} rows")
print(f"Columns: {list(train_stats.columns)}")

## 5.2 Compare Training vs CV (ONE LINE!)

The `compare_train_cv()` helper automatically:
1. Matches workflows between training and CV results
2. Calculates overfitting ratios (CV / Train)
3. Flags concerning overfit levels
4. Sorts by CV performance (most reliable metric)

In [None]:
# ONE LINE to detect overfitting!
comparison = cv_results.compare_train_cv(train_stats)

print("\nüìä Training vs CV Comparison:")
print(comparison[['wflow_id', 'group', 'rmse_train', 'rmse_cv', 'rmse_overfit_ratio', 'mae_train', 'mae_cv']].to_string(index=False))

print("\nüìà Overfitting Ratio Interpretation:")
print("  1.0-1.1: üü¢ Excellent (minimal overfit)")
print("  1.1-1.2: üü¢ Good (acceptable overfit)")
print("  1.2-1.5: üü° Moderate (some overfit)")
print("  >1.5:    üî¥ Severe (significant overfit)")

## 5.3 Identify Overfitting Workflows

In [None]:
# Find workflows with moderate or severe overfitting
overfit_workflows = comparison[comparison['rmse_overfit_ratio'] > 1.2].copy()

if len(overfit_workflows) > 0:
    print("\n‚ö†Ô∏è Workflows with Overfitting (ratio > 1.2):")
    print(overfit_workflows[['wflow_id', 'group', 'rmse_train', 'rmse_cv', 'rmse_overfit_ratio']].to_string(index=False))
    
    print("\nüí° Recommendations for overfitting workflows:")
    print("  - Add regularization (penalty parameter)")
    print("  - Simplify model (fewer features)")
    print("  - Increase training data")
    print("  - Use simpler model type")
else:
    print("\n‚úÖ No significant overfitting detected (all ratios < 1.2)")

## 5.4 Best Workflow Per Country (Based on CV)

In [None]:
# Find best workflow for each country (using CV RMSE - most reliable)
best_per_country = comparison.sort_values('rmse_cv').groupby('group').first().reset_index()

print("\nüèÜ Best Workflow per Country (based on CV RMSE):")
print(best_per_country[['group', 'wflow_id', 'rmse_cv', 'rmse_overfit_ratio']].to_string(index=False))

# Check if different countries prefer different workflows
unique_best = best_per_country['wflow_id'].nunique()
if unique_best > 1:
    print(f"\nüí° Heterogeneity detected: {unique_best} different workflows are best for different countries")
    print("This confirms that per-group modeling is beneficial!")
else:
    print(f"\nüìä Homogeneity: All countries perform best with {best_per_country['wflow_id'].iloc[0]}")

---

# Step 6: Final Model Selection and Evaluation

Based on CV results, select best workflow and evaluate on test set.

In [None]:
# Extract best workflow overall (by CV RMSE)
best_wf_id = cv_results.extract_best_workflow('rmse', by_group=False)
best_workflow = wf_set[best_wf_id]

print(f"üèÜ Best Workflow (overall): {best_wf_id}")
print(f"Based on CV RMSE across all countries")

# Fit on full training data and evaluate on test
print("\nFitting best workflow on full training data...")
final_fit = best_workflow.fit_nested(train_data, group_col='country')

# Evaluate on test
test_outputs, test_coeffs, test_stats = final_fit.extract_outputs()

# Get test stats
test_perf = test_stats[test_stats['split'] == 'test'].copy()

print("\nüìä Test Set Performance (best workflow):")
print(test_perf[['group', 'rmse', 'mae', 'r_squared']].to_string(index=False))

# Compare test with CV
cv_rmse_best = ranked.iloc[0]['rmse']
test_rmse_avg = test_perf['rmse'].mean()

print(f"\nüìà CV vs Test Comparison:")
print(f"  Average CV RMSE: {cv_rmse_best:.2f}")
print(f"  Average Test RMSE: {test_rmse_avg:.2f}")
print(f"  Difference: {abs(cv_rmse_best - test_rmse_avg):.2f} ({abs(cv_rmse_best - test_rmse_avg)/cv_rmse_best*100:.1f}%)")

if abs(cv_rmse_best - test_rmse_avg) / cv_rmse_best < 0.1:
    print("\n‚úÖ CV is a good estimator of test performance (<10% difference)")
else:
    print("\n‚ö†Ô∏è Larger difference between CV and test - consider more CV folds")

---

# Key Takeaways

## What We Learned

### 1. Per-Group CV Captures Heterogeneity
- Different countries have different optimal workflows
- Global CV would miss group-specific patterns
- Per-group CV respects temporal structure within each entity

### 2. Overfitting Detection is Critical
- Training performance can be misleading
- CV provides more reliable performance estimates
- `compare_train_cv()` makes detection easy (one line!)

### 3. Workflow Comparison is Comprehensive
- Tested 6 workflows √ó 5 countries = 30 model configurations
- Each evaluated with multiple CV folds
- Automated ranking and selection

---

## When to Use This Approach

**‚úÖ Use Per-Group CV When:**
- You have panel/grouped data (multiple stores, countries, products)
- Groups have heterogeneous patterns (different seasonality, trends)
- You need to identify group-specific overfitting
- You want to understand which groups are hard to forecast
- Each group has sufficient data for CV (>50 observations recommended)

**‚ùå Use Global CV When:**
- All groups share similar patterns (homogeneous)
- Limited data per group (<50 observations)
- Single model for all groups (no per-group customization)
- Computational resources are limited

---

## Best Practices

### CV Configuration
1. **Initial period**: At least 2-3 seasonal cycles
   - Monthly data: `initial='2 years'`
   - Daily data: `initial='6 months'`

2. **Assessment period**: Match forecast horizon
   - If forecasting 3 months ahead: `assess='3 months'`

3. **Skip**: Balance between folds and independence
   - More skip = more independence, fewer folds
   - Less skip = more folds, more correlation
   - Typical: `skip='3 months'` for monthly data

4. **Cumulative**: Depends on data volume
   - `True`: Expanding window (more data over time)
   - `False`: Rolling window (consistent data volume)

### Overfitting Thresholds
- **< 1.1**: Excellent (minimal overfit)
- **1.1-1.2**: Good (acceptable)
- **1.2-1.5**: Moderate (investigate)
- **> 1.5**: Severe (needs remediation)

### Workflow Selection
1. **Rank by CV metrics** (not training metrics)
2. **Check overfitting ratio** before selection
3. **Consider simplicity** if multiple workflows tie
4. **Validate on holdout test set** before production

---

## Common Pitfalls

### ‚ùå Selecting Based on Training Performance
- Training metrics can be optimistically biased
- Always use CV for model selection

### ‚ùå Ignoring Overfitting Ratios
- Low training error + high CV error = overfitting
- Use `compare_train_cv()` to detect this

### ‚ùå Too Few CV Folds
- Need at least 3-5 folds for stable estimates
- Adjust `initial`, `assess`, `skip` if needed

### ‚ùå Not Checking Group-Specific Performance
- Some groups may be much harder to forecast
- Use `by_group=True` analysis to identify

---

## Production Checklist

Before deploying to production:

- [ ] CV RMSE is acceptable for business needs
- [ ] Overfitting ratio < 1.2 for selected workflow
- [ ] Test set performance confirms CV estimates
- [ ] All groups have acceptable performance (no outliers)
- [ ] Model complexity justified by performance gain
- [ ] Monitoring plan for production performance
- [ ] Retraining schedule defined

---

# References

- **Per-Group CV Documentation**: `.claude_debugging/WORKFLOWSET_NESTED_RESAMPLES_IMPLEMENTATION.md`
- **Overfitting Detection**: `.claude_debugging/COMPARE_TRAIN_CV_HELPER.md`
- **WorkflowSet Guide**: Examples 11, 04_forecasting-workflowsets-tune-cv-grouped.ipynb
- **Grouped Modeling**: Example 13, 25_agent_advanced_features.ipynb
- **CLAUDE.md**: Complete architecture documentation