# Example 31: Per-Group Recipe Preprocessing

## Overview

This notebook demonstrates **per-group recipe preprocessing**, a NEW feature in py-tidymodels v1.0.0 (released 2025-11-10).

### What is Per-Group Preprocessing?

Traditional approach: **All groups share the same recipe**
```python
# ONE recipe fitted on all data
rec = recipe().step_pca(all_numeric_predictors(), num_comp=5)
wf = Workflow().add_recipe(rec).add_model(linear_reg())
fit = wf.fit_nested(data, group_col='country', per_group_prep=False)
# USA gets 5 components, Germany gets 5 components, France gets 5 components
```

NEW approach: **Each group gets its own recipe**
```python
# SEPARATE recipe per group, fitted on that group's data
rec = recipe().step_pca(all_numeric_predictors(), num_comp=5)
wf = Workflow().add_recipe(rec).add_model(linear_reg())
fit = wf.fit_nested(data, group_col='country', per_group_prep=True)
# USA gets 5 components, Germany gets 3 components, France gets 4 components
# Each group selects different features based on its data
```

---

## Why Per-Group Preprocessing?

### Problem with Global Preprocessing

Groups can have **heterogeneous feature spaces**:
- Different variance structures ‚Üí PCA components differ
- Different correlations ‚Üí Feature selection picks different vars
- Different scales ‚Üí Normalization parameters differ
- Different patterns ‚Üí Different feature importance

**One-size-fits-all preprocessing** may not be optimal.

### Benefits of Per-Group Preprocessing

‚úÖ **Adaptive feature spaces**: Each group gets appropriate features  
‚úÖ **Better performance**: Group-specific preprocessing improves accuracy  
‚úÖ **Handles heterogeneity**: Different groups ‚Üí different transformations  
‚úÖ **Interpretability**: See which features matter for each group  

---

## Use Cases

### When to Use Per-Group Preprocessing

‚úÖ **PCA/dimensionality reduction**: Groups need different # components  
‚úÖ **Feature selection**: Groups have different important features  
‚úÖ **Correlation filtering**: Groups have different correlation structures  
‚úÖ **Variance filtering**: Groups have different variance distributions  
‚úÖ **Heterogeneous data**: Groups fundamentally different  

### When to Use Global Preprocessing

‚úÖ **Homogeneous groups**: All groups similar  
‚úÖ **Small groups**: Not enough data per group (<50 obs)  
‚úÖ **Simple transformations**: Normalization, scaling (less critical)  
‚úÖ **Consistency needs**: Want same features for all groups  

---

## Dataset

**European Gas Demand** (96K rows, 10 countries)
- Daily gas demand with weather features
- Heterogeneous patterns across countries
- Perfect for demonstrating per-group preprocessing

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Per-group preprocessing imports
from py_recipes import recipe
from py_recipes.selectors import all_numeric_predictors
from py_workflows import Workflow
from py_parsnip import linear_reg, rand_forest
from py_rsample import initial_time_split
from py_yardstick import rmse, mae, r_squared

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úì Imports successful")

## Load European Gas Demand Data

In [None]:
# Load data
raw_data = pd.read_csv('../_md/__data/european_gas_demand_weather_data.csv')
raw_data['date'] = pd.to_datetime(raw_data['date'])

print(f"Total dataset: {len(raw_data):,} rows")
print(f"Countries: {raw_data['country'].nunique()}")
print(f"\nCountries: {raw_data['country'].unique()}")

### Select Representative Countries

In [None]:
# Select 3 countries with different characteristics
selected_countries = ['Germany', 'Italy', 'Netherlands']

data = raw_data[raw_data['country'].isin(selected_countries)].copy()
data = data.sort_values(['country', 'date']).reset_index(drop=True)

print(f"Filtered data: {len(data):,} rows")
print(f"\nRows per country:")
print(data.groupby('country').size())

### Feature Engineering

In [None]:
# Add more features for demonstration
for country in selected_countries:
    country_mask = data['country'] == country
    country_data = data[country_mask].copy()
    
    # Lagged features
    data.loc[country_mask, 'demand_lag1'] = country_data['gas_demand'].shift(1)
    data.loc[country_mask, 'demand_lag7'] = country_data['gas_demand'].shift(7)
    
    # Rolling features
    data.loc[country_mask, 'temp_ma7'] = country_data['temperature'].rolling(7).mean()
    data.loc[country_mask, 'wind_ma7'] = country_data['wind_speed'].rolling(7).mean()
    
    # Interaction
    data.loc[country_mask, 'temp_wind'] = country_data['temperature'] * country_data['wind_speed']

# Drop NaN from rolling/lagging
data = data.dropna().reset_index(drop=True)

print(f"After feature engineering: {len(data):,} rows")
print(f"\nFeatures: {[c for c in data.columns if c not in ['date', 'country']]}")

### Check Feature Correlations Per Country

In [None]:
# Show correlation structure differs by country
numeric_cols = ['temperature', 'wind_speed', 'demand_lag1', 'demand_lag7', 'temp_ma7', 'wind_ma7', 'temp_wind']

for country in selected_countries:
    country_data = data[data['country'] == country][numeric_cols]
    print(f"\n{country} - Correlation with gas_demand:")
    corrs = data[data['country'] == country][numeric_cols + ['gas_demand']].corr()['gas_demand'][:-1]
    print(corrs.sort_values(ascending=False).to_string())

print("\nüí° Different countries have different correlation structures!")
print("This suggests per-group preprocessing could be beneficial.")

## Train/Test Split

In [None]:
# Split per country
train_list, test_list = [], []

for country in selected_countries:
    country_data = data[data['country'] == country]
    split = initial_time_split(country_data, prop=0.8)
    train_list.append(split.training())
    test_list.append(split.testing())

train_data = pd.concat(train_list, ignore_index=True)
test_data = pd.concat(test_list, ignore_index=True)

print(f"Train: {len(train_data):,} rows")
print(f"Test:  {len(test_data):,} rows")

---

# Example 1: Per-Group PCA

**Scenario**: Each country may need different number of PCA components.

## Global PCA (Baseline)

In [None]:
# Global PCA: All countries share same 3 components
rec_global = (
    recipe()
    .step_normalize(all_numeric_predictors())
    .step_pca(all_numeric_predictors(), num_comp=3)
)

wf_global = Workflow().add_recipe(rec_global).add_model(linear_reg())

print("üî¨ Global PCA: All countries get 3 components\n")
fit_global = wf_global.fit_nested(train_data, group_col='country', per_group_prep=False)

# Evaluate on test data (REQUIRED for test stats!)
fit_global = fit_global.evaluate(test_data)
outputs_global, coeffs_global, stats_global = fit_global.extract_outputs()

# Convert from LONG to WIDE format for display
test_stats = stats_global[stats_global['split'] == 'test']
test_stats_global = test_stats.pivot_table(
    index='group',
    columns='metric',
    values='value'
).reset_index()
test_stats_global.columns.name = None  # Remove column name attribute

print("\nüìä Global PCA - Test Performance:")
print(test_stats_global[['group', 'rmse', 'mae', 'r_squared']].to_string(index=False))
print(f"\nAverage RMSE: {test_stats_global['rmse'].mean():.2f}")

## Per-Group PCA (NEW!)

In [None]:
# Per-Group PCA: Each country gets its own PCA (up to 3 components)
rec_per_group = (
    recipe()
    .step_normalize(all_numeric_predictors())
    .step_pca(all_numeric_predictors(), num_comp=3)
)

wf_per_group = Workflow().add_recipe(rec_per_group).add_model(linear_reg())

print("üî¨ Per-Group PCA: Each country gets its own PCA\n")
fit_per_group = wf_per_group.fit_nested(
    train_data, 
    group_col='country', 
    per_group_prep=True,  # NEW parameter!
    min_group_size=30  # Minimum samples for group-specific prep
)

# Evaluate on test data (REQUIRED for test stats!)
fit_per_group = fit_per_group.evaluate(test_data)
outputs_per_group, coeffs_per_group, stats_per_group = fit_per_group.extract_outputs()

# Convert from LONG to WIDE format for display
test_stats = stats_per_group[stats_per_group['split'] == 'test']
test_stats_per_group = test_stats.pivot_table(
    index='group',
    columns='metric',
    values='value'
).reset_index()
test_stats_per_group.columns.name = None  # Remove column name attribute

print("\nüìä Per-Group PCA - Test Performance:")
print(test_stats_per_group[['group', 'rmse', 'mae', 'r_squared']].to_string(index=False))
print(f"\nAverage RMSE: {test_stats_per_group['rmse'].mean():.2f}")

### Compare Global vs Per-Group PCA

In [None]:
# Comparison
comparison_pca = pd.DataFrame([
    {'Approach': 'Global PCA', 'Germany': test_stats_global[test_stats_global['group']=='Germany']['rmse'].iloc[0],
     'Italy': test_stats_global[test_stats_global['group']=='Italy']['rmse'].iloc[0],
     'Netherlands': test_stats_global[test_stats_global['group']=='Netherlands']['rmse'].iloc[0],
     'Average': test_stats_global['rmse'].mean()},
    {'Approach': 'Per-Group PCA', 'Germany': test_stats_per_group[test_stats_per_group['group']=='Germany']['rmse'].iloc[0],
     'Italy': test_stats_per_group[test_stats_per_group['group']=='Italy']['rmse'].iloc[0],
     'Netherlands': test_stats_per_group[test_stats_per_group['group']=='Netherlands']['rmse'].iloc[0],
     'Average': test_stats_per_group['rmse'].mean()}
])

comparison_pca['Improvement'] = (
    (comparison_pca.loc[0, 'Average'] - comparison_pca.loc[1, 'Average']) / 
    comparison_pca.loc[0, 'Average'] * 100
)

print("\nüìä PCA Comparison (Test RMSE):")
print(comparison_pca.to_string(index=False))

improvement = comparison_pca.loc[1, 'Improvement']
if improvement > 0:
    print(f"\n‚úÖ Per-group PCA improves RMSE by {improvement:.2f}%")
else:
    print(f"\n‚ùå Global PCA is better (or per-group needs more data)")

### Use `get_feature_comparison()` to See Differences

In [None]:
# NEW utility: Compare features across groups
feature_comparison = fit_per_group.get_feature_comparison()

print("\nüìä Feature Comparison Across Groups:")
print(feature_comparison.to_string(index=False))

print("\nüí° This shows which PCA components each country uses")
print("Different groups may have different # components based on their variance structure")

---

# Example 2: Per-Group Feature Selection

**Scenario**: Different countries have different important features.

## Global Feature Selection

In [None]:
# Global: Remove highly correlated features (threshold 0.9)
rec_fs_global = (
    recipe()
    .step_normalize(all_numeric_predictors())
    .step_select_corr(all_numeric_predictors(), threshold=0.9)
)

wf_fs_global = Workflow().add_recipe(rec_fs_global).add_model(rand_forest().set_mode('regression'))

print("üî¨ Global Feature Selection: Same features for all countries\n")
fit_fs_global = wf_fs_global.fit_nested(train_data, group_col='country', per_group_prep=False)

# Evaluate on test data (REQUIRED for test stats!)
fit_fs_global = fit_fs_global.evaluate(test_data)
outputs_fs_global, coeffs_fs_global, stats_fs_global = fit_fs_global.extract_outputs()

# Convert from LONG to WIDE format for display
test_stats = stats_fs_global[stats_fs_global['split'] == 'test']
test_stats_fs_global = test_stats.pivot_table(
    index='group',
    columns='metric',
    values='value'
).reset_index()
test_stats_fs_global.columns.name = None  # Remove column name attribute

print("üìä Global Feature Selection - Test Performance:")
print(test_stats_fs_global[['group', 'rmse', 'mae']].to_string(index=False))
print(f"\nAverage RMSE: {test_stats_fs_global['rmse'].mean():.2f}")

## Per-Group Feature Selection

In [None]:
# Per-Group: Each country selects its own features
rec_fs_per_group = (
    recipe()
    .step_normalize(all_numeric_predictors())
    .step_select_corr(all_numeric_predictors(), threshold=0.9)
)

wf_fs_per_group = Workflow().add_recipe(rec_fs_per_group).add_model(rand_forest().set_mode('regression'))

print("üî¨ Per-Group Feature Selection: Each country selects different features\n")
fit_fs_per_group = wf_fs_per_group.fit_nested(
    train_data,
    group_col='country',
    per_group_prep=True
)

# Evaluate on test data (REQUIRED for test stats!)
fit_fs_per_group = fit_fs_per_group.evaluate(test_data)
outputs_fs_per_group, coeffs_fs_per_group, stats_fs_per_group = fit_fs_per_group.extract_outputs()

# Convert from LONG to WIDE format for display
test_stats = stats_fs_per_group[stats_fs_per_group['split'] == 'test']
test_stats_fs_per_group = test_stats.pivot_table(
    index='group',
    columns='metric',
    values='value'
).reset_index()
test_stats_fs_per_group.columns.name = None  # Remove column name attribute

print("üìä Per-Group Feature Selection - Test Performance:")
print(test_stats_fs_per_group[['group', 'rmse', 'mae']].to_string(index=False))
print(f"\nAverage RMSE: {test_stats_fs_per_group['rmse'].mean():.2f}")

### Which Features Did Each Country Select?

In [None]:
# See which features each country uses
feature_comparison_fs = fit_fs_per_group.get_feature_comparison()

print("\nüìä Feature Selection by Country:")
print(feature_comparison_fs.to_string(index=False))

print("\nüí° Different countries selected different features!")
print("This is because correlation structures differ by country.")

### Compare Global vs Per-Group Feature Selection

In [None]:
# Comparison
comparison_fs = pd.DataFrame([
    {'Approach': 'Global FS', 'Germany': test_stats_fs_global[test_stats_fs_global['group']=='Germany']['rmse'].iloc[0],
     'Italy': test_stats_fs_global[test_stats_fs_global['group']=='Italy']['rmse'].iloc[0],
     'Netherlands': test_stats_fs_global[test_stats_fs_global['group']=='Netherlands']['rmse'].iloc[0],
     'Average': test_stats_fs_global['rmse'].mean()},
    {'Approach': 'Per-Group FS', 'Germany': test_stats_fs_per_group[test_stats_fs_per_group['group']=='Germany']['rmse'].iloc[0],
     'Italy': test_stats_fs_per_group[test_stats_fs_per_group['group']=='Italy']['rmse'].iloc[0],
     'Netherlands': test_stats_fs_per_group[test_stats_fs_per_group['group']=='Netherlands']['rmse'].iloc[0],
     'Average': test_stats_fs_per_group['rmse'].mean()}
])

print("\nüìä Feature Selection Comparison (Test RMSE):")
print(comparison_fs.to_string(index=False))

improvement_fs = (comparison_fs.loc[0, 'Average'] - comparison_fs.loc[1, 'Average']) / comparison_fs.loc[0, 'Average'] * 100
print(f"\nImprovement: {improvement_fs:+.2f}%")

---

# Final Comparison: All Approaches

In [None]:
# Compile all results
final_comparison = pd.DataFrame([
    {'Approach': 'Global PCA', 'Preprocessing': 'PCA (3 comp)', 'Per-Group': 'No',
     'Avg_RMSE': test_stats_global['rmse'].mean()},
    {'Approach': 'Per-Group PCA', 'Preprocessing': 'PCA (adaptive)', 'Per-Group': 'Yes',
     'Avg_RMSE': test_stats_per_group['rmse'].mean()},
    {'Approach': 'Global Feature Selection', 'Preprocessing': 'Corr filter (0.9)', 'Per-Group': 'No',
     'Avg_RMSE': test_stats_fs_global['rmse'].mean()},
    {'Approach': 'Per-Group Feature Selection', 'Preprocessing': 'Corr filter (adaptive)', 'Per-Group': 'Yes',
     'Avg_RMSE': test_stats_fs_per_group['rmse'].mean()}
]).sort_values('Avg_RMSE')

print("\n" + "="*70)
print("üìä FINAL COMPARISON: Global vs Per-Group Preprocessing")
print("="*70 + "\n")
print(final_comparison.to_string(index=False))

best = final_comparison.iloc[0]
print(f"\nüèÜ BEST APPROACH: {best['Approach']}")
print(f"   Average Test RMSE: {best['Avg_RMSE']:.2f}")

## Visualize Comparison

In [None]:
# Bar chart
fig, ax = plt.subplots(figsize=(12, 6))

colors = ['green' if 'Per-Group' in a else 'steelblue' for a in final_comparison['Approach']]
bars = ax.barh(final_comparison['Approach'], final_comparison['Avg_RMSE'], color=colors, alpha=0.7)

ax.set_xlabel('Average Test RMSE (lower is better)', fontsize=11)
ax.set_title('Global vs Per-Group Preprocessing Comparison', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
ax.invert_yaxis()

# Add values
for bar in bars:
    width = bar.get_width()
    ax.text(width, bar.get_y() + bar.get_height()/2, 
            f'{width:.2f}', ha='left', va='center', fontsize=9)

plt.tight_layout()
plt.show()

---

# Key Takeaways

## When to Use Per-Group Preprocessing

### ‚úÖ Use Per-Group When:

1. **PCA/Dimensionality Reduction**
   - Groups have different variance structures
   - Need different # components per group
   - Variance explained differs significantly

2. **Feature Selection**
   - Groups have different important features
   - Correlation structures differ
   - Feature importance varies by group

3. **Heterogeneous Data**
   - Fundamentally different groups (countries, stores, products)
   - Different data distributions
   - Different scale/variance/correlation patterns

4. **Sufficient Data Per Group**
   - At least 50-100 observations per group
   - Enough to fit stable preprocessing parameters

### ‚ùå Use Global When:

1. **Homogeneous Groups**
   - All groups similar
   - Shared patterns

2. **Small Groups**
   - <50 observations per group
   - Not enough data for stable group-specific parameters

3. **Simple Transformations**
   - Basic normalization/scaling
   - Log transforms
   - Less critical preprocessing steps

4. **Consistency Requirements**
   - Need same features for all groups
   - Business requirements for uniformity
   - Interpretability across groups important

---

## Best Practices

### Configuration

```python
# Enable per-group preprocessing
fit = wf.fit_nested(
    data=train_data,
    group_col='country',
    per_group_prep=True,  # Enable per-group
    min_group_size=50     # Minimum samples for group-specific prep
)
```

**Parameters**:
- `per_group_prep=True`: Enable per-group preprocessing
- `min_group_size`: Minimum samples needed (default: 30)
  - Groups with fewer samples use global recipe
  - Prevents overfitting on small groups

### Inspect Differences

```python
# See which features each group uses
comparison = fit.get_feature_comparison()
print(comparison)

# Shows:
# - Which features each group selected
# - Differences in feature spaces
# - PCA components per group
```

### Outcome Column Handling

**Automatic preservation**: Outcome column is automatically excluded from preprocessing steps to prevent data leakage.

```python
# This is handled automatically - you don't need to do anything!
rec = recipe().step_pca(all_numeric_predictors(), num_comp=3)
# Outcome column NOT included in PCA
```

### Small Group Fallback

If a group has < `min_group_size` samples:
- Warning is issued
- Group uses global recipe instead
- Prevents unstable preprocessing

---

## Common Pitfalls

### ‚ùå Using Per-Group with Too Little Data

```python
# Bad: Only 20 samples per group
fit = wf.fit_nested(small_data, group_col='group', per_group_prep=True)

# Good: Check group sizes first
print(data.groupby('group').size())
# Only use per-group if groups have >50 samples
```

### ‚ùå Not Checking Feature Differences

```python
# Always inspect what per-group did
comparison = fit.get_feature_comparison()
print(comparison)

# If all groups have same features ‚Üí per-group unnecessary
```

### ‚ùå Using for Simple Transformations

```python
# Overkill: per-group for basic normalization
rec = recipe().step_normalize(all_numeric_predictors())
# Per-group less critical here (means/stds similar across groups)

# Better use: per-group for feature selection/PCA
rec = recipe().step_pca(...).step_select_corr(...)
```

### ‚ùå Forgetting to Compare with Global

```python
# Always compare per-group vs global
fit_global = wf.fit_nested(data, group_col='g', per_group_prep=False)
fit_per_group = wf.fit_nested(data, group_col='g', per_group_prep=True)

# Compare performance - per-group not always better!
```

---

## Production Considerations

### New Group Handling

**Problem**: What if prediction data has new/unseen group?

**Solution**: Falls back to global recipe
- Warning issued
- Uses global preprocessing parameters
- Ensures predictions still work

### Memory Usage

Per-group preprocessing stores:
- One PreparedRecipe per group
- More memory than global (1 recipe)
- Monitor memory with many groups (>100)

### Monitoring

Track per-group:
- Which features each group uses (via `get_feature_comparison()`)
- Performance per group
- Alert if groups diverge too much

### Retraining

When retraining:
- Recipe prep happens per group
- Feature spaces may change
- Monitor for stability

---

# References

- **Per-Group Preprocessing Documentation**: `.claude_debugging/PER_GROUP_PREPROCESSING_IMPLEMENTATION.md`
- **Tests**: `tests/test_workflows/test_per_group_prep.py` (5 tests)
- **Code**: `py_workflows/workflow.py:121-179, 255-311, 392-543, 1023-1113`
- **Grouped Modeling**: Example 13, 25, 28
- **CLAUDE.md**: Complete architecture documentation