# Panel/Grouped Models Demo

This notebook demonstrates **panel/grouped modeling** - fitting models for datasets with multiple groups or entities.

## What is Panel/Grouped Modeling?

Panel data contains observations for multiple entities (groups) over time:
- **Multiple stores** with daily sales
- **Multiple customers** with transaction histories
- **Multiple regions** with economic indicators
- **Multiple products** with demand patterns

## Two Modeling Approaches:

### 1. Nested (Per-Group) Modeling - `fit_nested()`
- Fits **separate models for each group**
- Each group gets its own parameters
- Best when groups have **different patterns**
- Example: High-end vs budget stores have different sales trends

### 2. Global Modeling - `fit_global()`
- Fits **one model for all groups**
- Group ID becomes a **feature** in the model
- Best when groups share **similar patterns**
- More efficient with limited data per group

## Key Features:

1. **Unified API**: Same workflow for both approaches
2. **Works with any model**: linear_reg, rand_forest, recursive_reg, etc.
3. **Three-DataFrame outputs**: All outputs include group column
4. **Easy comparison**: Compare performance across groups
5. **Handles evaluation**: Test on held-out data per group

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

from py_workflows import workflow
from py_parsnip import linear_reg, rand_forest, recursive_reg

# Set random seed for reproducibility
np.random.seed(42)

## Generate Synthetic Panel Data

Create sales data for **3 stores** with different characteristics:
- **Store A (Premium)**: High sales, steep growth
- **Store B (Standard)**: Medium sales, steady growth  
- **Store C (Budget)**: Lower sales, slow growth

In [2]:
# Generate 120 days of data for 3 stores
n_days = 120
dates = pd.date_range("2023-01-01", periods=n_days, freq="D")

# Store A: Premium store - high sales, steep growth
store_a = pd.DataFrame({
    "date": dates,
    "store_id": "A",
    "sales": np.linspace(200, 300, n_days) + np.random.normal(0, 8, n_days)
})

# Store B: Standard store - medium sales, steady growth
store_b = pd.DataFrame({
    "date": dates,
    "store_id": "B",
    "sales": np.linspace(150, 200, n_days) + np.random.normal(0, 6, n_days)
})

# Store C: Budget store - lower sales, slow growth
store_c = pd.DataFrame({
    "date": dates,
    "store_id": "C",
    "sales": np.linspace(80, 110, n_days) + np.random.normal(0, 4, n_days)
})

# Combine all stores
data = pd.concat([store_a, store_b, store_c], ignore_index=True)

print("Panel Data Overview:")
print(data.head(15))
print(f"\nShape: {data.shape}")
print(f"\nStores: {data['store_id'].unique()}")
print(f"Days per store: {data.groupby('store_id').size().values}")

Panel Data Overview:
         date store_id       sales
0  2023-01-01        A  203.973713
1  2023-01-02        A  199.734222
2  2023-01-03        A  206.862181
3  2023-01-04        A  214.705247
4  2023-01-05        A  201.488118
5  2023-01-06        A  202.328585
6  2023-01-07        A  217.675719
7  2023-01-08        A  212.021831
8  2023-01-09        A  202.966894
9  2023-01-10        A  211.903506
10 2023-01-11        A  204.696020
11 2023-01-12        A  205.517859
12 2023-01-13        A  212.019732
13 2023-01-14        A  195.618128
14 2023-01-15        A  197.965363

Shape: (360, 3)

Stores: ['A' 'B' 'C']
Days per store: [120 120 120]


In [3]:
# View sales statistics by store
print("Sales Statistics by Store:")
print(data.groupby('store_id')['sales'].describe())

Sales Statistics by Store:
          count        mean        std         min         25%         50%  \
store_id                                                                     
A         120.0  249.366648  30.705419  195.618128  221.079571  249.603229   
B         120.0  175.446186  16.056273  142.849398  164.072695  175.074306   
C         120.0   95.039061   9.487932   72.581149   87.874811   96.672970   

                 75%         max  
store_id                          
A         275.582286  314.663920  
B         188.674802  210.762824  
C         102.655838  113.884618  


## Train/Test Split

Split chronologically - train on first 100 days, test on last 20 days **for each store**:

In [4]:
# Create a numeric time variable for easy splitting
data['time'] = data.groupby('store_id').cumcount()

# Split: first 100 days = train, last 20 days = test
train = data[data['time'] < 100].copy()
test = data[data['time'] >= 100].copy()

print(f"Train: {len(train)} rows ({len(train) // 3} days per store)")
print(f"Test: {len(test)} rows ({len(test) // 3} days per store)")

# Verify each store has the right split
print("\nRows per store:")
print(f"Train: {train.groupby('store_id').size().to_dict()}")
print(f"Test: {test.groupby('store_id').size().to_dict()}")

Train: 300 rows (100 days per store)
Test: 60 rows (20 days per store)

Rows per store:
Train: {'A': 100, 'B': 100, 'C': 100}
Test: {'A': 20, 'B': 20, 'C': 20}


---

# Approach 1: Nested (Per-Group) Modeling

Use `fit_nested()` to fit **separate models for each store**.

## Example 1: Nested Linear Models

In [5]:
# Create workflow with linear regression
wf_nested = (
    workflow()
    .add_formula("sales ~ time")
    .add_model(linear_reg())
)

# Fit nested models - one per store
nested_fit = wf_nested.fit_nested(train, group_col="store_id")

print(f"Fitted {len(nested_fit.group_fits)} models:")
for store_id in nested_fit.group_fits.keys():
    print(f"  - Store {store_id}")

Fitted 3 models:
  - Store A
  - Store B
  - Store C


### Predict for All Stores

The `predict()` method automatically routes predictions to the correct model:

In [6]:
# Predict on test data
predictions_nested = nested_fit.predict(test)

print("Predictions for all stores:")
print(predictions_nested.head(15))

# Verify we have predictions for all stores
print(f"\nPredictions per store:")
print(predictions_nested.groupby('store_id').size())

Predictions for all stores:
         .pred store_id
0   283.765718        A
1   284.617200        A
2   285.468682        A
3   286.320165        A
4   287.171647        A
5   288.023129        A
6   288.874611        A
7   289.726094        A
8   290.577576        A
9   291.429058        A
10  292.280540        A
11  293.132022        A
12  293.983505        A
13  294.834987        A
14  295.686469        A

Predictions per store:
store_id
A    20
B    20
C    20
dtype: int64


### Evaluate on Test Data

Use `evaluate()` to compute metrics for each store:

In [7]:
# Evaluate on test set
nested_fit = nested_fit.evaluate(test)

print("Evaluation complete for all stores!")

Evaluation complete for all stores!


### Extract Outputs - Three-DataFrame Structure

All outputs include the **group column** for easy analysis:

In [8]:
# Extract outputs
outputs_nested, coefs_nested, stats_nested = nested_fit.extract_outputs()

print("1. OUTPUTS DataFrame (first 10 rows):")
print(outputs_nested.head(10))

1. OUTPUTS DataFrame (first 10 rows):
      actuals      fitted    forecast   residuals  split       model  \
0  203.973713  198.617497  203.973713  203.973713  train  linear_reg   
1  199.734222  199.468979  199.734222  199.734222  train  linear_reg   
2  206.862181  200.320462  206.862181  206.862181  train  linear_reg   
3  214.705247  201.171944  214.705247  214.705247  train  linear_reg   
4  201.488118  202.023426  201.488118  201.488118  train  linear_reg   
5  202.328585  202.874908  202.328585  202.328585  train  linear_reg   
6  217.675719  203.726390  217.675719  217.675719  train  linear_reg   
7  212.021831  204.577873  212.021831  212.021831  train  linear_reg   
8  202.966894  205.429355  202.966894  202.966894  train  linear_reg   
9  211.903506  206.280837  211.903506  211.903506  train  linear_reg   

  model_group_name   group store_id  
0                   global        A  
1                   global        A  
2                   global        A  
3                

In [9]:
print("\n2. COEFFICIENTS DataFrame:")
print(coefs_nested)

# Compare slopes across stores
print("\nGrowth Rate (time coefficient) by Store:")
time_coefs = coefs_nested[coefs_nested['variable'] == 'time'][['store_id', 'coefficient']]
print(time_coefs.sort_values('coefficient', ascending=False))


2. COEFFICIENTS DataFrame:
    variable  coefficient  std_error    t_stat   p_value   ci_0.025  \
0  Intercept     0.000000  48.552021  0.000000  1.000000 -96.349906   
1       time     0.851482   0.847303  1.004932  0.317404  -0.829963   
2  Intercept     0.000000  34.482586  0.000000  1.000000 -68.429570   
3       time     0.442764   0.601771  0.735768  0.463629  -0.751431   
4  Intercept     0.000000  18.665178  0.000000  1.000000 -37.040439   
5       time     0.263937   0.325734  0.810282  0.419741  -0.382472   

    ci_0.975  vif       model model_group_name   group store_id  
0  96.349906  NaN  linear_reg                   global        A  
1   2.532928  1.0  linear_reg                   global        A  
2  68.429570  NaN  linear_reg                   global        B  
3   1.636959  1.0  linear_reg                   global        B  
4  37.040439  NaN  linear_reg                   global        C  
5   0.910346  1.0  linear_reg                   global        C  

Growth Rate

In [10]:
print("\n3. STATS DataFrame:")
print(stats_nested)

# Get test metrics for each store
test_metrics = stats_nested[
    (stats_nested['split'] == 'test') & 
    (stats_nested['metric'].isin(['rmse', 'mae', 'r_squared']))
][['store_id', 'metric', 'value']]

print("\nTest Metrics by Store:")
test_pivot = test_metrics.pivot(index='store_id', columns='metric', values='value')
print(test_pivot)


3. STATS DataFrame:
             metric             value  split       model model_group_name  \
0              rmse          7.221766  train  linear_reg                    
1               mae          5.719704  train  linear_reg                    
2              mape          2.418699  train  linear_reg                    
3             smape          2.414162  train  linear_reg                    
4         r_squared          0.920531  train  linear_reg                    
..              ...               ...    ...         ...              ...   
67  breusch_pagan_p               NaN  train  linear_reg                    
68          formula      sales ~ time         linear_reg                    
69       model_type        linear_reg         linear_reg                    
70      model_class  LinearRegression         linear_reg                    
71      n_obs_train               100  train  linear_reg                    

     group store_id  
0   global        A  
1   global

---

# Approach 2: Global Modeling

Use `fit_global()` to fit **one model for all stores** with store_id as a feature.

## Example 2: Global Linear Model

In [11]:
# Create workflow with linear regression
wf_global = (
    workflow()
    .add_formula("sales ~ time")
    .add_model(linear_reg())
)

# Fit global model - store_id will be added as a feature
global_fit = wf_global.fit_global(train, group_col="store_id")

print("Fitted single global model with store_id as a feature")

Fitted single global model with store_id as a feature


In [12]:
# Predict on test data
predictions_global = global_fit.predict(test)

print("Global model predictions:")
print(predictions_global.head(15))

Global model predictions:
         .pred
0   266.995283
1   267.514677
2   268.034071
3   268.553466
4   269.072860
5   269.592254
6   270.111649
7   270.631043
8   271.150438
9   271.669832
10  272.189226
11  272.708621
12  273.228015
13  273.747409
14  274.266804


In [13]:
# Evaluate on test set
global_fit = global_fit.evaluate(test)

# Extract outputs
outputs_global, coefs_global, stats_global = global_fit.extract_outputs()

print("Global Model Coefficients:")
print(coefs_global[['variable', 'coefficient']])

Global Model Coefficients:
        variable  coefficient
0      Intercept     0.000000
1  store_id[T.B]   -69.378045
2  store_id[T.C]  -148.071354
3           time     0.519394


In [14]:
# Get test metrics
global_test_metrics = stats_global[
    (stats_global['split'] == 'test') & 
    (stats_global['metric'].isin(['rmse', 'mae', 'r_squared']))
][['metric', 'value']]

print("Global Model Test Metrics:")
print(global_test_metrics)

Global Model Test Metrics:
       metric      value
7        rmse  17.221136
8         mae  15.270248
11  r_squared   0.948896


---

# Comparison: Nested vs Global

Compare the two approaches on test set performance:

In [15]:
print("=" * 80)
print("NESTED vs GLOBAL MODEL COMPARISON")
print("=" * 80)

# Nested model metrics (per store)
nested_test = stats_nested[
    (stats_nested['split'] == 'test') & 
    (stats_nested['metric'].isin(['rmse', 'mae']))
][['store_id', 'metric', 'value']]

print("\nNested (Per-Store) Models:")
print(nested_test.pivot(index='store_id', columns='metric', values='value'))
print(f"\nAverage RMSE: {nested_test[nested_test['metric'] == 'rmse']['value'].mean():.4f}")

# Global model metrics (overall)
global_test = stats_global[
    (stats_global['split'] == 'test') & 
    (stats_global['metric'].isin(['rmse', 'mae']))
][['metric', 'value']]

print("\nGlobal (Single) Model:")
print(global_test)

# Calculate global RMSE by store manually
print("\nGlobal Model RMSE by Store:")
test_with_preds = test.copy()
test_with_preds['.pred'] = predictions_global['.pred'].values
test_with_preds['residual'] = test_with_preds['sales'] - test_with_preds['.pred']
test_with_preds['squared_error'] = test_with_preds['residual'] ** 2

global_rmse_by_store = test_with_preds.groupby('store_id')['squared_error'].mean() ** 0.5
print(global_rmse_by_store)

print("\n" + "=" * 80)
print("INTERPRETATION:")
print("=" * 80)
print("Nested models typically perform better when stores have DIFFERENT patterns.")
print("Global models are more efficient when stores have SIMILAR patterns.")

NESTED vs GLOBAL MODEL COMPARISON

Nested (Per-Store) Models:
metric         mae      rmse
store_id                    
A          5.56426  7.972791
B         5.899484  7.281123
C         2.604756  3.328922

Average RMSE: 6.1943

Global (Single) Model:
  metric      value
7   rmse  17.221136
8    mae  15.270248

Global Model RMSE by Store:
store_id
A    22.210394
B     9.734423
C    17.367842
Name: squared_error, dtype: float64

INTERPRETATION:
Nested models typically perform better when stores have DIFFERENT patterns.
Global models are more efficient when stores have SIMILAR patterns.


---

# Example 3: Nested with Random Forest

Works with any model - let's try Random Forest:

In [16]:
# Create workflow with Random Forest (set mode to regression for continuous outcomes)
wf_rf = (
    workflow()
    .add_formula("sales ~ time")
    .add_model(rand_forest(trees=100, min_n=5).set_mode("regression"))
)

# Fit nested Random Forest models
nested_rf_fit = wf_rf.fit_nested(train, group_col="store_id")
nested_rf_fit = nested_rf_fit.evaluate(test)

print("Nested Random Forest models fitted and evaluated!")

Nested Random Forest models fitted and evaluated!


In [17]:
# Extract outputs
outputs_rf, coefs_rf, stats_rf = nested_rf_fit.extract_outputs()

# Feature importances for each store
print("Random Forest Feature Importances by Store:")
print(coefs_rf[['store_id', 'variable', 'coefficient']])

# Test metrics
rf_test_metrics = stats_rf[
    (stats_rf['split'] == 'test') & 
    (stats_rf['metric'].isin(['rmse', 'mae']))
][['store_id', 'metric', 'value']]

print("\nRandom Forest Test Metrics by Store:")
print(rf_test_metrics.pivot(index='store_id', columns='metric', values='value'))

Random Forest Feature Importances by Store:
  store_id variable  coefficient
0        A     time          1.0
1        B     time          1.0
2        C     time          1.0

Random Forest Test Metrics by Store:
metric          mae       rmse
store_id                      
A         12.313664  15.063605
B          8.069578   9.685909
C          3.521888   4.295017


---

# Example 4: Nested with Recursive Forecasting

Combine panel modeling with recursive forecasting for **per-store autoregressive models**.

In [18]:
# Create workflow with recursive forecasting
wf_recursive = (
    workflow()
    .add_formula("sales ~ .")
    .add_model(recursive_reg(base_model=linear_reg(), lags=7))
)

# Fit nested recursive models
# For recursive models, we need the date column
train_recursive = train.copy()
test_recursive = test.copy()

nested_recursive_fit = wf_recursive.fit_nested(train_recursive, group_col="store_id")

print(f"Fitted {len(nested_recursive_fit.group_fits)} recursive models (7 lags each)")

Fitted 3 recursive models (7 lags each)


In [19]:
# Predict future values
predictions_recursive = nested_recursive_fit.predict(test_recursive)

print("Recursive forecasts for all stores:")
print(predictions_recursive.head(15))

Recursive forecasts for all stores:
         .pred store_id
0   282.449322        A
1   286.586154        A
2   286.967227        A
3   287.430585        A
4   287.786401        A
5   288.610368        A
6   290.181587        A
7   290.877576        A
8   291.715758        A
9   292.476027        A
10  293.332922        A
11  294.367467        A
12  295.190490        A
13  296.074555        A
14  296.925342        A


In [20]:
# Evaluate
nested_recursive_fit = nested_recursive_fit.evaluate(test_recursive)

outputs_rec, coefs_rec, stats_rec = nested_recursive_fit.extract_outputs()

# Show lag coefficients for each store
print("Lag Coefficients by Store:")
lag_coefs = coefs_rec[coefs_rec['variable'].str.contains('lag_', na=False)]
print(lag_coefs[['store_id', 'variable', 'coefficient']])

Lag Coefficients by Store:
   store_id variable  coefficient
0         A    lag_1    -0.041636
1         A    lag_2    -0.066337
2         A    lag_3    -0.091692
3         A    lag_4    -0.088681
4         A    lag_5     0.179099
5         A    lag_6    -0.046983
6         A    lag_7    -0.000828
8         B    lag_1     0.008633
9         B    lag_2     0.003536
10        B    lag_3     0.025337
11        B    lag_4    -0.145745
12        B    lag_5    -0.082888
13        B    lag_6     0.086616
14        B    lag_7    -0.096384
16        C    lag_1    -0.184567
17        C    lag_2    -0.061286
18        C    lag_3     0.071973
19        C    lag_4    -0.033647
20        C    lag_5     0.023651
21        C    lag_6     0.081030
22        C    lag_7     0.001164


In [21]:
# Test metrics for recursive models
rec_test_metrics = stats_rec[
    (stats_rec['split'] == 'test') & 
    (stats_rec['metric'].isin(['rmse', 'mae']))
][['store_id', 'metric', 'value']]

print("Recursive Model Test Metrics by Store:")
print(rec_test_metrics.pivot(index='store_id', columns='metric', values='value'))

Recursive Model Test Metrics by Store:
metric         mae      rmse
store_id                    
A         5.646555  7.804934
B          6.14292  7.643763
C         2.658626  3.335106


---

# Example 5: Multi-Model Comparison

Compare all approaches side-by-side:

In [22]:
# Collect all test RMSE values
comparison_data = []

# Nested Linear
for _, row in nested_test[nested_test['metric'] == 'rmse'].iterrows():
    comparison_data.append({
        'approach': 'Nested Linear',
        'store_id': row['store_id'],
        'rmse': row['value']
    })

# Global Linear (by store)
for store_id, rmse in global_rmse_by_store.items():
    comparison_data.append({
        'approach': 'Global Linear',
        'store_id': store_id,
        'rmse': rmse
    })

# Nested Random Forest
for _, row in rf_test_metrics[rf_test_metrics['metric'] == 'rmse'].iterrows():
    comparison_data.append({
        'approach': 'Nested RF',
        'store_id': row['store_id'],
        'rmse': row['value']
    })

# Nested Recursive
for _, row in rec_test_metrics[rec_test_metrics['metric'] == 'rmse'].iterrows():
    comparison_data.append({
        'approach': 'Nested Recursive',
        'store_id': row['store_id'],
        'rmse': row['value']
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_pivot = comparison_df.pivot(index='store_id', columns='approach', values='rmse')

print("=" * 100)
print("COMPREHENSIVE MODEL COMPARISON - TEST SET RMSE")
print("=" * 100)
print(comparison_pivot)
print("\nLower RMSE is better")

# Find best model for each store
print("\n" + "=" * 100)
print("BEST MODEL FOR EACH STORE")
print("=" * 100)
for store_id in comparison_pivot.index:
    best_model = comparison_pivot.loc[store_id].idxmin()
    best_rmse = comparison_pivot.loc[store_id].min()
    print(f"Store {store_id}: {best_model:25s} (RMSE: {best_rmse:.4f})")

# Overall average
print("\n" + "=" * 100)
print("AVERAGE RMSE ACROSS ALL STORES")
print("=" * 100)
avg_rmse = comparison_pivot.mean()
print(avg_rmse.sort_values())
print(f"\nBest overall: {avg_rmse.idxmin()} (Average RMSE: {avg_rmse.min():.4f})")

COMPREHENSIVE MODEL COMPARISON - TEST SET RMSE
approach  Global Linear  Nested Linear  Nested RF  Nested Recursive
store_id                                                           
A             22.210394       7.972791  15.063605          7.804934
B              9.734423       7.281123   9.685909          7.643763
C             17.367842       3.328922   4.295017          3.335106

Lower RMSE is better

BEST MODEL FOR EACH STORE
Store A: Nested Recursive          (RMSE: 7.8049)
Store B: Nested Linear             (RMSE: 7.2811)
Store C: Nested Linear             (RMSE: 3.3289)

AVERAGE RMSE ACROSS ALL STORES
approach
Nested Linear        6.194279
Nested Recursive     6.261268
Nested RF            9.681510
Global Linear       16.437553
dtype: float64

Best overall: Nested Linear (Average RMSE: 6.1943)


---

# Summary

## Panel/Grouped Modeling Key Takeaways:

### 1. When to Use Nested vs Global:

**Use Nested (`fit_nested()`) when:**
- Groups have **different patterns** (e.g., premium vs budget stores)
- You have **enough data per group** (50+ observations)
- You want **group-specific parameters**
- Interpretability of per-group models is important

**Use Global (`fit_global()`) when:**
- Groups have **similar patterns** with different levels
- You have **limited data per group** (< 50 observations)
- You want to **share information** across groups
- Computational efficiency is important

### 2. Works with Any Model:
- **Linear models**: Fast, interpretable coefficients per group
- **Random Forest**: Captures non-linear patterns, feature importances
- **Recursive models**: Time series forecasting per group
- **Any sklearn-compatible model**: Full flexibility

### 3. Three-DataFrame Output:
All outputs include the **group column** for easy filtering and comparison:
- **Outputs**: Predictions and residuals per group
- **Coefficients**: Model parameters per group
- **Stats**: Performance metrics per group

### 4. Unified API:
- `fit_nested(data, group_col="store_id")`: Fit separate models
- `fit_global(data, group_col="store_id")`: Fit single model
- `predict(new_data)`: Automatic routing to correct model
- `evaluate(test_data)`: Test set evaluation per group
- `extract_outputs()`: Standardized output structure

### 5. Combining Approaches:
You can combine panel modeling with:
- **Recursive forecasting**: Per-group autoregressive models
- **Feature engineering**: Add group-specific features
- **Cross-validation**: Time series splits per group
- **Model selection**: Compare models per group

## Best Practices:

1. **Start with nested models** to understand group differences
2. **Compare with global** to see if simpler model works
3. **Look at coefficients** to understand group patterns
4. **Evaluate per-group metrics** to find problematic groups
5. **Use appropriate base models** for your data patterns

## Common Use Cases:

- **Retail**: Sales forecasting per store/product
- **Finance**: Risk modeling per customer segment
- **Healthcare**: Patient outcomes per hospital/clinic
- **Energy**: Demand forecasting per region
- **Supply Chain**: Inventory optimization per warehouse

## Next Steps:

1. **Try with your own data**: Use your panel dataset
2. **Experiment with different models**: Test various base models
3. **Add more features**: Include exogenous variables
4. **Use with WorkflowSet**: Compare multiple configurations
5. **Cross-validation**: Time series splits for robust evaluation