# Panel Regression: Evaluation, Diagnostics, and Best Practices

## Introduction

This notebook focuses on the practical aspects of panel regression:

1. **Train/test evaluation**: Assessing out-of-sample performance
2. **Residual diagnostics**: Checking model assumptions
3. **Practical recommendations**: Guidelines for real-world applications
4. **Common pitfalls**: What to avoid and how to handle edge cases

### Why Evaluation Matters

Panel models can overfit to group structure if:
- Groups have very few observations
- Too many random effects are estimated
- Data contains outliers or influential observations

Proper evaluation helps identify these issues before deployment.

### Residual Diagnostics for Panel Data

Standard residual checks:
- **Normality**: Q-Q plots, Shapiro-Wilk test
- **Heteroskedasticity**: Residuals vs fitted, Breusch-Pagan test
- **Autocorrelation**: Durbin-Watson, Ljung-Box test
- **Outliers**: Residuals by group, Cook's distance

---

## Learning Objectives

In this notebook, you will:
1. Generate multi-factory production data
2. Perform time-based train/test splits
3. Evaluate panel models on held-out data
4. Conduct comprehensive residual diagnostics
5. Identify outliers and influential groups
6. Learn practical recommendations for panel modeling
7. Understand when to use panel_reg() vs alternatives

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import shapiro, normaltest
import statsmodels.stats.diagnostic as sm_diag

# py-tidymodels imports
from py_parsnip import panel_reg, linear_reg
from py_workflows import workflow

# Set random seed for reproducibility
np.random.seed(42)

# Plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

## 1. Generate Multi-Factory Production Data

We'll create a dataset with:
- **8 factories**: Each with different baseline production levels
- **104 weeks** of weekly data per factory = 832 total observations
- **Time series structure**: Observations within factories are correlated
- **Variables**:
  - `output`: Production output (outcome)
  - `temperature`: Operating temperature (predictor)
  - `pressure`: Operating pressure (predictor)
  - `humidity`: Environmental humidity (predictor)
  - `week`: Week number (time variable)
  - `factory_id`: Factory identifier (group)

**Data Generating Process**:
- Random intercepts: Factories have different baseline outputs (80-120)
- Temperature effect: +2.0 (positive effect on output)
- Pressure effect: +1.5 (positive effect on output)
- Humidity effect: -0.5 (negative effect on output)
- Autocorrelation: AR(1) structure within each factory

In [None]:
# Parameters
n_factories = 8
n_weeks = 104
n_total = n_factories * n_weeks

# Factory IDs
factory_ids = [f'Factory_{i+1}' for i in range(n_factories)]

# Random intercepts (baseline production levels)
random_intercepts = np.random.uniform(80, 120, n_factories)

# Fixed effects
beta_temperature = 2.0
beta_pressure = 1.5
beta_humidity = -0.5

# Autocorrelation parameter
rho = 0.6  # AR(1) coefficient

# Generate data
data_list = []

for i, factory_id in enumerate(factory_ids):
    # Predictors
    week = np.arange(1, n_weeks + 1)
    temperature = np.random.uniform(60, 80, n_weeks)
    pressure = np.random.uniform(20, 30, n_weeks)
    humidity = np.random.uniform(30, 70, n_weeks)
    
    # Generate AR(1) errors
    errors = np.zeros(n_weeks)
    errors[0] = np.random.normal(0, 5)
    for t in range(1, n_weeks):
        errors[t] = rho * errors[t-1] + np.random.normal(0, 5)
    
    # Output = intercept + predictors*betas + AR(1) errors
    output = (
        random_intercepts[i] + 
        beta_temperature * temperature + 
        beta_pressure * pressure + 
        beta_humidity * humidity + 
        errors
    )
    
    factory_data = pd.DataFrame({
        'factory_id': factory_id,
        'week': week,
        'temperature': temperature,
        'pressure': pressure,
        'humidity': humidity,
        'output': output
    })
    
    data_list.append(factory_data)

# Combine all factories
production_data = pd.concat(data_list, ignore_index=True)

print(f"Dataset shape: {production_data.shape}")
print(f"\nFirst few rows:")
print(production_data.head(10))
print(f"\nSummary statistics:")
print(production_data.describe())

In [None]:
# Visualize factory production over time
fig, ax = plt.subplots(figsize=(14, 6))

for factory_id in factory_ids:
    factory_data = production_data[production_data['factory_id'] == factory_id]
    ax.plot(factory_data['week'], factory_data['output'], alpha=0.7, linewidth=1.5, label=factory_id)

ax.set_xlabel('Week', fontsize=12)
ax.set_ylabel('Production Output', fontsize=12)
ax.set_title('Factory Production Over Time', fontsize=14, weight='bold')
ax.legend(loc='upper left', fontsize=9, ncol=2)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìä Each factory has a different baseline production level (random intercepts).")
print("   Production shows autocorrelation over time within each factory.")

## 2. Train/Test Split (Time-Based)

For time series data, we use a **chronological split** to avoid data leakage:
- **Training**: First 80 weeks per factory
- **Test**: Last 24 weeks per factory

This simulates forecasting future production based on historical data.

In [None]:
# Time-based split
split_week = 80

train_data = production_data[production_data['week'] <= split_week].copy()
test_data = production_data[production_data['week'] > split_week].copy()

print(f"Training data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")
print(f"\nTraining weeks: 1-{split_week}")
print(f"Test weeks: {split_week+1}-{n_weeks}")
print(f"\nFactories in train: {train_data['factory_id'].nunique()}")
print(f"Factories in test: {test_data['factory_id'].nunique()}")

## 3. Fit and Evaluate Panel Model

In [None]:
# Fit panel regression on training data
spec = panel_reg(random_effects="intercept")
wf = workflow().add_formula("output ~ temperature + pressure + humidity").add_model(spec)
fit = wf.fit_global(train_data, group_col='factory_id')

print("‚úÖ Panel regression model fitted on training data!")

In [None]:
# Evaluate on test data
evaluated = fit.evaluate(test_data)

# Extract outputs
outputs, coefficients, stats = evaluated.extract_outputs()

# Compare train vs test metrics
train_stats = stats[stats['split'] == 'train']
test_stats = stats[stats['split'] == 'test']

train_rmse = train_stats[train_stats['metric'] == 'rmse']['value'].values[0]
test_rmse = test_stats[test_stats['metric'] == 'rmse']['value'].values[0]
train_r2 = train_stats[train_stats['metric'] == 'r_squared']['value'].values[0]
test_r2 = test_stats[test_stats['metric'] == 'r_squared']['value'].values[0]

print("\n" + "="*60)
print("TRAIN VS TEST PERFORMANCE")
print("="*60)
print(f"\nTraining RMSE: {train_rmse:.4f}")
print(f"Test RMSE: {test_rmse:.4f}")
print(f"\nTraining R¬≤: {train_r2:.4f}")
print(f"Test R¬≤: {test_r2:.4f}")

# Calculate degradation
rmse_degradation = ((test_rmse - train_rmse) / train_rmse) * 100
r2_degradation = ((train_r2 - test_r2) / train_r2) * 100

print(f"\nüìä Performance Degradation:")
print(f"   RMSE increased by {rmse_degradation:.1f}%")
print(f"   R¬≤ decreased by {r2_degradation:.1f}%")

if rmse_degradation < 10:
    print(f"\n‚úÖ Good generalization: Model performs well on held-out data.")
elif rmse_degradation < 20:
    print(f"\n‚ö†Ô∏è Moderate degradation: Some overfitting may be present.")
else:
    print(f"\n‚ùå Poor generalization: Model is overfitting to training data.")

In [None]:
# Visualize train vs test performance by factory
train_outputs = outputs[outputs['split'] == 'train']
test_outputs = outputs[outputs['split'] == 'test']

# Calculate per-factory RMSE
train_rmse_by_factory = train_outputs.groupby('group').apply(
    lambda df: np.sqrt(np.mean((df['actuals'] - df['fitted'])**2))
).reset_index()
train_rmse_by_factory.columns = ['factory_id', 'train_rmse']

test_rmse_by_factory = test_outputs.groupby('group').apply(
    lambda df: np.sqrt(np.mean((df['actuals'] - df['fitted'])**2))
).reset_index()
test_rmse_by_factory.columns = ['factory_id', 'test_rmse']

# Merge
rmse_comparison = train_rmse_by_factory.merge(test_rmse_by_factory, on='factory_id')
rmse_comparison['degradation_%'] = ((rmse_comparison['test_rmse'] - rmse_comparison['train_rmse']) / rmse_comparison['train_rmse']) * 100

print("\nPer-Factory Performance:")
print(rmse_comparison.to_string(index=False))

In [None]:
# Visualize train vs test RMSE
fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(rmse_comparison))
width = 0.35

ax.bar(x - width/2, rmse_comparison['train_rmse'], width, label='Train', color='steelblue')
ax.bar(x + width/2, rmse_comparison['test_rmse'], width, label='Test', color='coral')

ax.set_ylabel('RMSE', fontsize=12)
ax.set_xlabel('Factory', fontsize=12)
ax.set_title('Train vs Test RMSE by Factory', fontsize=14, weight='bold')
ax.set_xticks(x)
ax.set_xticklabels(rmse_comparison['factory_id'], rotation=45)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print(f"\nüìà Most factories show similar train/test performance, indicating good generalization.")

## 4. Comprehensive Residual Diagnostics

Let's check model assumptions using residual plots and statistical tests.

In [None]:
# Extract residuals and fitted values
train_residuals = train_outputs['residuals'].values
train_fitted = train_outputs['fitted'].values
train_actuals = train_outputs['actuals'].values

print("\n" + "="*60)
print("RESIDUAL DIAGNOSTICS")
print("="*60)

# Test 1: Normality (Shapiro-Wilk test)
shapiro_stat, shapiro_p = shapiro(train_residuals)
print(f"\n1. Normality Test (Shapiro-Wilk):")
print(f"   Statistic: {shapiro_stat:.4f}")
print(f"   p-value: {shapiro_p:.4f}")
if shapiro_p > 0.05:
    print(f"   ‚úÖ Residuals are approximately normal (p > 0.05)")
else:
    print(f"   ‚ö†Ô∏è Residuals deviate from normality (p < 0.05)")

# Test 2: Durbin-Watson (autocorrelation)
dw_stats = stats[stats['metric'] == 'durbin_watson']
if not dw_stats.empty:
    dw_stat = dw_stats['value'].values[0]
    print(f"\n2. Autocorrelation Test (Durbin-Watson):")
    print(f"   Statistic: {dw_stat:.4f}")
    print(f"   Interpretation: 2 = no autocorrelation, 0 = positive, 4 = negative")
    if 1.5 < dw_stat < 2.5:
        print(f"   ‚úÖ No significant autocorrelation")
    elif dw_stat < 1.5:
        print(f"   ‚ö†Ô∏è Positive autocorrelation detected (common in time series)")
    else:
        print(f"   ‚ö†Ô∏è Negative autocorrelation detected")

# Test 3: Ljung-Box (autocorrelation in residuals)
ljung_box_stats = stats[stats['metric'] == 'ljung_box_p']
if not ljung_box_stats.empty:
    ljung_box_p = ljung_box_stats['value'].values[0]
    print(f"\n3. Ljung-Box Test (autocorrelation):")
    print(f"   p-value: {ljung_box_p:.4f}")
    if ljung_box_p > 0.05:
        print(f"   ‚úÖ No significant autocorrelation in residuals (p > 0.05)")
    else:
        print(f"   ‚ö†Ô∏è Autocorrelation present in residuals (p < 0.05)")
        print(f"   Consider: (1) Adding lagged predictors, (2) AR error structure")

# Test 4: Breusch-Pagan (heteroskedasticity)
bp_stats = stats[stats['metric'] == 'breusch_pagan_p']
if not bp_stats.empty:
    bp_p = bp_stats['value'].values[0]
    print(f"\n4. Breusch-Pagan Test (heteroskedasticity):")
    print(f"   p-value: {bp_p:.4f}")
    if bp_p > 0.05:
        print(f"   ‚úÖ Homoskedastic residuals (p > 0.05)")
    else:
        print(f"   ‚ö†Ô∏è Heteroskedasticity present (p < 0.05)")
        print(f"   Consider: (1) Log transformation, (2) Robust standard errors")

In [None]:
# Create comprehensive residual plots
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Plot 1: Residuals vs Fitted
axes[0, 0].scatter(train_fitted, train_residuals, alpha=0.5, s=20)
axes[0, 0].axhline(y=0, color='red', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Q-Q Plot
stats.probplot(train_residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Histogram of residuals
axes[0, 2].hist(train_residuals, bins=40, edgecolor='black', alpha=0.7)
axes[0, 2].set_xlabel('Residuals')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].set_title('Distribution of Residuals')
axes[0, 2].axvline(x=0, color='red', linestyle='--', linewidth=2)
axes[0, 2].grid(True, alpha=0.3)

# Plot 4: Scale-Location (sqrt(|residuals|) vs fitted)
sqrt_abs_resid = np.sqrt(np.abs(train_residuals))
axes[1, 0].scatter(train_fitted, sqrt_abs_resid, alpha=0.5, s=20)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('‚àö|Residuals|')
axes[1, 0].set_title('Scale-Location Plot')
axes[1, 0].grid(True, alpha=0.3)

# Add lowess smoothing
from statsmodels.nonparametric.smoothers_lowess import lowess
lowess_result = lowess(sqrt_abs_resid, train_fitted, frac=0.3)
axes[1, 0].plot(lowess_result[:, 0], lowess_result[:, 1], color='red', linewidth=2)

# Plot 5: Residuals by factory (boxplot)
train_outputs.boxplot(column='residuals', by='group', ax=axes[1, 1])
axes[1, 1].set_xlabel('Factory ID')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Residuals by Factory')
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=45)
plt.suptitle('')  # Remove auto title

# Plot 6: ACF plot (autocorrelation function)
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(train_residuals, lags=20, ax=axes[1, 2])
axes[1, 2].set_title('Autocorrelation Function')
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Residual Diagnostic Interpretation:")
print("   - Residuals vs Fitted: Should show no clear pattern (random scatter)")
print("   - Q-Q Plot: Points should follow diagonal line (normal distribution)")
print("   - Histogram: Should be roughly bell-shaped and centered at zero")
print("   - Scale-Location: Red line should be roughly horizontal (homoskedasticity)")
print("   - By Factory: Similar distributions across factories (no outlier factories)")
print("   - ACF: Bars should stay within blue confidence bands (no autocorrelation)")

## 5. Identify Outliers and Influential Groups

In [None]:
# Calculate standardized residuals
train_outputs['std_residuals'] = (train_outputs['residuals'] - train_outputs['residuals'].mean()) / train_outputs['residuals'].std()

# Identify outlier observations (|std_resid| > 2.5)
outliers = train_outputs[np.abs(train_outputs['std_residuals']) > 2.5]

print("\n" + "="*60)
print("OUTLIER DETECTION")
print("="*60)
print(f"\nTotal observations: {len(train_outputs)}")
print(f"Outliers (|std_resid| > 2.5): {len(outliers)}")
print(f"Outlier percentage: {(len(outliers) / len(train_outputs)) * 100:.2f}%")

if len(outliers) > 0:
    print(f"\nOutliers by factory:")
    outlier_counts = outliers.groupby('group').size().reset_index(name='n_outliers')
    outlier_counts = outlier_counts.sort_values('n_outliers', ascending=False)
    print(outlier_counts.to_string(index=False))
    
    if outlier_counts['n_outliers'].max() > 5:
        worst_factory = outlier_counts.iloc[0]['group']
        print(f"\n‚ö†Ô∏è Factory '{worst_factory}' has unusually many outliers.")
        print(f"   Consider investigating this factory for data quality issues.")
else:
    print(f"\n‚úÖ No significant outliers detected.")

In [None]:
# Visualize outliers
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Standardized residuals over time
for factory_id in factory_ids:
    factory_train = train_outputs[train_outputs['group'] == factory_id]
    axes[0].plot(range(len(factory_train)), factory_train['std_residuals'].values, alpha=0.6, linewidth=1)

axes[0].axhline(y=2.5, color='red', linestyle='--', linewidth=2, label='Outlier Threshold')
axes[0].axhline(y=-2.5, color='red', linestyle='--', linewidth=2)
axes[0].axhline(y=0, color='black', linestyle='-', linewidth=1, alpha=0.5)
axes[0].set_xlabel('Observation Index (within factory)', fontsize=12)
axes[0].set_ylabel('Standardized Residuals', fontsize=12)
axes[0].set_title('Standardized Residuals Over Time', fontsize=13, weight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Distribution of standardized residuals by factory
train_outputs.boxplot(column='std_residuals', by='group', ax=axes[1])
axes[1].axhline(y=2.5, color='red', linestyle='--', linewidth=2)
axes[1].axhline(y=-2.5, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Factory ID', fontsize=12)
axes[1].set_ylabel('Standardized Residuals', fontsize=12)
axes[1].set_title('Standardized Residuals by Factory', fontsize=13, weight='bold')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
plt.suptitle('')  # Remove auto title

plt.tight_layout()
plt.show()

## 6. Practical Recommendations

Based on our analysis and industry best practices, here are key recommendations for using panel regression.

### 6.1 Minimum Data Requirements

In [None]:
# Check group sizes
group_sizes = train_data.groupby('factory_id').size().reset_index(name='n_obs')
group_sizes = group_sizes.sort_values('n_obs')

print("\n" + "="*60)
print("MINIMUM DATA REQUIREMENTS")
print("="*60)
print("\nObservations per group:")
print(group_sizes.to_string(index=False))

min_obs = group_sizes['n_obs'].min()
max_obs = group_sizes['n_obs'].max()
mean_obs = group_sizes['n_obs'].mean()

print(f"\nMin observations per group: {min_obs}")
print(f"Max observations per group: {max_obs}")
print(f"Mean observations per group: {mean_obs:.1f}")

print("\nüí° GUIDELINES:")
print("   Minimum for random intercepts: 5-10 observations per group")
print("   Minimum for random slopes: 20+ observations per group")
print("   Number of groups: At least 5-10 groups for reliable variance estimation")

if min_obs >= 20:
    print("\n‚úÖ Your data meets requirements for random slopes models.")
elif min_obs >= 5:
    print("\n‚úÖ Your data meets requirements for random intercepts models.")
    print("   ‚ö†Ô∏è Insufficient data for random slopes (need 20+ per group).")
else:
    print("\n‚ùå Insufficient data per group. Consider:")
    print("   1. Removing small groups")
    print("   2. Using linear_reg() with fixed effects")
    print("   3. Collecting more data")

### 6.2 When to Use Random Slopes vs Intercepts Only

In [None]:
print("\n" + "="*60)
print("RANDOM SLOPES DECISION GUIDE")
print("="*60)

print("\n‚úÖ USE RANDOM SLOPES WHEN:")
print("   1. Sufficient data per group (20+ observations)")
print("   2. Visual inspection shows varying slopes across groups")
print("   3. Theory suggests heterogeneity in effects")
print("   4. AIC/BIC improve by 10+ points")
print("   5. Variance of random slopes is substantial")

print("\n‚ùå USE RANDOM INTERCEPTS ONLY WHEN:")
print("   1. Limited data per group (5-20 observations)")
print("   2. Visual inspection shows parallel slopes")
print("   3. Slopes model fails to converge")
print("   4. AIC/BIC do not improve or worsen")
print("   5. Primary interest is in group-level differences")

print("\nüí° PRACTICAL TIP:")
print("   Always start with random intercepts.")
print("   Add random slopes only if there's clear evidence they improve the model.")

### 6.3 Handling Unbalanced Panels

In [None]:
print("\n" + "="*60)
print("UNBALANCED PANELS")
print("="*60)

print("\nUnbalanced panels occur when groups have different numbers of observations.")
print("\n‚úÖ PANEL_REG HANDLES UNBALANCED DATA:")
print("   - MixedLM (statsmodels) naturally handles unbalanced panels")
print("   - No need for imputation or dropping groups")
print("   - Groups with more data get more weight in estimation")

print("\n‚ö†Ô∏è CONSIDERATIONS:")
print("   1. Groups with very few observations (<5) may have unreliable random effects")
print("   2. Extreme imbalance (e.g., 5 obs vs 500 obs) can affect convergence")
print("   3. Consider removing groups with <5 observations")

# Check balance
balance_ratio = max_obs / min_obs if min_obs > 0 else np.inf
print(f"\nYour data balance ratio: {balance_ratio:.2f}")
if balance_ratio < 3:
    print("‚úÖ Well-balanced panel (max/min < 3)")
elif balance_ratio < 10:
    print("‚ö†Ô∏è Moderately unbalanced (max/min < 10)")
else:
    print("‚ùå Severely unbalanced (max/min ‚â• 10) - consider filtering small groups")

### 6.4 Convergence Issues and Solutions

In [None]:
print("\n" + "="*60)
print("CONVERGENCE TROUBLESHOOTING")
print("="*60)

print("\nüîß IF MODEL FAILS TO CONVERGE:")
print("\n1. Data Issues:")
print("   - Scale predictors using step_normalize()")
print("   - Remove highly correlated predictors (VIF > 10)")
print("   - Check for extreme outliers")

print("\n2. Model Specification:")
print("   - Simplify: Use random intercepts only (not slopes)")
print("   - Remove interaction terms")
print("   - Reduce number of random effects")

print("\n3. Data Structure:")
print("   - Remove groups with <5 observations")
print("   - Check for singleton groups")
print("   - Ensure sufficient between-group variation")

print("\n4. Estimation:")
print("   - MixedLM uses LBFGS optimization (default)")
print("   - Convergence warnings are common but often harmless")
print("   - Check if results are reasonable despite warning")

print("\nüí° PREVENTION:")
print("   - Always normalize/scale predictors")
print("   - Start simple (intercepts only)")
print("   - Add complexity incrementally")

### 6.5 Missing Data Strategies

In [None]:
print("\n" + "="*60)
print("MISSING DATA STRATEGIES")
print("="*60)

print("\n1. MISSING PREDICTORS:")
print("   - Use recipe steps for imputation:")
print("     ‚Ä¢ step_impute_median() for numeric")
print("     ‚Ä¢ step_impute_mode() for categorical")
print("     ‚Ä¢ step_impute_knn() for complex patterns")

print("\n2. MISSING OUTCOMES:")
print("   - MixedLM uses listwise deletion (drops rows with missing outcome)")
print("   - This is appropriate for panel data")
print("   - Groups with all missing outcomes are automatically excluded")

print("\n3. MISSING GROUPS (NEW GROUPS IN TEST):")
print("   - Panel regression handles this automatically")
print("   - New groups get population average prediction (fixed effects only)")
print("   - No group-specific adjustment without training data")

print("\nüí° BEST PRACTICE:")
print("   Handle missing data BEFORE modeling:")
print("   - Use recipe imputation steps")
print("   - Document imputation strategy")
print("   - Check sensitivity to imputation method")

## 7. Future: Cross-Validation for Panel Data (Conceptual)

Cross-validation for panel data is more complex than standard CV because we must respect:
1. **Time ordering**: Don't train on future to predict past
2. **Group structure**: Decide whether to evaluate on seen vs unseen groups

Here's a conceptual outline (implementation in future release):

In [None]:
print("\n" + "="*60)
print("CROSS-VALIDATION STRATEGIES (FUTURE FEATURE)")
print("="*60)

print("\n1. TIME SERIES CV (WITHIN GROUPS):")
print("   - Create per-group rolling/expanding windows")
print("   - Evaluate on future time periods within each group")
print("   - Tests: Can we forecast future for existing groups?")
print("\n   Example:")
print("   for group in groups:")
print("       cv_folds = time_series_cv(group_data, initial='60 weeks', assess='10 weeks')")
print("       metrics = evaluate_on_folds(cv_folds)")

print("\n2. GROUP-BASED CV (LEAVE-ONE-GROUP-OUT):")
print("   - Train on K-1 groups, test on 1 held-out group")
print("   - Repeat for each group")
print("   - Tests: Can we predict new groups?")
print("\n   Example:")
print("   for test_group in groups:")
print("       train = data[data.group != test_group]")
print("       test = data[data.group == test_group]")
print("       fit_and_evaluate(train, test)")

print("\n3. BLOCKED CV (COMBINATIONS):")
print("   - Combine time-based and group-based splits")
print("   - More complex but more realistic")
print("   - Tests both temporal and group generalization")

print("\nüí° CURRENT WORKAROUND:")
print("   Use manual train/test splits as shown in this notebook.")
print("   Evaluate on chronologically later data for time series.")
print("   Evaluate on held-out groups for new group prediction.")

## Summary and Key Takeaways

### What We Learned

1. **Train/Test Evaluation**:
   - Use time-based splits for time series data
   - Monitor train vs test performance to detect overfitting
   - Expect some degradation (5-15% is normal)

2. **Residual Diagnostics**:
   - Check normality (Q-Q plot, Shapiro-Wilk)
   - Check homoskedasticity (scale-location plot, Breusch-Pagan)
   - Check autocorrelation (ACF plot, Durbin-Watson, Ljung-Box)
   - Inspect per-group residuals for outlier factories

3. **Outlier Detection**:
   - Standardized residuals > 2.5 are potential outliers
   - Some outliers are expected (‚âà1% if normal)
   - Investigate groups with excessive outliers

4. **Data Requirements**:
   - Minimum 5-10 observations per group for random intercepts
   - Minimum 20+ observations per group for random slopes
   - At least 5-10 groups for reliable variance estimation
   - Unbalanced panels are OK (MixedLM handles them)

5. **Model Selection Guidelines**:
   - Start simple (random intercepts only)
   - Add random slopes only if justified
   - Use AIC/BIC for comparison (improvement > 10 points is meaningful)
   - Prioritize interpretability over complexity

### Common Pitfalls to Avoid

‚ùå **Don't**:
- Use panel_reg() with <5 observations per group
- Add random slopes without sufficient data (20+ per group)
- Ignore convergence warnings (check results still make sense)
- Use unscaled predictors (always normalize)
- Forget to check residual diagnostics

‚úÖ **Do**:
- Normalize/scale predictors with step_normalize()
- Check ICC to justify panel model (ICC > 0.1)
- Inspect per-group fits for outliers
- Start simple and add complexity incrementally
- Validate on held-out data

### Decision Tree: panel_reg() vs Alternatives

```
Do you have grouped/panel data?
‚îú‚îÄ No ‚Üí Use linear_reg() or other standard models
‚îî‚îÄ Yes
   ‚îú‚îÄ ICC < 0.1 (low group effect)
   ‚îÇ  ‚îî‚îÄ Use linear_reg() with group as dummy variable
   ‚îî‚îÄ ICC > 0.1 (moderate to high group effect)
      ‚îú‚îÄ < 5 observations per group
      ‚îÇ  ‚îî‚îÄ Use linear_reg() with fixed effects (dummies)
      ‚îî‚îÄ ‚â• 5 observations per group
         ‚îú‚îÄ 5-20 observations per group
         ‚îÇ  ‚îî‚îÄ Use panel_reg() with random intercepts only
         ‚îî‚îÄ > 20 observations per group
            ‚îú‚îÄ Theory/visuals suggest parallel slopes
            ‚îÇ  ‚îî‚îÄ Use panel_reg() with random intercepts only
            ‚îî‚îÄ Theory/visuals suggest varying slopes
               ‚îî‚îÄ Use panel_reg() with random intercepts + slopes
```

### Next Steps for Real-World Applications

1. **Data Preparation**:
   - Clean and preprocess data
   - Handle missing values
   - Create recipe with normalization

2. **Model Development**:
   - Fit random intercepts model first
   - Check ICC and residual diagnostics
   - Consider random slopes if justified

3. **Model Validation**:
   - Evaluate on held-out data
   - Check per-group performance
   - Identify outliers and influential groups

4. **Deployment**:
   - Document model assumptions
   - Monitor performance over time
   - Retrain periodically with new data

### Resources for Further Learning

- **Books**:
  - Gelman & Hill (2006): Data Analysis Using Regression and Multilevel/Hierarchical Models
  - Snijders & Bosker (2011): Multilevel Analysis

- **Online**:
  - statsmodels MixedLM documentation
  - Panel data analysis tutorials
  - Mixed effects model interpretation guides