# Example 33: Recursive Multi-Step Forecasting

**Feature**: `recursive_reg()` for multi-step ahead forecasting with lagged features

## Overview

This notebook demonstrates **recursive forecasting** using `recursive_reg()`, which enables machine learning models to make multi-step ahead forecasts by:

1. **Creating lagged features** from the target variable
2. **Training ML model** on those lags + exogenous variables
3. **Recursive prediction**: Use predictions as inputs for next step

### How Recursive Forecasting Works

```
Step 1: y_t+1 = f(y_t, y_t-1, ..., y_t-k, X_t)
Step 2: y_t+2 = f(y_t+1, y_t, ..., y_t-k+1, X_t+1)
Step 3: y_t+3 = f(y_t+2, y_t+1, ..., y_t-k+2, X_t+2)
...
```

Each prediction becomes an input for the next step.

## Key Features

- **Flexible lag specification**: Integer (lags 1-N) or list (specific lags)
- **Exogenous variables**: Include external predictors (temperature, price, etc.)
- **Any ML model**: Random Forest, XGBoost, Linear Regression, etc.
- **Prediction intervals**: Uncertainty quantification via in-sample residuals
- **Differencing support**: Handle non-stationary series

## Dataset

**European Gas Demand with Weather Data** (Germany):
- Daily gas demand from 2016-2024
- Temperature as exogenous variable (strong inverse relationship)
- ~2,600 daily observations
- Highly seasonal pattern (heating demand)

In [None]:
# Setup
import pandas as pd
import numpy as np
from datetime import timedelta

# py-tidymodels imports
from py_parsnip import recursive_reg, linear_reg, rand_forest, boost_tree
from py_rsample import initial_time_split
from py_yardstick import rmse, mae, r_squared
from py_yardstick import metric_set
from py_workflows import Workflow
from py_workflowsets import WorkflowSet

import warnings
warnings.filterwarnings('ignore')

print("✓ Imports complete")

## 1. Load and Prepare Data

In [None]:
# Load European gas demand with weather
df = pd.read_csv('../_md/__data/european_gas_demand_weather_data.csv')
df['date'] = pd.to_datetime(df['date'])

# Filter to Germany only (largest market)
germany = df[df['country'] == 'Germany'].copy()
germany = germany[['date', 'gas_demand', 'temperature']].sort_values('date').reset_index(drop=True)

# Remove any missing values
germany = germany.dropna()

print(f"Germany gas demand data:")
print(f"  Records: {len(germany):,} days")
print(f"  Date range: {germany['date'].min()} to {germany['date'].max()}")
print(f"  Demand mean: {germany['gas_demand'].mean():.0f} GWh/day")
print(f"  Demand std: {germany['gas_demand'].std():.0f} GWh/day")
print(f"  Temperature mean: {germany['temperature'].mean():.1f}°C")
print(f"\nFirst few rows:")
print(germany.head())

In [None]:
# Train/test split (hold out last 90 days)
split = initial_time_split(germany, date_column='date', prop=0.90)
train = split.training()
test = split.testing()

print(f"Train: {len(train)} days ({train['date'].min()} to {train['date'].max()})")
print(f"Test:  {len(test)} days ({test['date'].min()} to {test['date'].max()})")
print(f"\nHolding out {len(test)} days for evaluation")

## 2. Basic Recursive Forecasting

Start with simple example: 7-day lags, Random Forest base model.

In [None]:
# Recursive forecasting with Random Forest
# Uses lags 1-7 (past week)
spec_basic = recursive_reg(
    base_model=rand_forest(trees=100).set_mode('regression'),
    lags=7,  # Use lags 1, 2, 3, 4, 5, 6, 7
    differentiation=None  # No differencing
)

# Fit on training data
# Note: gas_demand ~ date means only use lagged demand (no exogenous vars)
fit_basic = spec_basic.fit(train, 'gas_demand ~ date')

print("Basic Recursive Model (Random Forest):")
print(f"  Base model: Random Forest (100 trees)")
print(f"  Lags: 1-7 days")
print(f"  Exogenous variables: None")
print(f"  Training completed ✓")

In [None]:
# Predict on test period
predictions_basic = fit_basic.predict(test)

# Evaluate
eval_basic = fit_basic.evaluate(test)
outputs, coeffs, stats = eval_basic.extract_outputs()

test_stats_df = stats[stats['split'] == 'test']
test_stats = test_stats_df.set_index('metric')['value']
print("Test Set Performance:")
print(f"  RMSE: {test_stats['rmse']:.2f} GWh/day")
print(f"  MAE: {test_stats['mae']:.2f} GWh/day")
print(f"  R²: {test_stats['r_squared']:.4f}")
print(f"  MAPE: {test_stats['mape']:.2f}%")

## 3. Custom Lag Selection

Instead of consecutive lags (1-7), use specific lags that capture important patterns.

In [None]:
# Custom lags: yesterday, last week, two weeks ago
spec_custom_lags = recursive_reg(
    base_model=linear_reg(),
    lags=[1, 7, 14],  # Specific lags only
    differentiation=None
)

fit_custom_lags = spec_custom_lags.fit(train, 'gas_demand ~ date')

# Evaluate
eval_custom_lags = fit_custom_lags.evaluate(test)
_, _, stats_custom_lags = eval_custom_lags.extract_outputs()

test_stats_cl_df = stats_custom_lags[stats_custom_lags['split'] == 'test']
test_stats_cl = test_stats_cl_df.set_index('metric')['value']
print("Custom Lags Model (Linear Regression):")
print(f"  Lags used: 1, 7, 14 days")
print(f"  Test RMSE: {test_stats_cl['rmse']:.2f} GWh/day")
print(f"  Test MAE: {test_stats_cl['mae']:.2f} GWh/day")
print(f"  Test R²: {test_stats_cl['r_squared']:.4f}")

## 4. With Exogenous Variables

Include temperature as predictor. Gas demand strongly inversely related to temperature (heating demand).

In [None]:
# Recursive forecasting with temperature as exogenous variable
spec_exog = recursive_reg(
    base_model=rand_forest(trees=100).set_mode('regression'),
    lags=7,
    differentiation=None
)

# Formula includes temperature
fit_exog = spec_exog.fit(train, 'gas_demand ~ date + temperature')

print("Recursive Model with Exogenous Variable:")
print(f"  Base model: Random Forest (100 trees)")
print(f"  Lags: 1-7 days")
print(f"  Exogenous: temperature")
print(f"  Training completed ✓")

In [None]:
# Predict (test set has actual temperature values)
predictions_exog = fit_exog.predict(test)

# Evaluate
eval_exog = fit_exog.evaluate(test)
_, _, stats_exog = eval_exog.extract_outputs()

test_stats_ex_df = stats_exog[stats_exog['split'] == 'test']
test_stats_ex = test_stats_ex_df.set_index('metric')['value']
print("Test Set Performance (with temperature):")
print(f"  RMSE: {test_stats_ex['rmse']:.2f} GWh/day")
print(f"  MAE: {test_stats_ex['mae']:.2f} GWh/day")
print(f"  R²: {test_stats_ex['r_squared']:.4f}")
print(f"\nImprovement vs no exogenous variables:")
print(f"  RMSE: {((test_stats['rmse'] - test_stats_ex['rmse']) / test_stats['rmse'] * 100):.1f}% better")
print(f"  MAE: {((test_stats['mae'] - test_stats_ex['mae']) / test_stats['mae'] * 100):.1f}% better")

## 5. Prediction Intervals

Get uncertainty estimates using in-sample residuals.

In [None]:
# Prediction intervals not yet supported for recursive_reg
# Skipping confidence interval demonstration
print("Note: Prediction intervals not yet implemented for recursive forecasting.")
print("Standard predictions available via: fit_exog.predict(test, type='numeric')")

## 6. Compare Different Base Models

Test Linear Regression, Random Forest, and XGBoost as base models.

In [None]:
# Create recursive models with different base models
base_models = [
    ('linear', linear_reg()),
    ('random_forest', rand_forest(trees=100).set_mode('regression')),
    ('xgboost', boost_tree(trees=100).set_engine('xgboost').set_mode('regression'))
]

recursive_models = [
    recursive_reg(base_model=model, lags=7, differentiation=None)
    for name, model in base_models
]

# Create workflows
workflows = []
for spec in recursive_models:
    wf = Workflow().add_formula('gas_demand ~ date + temperature').add_model(spec)
    workflows.append(wf)

wf_set = WorkflowSet.from_workflows(workflows)

print(f"Created {len(workflows)} recursive forecasting workflows")
print(f"Base models: Linear, Random Forest, XGBoost")
print(f"All using: 7 lags + temperature")

In [None]:
# Fit all models and compare
results = []
for wf_id, wf in wf_set.workflows.items():
    try:
        fit = wf.fit(train)
        eval_fit = fit.evaluate(test)
        _, _, stats = eval_fit.extract_outputs()
        
        test_stats_df = stats[stats['split'] == 'test']
        test_stats = test_stats_df.set_index('metric')['value']
        results.append({
            'model': wf_id,
            'rmse': test_stats['rmse'],
            'mae': test_stats['mae'],
            'r_squared': test_stats['r_squared'],
            'mape': test_stats['mape']
        })
    except Exception as e:
        print(f"Warning: {wf_id} failed - {str(e)[:80]}")

comparison = pd.DataFrame(results)
comparison = comparison.sort_values('rmse')

print("\nBase Model Comparison (Recursive Forecasting):")
print("="*80)
print(comparison.to_string(index=False))
print("="*80)
print(f"\nBest base model: {comparison.iloc[0]['model']}")
print(f"  RMSE: {comparison.iloc[0]['rmse']:.2f} GWh/day")
print(f"  R²: {comparison.iloc[0]['r_squared']:.4f}")

## 7. Compare Different Lag Configurations

Test how lag selection affects performance.

In [None]:
# Different lag configurations
lag_configs = [
    ('lags_3', 3),           # Short history (1-3 days)
    ('lags_7', 7),           # One week (1-7 days)
    ('lags_14', 14),         # Two weeks (1-14 days)
    ('lags_custom_1_7', [1, 7]),           # Yesterday + last week
    ('lags_custom_1_7_14', [1, 7, 14]),    # Yesterday + weekly pattern
    ('lags_custom_weekly', [1, 7, 14, 21, 28])  # Monthly pattern
]

lag_results = []
for name, lags in lag_configs:
    try:
        spec = recursive_reg(
            base_model=rand_forest(trees=100).set_mode('regression'),
            lags=lags,
            differentiation=None
        )
        
        fit = spec.fit(train, 'gas_demand ~ date + temperature')
        eval_fit = fit.evaluate(test)
        _, _, stats = eval_fit.extract_outputs()
        
        test_stats_df = stats[stats['split'] == 'test']
        test_stats = test_stats_df.set_index('metric')['value']
        
        # Count number of lags
        n_lags = lags if isinstance(lags, int) else len(lags)
        
        lag_results.append({
            'config': name,
            'n_lags': n_lags,
            'rmse': test_stats['rmse'],
            'mae': test_stats['mae'],
            'r_squared': test_stats['r_squared']
        })
    except Exception as e:
        print(f"Warning: {name} failed - {str(e)[:80]}")

lag_comparison = pd.DataFrame(lag_results)
lag_comparison = lag_comparison.sort_values('rmse')

print("\nLag Configuration Comparison:")
print("="*80)
print(lag_comparison.to_string(index=False))
print("="*80)
print(f"\nBest lag config: {lag_comparison.iloc[0]['config']}")
print(f"  Number of lags: {lag_comparison.iloc[0]['n_lags']}")
print(f"  RMSE: {lag_comparison.iloc[0]['rmse']:.2f} GWh/day")

## 8. Differencing for Non-Stationary Series

If series has trend, differencing can improve stationarity.

In [None]:
# Compare no differencing vs first-order differencing
spec_no_diff = recursive_reg(
    base_model=linear_reg(),
    lags=7,
    differentiation=None
)

spec_diff = recursive_reg(
    base_model=linear_reg(),
    lags=7,
    differentiation=1  # First-order differencing
)

# Fit both
fit_no_diff = spec_no_diff.fit(train, 'gas_demand ~ date + temperature')
fit_diff = spec_diff.fit(train, 'gas_demand ~ date + temperature')

# Evaluate
eval_no_diff = fit_no_diff.evaluate(test)
eval_diff = fit_diff.evaluate(test)

_, _, stats_no_diff = eval_no_diff.extract_outputs()
_, _, stats_diff = eval_diff.extract_outputs()

# Convert to easy access format
test_no_diff = stats_no_diff[stats_no_diff['split']=='test'].set_index('metric')['value']
test_diff = stats_diff[stats_diff['split']=='test'].set_index('metric')['value']

print("Differencing Comparison:")
print(f"\nNo Differencing:")
print(f"  Test RMSE: {test_no_diff['rmse']:.2f} GWh/day")
print(f"  Test MAE: {test_no_diff['mae']:.2f} GWh/day")

print(f"\nWith Differencing (d=1):")
print(f"  Test RMSE: {test_diff['rmse']:.2f} GWh/day")
print(f"  Test MAE: {test_diff['mae']:.2f} GWh/day")

print(f"\nNote: Differencing helps with trending data. This series is seasonal but stationary.")

## 9. Key Takeaways

### When to Use Recursive Forecasting

✅ **Good for**:
- Multi-step ahead forecasts (30, 60, 90 days)
- ML models that don't natively handle time series (Random Forest, XGBoost, SVM)
- Combining lagged features with exogenous variables
- When you have rich feature engineering capabilities

❌ **Not ideal for**:
- Very long forecast horizons (1+ years)
- When prediction errors compound significantly
- When direct multi-output models perform better
- Simple autoregressive patterns (use ARIMA instead)

### Lag Selection Best Practices

1. **Start with domain knowledge**:
   - Daily data with weekly patterns → use lag 7
   - Monthly data with yearly patterns → use lag 12
   - Include lag 1 (yesterday) almost always

2. **Test multiple configurations**:
   - Short history: lags=3 (recent only)
   - Medium: lags=7 (one week)
   - Long: lags=14 or 30 (more history)
   - Custom: lags=[1, 7, 14, 30] (key points)

3. **Balance complexity vs performance**:
   - More lags = more features = longer training
   - Diminishing returns after ~14 lags typically
   - Custom lags often beat consecutive lags

### Base Model Selection

**Linear Regression**:
- Fast training and prediction
- Good baseline
- Interpretable coefficients
- Struggles with non-linear patterns

**Random Forest**:
- Handles non-linear relationships
- Robust to outliers
- No hyperparameter tuning needed
- Slower than linear

**XGBoost**:
- Often best performance
- Requires careful tuning
- Fast prediction after training
- Can overfit without regularization

### Exogenous Variables

**Critical requirements**:
1. Must have future values for forecasting
2. Should be leading indicators (not lagging)
3. Strong correlation with target variable

**Examples**:
- ✅ Temperature forecasts for energy demand
- ✅ Scheduled events for retail sales
- ✅ Economic indicators for financial forecasting
- ❌ Competitor prices (not available in advance)
- ❌ News sentiment (unpredictable future)

### Production Considerations

```python
# Production pattern
spec = recursive_reg(
    base_model=rand_forest(trees=200).set_mode('regression'),
    lags=[1, 7, 14],  # Custom lags for efficiency
    differentiation=None
)

# Train on ALL available data
fit = spec.fit(all_data, 'gas_demand ~ date + temperature')

# Forecast with future exogenous variables
forecast_dates = pd.date_range(last_date + timedelta(days=1), periods=30)
forecast_data = pd.DataFrame({
    'date': forecast_dates,
    'temperature': get_temperature_forecast(forecast_dates)  # From weather API
})

# Get predictions with intervals
predictions = fit.predict(forecast_data, type='conf_int', level=0.95)
```

### Common Pitfalls

1. **Too many lags**: Overfitting and slow training
   - Solution: Use custom lag selection

2. **Missing exogenous variables**: Runtime errors during prediction
   - Solution: Ensure forecast data has all required columns

3. **Error compounding**: Long horizons accumulate errors
   - Solution: Retrain frequently, use prediction intervals

4. **Non-stationary series**: Poor performance
   - Solution: Use differentiation parameter or detrend first

## Summary

This notebook demonstrated:

✅ Basic recursive forecasting with `recursive_reg()`  
✅ Custom lag selection (specific lags vs consecutive)  
✅ Exogenous variables (temperature for gas demand)  
✅ Prediction intervals for uncertainty quantification  
✅ Comparison of base models (Linear, Random Forest, XGBoost)  
✅ Lag configuration optimization  
✅ Differencing for non-stationary series  
✅ Production deployment patterns  

**Key Insight**: Recursive forecasting enables any ML model to make multi-step forecasts by using predictions as inputs for future steps. Performance depends heavily on:
1. Appropriate lag selection
2. Quality of exogenous variable forecasts
3. Base model choice

**Next Steps**:
- Example 34: Gradient boosting engines comparison
- Experiment with different base models and lag configurations
- Integrate with production forecasting pipelines