# Recursive Forecasting Demo

This notebook demonstrates **recursive (autoregressive) forecasting** for multi-step time series prediction using `recursive_reg()`.

## What is Recursive Forecasting?

Recursive forecasting uses **lagged values** of the target variable as features to predict future values:
- Trains a regression model on past observations (lags)
- Predicts one step ahead
- Uses that prediction as input for the next prediction
- Repeats recursively for multi-step forecasts

## Key Features:

1. **Wraps any sklearn-compatible model**: Use `linear_reg()`, `rand_forest()`, etc. as base models
2. **Flexible lag specification**: Integer (use lags 1-n) or list (specific lags)
3. **Differentiation support**: Make non-stationary series stationary
4. **Prediction intervals**: Get uncertainty estimates
5. **Three-DataFrame outputs**: Standardized outputs for all models

## When to Use Recursive Forecasting:

- **Short to medium-term forecasts**: Works well for horizons up to 30-90 days
- **Autocorrelated data**: When past values predict future values
- **Non-linear patterns**: Random Forest can capture complex relationships
- **Simple implementation**: No need for manual feature engineering

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

from py_parsnip import recursive_reg, linear_reg, rand_forest

# Set random seed for reproducibility
np.random.seed(42)

## Generate Synthetic Time Series Data

We'll create daily data with:
- **Trend**: Increasing over time
- **Weekly seasonality**: 7-day pattern
- **Noise**: Random variation

In [2]:
# Generate 150 days of data
n_days = 150
dates = pd.date_range("2023-01-01", periods=n_days, freq="D")

# Create time series components
trend = np.linspace(100, 150, n_days)
seasonality = 20 * np.sin(2 * np.pi * np.arange(n_days) / 7)  # Weekly pattern
noise = np.random.normal(0, 5, n_days)

sales = trend + seasonality + noise

data = pd.DataFrame({"date": dates, "sales": sales})
data = data.set_index("date")

print("Data Overview:")
print(data.head(10))
print(f"\nShape: {data.shape}")
print(f"Date range: {data.index.min()} to {data.index.max()}")

Data Overview:
                 sales
date                  
2023-01-01  102.483571
2023-01-02  115.280879
2023-01-03  123.408142
2023-01-04  117.299535
2023-01-05   91.493840
2023-01-06   81.008609
2023-01-07   94.272857
2023-01-08  106.186167
2023-01-09  115.973821
2023-01-10  125.231493

Shape: (150, 1)
Date range: 2023-01-01 00:00:00 to 2023-05-30 00:00:00


## Train/Test Split

Split chronologically (not randomly) for time series:

In [3]:
# Train on first 120 days, test on last 30 days
train = data.iloc[:120]
test = data.iloc[120:]

print(f"Train: {len(train)} days ({train.index.min()} to {train.index.max()})")
print(f"Test: {len(test)} days ({test.index.min()} to {test.index.max()})")

Train: 120 days (2023-01-01 00:00:00 to 2023-04-30 00:00:00)
Test: 30 days (2023-05-01 00:00:00 to 2023-05-30 00:00:00)


---

# Model 1: Recursive with Linear Regression

Start with a simple linear model using 7 lags (past week).

In [4]:
# Create recursive specification with linear base model
spec_linear = recursive_reg(
    base_model=linear_reg(),
    lags=7  # Use past 7 days as features
)

print(spec_linear)
print(f"\nBase model type: {spec_linear.args['base_model'].model_type}")
print(f"Lags: {spec_linear.args['lags']}")

ModelSpec(model_type='recursive_reg', engine='skforecast', mode='regression', args={'base_model': ModelSpec(model_type='linear_reg', engine='sklearn', mode='regression', args={}), 'lags': 7, 'differentiation': None})

Base model type: linear_reg
Lags: 7


In [5]:
# Fit the model
fit_linear = spec_linear.fit(train, "sales ~ .")
print("Linear recursive model fitted!")

Linear recursive model fitted!


In [6]:
# Predict on test set
pred_linear = fit_linear.predict(test)

print("Predictions:")
print(pred_linear.head(10))

Predictions:
                 .pred
date                  
2023-05-01  164.557752
2023-05-02  163.216553
2023-05-03  148.829369
2023-05-04  131.781997
2023-05-05  119.604799
2023-05-06  131.433865
2023-05-07  148.611666
2023-05-08  165.747579
2023-05-09  166.317394
2023-05-10  150.899573


### Extract Outputs - Three-DataFrame Structure

All models return three DataFrames:
1. **Outputs**: Observation-level data (actuals, fitted, forecast, residuals)
2. **Coefficients**: Model parameters (lag coefficients for linear models)
3. **Stats**: Model performance metrics

In [7]:
# Evaluate on test set first
fit_linear = fit_linear.evaluate(test)

# Extract outputs
outputs_linear, coefs_linear, stats_linear = fit_linear.extract_outputs()

print("1. OUTPUTS DataFrame (first 10 rows):")
print(outputs_linear.head(10))

1. OUTPUTS DataFrame (first 10 rows):
        date     actuals  fitted    forecast  residuals  split          model  \
0 2023-01-01  102.483571     NaN  102.483571        NaN  train  recursive_reg   
1 2023-01-02  115.280879     NaN  115.280879        NaN  train  recursive_reg   
2 2023-01-03  123.408142     NaN  123.408142        NaN  train  recursive_reg   
3 2023-01-04  117.299535     NaN  117.299535        NaN  train  recursive_reg   
4 2023-01-05   91.493840     NaN   91.493840        NaN  train  recursive_reg   
5 2023-01-06   81.008609     NaN   81.008609        NaN  train  recursive_reg   
6 2023-01-07   94.272857     NaN   94.272857        NaN  train  recursive_reg   
7 2023-01-08  106.186167     NaN  106.186167        NaN  train  recursive_reg   
8 2023-01-09  115.973821     NaN  115.973821        NaN  train  recursive_reg   
9 2023-01-10  125.231493     NaN  125.231493        NaN  train  recursive_reg   

  model_group_name   group  
0                   global  
1           

In [8]:
print("\n2. COEFFICIENTS DataFrame:")
print(coefs_linear)

# Lag coefficients show importance of each lag
lag_coefs = coefs_linear[coefs_linear['variable'].str.contains('lag_', na=False)]
print("\nLag Coefficients (importance of each past day):")
print(lag_coefs[['variable', 'coefficient']].sort_values('coefficient', ascending=False))


2. COEFFICIENTS DataFrame:
    variable  coefficient  std_error  t_stat  p_value  ci_0.025  ci_0.975  \
0      lag_1     0.336396        NaN     NaN      NaN       NaN       NaN   
1      lag_2     0.068139        NaN     NaN      NaN       NaN       NaN   
2      lag_3    -0.106501        NaN     NaN      NaN       NaN       NaN   
3      lag_4    -0.182646        NaN     NaN      NaN       NaN       NaN   
4      lag_5     0.265701        NaN     NaN      NaN       NaN       NaN   
5      lag_6     0.149203        NaN     NaN      NaN       NaN       NaN   
6      lag_7     0.487109        NaN     NaN      NaN       NaN       NaN   
7  Intercept    -0.447153        NaN     NaN      NaN       NaN       NaN   

   vif          model model_group_name   group  
0  NaN  recursive_reg                   global  
1  NaN  recursive_reg                   global  
2  NaN  recursive_reg                   global  
3  NaN  recursive_reg                   global  
4  NaN  recursive_reg            

In [9]:
print("\n3. STATS DataFrame:")
print(stats_linear)

# Get test metrics
test_metrics = stats_linear[
    (stats_linear['split'] == 'test') & 
    (stats_linear['metric'].isin(['rmse', 'mae', 'r_squared']))
][['metric', 'value']]

print("\nTest Set Metrics:")
print(test_metrics)


3. STATS DataFrame:
             metric       value  split          model model_group_name   group
0              rmse         NaN  train  recursive_reg                   global
1               mae         NaN  train  recursive_reg                   global
2              mape         NaN  train  recursive_reg                   global
3         r_squared         NaN  train  recursive_reg                   global
4              rmse    7.682572   test  recursive_reg                   global
5               mae    6.247134   test  recursive_reg                   global
6              mape    4.391373   test  recursive_reg                   global
7         r_squared    0.755406   test  recursive_reg                   global
8           formula   sales ~ .         recursive_reg                   global
9       n_obs_train         120  train  recursive_reg                   global
10             lags           7         recursive_reg                   global
11  differentiation        None

---

# Model 2: Recursive with Random Forest

Random Forest can capture **non-linear relationships** between lags.

In [10]:
# Create recursive specification with Random Forest base model
spec_rf = recursive_reg(
    base_model=rand_forest(trees=200, min_n=5),
    lags=7
)

print(spec_rf)

ModelSpec(model_type='recursive_reg', engine='skforecast', mode='regression', args={'base_model': ModelSpec(model_type='rand_forest', engine='sklearn', mode='unknown', args={'trees': 200, 'min_n': 5}), 'lags': 7, 'differentiation': None})


In [11]:
# Fit and evaluate
fit_rf = spec_rf.fit(train, "sales ~ .")
fit_rf = fit_rf.evaluate(test)

print("Random Forest recursive model fitted and evaluated!")

Random Forest recursive model fitted and evaluated!


In [12]:
# Extract outputs
outputs_rf, coefs_rf, stats_rf = fit_rf.extract_outputs()

# Random Forest reports feature importances
print("Random Forest Feature Importances:")
print(coefs_rf[['variable', 'coefficient']].sort_values('coefficient', ascending=False))

print("\nInterpretation: Higher values indicate more important lags for prediction")

Random Forest Feature Importances:
  variable  coefficient
6    lag_7     0.929886
0    lag_1     0.018474
5    lag_6     0.014600
4    lag_5     0.012253
2    lag_3     0.008688
3    lag_4     0.008314
1    lag_2     0.007785

Interpretation: Higher values indicate more important lags for prediction


In [13]:
# Get test metrics
rf_test_metrics = stats_rf[
    (stats_rf['split'] == 'test') & 
    (stats_rf['metric'].isin(['rmse', 'mae', 'r_squared']))
][['metric', 'value']]

print("Random Forest Test Metrics:")
print(rf_test_metrics)

Random Forest Test Metrics:
      metric      value
4       rmse  11.148346
5        mae   9.059137
7  r_squared   0.484945


---

# Model 3: Specific Lag Selection

Instead of using all lags 1-7, we can select **specific lags** that match known patterns.

For example, with weekly data:
- Lag 1: Yesterday
- Lag 7: Same day last week
- Lag 14: Same day 2 weeks ago

In [14]:
# Create recursive specification with specific lags
spec_lags = recursive_reg(
    base_model=rand_forest(trees=200, min_n=5),
    lags=[1, 7, 14]  # Specific lag indices
)

print(spec_lags)
print(f"\nUsing lags: {spec_lags.args['lags']}")

ModelSpec(model_type='recursive_reg', engine='skforecast', mode='regression', args={'base_model': ModelSpec(model_type='rand_forest', engine='sklearn', mode='unknown', args={'trees': 200, 'min_n': 5}), 'lags': [1, 7, 14], 'differentiation': None})

Using lags: [1, 7, 14]


In [15]:
# Fit and evaluate
fit_lags = spec_lags.fit(train, "sales ~ .")
fit_lags = fit_lags.evaluate(test)

outputs_lags, coefs_lags, stats_lags = fit_lags.extract_outputs()

# Feature importances for specific lags
print("Feature Importances for Specific Lags:")
print(coefs_lags[['variable', 'coefficient']].sort_values('coefficient', ascending=False))

Feature Importances for Specific Lags:
  variable  coefficient
2   lag_14     0.522144
1    lag_7     0.460430
0    lag_1     0.017426


In [16]:
# Get test metrics
lags_test_metrics = stats_lags[
    (stats_lags['split'] == 'test') & 
    (stats_lags['metric'].isin(['rmse', 'mae', 'r_squared']))
][['metric', 'value']]

print("Specific Lags Test Metrics:")
print(lags_test_metrics)

Specific Lags Test Metrics:
      metric      value
4       rmse  11.173599
5        mae   9.247505
7  r_squared   0.482609


---

# Model 4: Differentiation for Non-Stationary Data

For data with strong trends, **differencing** can improve model performance by making the series stationary.

- `differentiation=1`: First difference (removes linear trend)
- `differentiation=2`: Second difference (removes quadratic trend)

In [17]:
# Create recursive specification with differentiation
spec_diff = recursive_reg(
    base_model=linear_reg(),
    lags=7,
    differentiation=1  # Apply first differencing
)

print(spec_diff)
print(f"\nDifferentiation order: {spec_diff.args['differentiation']}")

ModelSpec(model_type='recursive_reg', engine='skforecast', mode='regression', args={'base_model': ModelSpec(model_type='linear_reg', engine='sklearn', mode='regression', args={}), 'lags': 7, 'differentiation': 1})

Differentiation order: 1


In [18]:
# Fit and evaluate
fit_diff = spec_diff.fit(train, "sales ~ .")
fit_diff = fit_diff.evaluate(test)

outputs_diff, coefs_diff, stats_diff = fit_diff.extract_outputs()

# Get test metrics
diff_test_metrics = stats_diff[
    (stats_diff['split'] == 'test') & 
    (stats_diff['metric'].isin(['rmse', 'mae', 'r_squared']))
][['metric', 'value']]

print("Differentiated Model Test Metrics:")
print(diff_test_metrics)

Differentiated Model Test Metrics:
      metric     value
4       rmse  6.529885
5        mae  5.556278
7  r_squared  0.823297


---

# Prediction Intervals

Get **uncertainty estimates** with prediction intervals:

In [19]:
# Predict with intervals (works best with Random Forest)
pred_intervals = fit_rf.predict(test, type="pred_int")

print("Predictions with 90% Prediction Intervals:")
print(pred_intervals.head(15))

# Verify intervals are properly ordered
print("\nInterval Coverage Check:")
print(f"Lower <= Prediction: {(pred_intervals['.pred_lower'] <= pred_intervals['.pred']).all()}")
print(f"Prediction <= Upper: {(pred_intervals['.pred'] <= pred_intervals['.pred_upper']).all()}")

Predictions with 90% Prediction Intervals:
                 .pred  .pred_lower  .pred_upper
date                                            
2023-05-01  158.512917   155.036640   165.479643
2023-05-02  157.096032   154.380623   164.503835
2023-05-03  150.126702   147.842040   157.521057
2023-05-04  136.720868   131.999045   141.257660
2023-05-05  115.790545   110.261882   118.417427
2023-05-06  136.155311   131.372863   140.285499
2023-05-07  147.067404   144.441834   153.764168
2023-05-08  158.375612   155.095576   165.460412
2023-05-09  156.883055   153.372913   164.172953
2023-05-10  150.357190   148.257829   160.703553
2023-05-11  142.503815   134.308324   145.714845
2023-05-12  113.757639   109.758064   119.994012
2023-05-13  141.753752   132.562102   144.029083
2023-05-14  151.115195   147.825910   160.045315
2023-05-15  158.493686   155.092864   165.535867

Interval Coverage Check:
Lower <= Prediction: True
Prediction <= Upper: True


---

# Model Comparison

Compare all models on test set performance:

In [20]:
# Combine all test metrics
linear_metrics = test_metrics.copy()
linear_metrics['model'] = 'Linear (7 lags)'

rf_metrics = rf_test_metrics.copy()
rf_metrics['model'] = 'Random Forest (7 lags)'

lags_metrics = lags_test_metrics.copy()
lags_metrics['model'] = 'Random Forest (1,7,14 lags)'

diff_metrics = diff_test_metrics.copy()
diff_metrics['model'] = 'Linear (7 lags, diff=1)'

all_metrics = pd.concat([
    linear_metrics,
    rf_metrics,
    lags_metrics,
    diff_metrics
], ignore_index=True)

# Pivot for easy comparison
comparison = all_metrics.pivot(index='metric', columns='model', values='value')

print("=" * 100)
print("MODEL COMPARISON - TEST SET METRICS")
print("=" * 100)
print(comparison)
print("\nLower is better for: RMSE, MAE")
print("Higher is better for: R²")

MODEL COMPARISON - TEST SET METRICS
model     Linear (7 lags) Linear (7 lags, diff=1) Random Forest (1,7,14 lags)  \
metric                                                                          
mae              6.247134                5.556278                    9.247505   
r_squared        0.755406                0.823297                    0.482609   
rmse             7.682572                6.529885                   11.173599   

model     Random Forest (7 lags)  
metric                            
mae                     9.059137  
r_squared               0.484945  
rmse                   11.148346  

Lower is better for: RMSE, MAE
Higher is better for: R²


In [21]:
# Find best model for each metric
print("\n" + "=" * 100)
print("BEST MODEL FOR EACH METRIC")
print("=" * 100)

for metric in ['rmse', 'mae']:
    best_model = comparison.loc[metric].idxmin()
    best_value = comparison.loc[metric].min()
    print(f"{metric.upper():10s}: {best_model:35s} ({best_value:.4f})")

# R² is higher-is-better
best_model = comparison.loc['r_squared'].idxmax()
best_value = comparison.loc['r_squared'].max()
print(f"{'R²':10s}: {best_model:35s} ({best_value:.4f})")


BEST MODEL FOR EACH METRIC
RMSE      : Linear (7 lags, diff=1)             (6.5299)
MAE       : Linear (7 lags, diff=1)             (5.5563)
R²        : Linear (7 lags, diff=1)             (0.8233)


---

# Future Forecasting

Forecast beyond the training data into the future:

In [22]:
# Create future dates (30 days beyond test set)
last_date = data.index.max()
future_dates = pd.date_range(last_date + timedelta(days=1), periods=30, freq="D")
future_data = pd.DataFrame(index=future_dates)

print(f"Forecasting for: {future_data.index.min()} to {future_data.index.max()}")

Forecasting for: 2023-05-31 00:00:00 to 2023-06-29 00:00:00


In [23]:
# Generate forecasts with best model (Random Forest)
future_forecast = fit_rf.predict(future_data)

print("Future Forecast (next 30 days):")
print(future_forecast)

print(f"\nForecast Statistics:")
print(f"Mean: {future_forecast['.pred'].mean():.2f}")
print(f"Std: {future_forecast['.pred'].std():.2f}")
print(f"Min: {future_forecast['.pred'].min():.2f}")
print(f"Max: {future_forecast['.pred'].max():.2f}")

Future Forecast (next 30 days):
                 .pred
2023-05-31  158.512917
2023-06-01  157.096032
2023-06-02  150.126702
2023-06-03  136.720868
2023-06-04  115.790545
2023-06-05  136.155311
2023-06-06  147.067404
2023-06-07  158.375612
2023-06-08  156.883055
2023-06-09  150.357190
2023-06-10  142.503815
2023-06-11  113.757639
2023-06-12  141.753752
2023-06-13  151.115195
2023-06-14  158.493686
2023-06-15  157.222961
2023-06-16  150.750451
2023-06-17  145.029739
2023-06-18  112.669461
2023-06-19  144.832870
2023-06-20  152.577169
2023-06-21  158.493686
2023-06-22  157.222961
2023-06-23  152.635717
2023-06-24  148.514315
2023-06-25  112.342070
2023-06-26  150.109137
2023-06-27  152.805923
2023-06-28  158.615999
2023-06-29  157.244901

Forecast Statistics:
Mean: 146.19
Std: 14.45
Min: 112.34
Max: 158.62


In [24]:
# Future forecast with intervals
future_intervals = fit_rf.predict(future_data, type="pred_int")

print("Future Forecast with 90% Prediction Intervals:")
print(future_intervals)

Future Forecast with 90% Prediction Intervals:
                 .pred  .pred_lower  .pred_upper
2023-05-31  158.512917   155.036640   165.479643
2023-06-01  157.096032   154.380623   164.503835
2023-06-02  150.126702   147.842040   157.521057
2023-06-03  136.720868   131.999045   141.257660
2023-06-04  115.790545   110.261882   118.417427
2023-06-05  136.155311   131.372863   140.285499
2023-06-06  147.067404   144.441834   153.764168
2023-06-07  158.375612   155.095576   165.460412
2023-06-08  156.883055   153.372913   164.172953
2023-06-09  150.357190   148.257829   160.703553
2023-06-10  142.503815   134.308324   145.714845
2023-06-11  113.757639   109.758064   119.994012
2023-06-12  141.753752   132.562102   144.029083
2023-06-13  151.115195   147.825910   160.045315
2023-06-14  158.493686   155.092864   165.535867
2023-06-15  157.222961   154.399088   164.326156
2023-06-16  150.750451   149.291289   160.766481
2023-06-17  145.029739   137.487554   153.003189
2023-06-18  112.669461

---

# Summary

## Recursive Forecasting Key Takeaways:

### 1. Base Model Selection
- **Linear Regression**: Fast, interpretable, works well with linear autocorrelation
- **Random Forest**: Captures non-linear patterns, provides feature importances
- **Any sklearn model**: Can use any regression model that fits sklearn API

### 2. Lag Configuration
- **Integer lags** (`lags=7`): Uses all lags from 1 to n
- **Specific lags** (`lags=[1, 7, 14]`): Use only certain lags
- **Choose based on domain knowledge**: Weekly patterns → use lag 7, monthly → use lag 30

### 3. Differentiation
- Use `differentiation=1` for data with trends
- Makes non-stationary series stationary
- Can improve forecast accuracy

### 4. Prediction Intervals
- Use `type="pred_int"` for uncertainty estimates
- Based on in-sample residuals
- Important for decision-making under uncertainty

### 5. Three-DataFrame Output
- **Outputs**: Train and test predictions with residuals
- **Coefficients**: Model parameters or feature importances
- **Stats**: Performance metrics (RMSE, MAE, R²)

## When to Use Recursive Forecasting:

✅ **Good for:**
- Short to medium-term forecasts (1-90 days)
- Data with strong autocorrelation
- When you want to use ML models for time series
- When you need non-linear relationships

❌ **Consider alternatives for:**
- Very long-term forecasts (error compounds)
- Data with complex seasonality (use Prophet/ARIMA)
- When you have many exogenous variables
- When interpretability is paramount

## Next Steps:

1. **Try different base models**: XGBoost, LightGBM, etc.
2. **Experiment with lag configurations**: Find optimal lags for your data
3. **Use with WorkflowSet**: Compare multiple recursive configurations
4. **Add exogenous variables**: Include external predictors alongside lags
5. **Cross-validation**: Use time series splits for robust evaluation