# Time Series Forecasting Demo

This notebook demonstrates time series forecasting using multiple model types:
- **Prophet**: Native time series model with trend and seasonality
- **ARIMA**: Classical time series model
- **Random Forest**: ML model with lag features for time series
- **Linear Regression**: Simple baseline with time features

## Key Concepts:
1. **Native time series models** (Prophet, ARIMA): Handle dates directly
2. **ML models** (Random Forest, Linear Reg): Require feature engineering (lags, rolling stats)
3. **Comprehensive outputs**: All models return standardized three-DataFrame structure

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

from py_parsnip import prophet_reg, arima_reg, rand_forest, linear_reg

# Set random seed
np.random.seed(42)

## Generate Synthetic Time Series Data

We'll create daily sales data with:
- Trend (increasing over time)
- Weekly seasonality
- Random noise

In [None]:
# Generate 2 years of daily data
n_days = 730
start_date = datetime(2022, 1, 1)

dates = [start_date + timedelta(days=i) for i in range(n_days)]

# Create time series components
trend = np.linspace(100, 150, n_days)
seasonality = 20 * np.sin(2 * np.pi * np.arange(n_days) / 7)  # Weekly pattern
noise = np.random.normal(0, 5, n_days)

sales = trend + seasonality + noise

data = pd.DataFrame({
    'date': dates,
    'sales': sales
})

print(data.head(10))
print(f"\nData shape: {data.shape}")
print(f"Date range: {data['date'].min()} to {data['date'].max()}")

## Train/Test Split for Time Series

**Important**: For time series, we split chronologically (not randomly)

In [None]:
# Split: Train on first 600 days, test on last 130 days
train = data.iloc[:600].copy()
test = data.iloc[600:].copy()

print(f"Train: {train.shape[0]} days ({train['date'].min()} to {train['date'].max()})")
print(f"Test: {test.shape[0]} days ({test['date'].min()} to {test['date'].max()})")

---

# Model 1: Prophet (Native Time Series)

Prophet handles dates natively and automatically detects trend + seasonality.

In [None]:
# Create Prophet specification
# Prophet will automatically detect weekly seasonality in our data
spec_prophet = prophet_reg(
    n_changepoints=25,  # Number of potential trend changes
    changepoint_prior_scale=0.05,  # Flexibility of trend
    seasonality_prior_scale=10.0   # Flexibility of seasonality
)

print(spec_prophet)

In [None]:
# Fit Prophet
fit_prophet = spec_prophet.fit(train, "sales ~ date")
print("Prophet model fitted!")

In [None]:
# Predict on test data
pred_prophet = fit_prophet.predict(test)
print(pred_prophet.head(10))

In [None]:
# Evaluate and extract outputs
fit_prophet = fit_prophet.evaluate(test)
outputs_prophet, coefs_prophet, stats_prophet = fit_prophet.extract_outputs()

print("Prophet OUTPUTS:")
print(outputs_prophet[outputs_prophet['split'] == 'test'].head(10))

In [None]:
# Get test metrics
prophet_test_metrics = stats_prophet[
    (stats_prophet['split'] == 'test') & 
    (stats_prophet['metric'].isin(['rmse', 'mae', 'mape', 'r_squared']))
][['metric', 'value']]

print("\nProphet Test Metrics:")
print(prophet_test_metrics)

---

# Model 2: ARIMA (Classical Time Series)

ARIMA models the autocorrelation structure of the time series.

In [None]:
# Create ARIMA specification
spec_arima = arima_reg(
    seasonal_period=7,  # Weekly seasonality
    non_seasonal_ar=1,
    non_seasonal_differences=1,
    non_seasonal_ma=1,
    seasonal_ar=1,
    seasonal_differences=0,
    seasonal_ma=1
)

print(spec_arima)

In [None]:
# Fit ARIMA
fit_arima = spec_arima.fit(train, "sales ~ date")
print("ARIMA model fitted!")

In [None]:
# Predict on test data
pred_arima = fit_arima.predict(test)
print(pred_arima.head(10))

In [None]:
# Evaluate and extract outputs
fit_arima = fit_arima.evaluate(test)
outputs_arima, coefs_arima, stats_arima = fit_arima.extract_outputs()

# Get test metrics
arima_test_metrics = stats_arima[
    (stats_arima['split'] == 'test') & 
    (stats_arima['metric'].isin(['rmse', 'mae', 'mape', 'r_squared']))
][['metric', 'value']]

print("ARIMA Test Metrics:")
print(arima_test_metrics)

---

# Model 3: Random Forest (ML with Time Features)

Random Forest doesn't handle dates natively, so we need to engineer features:
- Lag features (previous values)
- Rolling statistics (moving averages)
- Time-based features (day of week, month, etc.)

In [None]:
def create_time_series_features(df, target_col='sales', lags=[1, 7, 14]):
    """
    Create time series features for ML models.
    
    Args:
        df: DataFrame with 'date' and target column
        target_col: Name of target variable
        lags: List of lag periods to create
    
    Returns:
        DataFrame with engineered features
    """
    df = df.copy()
    
    # Time-based features
    df['day_of_week'] = df['date'].dt.dayofweek
    df['day_of_month'] = df['date'].dt.day
    df['month'] = df['date'].dt.month
    df['day_of_year'] = df['date'].dt.dayofyear
    
    # Lag features
    for lag in lags:
        df[f'lag_{lag}'] = df[target_col].shift(lag)
    
    # Rolling statistics
    df['rolling_mean_7'] = df[target_col].shift(1).rolling(window=7, min_periods=1).mean()
    df['rolling_std_7'] = df[target_col].shift(1).rolling(window=7, min_periods=1).std()
    df['rolling_mean_14'] = df[target_col].shift(1).rolling(window=14, min_periods=1).mean()
    
    # Drop rows with NaN (from lag/rolling features)
    df = df.dropna()
    
    return df

# Create features for train and test
train_rf = create_time_series_features(train)
test_rf = create_time_series_features(test)

print("Random Forest features:")
print(train_rf.head(20))
print(f"\nTrain shape: {train_rf.shape}")
print(f"Test shape: {test_rf.shape}")

In [None]:
# Create Random Forest specification
spec_rf = rand_forest(
    trees=300,
    mtry=4,
    min_n=5
).set_mode("regression")

print(spec_rf)

In [None]:
# Fit Random Forest (using engineered features)
formula_rf = "sales ~ day_of_week + day_of_month + month + lag_1 + lag_7 + lag_14 + rolling_mean_7 + rolling_std_7 + rolling_mean_14"

fit_rf = spec_rf.fit(train_rf, formula_rf)
print("Random Forest model fitted!")

In [None]:
# Predict on test data
pred_rf = fit_rf.predict(test_rf)
print(pred_rf.head(10))

In [None]:
# Evaluate and extract outputs
fit_rf = fit_rf.evaluate(test_rf)
outputs_rf, coefs_rf, stats_rf = fit_rf.extract_outputs()

print("\nRandom Forest Feature Importances:")
print(coefs_rf.sort_values('coefficient', ascending=False))

In [None]:
# Get test metrics
rf_test_metrics = stats_rf[
    (stats_rf['split'] == 'test') & 
    (stats_rf['metric'].isin(['rmse', 'mae', 'mape', 'r_squared']))
][['metric', 'value']]

print("Random Forest Test Metrics:")
print(rf_test_metrics)

---

# Model 4: Linear Regression (Simple Baseline)

Linear regression with the same time features as Random Forest.

In [None]:
# Create Linear Regression specification
spec_lm = linear_reg()

# Fit using same formula as Random Forest
fit_lm = spec_lm.fit(train_rf, formula_rf)
print("Linear Regression model fitted!")

In [None]:
# Evaluate
fit_lm = fit_lm.evaluate(test_rf)
outputs_lm, coefs_lm, stats_lm = fit_lm.extract_outputs()

# Get test metrics
lm_test_metrics = stats_lm[
    (stats_lm['split'] == 'test') & 
    (stats_lm['metric'].isin(['rmse', 'mae', 'mape', 'r_squared']))
][['metric', 'value']]

print("Linear Regression Test Metrics:")
print(lm_test_metrics)

---

# Model Comparison

Compare all models on test set performance.

In [None]:
# Combine all test metrics
prophet_test_metrics['model'] = 'Prophet'
arima_test_metrics['model'] = 'ARIMA'
rf_test_metrics['model'] = 'Random Forest'
lm_test_metrics['model'] = 'Linear Regression'

all_metrics = pd.concat([
    prophet_test_metrics,
    arima_test_metrics,
    rf_test_metrics,
    lm_test_metrics
], ignore_index=True)

# Pivot for easy comparison
comparison = all_metrics.pivot(index='metric', columns='model', values='value')

print("\n" + "=" * 80)
print("MODEL COMPARISON - TEST SET METRICS")
print("=" * 80)
print(comparison)
print("\nLower is better for: RMSE, MAE, MAPE")
print("Higher is better for: R²")

In [None]:
# Find best model for each metric
print("\n" + "=" * 80)
print("BEST MODEL FOR EACH METRIC")
print("=" * 80)

for metric in ['rmse', 'mae', 'mape']:
    best_model = comparison.loc[metric].idxmin()
    best_value = comparison.loc[metric].min()
    print(f"{metric.upper():6s}: {best_model:20s} ({best_value:.2f})")

# R² is higher-is-better
best_model = comparison.loc['r_squared'].idxmax()
best_value = comparison.loc['r_squared'].max()
print(f"R²    : {best_model:20s} ({best_value:.4f})")

---

# Future Forecasting

Generate forecasts for the next 30 days beyond the test set.

In [None]:
# Create future dates
last_date = data['date'].max()
future_dates = [last_date + timedelta(days=i+1) for i in range(30)]
future_data = pd.DataFrame({'date': future_dates})

print(f"Forecasting for: {future_data['date'].min()} to {future_data['date'].max()}")

### Prophet Future Forecast

In [None]:
# Prophet can forecast directly
future_prophet = fit_prophet.predict(future_data)

print("Prophet Future Forecast:")
print(future_prophet)

### ARIMA Future Forecast

In [None]:
# ARIMA can also forecast directly
future_arima = fit_arima.predict(future_data)

print("ARIMA Future Forecast:")
print(future_arima)

### Random Forest Future Forecast

**Note**: For ML models, we need to generate features iteratively for multi-step forecasting.

In [None]:
# For ML models, we need to forecast iteratively
# Start with full historical data
full_history = data.copy()

# Forecast one day at a time
future_forecasts_rf = []

for future_date in future_dates:
    # Create features for this future date
    temp_data = pd.concat([
        full_history,
        pd.DataFrame({'date': [future_date], 'sales': [np.nan]})
    ], ignore_index=True)
    
    temp_features = create_time_series_features(temp_data.dropna())
    
    if len(temp_features) == 0:
        print(f"Warning: Could not create features for {future_date}")
        continue
    
    # Get last row (our future date)
    future_row = temp_features.iloc[[-1]].copy()
    
    # Predict
    pred = fit_rf.predict(future_row)
    forecast_value = pred['.pred'].values[0]
    
    # Store forecast
    future_forecasts_rf.append({
        'date': future_date,
        '.pred': forecast_value
    })
    
    # Add to history for next iteration
    full_history = pd.concat([
        full_history,
        pd.DataFrame({'date': [future_date], 'sales': [forecast_value]})
    ], ignore_index=True)

future_rf = pd.DataFrame(future_forecasts_rf)
print("Random Forest Future Forecast:")
print(future_rf)

### Compare Future Forecasts

In [None]:
# Combine all forecasts
forecast_comparison = pd.DataFrame({
    'date': future_dates,
    'Prophet': future_prophet['.pred'].values,
    'ARIMA': future_arima['.pred'].values,
    'Random Forest': future_rf['.pred'].values
})

print("\nFuture Forecast Comparison:")
print(forecast_comparison)

print("\nForecast Statistics:")
print(forecast_comparison[['Prophet', 'ARIMA', 'Random Forest']].describe())

---

# Summary

## Model Characteristics:

### 1. Prophet
- **Pros**: Handles dates natively, automatic trend/seasonality detection, robust to missing data
- **Cons**: Can be slower, less flexible for custom features
- **Best for**: Business time series with strong seasonality

### 2. ARIMA
- **Pros**: Classical approach, interpretable parameters, works well with stationary series
- **Cons**: Requires stationarity, parameter tuning can be complex
- **Best for**: Stationary time series, short-term forecasts

### 3. Random Forest
- **Pros**: Captures non-linear relationships, handles complex interactions, feature importances
- **Cons**: Requires feature engineering, can't extrapolate trends, slower for multi-step forecasting
- **Best for**: Time series with rich features, non-linear patterns

### 4. Linear Regression
- **Pros**: Fast, interpretable coefficients, simple baseline
- **Cons**: Assumes linear relationships, limited flexibility
- **Best for**: Simple trends, baseline comparisons

## Key Takeaways:

1. **Native time series models** (Prophet, ARIMA) are easier to use but less flexible
2. **ML models** (Random Forest, Linear Reg) require feature engineering but can capture complex patterns
3. **All models** return standardized three-DataFrame outputs for consistent analysis
4. **evaluate()** method enables easy train/test comparison across all model types
5. **Multi-step forecasting** is straightforward for Prophet/ARIMA, iterative for ML models