# Example 29: Hybrid Models - All 4 Strategies

## Overview

This notebook demonstrates **`hybrid_model()`**, a powerful NEW feature in py-tidymodels v1.0.0 that combines any two models with **4 different strategies**.

### What is a Hybrid Model?

A hybrid model combines two models to leverage their complementary strengths:
- **Model 1**: Captures main patterns (trend, seasonality)
- **Model 2**: Handles what Model 1 misses (residuals, regime changes, etc.)

### The 4 Hybrid Strategies

#### 1. **Residual Strategy** (default)
Train Model 2 on Model 1's residuals.

```python
hybrid_model(
    model1=linear_reg(),
    model2=rand_forest().set_mode('regression'),
    strategy='residual'
)
```

**Final prediction** = model1_pred + model2_pred

**Use when**:
- Model 1 captures main pattern but misses complex residuals
- Want to boost linear model with ML
- Examples: ARIMA + XGBoost, Linear + Random Forest

---

#### 2. **Sequential Strategy**
Use Model 1 before split point, Model 2 after.

```python
hybrid_model(
    model1=arima_reg(),
    model2=prophet_reg(),
    strategy='sequential',
    split_point='2020-03-01'  # COVID regime change
)
```

**Final prediction** = model1 before split, model2 after

**Use when**:
- Clear regime change (COVID, policy shift, structural break)
- Different models work in different periods
- Split can be date string, int index, or float proportion

---

#### 3. **Weighted Strategy**
Ensemble: weighted average of predictions.

```python
hybrid_model(
    model1=linear_reg(),
    model2=svm_rbf().set_mode('regression'),
    strategy='weighted',
    weight1=0.7,
    weight2=0.3
)
```

**Final prediction** = 0.7 × model1 + 0.3 × model2

**Use when**:
- Want model diversity (ensembling)
- Both models have merit
- Reduce variance through averaging

---

#### 4. **Custom Data Strategy** (NEW!)
Train models on different/overlapping datasets.

```python
hybrid_model(
    model1=linear_reg(),
    model2=prophet_reg(),
    strategy='custom_data',
    blending='weighted',
    weight1=0.5,
    weight2=0.5
)

# Train on different datasets
fit = spec.fit(
    data={'model1': early_data, 'model2': recent_data},
    formula='price ~ date'
)
```

**Final prediction** = blended predictions (weighted/avg/model1/model2)

**Use when**:
- Data distribution shifts over time
- Concept drift or regime changes
- Want adaptive learning
- Recent data more relevant than old data

---

## Dataset

**Corn Futures** (commodity data)
- 22 years of daily prices (2002-2024)
- Open, High, Low, Close, Volume
- Real market data with regime changes, volatility clustering
- Perfect for demonstrating all 4 strategies

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# Hybrid model imports (NEW!)
from py_parsnip import (
    hybrid_model, linear_reg, rand_forest, prophet_reg, 
    arima_reg, svm_rbf, decision_tree
)

# Supporting imports
from py_rsample import initial_time_split
from py_yardstick import rmse, mae, r_squared

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("✓ Imports successful")

## Load Commodity Futures Data

In [None]:
# Load all commodities
raw_data = pd.read_csv('../_md/__data/all_commodities_futures_collection.csv')
raw_data['date'] = pd.to_datetime(raw_data['date'])

print(f"Total dataset: {len(raw_data):,} rows")
print(f"Commodities: {raw_data['commodity'].nunique()}")
print(f"\nAvailable commodities:")
print(raw_data['commodity'].value_counts().head(10))

### Focus on Corn Futures

In [None]:
# Select Corn futures
corn_data = raw_data[raw_data['commodity'] == 'Corn'].copy()
corn_data = corn_data.sort_values('date').reset_index(drop=True)

# Filter out zero-volume days
corn_data = corn_data[corn_data['close'] > 0]

print(f"Corn data: {len(corn_data):,} rows")
print(f"Date range: {corn_data['date'].min()} to {corn_data['date'].max()}")
print(f"\nFirst few rows:")
print(corn_data.head())

print(f"\nPrice statistics:")
print(corn_data[['open', 'high', 'low', 'close']].describe())

### Feature Engineering

In [None]:
# Add technical indicators
corn_data['ma_7'] = corn_data['close'].rolling(7).mean()
corn_data['ma_30'] = corn_data['close'].rolling(30).mean()
corn_data['volatility'] = corn_data['close'].rolling(30).std()
corn_data['price_change'] = corn_data['close'].diff()

# Drop NaN from rolling calculations
corn_data = corn_data.dropna().reset_index(drop=True)

print(f"After feature engineering: {len(corn_data):,} rows")
print(f"\nNew columns: {list(corn_data.columns)}")

### Visualize Data

In [None]:
# Plot corn prices
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Close price
axes[0].plot(corn_data['date'], corn_data['close'], linewidth=0.5, alpha=0.7)
axes[0].set_title('Corn Futures - Close Price', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Price (cents/bushel)')
axes[0].grid(True, alpha=0.3)

# Add regime markers
axes[0].axvline(pd.Timestamp('2008-09-01'), color='red', linestyle='--', alpha=0.5, label='2008 Financial Crisis')
axes[0].axvline(pd.Timestamp('2020-03-01'), color='orange', linestyle='--', alpha=0.5, label='COVID-19')
axes[0].legend()

# Volatility
axes[1].plot(corn_data['date'], corn_data['volatility'], linewidth=0.5, alpha=0.7, color='red')
axes[1].set_title('30-Day Volatility', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Volatility')
axes[1].set_xlabel('Date')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Observations:")
print("- Multiple regime changes (2008 crisis, 2012 drought, 2020 COVID)")
print("- Volatility clustering (high vol periods, low vol periods)")
print("- Non-linear patterns → perfect for hybrid models!")

## Train/Test Split

In [None]:
# Time-based split: 80% train, 20% test
split = initial_time_split(corn_data, prop=0.8)
train_data = split.training()
test_data = split.testing()

print(f"Train: {len(train_data):,} rows ({train_data['date'].min()} to {train_data['date'].max()})")
print(f"Test:  {len(test_data):,} rows ({test_data['date'].min()} to {test_data['date'].max()})")

---

# Strategy 1: Residual (Default)

**Concept**: Model 2 learns from Model 1's mistakes.

1. Train Model 1 (linear regression - captures trend)
2. Calculate residuals = actual - model1_pred
3. Train Model 2 on residuals (random forest - captures non-linearity)
4. Final = model1_pred + model2_pred

**Use case**: Boosting a simple model with a complex one.

In [None]:
# Strategy 1: Residual
print("🔬 Strategy 1: Residual")
print("Model 1: Linear Regression (captures linear trend)")
print("Model 2: Random Forest (captures non-linear residuals)\n")

spec_residual = hybrid_model(
    model1=linear_reg(),
    model2=rand_forest(trees=100, min_n=10).set_mode('regression'),
    strategy='residual'
)

# Fit
print("Training...")
fit_residual = spec_residual.fit(train_data, 'close ~ date + ma_7 + volatility')

# Evaluate
eval_residual = fit_residual.evaluate(test_data)
outputs_res, coeffs_res, stats_res = eval_residual.extract_outputs()

# Test performance
test_stats_res_df = stats_res[stats_res['split'] == 'test']
test_stats_res = test_stats_res_df.set_index('metric')['value']
print("\n📊 Strategy 1 (Residual) - Test Performance:")
print(f"RMSE: {test_stats_res['rmse']:.4f}")
print(f"MAE:  {test_stats_res['mae']:.4f}")
print(f"R²:   {test_stats_res['r_squared']:.4f}")

### Compare Components

In [None]:
# Also fit standalone models for comparison
fit_linear = linear_reg().fit(train_data, 'close ~ date + ma_7 + volatility')
fit_rf = rand_forest(trees=100, min_n=10).set_mode('regression').fit(train_data, 'close ~ date + ma_7 + volatility')

# Evaluate standalone models
eval_linear = fit_linear.evaluate(test_data)
eval_rf = fit_rf.evaluate(test_data)

_, _, stats_linear = eval_linear.extract_outputs()
_, _, stats_rf = eval_rf.extract_outputs()

test_linear_df = stats_linear[stats_linear['split'] == 'test']
test_linear = test_linear_df.set_index('metric')['value']
test_rf_df = stats_rf[stats_rf['split'] == 'test']
test_rf = test_rf_df.set_index('metric')['value']

print("\n📊 Component Comparison (Test RMSE):")
print(f"Linear only:      {test_linear['rmse']:.4f}")
print(f"Random Forest:    {test_rf['rmse']:.4f}")
print(f"Hybrid (Residual): {test_stats_res['rmse']:.4f}")
print(f"\n💡 Improvement over Linear: {(test_linear['rmse'] - test_stats_res['rmse'])/test_linear['rmse']*100:.2f}%")

---

# Strategy 2: Sequential

**Concept**: Different models for different time periods.

Use Model 1 before regime change, Model 2 after.

**Use case**: Clear structural break (COVID, policy change, etc.)

In [None]:
# Strategy 2: Sequential
print("🔬 Strategy 2: Sequential")
print("Model 1: ARIMA (for pre-COVID period)")
print("Model 2: Prophet (for post-COVID period)")
print("Split point: 2020-03-01 (COVID-19 pandemic)\n")

spec_sequential = hybrid_model(
    model1=linear_reg(),  # Using linear for demo (ARIMA is slow)
    model2=prophet_reg(),
    strategy='sequential',
    split_point='2020-03-01'  # COVID regime change
)

# Fit
print("Training...")
fit_sequential = spec_sequential.fit(train_data, 'close ~ date')

# Evaluate
eval_sequential = fit_sequential.evaluate(test_data)
outputs_seq, coeffs_seq, stats_seq = eval_sequential.extract_outputs()

test_stats_seq_df = stats_seq[stats_seq['split'] == 'test']
test_stats_seq = test_stats_seq_df.set_index('metric')['value']
print("\n📊 Strategy 2 (Sequential) - Test Performance:")
print(f"RMSE: {test_stats_seq['rmse']:.4f}")
print(f"MAE:  {test_stats_seq['mae']:.4f}")
print(f"R²:   {test_stats_seq['r_squared']:.4f}")

print("\n💡 Sequential is ideal when:")
print("  - Clear regime change exists")
print("  - Different models work in different periods")
print("  - You want to capture structural breaks")

---

# Strategy 3: Weighted (Ensemble)

**Concept**: Weighted average of two model predictions.

Final = weight1 × model1 + weight2 × model2

**Use case**: Model diversity, variance reduction, ensemble learning.

In [None]:
# Strategy 3: Weighted
print("🔬 Strategy 3: Weighted (Ensemble)")
print("Model 1: Linear Regression (70%)")
print("Model 2: SVM RBF (30%)\n")

spec_weighted = hybrid_model(
    model1=linear_reg(),
    model2=svm_rbf().set_mode('regression'),
    strategy='weighted',
    weight1=0.7,
    weight2=0.3
)

# Fit
print("Training...")
fit_weighted = spec_weighted.fit(train_data, 'close ~ date + ma_7 + volatility')

# Evaluate
eval_weighted = fit_weighted.evaluate(test_data)
outputs_wgt, coeffs_wgt, stats_wgt = eval_weighted.extract_outputs()

test_stats_wgt_df = stats_wgt[stats_wgt['split'] == 'test']
test_stats_wgt = test_stats_wgt_df.set_index('metric')['value']
print("\n📊 Strategy 3 (Weighted) - Test Performance:")
print(f"RMSE: {test_stats_wgt['rmse']:.4f}")
print(f"MAE:  {test_stats_wgt['mae']:.4f}")
print(f"R²:   {test_stats_wgt['r_squared']:.4f}")

print("\n💡 Weighted ensembles:")
print("  - Reduce overfitting via model diversity")
print("  - Lower variance than single models")
print("  - Weights can be optimized via CV")

---

# Strategy 4: Custom Data (NEW!)

**Concept**: Train models on different datasets, blend predictions.

**Use case**:
- Data distribution shifts
- Concept drift
- Recent data more relevant than old data
- Adaptive learning

**Example**: Train Model 1 on all data, Model 2 on recent data only.

In [None]:
# Strategy 4: Custom Data
print("🔬 Strategy 4: Custom Data (Adaptive Learning)")
print("Model 1: Linear (trained on all training data)")
print("Model 2: Random Forest (trained on recent 30% only)")
print("Blending: 50-50 weighted average\n")

# Split training data: all data vs recent data
cutoff_idx = int(len(train_data) * 0.7)
all_data = train_data.copy()  # Model 1: all training data
recent_data = train_data.iloc[cutoff_idx:].copy()  # Model 2: recent 30%

print(f"Model 1 data: {len(all_data):,} rows (full history)")
print(f"Model 2 data: {len(recent_data):,} rows (recent only)")

spec_custom = hybrid_model(
    model1=linear_reg(),
    model2=rand_forest(trees=100).set_mode('regression'),
    strategy='custom_data',
    blending='weighted',
    weight1=0.5,
    weight2=0.5
)

# Fit with custom datasets
print("\nTraining...")
fit_custom = spec_custom.fit(
    data={'model1': all_data, 'model2': recent_data},
    formula='close ~ date + ma_7 + volatility'
)

# Evaluate
eval_custom = fit_custom.evaluate(test_data)
outputs_cust, coeffs_cust, stats_cust = eval_custom.extract_outputs()

test_stats_cust_df = stats_cust[stats_cust['split'] == 'test']
test_stats_cust = test_stats_cust_df.set_index('metric')['value']
print("\n📊 Strategy 4 (Custom Data) - Test Performance:")
print(f"RMSE: {test_stats_cust['rmse']:.4f}")
print(f"MAE:  {test_stats_cust['mae']:.4f}")
print(f"R²:   {test_stats_cust['r_squared']:.4f}")

print("\n💡 Custom data strategy is powerful for:")
print("  - Distribution shifts (markets change over time)")
print("  - Concept drift (patterns evolve)")
print("  - Adaptive learning (recent data more relevant)")
print("  - Regime-specific models (different data, different model)")

---

# Final Comparison: All 4 Strategies

In [None]:
# Compile results
comparison = pd.DataFrame([
    {
        'Strategy': '1. Residual',
        'Description': 'Linear + RF on residuals',
        'Test_RMSE': test_stats_res['rmse'],
        'Test_MAE': test_stats_res['mae'],
        'Test_R²': test_stats_res['r_squared']
    },
    {
        'Strategy': '2. Sequential',
        'Description': 'Linear before COVID, Prophet after',
        'Test_RMSE': test_stats_seq['rmse'],
        'Test_MAE': test_stats_seq['mae'],
        'Test_R²': test_stats_seq['r_squared']
    },
    {
        'Strategy': '3. Weighted',
        'Description': '70% Linear + 30% SVM',
        'Test_RMSE': test_stats_wgt['rmse'],
        'Test_MAE': test_stats_wgt['mae'],
        'Test_R²': test_stats_wgt['r_squared']
    },
    {
        'Strategy': '4. Custom Data',
        'Description': 'Linear (all) + RF (recent)',
        'Test_RMSE': test_stats_cust['rmse'],
        'Test_MAE': test_stats_cust['mae'],
        'Test_R²': test_stats_cust['r_squared']
    },
    {
        'Strategy': 'Baseline: Linear',
        'Description': 'Simple linear regression',
        'Test_RMSE': test_linear['rmse'],
        'Test_MAE': test_linear['mae'],
        'Test_R²': test_linear['r_squared']
    },
    {
        'Strategy': 'Baseline: RF',
        'Description': 'Random Forest standalone',
        'Test_RMSE': test_rf['rmse'],
        'Test_MAE': test_rf['mae'],
        'Test_R²': test_rf['r_squared']
    }
])

# Sort by RMSE
comparison = comparison.sort_values('Test_RMSE')

print("\n" + "="*80)
print("📊 FINAL COMPARISON: All Hybrid Strategies + Baselines")
print("="*80 + "\n")
print(comparison.to_string(index=False))

# Best strategy
best = comparison.iloc[0]
baseline_best = comparison[comparison['Strategy'].str.contains('Baseline')]['Test_RMSE'].min()

print(f"\n🏆 BEST STRATEGY: {best['Strategy']}")
print(f"   {best['Description']}")
print(f"   Test RMSE: {best['Test_RMSE']:.4f}")
print(f"   Test R²: {best['Test_R²']:.4f}")
print(f"\n📈 Improvement over best baseline: {(baseline_best - best['Test_RMSE'])/baseline_best*100:.2f}%")

## Visualize Comparison

In [None]:
# Bar chart comparison
fig, ax = plt.subplots(figsize=(12, 6))

colors = ['green' if 'Baseline' not in s else 'gray' for s in comparison['Strategy']]
bars = ax.barh(comparison['Strategy'], comparison['Test_RMSE'], color=colors, alpha=0.7)

ax.set_xlabel('Test RMSE (lower is better)', fontsize=11)
ax.set_title('Hybrid Model Strategy Comparison', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
ax.invert_yaxis()

# Add value labels
for i, bar in enumerate(bars):
    width = bar.get_width()
    ax.text(width, bar.get_y() + bar.get_height()/2, 
            f'{width:.4f}', ha='left', va='center', fontsize=9)

plt.tight_layout()
plt.show()

---

# Key Takeaways

## Strategy Selection Guide

| Strategy | When to Use | Example Combo | Strength |
|----------|-------------|---------------|----------|
| **Residual** | Boost simple model | Linear + RF/XGBoost | Captures non-linearity |
| **Sequential** | Regime changes | ARIMA → Prophet (COVID) | Handles structural breaks |
| **Weighted** | Ensemble/diversity | Linear + SVM | Variance reduction |
| **Custom Data** | Distribution shift | Linear (all) + RF (recent) | Adaptive learning |

---

## Detailed Strategy Recommendations

### Use Residual When:
- ✅ Model 1 captures main pattern but misses details
- ✅ Want to boost linear/simple model with ML
- ✅ Non-linear residual patterns exist
- ✅ Examples: ARIMA + XGBoost, Linear + Random Forest

### Use Sequential When:
- ✅ Clear regime change or structural break
- ✅ Different models work in different periods
- ✅ Policy changes, crises, market shifts
- ✅ Examples: Pre/post COVID, Pre/post regulation

### Use Weighted When:
- ✅ Want model diversity (ensemble benefits)
- ✅ Both models have similar performance
- ✅ Reduce variance through averaging
- ✅ Weights can be optimized via CV

### Use Custom Data When:
- ✅ Data distribution shifts over time
- ✅ Recent data more relevant (concept drift)
- ✅ Want adaptive learning
- ✅ Different training strategies for different models

---

## Best Practices

### Model Selection
1. **Model 1**: Usually simpler (linear, ARIMA)
2. **Model 2**: Usually more complex (RF, XGBoost, SVM)
3. **Complementary**: Choose models with different strengths

### Validation
1. **Always cross-validate** hybrid models
2. **Compare with standalone** models (baselines)
3. **Check for overfitting** (hybrid can overfit more easily)
4. **Test on holdout** set before production

### Weight Optimization (Weighted Strategy)
```python
# Grid search over weights
from py_tune import grid_regular, tune_grid

spec = hybrid_model(
    model1=linear_reg(),
    model2=svm_rbf().set_mode('regression'),
    strategy='weighted',
    weight1=tune('weight1'),
    weight2=tune('weight2')
)

grid = grid_regular({'weight1': [0.3, 0.5, 0.7], 'weight2': [0.3, 0.5, 0.7]})
results = tune_grid(workflow, cv_folds, grid)
```

### Split Point Selection (Sequential Strategy)
- **Date string**: `'2020-03-01'` (most common)
- **Integer index**: `1500` (row index)
- **Float proportion**: `0.7` (70% mark)

---

## Common Pitfalls

### ❌ Using Same Model Type Twice
```python
# Bad: two linear models (no benefit)
hybrid_model(model1=linear_reg(), model2=linear_reg())

# Good: complementary models
hybrid_model(model1=linear_reg(), model2=rand_forest().set_mode('regression'))
```

### ❌ Not Setting Mode for sklearn Models
```python
# Wrong
model2=rand_forest()  # mode='unknown'

# Correct
model2=rand_forest().set_mode('regression')
```

### ❌ Ignoring Baseline Comparison
- Always compare hybrid with standalone models
- Hybrid should beat both components
- If not, use simpler standalone model

### ❌ Overfitting with Residual Strategy
- Model 2 can overfit to noise in residuals
- Use regularization (penalty parameters)
- Validate with CV

---

## Production Considerations

### Training Time
- Hybrid models take ~2× time (train both models)
- Custom Data can be slower (different data prep)
- Consider computational budget

### Prediction Latency
- Residual/Custom: 2× prediction time
- Sequential: 1× (only one model predicts)
- Weighted: 2× (both models predict)

### Monitoring
- Track both component models separately
- Monitor blend/combination mechanism
- Alert if components diverge significantly

### Retraining
- Sequential: May need to retrain both if regime persists
- Custom Data: Retrain Model 2 more frequently (recent data)
- Residual: Retrain both together

---

# References

- **Hybrid Model Documentation**: `_md/ISSUE_7_HYBRID_MODEL_SUMMARY.md`
- **Time Series Hybrids**: Examples 20 (arima_boost, prophet_boost)
- **Model Comparison**: Example 11 (WorkflowSet comparison)
- **CLAUDE.md**: Complete architecture documentation