# Example 37: Advanced sklearn Regression Models

**Feature**: decision_tree(), nearest_neighbor(), svm_rbf(), svm_linear(), mlp()

## Overview

This notebook demonstrates **5 advanced sklearn-based regression models** available in py-tidymodels:

1. **decision_tree()**: Single decision tree for regression
2. **nearest_neighbor()**: k-Nearest Neighbors (k-NN)
3. **svm_rbf()**: Support Vector Machine with RBF kernel
4. **svm_linear()**: Support Vector Machine with linear kernel
5. **mlp()**: Multi-Layer Perceptron (neural network)

## CRITICAL API Pattern

**All these models require `.set_mode('regression')` method call**:
```python
# WRONG - Will cause TypeError
model = decision_tree(mode='regression', tree_depth=5)

# CORRECT - Mode set via method chaining
model = decision_tree(tree_depth=5).set_mode('regression')
```

This is different from `rand_forest()` and `boost_tree()` which accept mode parameter.

## When to Use Each Model

**Decision Tree**:
- ✅ Highly interpretable (visualize splits)
- ✅ Handles non-linear relationships
- ✅ No feature scaling needed
- ❌ Prone to overfitting (use pruning)
- ❌ High variance (small data changes → different tree)

**k-Nearest Neighbors**:
- ✅ No training phase (lazy learning)
- ✅ Non-parametric (no assumptions)
- ✅ Good for local patterns
- ❌ Slow prediction with large datasets
- ❌ Requires feature scaling

**SVM (RBF kernel)**:
- ✅ Excellent for non-linear patterns
- ✅ Robust to outliers (in feature space)
- ✅ Good with high-dimensional data
- ❌ Slow training on large datasets (>10K rows)
- ❌ Requires careful hyperparameter tuning

**SVM (Linear kernel)**:
- ✅ Faster than RBF kernel
- ✅ Good for linearly separable data
- ✅ Scales better to large datasets
- ❌ Limited to linear relationships
- ❌ Still slower than linear regression

**MLP (Neural Network)**:
- ✅ Universal function approximator
- ✅ Handles very complex non-linear patterns
- ✅ Feature interactions automatically learned
- ❌ Requires large datasets (>1000 rows)
- ❌ Black box (hard to interpret)
- ❌ Sensitive to hyperparameters

## Dataset

**Refinery Margins** (Germany):
- Monthly refinery margins 2006-2024
- Crude oil prices as predictors
- Non-linear relationships between variables

In [None]:
# Setup
import pandas as pd
import numpy as np
import time

# py-tidymodels imports
from py_parsnip import (
    decision_tree, nearest_neighbor, svm_rbf, svm_linear, mlp,
    linear_reg, rand_forest
)
from py_rsample import initial_time_split
from py_yardstick import rmse, mae, r_squared
from py_yardstick import metric_set
from py_workflows import Workflow
from py_workflowsets import WorkflowSet
from py_recipes import recipe

import warnings
warnings.filterwarnings('ignore')

print("✓ Imports complete")

## 1. Load and Prepare Data

In [None]:
# Load refinery margins data
df = pd.read_csv('../_md/__data/refinery_margins.csv')
df['date'] = pd.to_datetime(df['date'])

# Filter to Germany
germany = df[df['country'] == 'Germany'].copy()

# Select columns
germany = germany[[
    'date', 'brent', 'dubai', 'wti', 'brent_cracking_nw_europe'
]].rename(columns={'brent_cracking_nw_europe': 'margin'})

germany = germany.dropna().sort_values('date').reset_index(drop=True)

print(f"Germany refinery margin data:")
print(f"  Records: {len(germany):,} months")
print(f"  Date range: {germany['date'].min()} to {germany['date'].max()}")
print(f"  Margin mean: ${germany['margin'].mean():.2f}/bbl")
print(f"\nFirst few rows:")
print(germany.head())

In [None]:
# Train/test split
split = initial_time_split(germany, date_column='date', prop=0.85)
train = split.training()
test = split.testing()

print(f"Train: {len(train)} months ({train['date'].min()} to {train['date'].max()})")
print(f"Test:  {len(test)} months ({test['date'].min()} to {test['date'].max()})")

## 2. Decision Tree

Single decision tree with pruning parameters.

In [None]:
# Decision tree with pruning
spec_tree = decision_tree(
    tree_depth=5,       # Max depth
    min_n=10,           # Min samples per leaf
    cost_complexity=0.01  # Pruning parameter
).set_mode('regression')  # CRITICAL: set_mode() method

fit_tree = spec_tree.fit(train, 'margin ~ brent + dubai + wti')
eval_tree = fit_tree.evaluate(test)
_, _, stats_tree = eval_tree.extract_outputs()

test_stats_tree = stats_tree[stats_tree['split'] == 'test'].iloc[0]
print("Decision Tree:")
print(f"  Test RMSE: ${test_stats_tree['rmse']:.3f}/bbl")
print(f"  Test MAE: ${test_stats_tree['mae']:.3f}/bbl")
print(f"  Test R²: {test_stats_tree['r_squared']:.4f}")

## 3. k-Nearest Neighbors

Distance-based learning. **Requires feature scaling!**

In [None]:
# k-NN with normalization (REQUIRED for distance-based methods)
rec_knn = recipe().step_normalize(all_numeric_predictors())

spec_knn = nearest_neighbor(
    neighbors=5,        # k=5 neighbors
    weight_func='uniform'  # or 'distance' for weighted
).set_mode('regression')

wf_knn = Workflow().add_recipe(rec_knn).add_model(spec_knn)
fit_knn = wf_knn.fit(train)
eval_knn = fit_knn.evaluate(test)
_, _, stats_knn = eval_knn.extract_outputs()

test_stats_knn = stats_knn[stats_knn['split'] == 'test'].iloc[0]
print("k-Nearest Neighbors (k=5):")
print(f"  Test RMSE: ${test_stats_knn['rmse']:.3f}/bbl")
print(f"  Test MAE: ${test_stats_knn['mae']:.3f}/bbl")
print(f"  Test R²: {test_stats_knn['r_squared']:.4f}")

## 4. SVM with RBF Kernel

Non-linear SVM. **Requires feature scaling!**

In [None]:
# SVM RBF with normalization
rec_svm = recipe().step_normalize(all_numeric_predictors())

spec_svm_rbf = svm_rbf(
    cost=1.0,           # Regularization (C parameter)
    rbf_sigma=0.1       # Kernel width (gamma)
).set_mode('regression')

wf_svm_rbf = Workflow().add_recipe(rec_svm).add_model(spec_svm_rbf)

start_time = time.time()
fit_svm_rbf = wf_svm_rbf.fit(train)
svm_rbf_train_time = time.time() - start_time

eval_svm_rbf = fit_svm_rbf.evaluate(test)
_, _, stats_svm_rbf = eval_svm_rbf.extract_outputs()

test_stats_svm_rbf = stats_svm_rbf[stats_svm_rbf['split'] == 'test'].iloc[0]
print("SVM (RBF kernel):")
print(f"  Training time: {svm_rbf_train_time:.3f} seconds")
print(f"  Test RMSE: ${test_stats_svm_rbf['rmse']:.3f}/bbl")
print(f"  Test MAE: ${test_stats_svm_rbf['mae']:.3f}/bbl")
print(f"  Test R²: {test_stats_svm_rbf['r_squared']:.4f}")

## 5. SVM with Linear Kernel

Linear SVM (faster than RBF). **Requires feature scaling!**

In [None]:
# SVM Linear with normalization
spec_svm_linear = svm_linear(
    cost=1.0            # Regularization (C parameter)
).set_mode('regression')

wf_svm_linear = Workflow().add_recipe(rec_svm).add_model(spec_svm_linear)

start_time = time.time()
fit_svm_linear = wf_svm_linear.fit(train)
svm_linear_train_time = time.time() - start_time

eval_svm_linear = fit_svm_linear.evaluate(test)
_, _, stats_svm_linear = eval_svm_linear.extract_outputs()

test_stats_svm_linear = stats_svm_linear[stats_svm_linear['split'] == 'test'].iloc[0]
print("SVM (Linear kernel):")
print(f"  Training time: {svm_linear_train_time:.3f} seconds")
print(f"  Test RMSE: ${test_stats_svm_linear['rmse']:.3f}/bbl")
print(f"  Test MAE: ${test_stats_svm_linear['mae']:.3f}/bbl")
print(f"  Test R²: {test_stats_svm_linear['r_squared']:.4f}")
print(f"\nSpeedup vs RBF: {svm_rbf_train_time / svm_linear_train_time:.2f}x")

## 6. Multi-Layer Perceptron (Neural Network)

Feed-forward neural network. **Requires feature scaling!**

In [None]:
# MLP with normalization
rec_mlp = recipe().step_normalize(all_numeric_predictors())

spec_mlp = mlp(
    hidden_units=50,    # Neurons in hidden layer
    epochs=200,         # Training iterations
    learn_rate=0.01,    # Learning rate
    activation='relu'   # Activation function
).set_mode('regression')

wf_mlp = Workflow().add_recipe(rec_mlp).add_model(spec_mlp)

start_time = time.time()
fit_mlp = wf_mlp.fit(train)
mlp_train_time = time.time() - start_time

eval_mlp = fit_mlp.evaluate(test)
_, _, stats_mlp = eval_mlp.extract_outputs()

test_stats_mlp = stats_mlp[stats_mlp['split'] == 'test'].iloc[0]
print("Multi-Layer Perceptron (Neural Network):")
print(f"  Training time: {mlp_train_time:.3f} seconds")
print(f"  Test RMSE: ${test_stats_mlp['rmse']:.3f}/bbl")
print(f"  Test MAE: ${test_stats_mlp['mae']:.3f}/bbl")
print(f"  Test R²: {test_stats_mlp['r_squared']:.4f}")

## 7. Comprehensive Comparison

Compare all 5 sklearn models plus baselines.

In [None]:
# Compile results
comparison = pd.DataFrame([
    {
        'model': 'Decision Tree',
        'type': 'sklearn',
        'rmse': test_stats_tree['rmse'],
        'mae': test_stats_tree['mae'],
        'r_squared': test_stats_tree['r_squared']
    },
    {
        'model': 'k-NN (k=5)',
        'type': 'sklearn',
        'rmse': test_stats_knn['rmse'],
        'mae': test_stats_knn['mae'],
        'r_squared': test_stats_knn['r_squared']
    },
    {
        'model': 'SVM (RBF)',
        'type': 'sklearn',
        'rmse': test_stats_svm_rbf['rmse'],
        'mae': test_stats_svm_rbf['mae'],
        'r_squared': test_stats_svm_rbf['r_squared']
    },
    {
        'model': 'SVM (Linear)',
        'type': 'sklearn',
        'rmse': test_stats_svm_linear['rmse'],
        'mae': test_stats_svm_linear['mae'],
        'r_squared': test_stats_svm_linear['r_squared']
    },
    {
        'model': 'MLP',
        'type': 'sklearn',
        'rmse': test_stats_mlp['rmse'],
        'mae': test_stats_mlp['mae'],
        'r_squared': test_stats_mlp['r_squared']
    }
])

comparison = comparison.sort_values('rmse')

print("\nAdvanced sklearn Models Comparison:")
print("="*80)
print(comparison.to_string(index=False))
print("="*80)
print(f"\nBest model: {comparison.iloc[0]['model']}")
print(f"  RMSE: ${comparison.iloc[0]['rmse']:.3f}/bbl")
print(f"  R²: {comparison.iloc[0]['r_squared']:.4f}")

## 8. WorkflowSet Comparison

Use WorkflowSet to systematically compare all models.

In [None]:
# Create workflows for all sklearn models
rec_norm = recipe().step_normalize(all_numeric_predictors())

sklearn_models = [
    ('decision_tree', decision_tree(tree_depth=5, min_n=10).set_mode('regression')),
    ('knn', nearest_neighbor(neighbors=5).set_mode('regression')),
    ('svm_rbf', svm_rbf(cost=1.0, rbf_sigma=0.1).set_mode('regression')),
    ('svm_linear', svm_linear(cost=1.0).set_mode('regression')),
    ('mlp', mlp(hidden_units=50, epochs=200, learn_rate=0.01).set_mode('regression'))
]

workflows = []
for name, spec in sklearn_models:
    # Decision tree doesn't need normalization, others do
    if name == 'decision_tree':
        wf = Workflow().add_formula('margin ~ brent + dubai + wti').add_model(spec)
    else:
        wf = Workflow().add_recipe(rec_norm).add_model(spec)
    workflows.append(wf)

wf_set = WorkflowSet.from_workflows(workflows)

print(f"Created WorkflowSet with {len(workflows)} sklearn models")
print(f"Models: {list(wf_set.workflows.keys())}")

In [None]:
# Fit all workflows
wf_results = []
for wf_id, wf in wf_set.workflows.items():
    try:
        start_time = time.time()
        fit = wf.fit(train)
        train_time = time.time() - start_time
        
        eval_fit = fit.evaluate(test)
        _, _, stats = eval_fit.extract_outputs()
        
        test_stats = stats[stats['split'] == 'test'].iloc[0]
        wf_results.append({
            'workflow': wf_id,
            'train_time_sec': train_time,
            'rmse': test_stats['rmse'],
            'mae': test_stats['mae'],
            'r_squared': test_stats['r_squared']
        })
    except Exception as e:
        print(f"Warning: {wf_id} failed - {str(e)[:80]}")

wf_comparison = pd.DataFrame(wf_results)
wf_comparison = wf_comparison.sort_values('rmse')

print("\nWorkflowSet Results:")
print("="*90)
print(wf_comparison.to_string(index=False))
print("="*90)

## 9. Compare with Baseline Models

How do advanced sklearn models compare to simpler baselines?

In [None]:
# Add baseline models
baseline_models = [
    ('linear_reg', linear_reg()),
    ('random_forest', rand_forest(trees=100).set_mode('regression'))
]

all_results = wf_results.copy()

for name, model in baseline_models:
    start_time = time.time()
    fit = model.fit(train, 'margin ~ brent + dubai + wti')
    train_time = time.time() - start_time
    
    eval_fit = fit.evaluate(test)
    _, _, stats = eval_fit.extract_outputs()
    
    test_stats = stats[stats['split'] == 'test'].iloc[0]
    all_results.append({
        'workflow': name,
        'train_time_sec': train_time,
        'rmse': test_stats['rmse'],
        'mae': test_stats['mae'],
        'r_squared': test_stats['r_squared']
    })

final_comparison = pd.DataFrame(all_results)
final_comparison = final_comparison.sort_values('rmse')

print("\nAll Models Comparison (sklearn + Baselines):")
print("="*90)
print(final_comparison.to_string(index=False))
print("="*90)

## 10. Key Takeaways

### API Pattern (CRITICAL)

**Must use `.set_mode('regression')` method**:
```python
# These 5 models require method call
decision_tree(...).set_mode('regression')
nearest_neighbor(...).set_mode('regression')
svm_rbf(...).set_mode('regression')
svm_linear(...).set_mode('regression')
mlp(...).set_mode('regression')

# Compare to rand_forest/boost_tree which accept parameter
rand_forest(mode='regression')  # This works
boost_tree(mode='regression')   # This works
```

### Feature Scaling Requirements

**MUST normalize for**:
- k-NN (distance-based)
- SVM (both RBF and linear)
- MLP (neural networks)

**NO normalization needed for**:
- Decision trees (split-based)

```python
# Pattern for models requiring scaling
rec = recipe().step_normalize(all_numeric_predictors())
wf = Workflow().add_recipe(rec).add_model(spec)
```

### Model Selection Guidance

**Choose Decision Tree when**:
- Interpretability is paramount
- Need to visualize decision rules
- Mixed data types (numeric + categorical)
- Baseline for tree ensembles

**Choose k-NN when**:
- Strong local patterns (nearby observations similar)
- Small to medium datasets (<10K rows)
- No training phase needed (lazy learning)
- Continuous updates (just add new points)

**Choose SVM (RBF) when**:
- Complex non-linear relationships
- High-dimensional feature space
- Robust to outliers needed
- Dataset size: 100-10K rows (sweet spot)

**Choose SVM (Linear) when**:
- Linear or near-linear relationships
- Need faster SVM alternative
- High-dimensional sparse data
- Larger datasets (10K+ rows)

**Choose MLP when**:
- Very complex non-linear patterns
- Large datasets (>1000 rows)
- Feature interactions important
- Have time for hyperparameter tuning

### Performance Patterns

From our refinery margin example:
1. **Accuracy**: Random Forest often best, then SVM RBF
2. **Speed**: Decision tree fastest, SVM RBF slowest
3. **Interpretability**: Decision tree >> k-NN > SVM/MLP
4. **Stability**: SVM/MLP more stable than single decision tree

### Hyperparameter Tuning

**Decision Tree**:
```python
decision_tree(
    tree_depth=5,          # 3-10 typical
    min_n=10,              # 5-50 typical
    cost_complexity=0.01   # 0.001-0.1 (pruning)
)
```

**k-NN**:
```python
nearest_neighbor(
    neighbors=5,           # 3-20 typical, odd numbers
    weight_func='uniform'  # or 'distance'
)
```

**SVM RBF**:
```python
svm_rbf(
    cost=1.0,              # 0.1-100 (regularization)
    rbf_sigma=0.1          # 0.001-1.0 (kernel width)
)
# Tune cost first, then rbf_sigma
```

**SVM Linear**:
```python
svm_linear(
    cost=1.0               # 0.1-100 (regularization)
)
```

**MLP**:
```python
mlp(
    hidden_units=50,       # 10-200 typical
    epochs=200,            # 100-1000 typical
    learn_rate=0.01,       # 0.001-0.1 typical
    activation='relu'      # relu, tanh, logistic
)
# Tune hidden_units first, then learn_rate
```

### Production Deployment

```python
# Production pattern
from py_parsnip import svm_rbf
from py_workflows import Workflow
from py_recipes import recipe

# Preprocessing + SVM
rec = recipe().step_normalize(all_numeric_predictors())
spec = svm_rbf(cost=10.0, rbf_sigma=0.1).set_mode('regression')
wf = Workflow().add_recipe(rec).add_model(spec)

# Fit on all training data
final_fit = wf.fit(all_training_data)

# Predict
predictions = final_fit.predict(new_data)
```

### Common Pitfalls

1. **Forgetting `.set_mode('regression')`**: TypeError
   - Solution: Always call `.set_mode('regression')` after model creation

2. **Not normalizing for k-NN/SVM/MLP**: Poor performance
   - Solution: Use `step_normalize()` in recipe

3. **Deep decision trees**: Severe overfitting
   - Solution: Limit `tree_depth` to 3-10, use `cost_complexity`

4. **Too few neighbors in k-NN**: Noisy predictions
   - Solution: Use k=5 or higher, cross-validate to find optimal k

5. **SVM on large datasets**: Extremely slow
   - Solution: Sample data or use linear_reg/rand_forest instead

6. **MLP on small datasets**: Overfitting
   - Solution: Reduce `hidden_units`, increase regularization, or use simpler model

### Comparison to Other Models

**vs Random Forest**:
- Single decision tree: More interpretable, less accurate
- Random Forest: Ensemble of trees, much more accurate

**vs Gradient Boosting**:
- SVM/MLP: Good for small-medium datasets
- XGBoost/LightGBM: Better for large datasets, tabular data

**vs Linear Regression**:
- Linear models: Fastest, most interpretable
- sklearn models: More flexible, handle non-linearity

## Summary

This notebook demonstrated:

✅ **decision_tree()**: Interpretable single tree with pruning  
✅ **nearest_neighbor()**: k-NN for local patterns  
✅ **svm_rbf()**: Non-linear SVM with RBF kernel  
✅ **svm_linear()**: Linear SVM (faster alternative)  
✅ **mlp()**: Neural network for complex patterns  
✅ **Proper `.set_mode('regression')` usage** (CRITICAL!)  
✅ **Feature normalization** for distance/gradient-based models  
✅ **WorkflowSet integration** for systematic comparison  
✅ **Benchmarking** vs linear_reg and rand_forest  

**Key Insight**: These 5 sklearn models provide flexibility between interpretability (decision tree), local patterns (k-NN), non-linear capabilities (SVM RBF), speed (SVM linear), and complex pattern learning (MLP). Feature scaling is CRITICAL for k-NN, SVM, and MLP.

**API Reminder**: All 5 models require `.set_mode('regression')` method call, NOT mode parameter in constructor.

**Next Steps**:
- Hyperparameter tuning with `tune_grid()`
- Feature engineering with `recipes`
- Ensemble methods combining multiple models