# Feature Selection via Conformal Interval Width

This notebook demonstrates using **WorkflowSet.compare_conformal()** to select optimal features based on uncertainty quantification.

## Key Concept

**Traditional Approach:** Select features based on predictive accuracy (RMSE, R²)

**Conformal Approach:** Select features based on:
1. **Interval Width** - Tighter intervals = better uncertainty quantification
2. **Coverage** - Maintains target coverage (e.g., 95%)
3. **Balance** - Good accuracy + tight intervals

## Key Method (NEW)

```python
# Compare conformal intervals across ALL workflows
comparison = wf_set.compare_conformal(
    data=train_data,
    alpha=0.05,
    method='split'
)
```

Returns DataFrame sorted by **average interval width** (tightest first).

## What We'll Demonstrate

1. Load real-world refinery production data (JODI)
2. Create 6 different feature sets (simple → complex)
3. Test with 2 models (linear_reg, rand_forest)
4. Compare all 12 workflows via conformal intervals
5. Key insight: **More features ≠ tighter intervals**

---

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from py_parsnip import linear_reg, rand_forest
from py_workflowsets import WorkflowSet

# Set random seed
np.random.seed(42)

# Plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

---

# 1. Load and Prepare Data

Using JODI (Joint Organisations Data Initiative) refinery production data.

In [None]:
# Load data
refinery = pd.read_csv('../_md/__data/jodi_refinery_production_data.csv')

# Convert date
refinery['date'] = pd.to_datetime(refinery['date'])

# Filter for Refinery Intake only
refinery = refinery[refinery['subcategory'] == 'Refinery Intake'].copy()

# Sort
refinery = refinery.sort_values(['country', 'date']).reset_index(drop=True)

print(f"Dataset shape: {refinery.shape}")
print(f"Date range: {refinery['date'].min()} to {refinery['date'].max()}")
print(f"Countries: {refinery['country'].nunique()}")
print(f"\nColumns: {list(refinery.columns)}")

In [None]:
# Select top 5 countries by average production
avg_production = refinery.groupby('country')['value'].mean().sort_values(ascending=False)

print("Top 10 Refining Countries:")
print(avg_production.head(10))

# Select top 5 for faster demonstration
top_countries = avg_production.head(5).index.tolist()
refinery_subset = refinery[refinery['country'].isin(top_countries)].copy()

print(f"\n✓ Selected countries: {top_countries}")
print(f"✓ Filtered dataset: {refinery_subset.shape}")

---

# 2. Feature Engineering

Create lagged features and rolling statistics.

In [None]:
def create_features(df):
    """Create lagged and rolling features."""
    df = df.copy()
    
    # Lagged production (1, 3, 6, 12 months)
    for lag in [1, 3, 6, 12]:
        df[f'prod_lag_{lag}'] = df.groupby('country')['value'].shift(lag)
    
    # Rolling means (3, 6 months)
    df['prod_ma_3'] = df.groupby('country')['value'].transform(
        lambda x: x.shift(1).rolling(3, min_periods=1).mean()
    )
    df['prod_ma_6'] = df.groupby('country')['value'].transform(
        lambda x: x.shift(1).rolling(6, min_periods=1).mean()
    )
    
    # Date features
    df['month'] = df['date'].dt.month
    df['quarter'] = df['date'].dt.quarter
    
    return df

# Apply feature engineering
refinery_features = create_features(refinery_subset)

# Drop missing values
refinery_clean = refinery_features.dropna().copy()

print(f"Dataset with features: {refinery_clean.shape}")
print(f"\nFeatures created:")
print(f"  Lags: {[c for c in refinery_clean.columns if 'lag' in c]}")
print(f"  MAs:  {[c for c in refinery_clean.columns if 'ma' in c]}")
print(f"\nSample:")
print(refinery_clean[['date', 'country', 'value', 'prod_lag_1', 'prod_ma_3']].head())

---

# 3. Train/Test Split

In [None]:
# Use last 12 months for testing
split_date = refinery_clean['date'].max() - pd.DateOffset(months=12)

train_data = refinery_clean[refinery_clean['date'] <= split_date].copy()
test_data = refinery_clean[refinery_clean['date'] > split_date].copy()

print(f"Train: {len(train_data)} samples (up to {train_data['date'].max().date()})")
print(f"Test:  {len(test_data)} samples (from {test_data['date'].min().date()})")
print(f"\nTrain countries: {train_data['country'].nunique()}")
print(f"Test countries:  {test_data['country'].nunique()}")

---

# 4. Define Feature Strategies

Create 6 preprocessing strategies from simple to complex.

In [None]:
# Define formulas with different feature combinations
strategies = [
    # Strategy 1: Minimal (just recent lag)
    "value ~ prod_lag_1",
    
    # Strategy 2: Short-term lags
    "value ~ prod_lag_1 + prod_lag_3",
    
    # Strategy 3: Short + medium term
    "value ~ prod_lag_1 + prod_lag_3 + prod_lag_6",
    
    # Strategy 4: All lags
    "value ~ prod_lag_1 + prod_lag_3 + prod_lag_6 + prod_lag_12",
    
    # Strategy 5: Lags + rolling means
    "value ~ prod_lag_1 + prod_lag_3 + prod_lag_6 + prod_ma_3 + prod_ma_6",
    
    # Strategy 6: Comprehensive (lags + MA + seasonality)
    "value ~ prod_lag_1 + prod_lag_3 + prod_lag_6 + prod_ma_3 + prod_ma_6 + month + quarter"
]

print(f"Number of preprocessing strategies: {len(strategies)}")
print("\nStrategies:")
for i, s in enumerate(strategies, 1):
    # Count features
    n_features = len(s.split(' ~ ')[1].split(' + '))
    print(f"  {i}. ({n_features} features) {s}")

---

# 5. Create WorkflowSet

Cross product: 6 strategies × 2 models = 12 workflows

In [None]:
# Create WorkflowSet
wf_set = WorkflowSet.from_cross(
    preproc=strategies,
    models=[
        linear_reg(),
        rand_forest(trees=50).set_mode('regression')  # Fewer trees for speed
    ]
)

print(f"✓ Created WorkflowSet with {len(wf_set.workflows)} workflows")
print(f"\nWorkflow IDs:")
for wf_id in wf_set.workflows.keys():
    print(f"  {wf_id}")

---

# 6. Compare Conformal Intervals (NEW METHOD)

## Key Feature: `WorkflowSet.compare_conformal()`

This fits ALL workflows and compares their conformal prediction intervals.

In [None]:
# Compare conformal intervals across all workflows
print("Comparing conformal intervals across all workflows...")
print("(This may take 1-2 minutes...)\n")

comparison = wf_set.compare_conformal(
    data=train_data,
    alpha=0.05,
    method='split'
)

print("\nConformal Interval Comparison Results:")
print("="*100)
print(comparison.to_string(index=False))
print("\n✓ Results sorted by average interval width (tightest first)")

---

# 7. Visualize Comparison

In [None]:
# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Shorten workflow labels
comparison['short_label'] = [
    f"S{i+1}_{wf_id.split('_')[-2]}" 
    for i, wf_id in enumerate(comparison['wflow_id'])
]

# Plot 1: Interval width
colors = plt.cm.viridis(np.linspace(0, 1, len(comparison)))
axes[0].barh(comparison['short_label'], comparison['avg_interval_width'], color=colors)
axes[0].set_xlabel('Average Interval Width (lower = better)')
axes[0].set_ylabel('Strategy_Model')
axes[0].set_title('Conformal Interval Width Comparison')
axes[0].grid(True, alpha=0.3, axis='x')

# Annotate best
best_idx = 0
axes[0].axhline(y=best_idx, color='red', linestyle='--', alpha=0.5, linewidth=2)

# Plot 2: Coverage
axes[1].barh(comparison['short_label'], comparison['coverage'], color=colors)
axes[1].axvline(x=0.95, color='red', linestyle='--', linewidth=2, label='Target 95%')
axes[1].set_xlabel('Coverage')
axes[1].set_ylabel('Strategy_Model')
axes[1].set_title('Empirical Coverage')
axes[1].set_xlim([0.85, 1.0])
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nKey Insights:")
print(f"• Best workflow: {comparison.iloc[0]['wflow_id']}")
print(f"  - Tightest intervals: {comparison.iloc[0]['avg_interval_width']:.2f}")
print(f"  - Coverage: {comparison.iloc[0]['coverage']:.1%}")
print(f"\n• Worst workflow: {comparison.iloc[-1]['wflow_id']}")
print(f"  - Widest intervals: {comparison.iloc[-1]['avg_interval_width']:.2f}")
print(f"  - Coverage: {comparison.iloc[-1]['coverage']:.1%}")

---

# 8. Analyze Feature Complexity vs Interval Width

In [None]:
# Extract strategy number and model type
comparison['strategy_num'] = comparison['wflow_id'].str.extract(r'prep_(\d+)_')[0].astype(int)
comparison['model_type'] = comparison['model'].str.replace('_reg', '').str.replace('_forest', '_forest')

# Plot: Strategy complexity vs interval width
fig, ax = plt.subplots(figsize=(12, 6))

for model in comparison['model_type'].unique():
    subset = comparison[comparison['model_type'] == model]
    ax.plot(subset['strategy_num'], subset['avg_interval_width'], 
           'o-', label=model, markersize=10, linewidth=2)

ax.set_xlabel('Strategy Number (1=simple, 6=complex)', fontsize=12)
ax.set_ylabel('Average Interval Width', fontsize=12)
ax.set_title('Feature Complexity vs Uncertainty Quantification', fontsize=14)
ax.set_xticks(range(1, 7))
ax.legend(title='Model', fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nInsight: More features ≠ tighter intervals")
print("  • Simple strategies may provide better uncertainty quantification")
print("  • Overfitting increases interval width (model uncertainty)")
print("  • Optimal strategy balances predictive power and overfitting")

---

# 9. Use Best Workflow for Forecasting

In [None]:
# Get best workflow (tightest intervals with good coverage)
best_wf_id = comparison.iloc[0]['wflow_id']
best_workflow = wf_set[best_wf_id]

print(f"Best Workflow: {best_wf_id}")
print(f"  Interval width: {comparison.iloc[0]['avg_interval_width']:.2f}")
print(f"  Coverage: {comparison.iloc[0]['coverage']:.1%}")
print(f"  Model: {comparison.iloc[0]['model']}")

# Fit on full training data
print("\nFitting best workflow on training data...")
best_fit = best_workflow.fit(train_data)

print("✓ Model fitted")

In [None]:
# Get conformal predictions for test data
best_predictions = best_fit.conformal_predict(
    test_data,
    alpha=0.05,
    method='split'
)

print(f"Generated {len(best_predictions)} predictions")
print(f"\nColumns: {list(best_predictions.columns)}")
print(f"\nSample predictions:")
print(best_predictions[['.pred', '.pred_lower', '.pred_upper']].head())

In [None]:
# Visualize forecasts for first country
first_country = test_data['country'].iloc[0]
country_test = test_data[test_data['country'] == first_country].reset_index(drop=True)
country_pred = best_predictions[:len(country_test)]

n_show = min(50, len(country_test))

fig, ax = plt.subplots(figsize=(14, 6))

ax.plot(range(n_show), country_test['value'].values[:n_show],
       'o-', label='Actual', markersize=5, linewidth=1.5)
ax.plot(range(n_show), country_pred['.pred'].values[:n_show],
       'k-', label='Prediction', linewidth=2)
ax.fill_between(
    range(n_show),
    country_pred['.pred_lower'].values[:n_show],
    country_pred['.pred_upper'].values[:n_show],
    alpha=0.3,
    label='95% Conformal Interval'
)

ax.set_title(f"{first_country} - Refinery Production Forecast (Best Workflow)")
ax.set_xlabel('Month (Test Period)')
ax.set_ylabel('Production (kbd)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n✓ Best workflow provides well-calibrated prediction intervals")
print(f"✓ Selected via conformal comparison (not just RMSE)")

---

# Summary

## What We Demonstrated

1. ✅ **WorkflowSet.compare_conformal()** - NEW METHOD
   - Fits all workflows and compares conformal intervals
   - Returns DataFrame sorted by interval width
   - Identifies optimal preprocessing for uncertainty quantification

2. ✅ **Feature Selection via Uncertainty**
   - Traditional: Select by RMSE or R²
   - Conformal: Select by interval width + coverage
   - More features ≠ better uncertainty quantification

3. ✅ **Multiple Strategies Comparison**
   - 6 feature strategies (simple → complex)
   - 2 model types (linear, random forest)
   - 12 total workflows evaluated

4. ✅ **Real-World Application**
   - JODI refinery production data
   - 5 countries, monthly observations
   - Practical energy forecasting scenario

## Key Takeaways

**Use conformal intervals for feature selection when:**
- Uncertainty quantification matters (risk management, planning)
- You need calibrated prediction intervals
- Multiple feature sets perform similarly in RMSE
- Want to avoid overfitting (wider intervals signal overfitting)

**Method Advantages:**
- ✅ Data-driven preprocessing selection
- ✅ Systematic comparison across all workflows
- ✅ Balances accuracy and uncertainty
- ✅ One method call evaluates everything

**Practical Workflow:**
```python
# 1. Define multiple strategies
strategies = [formula1, formula2, ...]

# 2. Create WorkflowSet
wf_set = WorkflowSet.from_cross(preproc=strategies, models=[model1, model2])

# 3. Compare via conformal intervals
comparison = wf_set.compare_conformal(train_data, alpha=0.05)

# 4. Select best
best_wf_id = comparison.iloc[0]['wflow_id']
best_wf = wf_set[best_wf_id]

# 5. Fit and use
final_fit = best_wf.fit(train_data)
```

---

**Next Steps:**
- See `24e_per_group_conformal.ipynb` for per-group calibration
- See `24h_cv_conformal_integration.ipynb` for CV + conformal dual ranking
- See `examples/22_conformal_prediction_demo.ipynb` for comprehensive overview