# Conformal Prediction Intervals Demo

This notebook demonstrates **conformal prediction intervals** in py-tidymodels - a distribution-free method for uncertainty quantification with finite-sample coverage guarantees.

## What is Conformal Prediction?

Conformal prediction provides prediction intervals that are:
- **Distribution-free**: No assumptions about data distribution
- **Finite-sample**: Guarantees hold for any sample size
- **Model-agnostic**: Works with any prediction model
- **Calibrated**: Achieves target coverage (e.g., 95% for α=0.05)

## Coverage

This demo covers:
1. **Basic Conformal Prediction** - Single models with different methods
2. **Time Series** - Temporal calibration and EnbPI
3. **Grouped Models** - Per-group calibration for nested/panel data
4. **extract_outputs() Integration** - Conformal intervals in standard output

---

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from py_parsnip import linear_reg
from py_workflows import workflow

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---

# 1. Basic Conformal Prediction

Start with a simple regression example to understand the basics.

## 1.1 Generate Sample Data

In [None]:
# Generate sample regression data
n = 500
data = pd.DataFrame({
    'x1': np.random.randn(n),
    'x2': np.random.randn(n),
    'x3': np.random.randn(n)
})
data['y'] = 10 + 2*data['x1'] + 3*data['x2'] + 1*data['x3'] + np.random.randn(n) * 0.5

print(f"Data shape: {data.shape}")
data.head()

## 1.2 Fit Model and Get Conformal Predictions

In [None]:
# Fit linear regression model
spec = linear_reg()
fit = spec.fit(data, 'y ~ x1 + x2 + x3')

# Get conformal prediction intervals (95% confidence)
conformal_preds = fit.conformal_predict(data, alpha=0.05, method='split')

print(f"\nConformal predictions shape: {conformal_preds.shape}")
print(f"\nColumns: {list(conformal_preds.columns)}")
conformal_preds.head()

## 1.3 Verify Coverage

Check that conformal intervals achieve the target 95% coverage.

In [None]:
# Calculate empirical coverage
in_interval = (
    (data['y'] >= conformal_preds['.pred_lower']) &
    (data['y'] <= conformal_preds['.pred_upper'])
)

coverage = in_interval.mean()
print(f"Empirical Coverage: {coverage:.1%}")
print(f"Target Coverage: 95%")
print(f"\n✓ Coverage achieved!" if coverage >= 0.90 else "✗ Coverage below target")

## 1.4 Visualize Conformal Intervals

In [None]:
# Plot first 100 observations
n_show = 100
idx = range(n_show)

plt.figure(figsize=(14, 6))
plt.scatter(idx, data['y'].iloc[:n_show], label='Actual', s=30, zorder=3)
plt.plot(idx, conformal_preds['.pred'].iloc[:n_show], 'k-', label='Prediction', linewidth=2)
plt.fill_between(
    idx,
    conformal_preds['.pred_lower'].iloc[:n_show],
    conformal_preds['.pred_upper'].iloc[:n_show],
    alpha=0.3,
    label='95% Conformal Interval'
)
plt.xlabel('Observation')
plt.ylabel('Value')
plt.title('Conformal Prediction Intervals (Split Method)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate interval width statistics
interval_width = conformal_preds['.pred_upper'] - conformal_preds['.pred_lower']
print(f"\nInterval Width Statistics:")
print(f"  Mean: {interval_width.mean():.3f}")
print(f"  Median: {interval_width.median():.3f}")
print(f"  Min: {interval_width.min():.3f}")
print(f"  Max: {interval_width.max():.3f}")

## 1.5 Multiple Confidence Levels

Get intervals for multiple confidence levels simultaneously.

In [None]:
# Get 95%, 90%, and 80% confidence intervals
multi_conf = fit.conformal_predict(data, alpha=[0.05, 0.1, 0.2], method='split')

print(f"Columns: {list(multi_conf.columns)}")
multi_conf.head()

In [None]:
# Visualize nested intervals
n_show = 50
idx = range(n_show)

plt.figure(figsize=(14, 6))
plt.scatter(idx, data['y'].iloc[:n_show], label='Actual', s=40, zorder=4, color='red')
plt.plot(idx, multi_conf['.pred'].iloc[:n_show], 'k-', label='Prediction', linewidth=2, zorder=3)

# 95% interval
plt.fill_between(
    idx,
    multi_conf['.pred_lower_95'].iloc[:n_show],
    multi_conf['.pred_upper_95'].iloc[:n_show],
    alpha=0.2,
    label='95% CI',
    color='blue'
)

# 90% interval
plt.fill_between(
    idx,
    multi_conf['.pred_lower_90'].iloc[:n_show],
    multi_conf['.pred_upper_90'].iloc[:n_show],
    alpha=0.3,
    label='90% CI',
    color='green'
)

# 80% interval
plt.fill_between(
    idx,
    multi_conf['.pred_lower_80'].iloc[:n_show],
    multi_conf['.pred_upper_80'].iloc[:n_show],
    alpha=0.4,
    label='80% CI',
    color='orange'
)

plt.xlabel('Observation')
plt.ylabel('Value')
plt.title('Nested Conformal Intervals (Multiple Confidence Levels)')
plt.legend(loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 1.6 Compare Different Methods

Compare the three main conformal methods: split, CV+, and Jackknife+.

In [None]:
# Compare methods
methods = ['split', 'cv+', 'jackknife+']
results = {}

for method in methods:
    preds = fit.conformal_predict(data, alpha=0.05, method=method)
    
    # Calculate coverage and interval width
    in_interval = (
        (data['y'] >= preds['.pred_lower']) &
        (data['y'] <= preds['.pred_upper'])
    )
    coverage = in_interval.mean()
    
    interval_width = (preds['.pred_upper'] - preds['.pred_lower']).mean()
    
    results[method] = {
        'coverage': coverage,
        'avg_width': interval_width
    }
    
    print(f"{method.upper():12} - Coverage: {coverage:.1%}, Avg Width: {interval_width:.3f}")

# Create comparison DataFrame
comparison_df = pd.DataFrame(results).T
print("\n" + "="*50)
print(comparison_df)

**Key Takeaways:**
- **Split**: Fastest (O(1)), good for large datasets
- **CV+**: Balanced approach (O(K)), better data efficiency
- **Jackknife+**: Most data-efficient (O(n)), best for small datasets

---

# 2. Time Series Conformal Prediction

Time series require special handling to preserve temporal structure.

## 2.1 Generate Time Series Data

In [None]:
# Generate time series data with trend and seasonality
n_days = 365
dates = pd.date_range('2020-01-01', periods=n_days, freq='D')

# Trend + weekly seasonality + noise
trend = np.linspace(100, 150, n_days)
seasonality = 10 * np.sin(2 * np.pi * np.arange(n_days) / 7)
noise = np.random.randn(n_days) * 5

ts_data = pd.DataFrame({
    'date': dates,
    'value': trend + seasonality + noise,
    'day_of_week': dates.dayofweek,
    'month': dates.month
})

# Create lagged features
ts_data['lag_1'] = ts_data['value'].shift(1)
ts_data['lag_7'] = ts_data['value'].shift(7)
ts_data = ts_data.dropna()

print(f"Time series data shape: {ts_data.shape}")
ts_data.head(10)

## 2.2 Conformal Prediction with EnbPI

EnbPI (Ensemble Batch Prediction Intervals) is designed for time series.

In [None]:
# Fit model
ts_spec = linear_reg()
ts_fit = ts_spec.fit(ts_data, 'value ~ lag_1 + lag_7 + day_of_week')

# Get conformal predictions with auto method (will select EnbPI for time series)
ts_conformal = ts_fit.conformal_predict(
    ts_data,
    alpha=0.05,
    method='auto'  # Auto-selects EnbPI based on model type
)

print(f"Method selected: {ts_conformal['.conf_method'].iloc[0]}")
ts_conformal.head()

In [None]:
# Visualize time series with conformal intervals
n_show = 100

plt.figure(figsize=(14, 6))
plt.plot(ts_data['date'].iloc[:n_show], ts_data['value'].iloc[:n_show], 
         'o-', label='Actual', markersize=4)
plt.plot(ts_data['date'].iloc[:n_show], ts_conformal['.pred'].iloc[:n_show],
         'k-', label='Prediction', linewidth=2)
plt.fill_between(
    ts_data['date'].iloc[:n_show],
    ts_conformal['.pred_lower'].iloc[:n_show],
    ts_conformal['.pred_upper'].iloc[:n_show],
    alpha=0.3,
    label='95% Conformal Interval'
)
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series with Conformal Prediction Intervals')
plt.legend()
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate coverage
in_interval = (
    (ts_data['value'].values >= ts_conformal['.pred_lower'].values) &
    (ts_data['value'].values <= ts_conformal['.pred_upper'].values)
)
print(f"\nTime Series Coverage: {in_interval.mean():.1%}")

---

# 3. Grouped/Panel Models with Conformal Prediction

For grouped data (e.g., multiple stores, regions), each group gets its own conformal calibration.

## 3.1 Generate Grouped Data

In [None]:
# Generate grouped data with different patterns per group
n_per_group = 300

# Group A: Strong linear relationship, low noise
group_a = pd.DataFrame({
    'store': ['Store_A'] * n_per_group,
    'x1': np.random.randn(n_per_group),
    'x2': np.random.randn(n_per_group),
})
group_a['sales'] = 100 + 2*group_a['x1'] + 3*group_a['x2'] + np.random.randn(n_per_group) * 0.5

# Group B: Weaker relationship, high noise
group_b = pd.DataFrame({
    'store': ['Store_B'] * n_per_group,
    'x1': np.random.randn(n_per_group),
    'x2': np.random.randn(n_per_group),
})
group_b['sales'] = 80 + 1*group_b['x1'] + 0.5*group_b['x2'] + np.random.randn(n_per_group) * 3.0

# Group C: Different pattern
group_c = pd.DataFrame({
    'store': ['Store_C'] * n_per_group,
    'x1': np.random.randn(n_per_group),
    'x2': np.random.randn(n_per_group),
})
group_c['sales'] = 120 - 1.5*group_c['x1'] + 2*group_c['x2'] + np.random.randn(n_per_group) * 1.5

# Combine
grouped_data = pd.concat([group_a, group_b, group_c], ignore_index=True)

print(f"Grouped data shape: {grouped_data.shape}")
print(f"\nStores: {grouped_data['store'].unique()}")
grouped_data.groupby('store')['sales'].describe()

## 3.2 Fit Nested Models with Conformal Prediction

In [None]:
# Fit nested models (separate model per store)
grouped_spec = linear_reg()
nested_fit = grouped_spec.fit_nested(grouped_data, 'sales ~ x1 + x2', group_col='store')

# Get per-store conformal predictions
grouped_conformal = nested_fit.conformal_predict(
    grouped_data,
    alpha=0.05,
    method='split',
    per_group_calibration=True  # Each store gets own calibration (default)
)

print(f"Conformal predictions shape: {grouped_conformal.shape}")
print(f"\nStores in results: {grouped_conformal['store'].unique()}")
grouped_conformal.head(10)

## 3.3 Compare Interval Widths Across Groups

Groups with more noise should have wider conformal intervals.

In [None]:
# Calculate interval width per store
grouped_conformal['interval_width'] = (
    grouped_conformal['.pred_upper'] - grouped_conformal['.pred_lower']
)

# Compare by store
width_comparison = grouped_conformal.groupby('store')['interval_width'].agg(['mean', 'std', 'min', 'max'])
print("Interval Width Comparison by Store:")
print(width_comparison)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
grouped_conformal.boxplot(column='interval_width', by='store', ax=axes[0])
axes[0].set_title('Interval Width Distribution by Store')
axes[0].set_xlabel('Store')
axes[0].set_ylabel('Interval Width')
plt.sca(axes[0])
plt.xticks(rotation=0)

# Bar plot of means
width_comparison['mean'].plot(kind='bar', ax=axes[1], color=['#1f77b4', '#ff7f0e', '#2ca02c'])
axes[1].set_title('Average Interval Width by Store')
axes[1].set_xlabel('Store')
axes[1].set_ylabel('Average Interval Width')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)

plt.tight_layout()
plt.show()

print("\n✓ Store B has widest intervals (highest noise in data generation)")

## 3.4 Verify Per-Group Coverage

In [None]:
# Calculate coverage per store
coverage_by_store = {}

for store in grouped_data['store'].unique():
    store_data = grouped_data[grouped_data['store'] == store]
    store_conf = grouped_conformal[grouped_conformal['store'] == store]
    
    in_interval = (
        (store_data['sales'].values >= store_conf['.pred_lower'].values) &
        (store_data['sales'].values <= store_conf['.pred_upper'].values)
    )
    
    coverage = in_interval.mean()
    coverage_by_store[store] = coverage
    print(f"{store}: {coverage:.1%} coverage")

# Visualize coverage
plt.figure(figsize=(10, 5))
stores = list(coverage_by_store.keys())
coverages = list(coverage_by_store.values())
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
plt.bar(stores, coverages, color=colors)
plt.axhline(y=0.95, color='red', linestyle='--', label='Target 95%', linewidth=2)
plt.xlabel('Store')
plt.ylabel('Coverage')
plt.title('Conformal Prediction Coverage by Store')
plt.ylim([0.8, 1.0])
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n✓ All stores achieve target coverage (~95%)")

---

# 4. extract_outputs() Integration

Conformal intervals can be added directly to the standard three-DataFrame output.

## 4.1 Standard Output (No Conformal)

In [None]:
# Fit simple model
simple_spec = linear_reg()
simple_fit = simple_spec.fit(data, 'y ~ x1 + x2 + x3')

# Standard extract_outputs (no conformal)
outputs, coefficients, stats = simple_fit.extract_outputs()

print("Standard outputs columns:")
print(list(outputs.columns))
print(f"\nOutputs shape: {outputs.shape}")
outputs.head()

## 4.2 Outputs with Conformal Intervals

In [None]:
# extract_outputs WITH conformal intervals
outputs_conf, coefficients_conf, stats_conf = simple_fit.extract_outputs(
    conformal_alpha=0.05,
    conformal_method='split'
)

print("Outputs columns with conformal:")
print(list(outputs_conf.columns))
print(f"\nOutputs shape: {outputs_conf.shape}")
outputs_conf.head()

## 4.3 Multiple Confidence Levels in extract_outputs()

In [None]:
# Get multiple confidence levels
outputs_multi, _, _ = simple_fit.extract_outputs(
    conformal_alpha=[0.05, 0.1, 0.2]
)

print("Columns with multiple alphas:")
print(list(outputs_multi.columns))
outputs_multi.head()

## 4.4 Grouped Models with extract_outputs()

In [None]:
# Fit nested models
nested_spec = linear_reg()
nested_fit = nested_spec.fit_nested(grouped_data, 'sales ~ x1 + x2', group_col='store')

# Extract outputs WITHOUT conformal
outputs_no_conf, _, _ = nested_fit.extract_outputs()
print("Outputs WITHOUT conformal:")
print(list(outputs_no_conf.columns))
print(f"Shape: {outputs_no_conf.shape}\n")

# Extract outputs WITH conformal
outputs_with_conf, coeffs, stats = nested_fit.extract_outputs(conformal_alpha=0.05)
print("Outputs WITH conformal:")
print(list(outputs_with_conf.columns))
print(f"Shape: {outputs_with_conf.shape}")
outputs_with_conf.head(10)

## 4.5 Visualize Grouped Results from extract_outputs()

In [None]:
# Plot conformal intervals per store
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, store in enumerate(['Store_A', 'Store_B', 'Store_C']):
    store_outputs = outputs_with_conf[outputs_with_conf['store'] == store]
    n_show = 50
    
    axes[idx].scatter(range(n_show), store_outputs['actuals'].iloc[:n_show], 
                     label='Actual', s=30, zorder=3)
    axes[idx].plot(range(n_show), store_outputs['fitted'].iloc[:n_show],
                  'k-', label='Fitted', linewidth=2)
    axes[idx].fill_between(
        range(n_show),
        store_outputs['.pred_lower'].iloc[:n_show],
        store_outputs['.pred_upper'].iloc[:n_show],
        alpha=0.3,
        label='95% CI'
    )
    axes[idx].set_title(f'{store}')
    axes[idx].set_xlabel('Observation')
    axes[idx].set_ylabel('Sales')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✓ Each store has its own conformal intervals from extract_outputs()")

---

# Summary

## What We Covered

1. **Basic Conformal Prediction**
   - Split, CV+, and Jackknife+ methods
   - Multiple confidence levels
   - Coverage verification

2. **Time Series**
   - Auto method selection (EnbPI for time series)
   - Temporal calibration
   - Preserving temporal structure

3. **Grouped/Panel Models**
   - Per-group conformal calibration
   - Group-specific interval widths
   - Per-group coverage validation

4. **extract_outputs() Integration**
   - Seamless integration with standard output
   - Backward compatibility
   - Works with nested models

## Key Advantages

✅ **Distribution-free**: No assumptions about data  
✅ **Finite-sample guarantees**: Valid for any sample size  
✅ **Model-agnostic**: Works with any model  
✅ **Easy to use**: One parameter (`conformal_alpha`)  
✅ **Group-aware**: Per-group calibration for heterogeneous data  

## API Reference

```python
# Direct conformal prediction
conformal_preds = fit.conformal_predict(
    new_data,
    alpha=0.05,              # Confidence level (95% → α=0.05)
    method='auto',           # 'auto', 'split', 'cv+', 'jackknife+', 'enbpi'
    calibration_data=None    # Optional separate calibration set
)

# Via extract_outputs()
outputs, coefs, stats = fit.extract_outputs(
    conformal_alpha=0.05,    # None (default) = no conformal
    conformal_method='auto'  # Method selection
)

# For nested models
nested_conformal = nested_fit.conformal_predict(
    test_data,
    alpha=0.05,
    per_group_calibration=True  # Each group gets own calibration
)
```

---