# SISSOEnsemble Demonstration: Interpretable Symbolic Regression with Calibrated UQ

This notebook demonstrates **SISSOEnsemble** - a shallow ensemble of SISSO equations with calibrated uncertainty quantification.

## What is SISSOEnsemble?

SISSOEnsemble combines:
- **SISSO** (Sure Independence Screening and Sparsifying Operator) for discovering interpretable symbolic equations
- **DPOSE-style shallow ensembles** for calibrated uncertainty quantification

Key features:
- âœ… **Interpretable equations** - each ensemble member is a readable symbolic expression
- âœ… **Calibrated uncertainties** - trained with CRPS or NLL loss
- âœ… **Ensemble diversity** - different equation complexities (n_term values)
- âœ… **Post-hoc calibration** - optional validation-based calibration
- âœ… **Uncertainty propagation** - via `predict_ensemble()`

**References:**
- TorchSISSO: https://arxiv.org/abs/2410.01752
- DPOSE: Kellner & Ceriotti (2024), *Machine Learning: Science and Technology*

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Set style
plt.style.use('default')
plt.rcParams['figure.dpi'] = 100

print('Imports successful!')

## 1. Basic Example: Simple Additive Function

Let's start with a simple example where the true function is `y = x0 + x1`.

In [None]:
from pycse.sklearn import SISSOEnsemble

# Generate simple additive data
np.random.seed(42)
X = np.random.rand(100, 2)
y = X[:, 0] + X[:, 1] + 0.05 * np.random.randn(100)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'Training samples: {len(X_train)}')
print(f'Test samples: {len(X_test)}')
print(f'True function: y = x0 + x1 + noise')

In [None]:
# Fit SISSOEnsemble
model = SISSOEnsemble(
    n_models=5,           # Number of equations to discover
    n_expansion=1,        # Feature expansion depth
    n_terms_range=(1, 3), # Range of terms per equation
    loss_type='crps',     # CRPS loss for calibrated uncertainties
    feature_names=['x0', 'x1']
)

model.fit(X_train, y_train)

print(f'\nDiscovered {model.n_ensemble_} unique equations:')
for i, eq in enumerate(model.equations_):
    print(f'  {i+1}. {eq}')

In [None]:
# Evaluate on test set
y_pred, y_std = model.predict(X_test, return_std=True)

mae = np.mean(np.abs(y_test - y_pred))
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
r2 = model.score(X_test, y_test)

print('Test Set Performance:')
print(f'  MAE:  {mae:.6f}')
print(f'  RMSE: {rmse:.6f}')
print(f'  RÂ²:   {r2:.6f}')
print(f'  Uncertainty range: [{y_std.min():.4f}, {y_std.max():.4f}]')

## 2. Interpolation and Extrapolation with Gaps

A key test for uncertainty quantification is how the model behaves:
- **In gaps**: regions without training data (interpolation)
- **Beyond data range**: extrapolation

Good UQ should show increased uncertainty in these regions.

In [None]:
# Generate 1D data with a gap in the middle
np.random.seed(42)

# Left region: [0, 0.35]
x_left = np.linspace(0, 0.35, 40)[:, None]
# Right region: [0.65, 1]
x_right = np.linspace(0.65, 1, 40)[:, None]

# Combine (gap in middle: [0.35, 0.65])
X_train = np.vstack([x_left, x_right])

# True function: quadratic
y_train = 2 * X_train.ravel()**2 + X_train.ravel() + 0.05 * np.random.randn(len(X_train))

print(f'Training samples: {len(X_train)}')
print(f'Data range: [0, 0.35] âˆª [0.65, 1]')
print(f'Gap: [0.35, 0.65]')
print(f'True function: y = 2xÂ² + x + noise')

In [None]:
# Fit SISSOEnsemble
model_gap = SISSOEnsemble(
    n_models=5,
    n_expansion=2,        # Allow xÂ², x*x, etc.
    n_terms_range=(1, 3),
    loss_type='crps',
    feature_names=['x']
)

model_gap.fit(X_train, y_train)

print(f'Discovered {model_gap.n_ensemble_} equations:')
for i, eq in enumerate(model_gap.equations_):
    print(f'  {i+1}. {eq}')

In [None]:
# Predict over extended range (including gap and extrapolation)
x_plot = np.linspace(-0.3, 1.3, 200)[:, None]
y_pred, y_std = model_gap.predict(x_plot, return_std=True)

# True function for comparison
y_true = 2 * x_plot.ravel()**2 + x_plot.ravel()

# Plot
fig, ax = plt.subplots(figsize=(12, 7))

# Extrapolation regions
ax.axvspan(-0.3, 0, alpha=0.15, color='orange', label='Extrapolation')
ax.axvspan(1, 1.3, alpha=0.15, color='orange')

# Gap region
ax.axvspan(0.35, 0.65, alpha=0.15, color='purple', label='Gap (interpolation)')

# Uncertainty band
ax.fill_between(x_plot.ravel(), y_pred - 2*y_std, y_pred + 2*y_std,
                alpha=0.3, color='red', label='Â±2Ïƒ (95% CI)')

# Predictions
ax.plot(x_plot.ravel(), y_pred, 'r-', linewidth=2.5, label='SISSOEnsemble')
ax.plot(x_plot.ravel(), y_true, 'k--', linewidth=1.5, label='True function', alpha=0.7)

# Training data
ax.scatter(X_train, y_train, c='blue', s=40, alpha=0.6, label='Training data', zorder=5)

# Data boundaries
ax.axvline(0, color='black', linestyle='--', alpha=0.3)
ax.axvline(1, color='black', linestyle='--', alpha=0.3)

ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('SISSOEnsemble: Interpolation in Gap & Extrapolation', fontsize=13, fontweight='bold')
ax.legend(fontsize=10, loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('\nKey observations:')
print('  â€¢ Orange regions: Extrapolation beyond training data')
print('  â€¢ Purple region: Gap where no training data exists')
print('  â€¢ Uncertainty band (red) should widen in gap and extrapolation regions')

## 3. Comparing Loss Functions: CRPS vs NLL vs MSE

In [None]:
# Compare loss functions
loss_types = ['crps', 'nll', 'mse']

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, loss_type in enumerate(loss_types):
    ax = axes[idx]
    
    # Fit model
    model_loss = SISSOEnsemble(
        n_models=5,
        n_expansion=2,
        n_terms_range=(1, 3),
        loss_type=loss_type,
        feature_names=['x']
    )
    model_loss.fit(X_train, y_train)
    
    # Predict
    y_pred, y_std = model_loss.predict(x_plot, return_std=True)
    
    # Plot regions
    ax.axvspan(-0.3, 0, alpha=0.1, color='orange')
    ax.axvspan(1, 1.3, alpha=0.1, color='orange')
    ax.axvspan(0.35, 0.65, alpha=0.1, color='purple')
    
    # Uncertainty and prediction
    ax.fill_between(x_plot.ravel(), y_pred - 2*y_std, y_pred + 2*y_std,
                    alpha=0.3, color='red', label='Â±2Ïƒ')
    ax.plot(x_plot.ravel(), y_pred, 'r-', linewidth=2, label='Prediction')
    ax.plot(x_plot.ravel(), y_true, 'k--', linewidth=1, label='True', alpha=0.5)
    ax.scatter(X_train, y_train, c='blue', s=20, alpha=0.5, label='Data')
    
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_title(f'{loss_type.upper()} Loss\nÏƒ âˆˆ [{y_std.min():.3f}, {y_std.max():.3f}]')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('\nLoss function comparison:')
print('  CRPS: Robust, single-stage training (recommended)')
print('  NLL:  Can capture heteroscedasticity well')
print('  MSE:  No uncertainty training (baseline)')

## 4. Post-hoc Calibration with Validation Data

In [None]:
# Generate more data for validation
np.random.seed(123)
X_full = np.random.rand(150, 2)
y_full = X_full[:, 0] + 2*X_full[:, 1] + 0.1 * np.random.randn(150)

# Split into train/val/test
X_train, X_temp, y_train, y_temp = train_test_split(X_full, y_full, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f'Training: {len(X_train)}, Validation: {len(X_val)}, Test: {len(X_test)}')

In [None]:
# Fit without calibration
model_uncalib = SISSOEnsemble(
    n_models=5, n_expansion=1, n_terms_range=(1, 3),
    feature_names=['x0', 'x1']
)
model_uncalib.fit(X_train, y_train)

# Fit with calibration
model_calib = SISSOEnsemble(
    n_models=5, n_expansion=1, n_terms_range=(1, 3),
    feature_names=['x0', 'x1']
)
model_calib.fit(X_train, y_train, val_X=X_val, val_y=y_val)

print(f'Uncalibrated model calibration factor: {model_uncalib.calibration_factor_:.4f}')
print(f'Calibrated model calibration factor: {model_calib.calibration_factor_:.4f}')

In [None]:
# Compare on test set
y_pred_uncalib, y_std_uncalib = model_uncalib.predict(X_test, return_std=True)
y_pred_calib, y_std_calib = model_calib.predict(X_test, return_std=True)

# Z-scores (should have std ~1 if well-calibrated)
z_uncalib = (y_test - y_pred_uncalib) / y_std_uncalib
z_calib = (y_test - y_pred_calib) / y_std_calib

print('Calibration Quality (Z-score std should be ~1.0):')
print(f'  Uncalibrated: Z-score std = {np.std(z_uncalib):.4f}')
print(f'  Calibrated:   Z-score std = {np.std(z_calib):.4f}')

## 5. Ensemble Predictions for Uncertainty Propagation

In [None]:
# Get full ensemble predictions
ensemble_preds = model_calib.predict_ensemble(X_test)

print(f'Ensemble predictions shape: {ensemble_preds.shape}')
print(f'  {len(X_test)} samples Ã— {model_calib.n_ensemble_} ensemble members')

# Each column is a different equation's prediction
print(f'\nPrediction range per ensemble member:')
for i in range(ensemble_preds.shape[1]):
    print(f'  Member {i+1}: [{ensemble_preds[:, i].min():.3f}, {ensemble_preds[:, i].max():.3f}]')

In [None]:
# Uncertainty propagation example: z = exp(y)
z_ensemble = np.exp(ensemble_preds)
z_mean = z_ensemble.mean(axis=1)
z_std = z_ensemble.std(axis=1)

# Compare with naive propagation
y_mean, y_std = model_calib.predict(X_test, return_std=True)
z_mean_naive = np.exp(y_mean)
z_std_naive = z_mean_naive * y_std  # Linear approximation: dz/dy * dy

print('Uncertainty propagation through z = exp(y):')
print(f'  Ensemble method:  z_std âˆˆ [{z_std.min():.4f}, {z_std.max():.4f}]')
print(f'  Naive (linear):   z_std âˆˆ [{z_std_naive.min():.4f}, {z_std_naive.max():.4f}]')

## 6. Viewing the Discovered Equations

A key advantage of SISSOEnsemble over neural network ensembles is **interpretability**.

In [None]:
# Display discovered equations
print('Discovered Symbolic Equations:')
print('=' * 60)
for i, eq in enumerate(model_calib.equations_):
    print(f'Equation {i+1}: {eq}')
print('=' * 60)

print(f'\nTrue function: y = x0 + 2*x1')
print('\nNote: Each equation provides a different symbolic approximation.')
print('The ensemble combines them for robust predictions with uncertainty.')

## 7. Key Takeaways

### What SISSOEnsemble Provides

1. **Interpretable Equations**
   - Each ensemble member is a readable symbolic expression
   - Useful for understanding the underlying relationship

2. **Calibrated Uncertainties**
   - CRPS/NLL training for well-calibrated predictions
   - Optional post-hoc calibration with validation data

3. **Gap and Extrapolation Handling**
   - Uncertainty increases appropriately in sparse regions
   - Ensemble diversity provides robustness

4. **Uncertainty Propagation**
   - Use `predict_ensemble()` for derived quantities
   - Apply any transformation to ensemble members

### Recommended Usage

```python
from pycse.sklearn import SISSOEnsemble

# Basic usage
model = SISSOEnsemble(
    n_models=5,           # Number of equations
    n_expansion=2,        # Feature complexity
    n_terms_range=(1, 3), # Terms per equation
    loss_type='crps',     # Robust loss function
    feature_names=['x0', 'x1', 'x2']
)

# Fit with optional calibration
model.fit(X_train, y_train, val_X=X_val, val_y=y_val)

# Predict with uncertainty
y_pred, y_std = model.predict(X_test, return_std=True)

# View equations
for eq in model.equations_:
    print(eq)

# Uncertainty propagation
ensemble = model.predict_ensemble(X_test)
z_ensemble = f(ensemble)  # Apply your function
z_mean, z_std = z_ensemble.mean(axis=1), z_ensemble.std(axis=1)
```

### When to Use SISSOEnsemble

- When you need **interpretable models** (symbolic equations)
- When you want **calibrated uncertainty quantification**
- When data is limited and **extrapolation awareness** is important
- When you need to **propagate uncertainties** through derived quantities

---

**Enjoy interpretable symbolic regression with calibrated UQ!** ðŸŽ‰