# Neural Network UQ Methods: NNGMM and NNBR

This notebook demonstrates two neural network-based uncertainty quantification methods:

1. **NNGMM**: Neural Network + Gaussian Mixture Model
2. **NNBR**: Neural Network + Bayesian Linear Regression

Both methods use a neural network to extract features from the input, then apply different statistical methods for uncertainty estimation in the learned feature space.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys
sys.path.insert(0, '../..')

from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import BayesianRidge
from pycse.sklearn.nngmm import NeuralNetworkGMM
from pycse.sklearn.nnbr import NeuralNetworkBLR
from sklearn.model_selection import train_test_split

from src.datasets.linear import LineDataset
from src.datasets.nonlinear import ExponentialDecayDataset
from src.visualization import setup_plot_style
from src.utils.seeds import set_global_seed

# Set up plotting
setup_plot_style()
%matplotlib inline

## 1. Neural Network + Gaussian Mixture Model (NNGMM)

### How NNGMM Works

NNGMM is a two-stage approach:

1. **Feature Extraction**: A neural network learns a nonlinear transformation of the input features
2. **Uncertainty Estimation**: A Gaussian Mixture Model (GMM) is fit in the learned feature space to estimate prediction uncertainty

**Key Idea**: The NN captures complex nonlinear relationships, while the GMM provides a probabilistic model in the transformed space.

**Architecture**:
```
Input → NN Hidden Layers → Features → GMM → (mean, std)
```

**Advantages**:
- Can capture multimodal uncertainty distributions
- Flexible nonlinear modeling

**Challenges**:
- GMM fitting can be unstable (may diverge or produce poor fits)
- Sensitive to number of GMM components
- No built-in calibration mechanism

In [None]:
# Set random seed for reproducibility
set_global_seed(42)

# Create a simple dataset with homoskedastic noise
dataset = LineDataset(
    slope=0.8,
    intercept=0.1,
    n_samples=100,
    noise_model='homoskedastic',
    noise_level=0.05,
    seed=42
)

data = dataset.generate()

# Reshape for sklearn
X_train = data.X_train.reshape(-1, 1)
y_train = data.y_train
X_test = data.X_test.reshape(-1, 1)
y_test = data.y_test

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# Create backend neural network
mlp_gmm = MLPRegressor(
    hidden_layer_sizes=(20,),
    activation='tanh',
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)

# Create NNGMM with 3 GMM components
nngmm = NeuralNetworkGMM(
    nn=mlp_gmm,
    n_components=3,
    n_samples=500
)

# Fit the model
print("Fitting NNGMM model...")
nngmm.fit(X_train, y_train)
print("Done!")

In [None]:
# Make predictions with uncertainty
y_pred_gmm, y_std_gmm = nngmm.predict(X_test, return_std=True)
y_pred_gmm = y_pred_gmm.flatten()

# Compute 95% prediction intervals
z_score = 1.96
y_lower_gmm = y_pred_gmm - z_score * y_std_gmm
y_upper_gmm = y_pred_gmm + z_score * y_std_gmm

# Calculate metrics
rmse_gmm = np.sqrt(np.mean((y_test - y_pred_gmm)**2))
coverage_gmm = np.mean((y_test >= y_lower_gmm) & (y_test <= y_upper_gmm))
mean_width_gmm = np.mean(y_upper_gmm - y_lower_gmm)

print(f"NNGMM Results:")
print(f"  RMSE: {rmse_gmm:.4f}")
print(f"  Coverage: {coverage_gmm:.3f} (target: 0.95)")
print(f"  Mean Width: {mean_width_gmm:.4f}")

In [None]:
# Visualize NNGMM predictions
fig, ax = plt.subplots(figsize=(12, 6))

# Sort for plotting
sort_idx = np.argsort(X_test.flatten())
X_sorted = X_test[sort_idx].flatten()
y_pred_sorted = y_pred_gmm[sort_idx]
y_lower_sorted = y_lower_gmm[sort_idx]
y_upper_sorted = y_upper_gmm[sort_idx]

# Plot training data
ax.scatter(X_train, y_train, alpha=0.6, s=30, label='Training data', color='blue')

# Plot test data
ax.scatter(X_test, y_test, alpha=0.3, s=20, label='Test data', color='gray')

# Plot predictions
ax.plot(X_sorted, y_pred_sorted, 'r-', linewidth=2, label='NNGMM prediction')

# Plot uncertainty band
ax.fill_between(X_sorted, y_lower_sorted, y_upper_sorted, 
                 alpha=0.3, color='red', label='95% prediction interval')

# Plot true function
X_plot = np.linspace(0, 1, 200)
y_true = dataset._generate_clean(X_plot)
ax.plot(X_plot, y_true, 'k--', linewidth=2, label='True function', zorder=10)

ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title(f'NNGMM: Coverage = {coverage_gmm:.3f}, Width = {mean_width_gmm:.4f}')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

## 2. Neural Network + Bayesian Linear Regression (NNBR)

### How NNBR Works

NNBR is also a two-stage approach:

1. **Feature Extraction**: A neural network learns nonlinear features from the input
2. **Uncertainty Estimation**: Bayesian Linear Regression is applied in the feature space to estimate uncertainty

**Key Idea**: The NN captures complex patterns, while Bayesian regression provides principled uncertainty estimates.

**Architecture**:
```
Input → NN Hidden Layers → Features → Bayesian Linear Regression → (mean, std)
```

**Advantages**:
- More stable than GMM fitting
- **Post-hoc calibration** using validation data improves coverage
- Computationally efficient

**Calibration Procedure**:
NNBR uses a validation split to calibrate uncertainties:
1. Split training data into fit (80%) and validation (20%)
2. Train NN and Bayesian regressor on fit data
3. Evaluate on validation data to learn a calibration factor
4. Apply calibration to scale uncertainties for better coverage

In [None]:
# Create backend neural network for NNBR
mlp_br = MLPRegressor(
    hidden_layer_sizes=(20,),
    activation='tanh',
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)

# Create Bayesian Ridge regressor
br = BayesianRidge()

# Create NNBR
nnbr = NeuralNetworkBLR(nn=mlp_br, br=br)

# Split training data for calibration (80% fit, 20% validation)
X_train_fit, X_train_val, y_train_fit, y_train_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

print(f"Fit samples: {len(X_train_fit)}")
print(f"Validation samples: {len(X_train_val)}")

# Fit with calibration
print("\nFitting NNBR model with calibration...")
nnbr.fit(X_train_fit, y_train_fit, val_X=X_train_val, val_y=y_train_val)
print("Done!")

In [None]:
# Make predictions with uncertainty
y_pred_br, y_std_br = nnbr.predict(X_test, return_std=True)
if y_pred_br.ndim > 1:
    y_pred_br = y_pred_br.flatten()

# Compute 95% prediction intervals
z_score = 1.96
y_lower_br = y_pred_br - z_score * y_std_br
y_upper_br = y_pred_br + z_score * y_std_br

# Calculate metrics
rmse_br = np.sqrt(np.mean((y_test - y_pred_br)**2))
coverage_br = np.mean((y_test >= y_lower_br) & (y_test <= y_upper_br))
mean_width_br = np.mean(y_upper_br - y_lower_br)

print(f"NNBR Results:")
print(f"  RMSE: {rmse_br:.4f}")
print(f"  Coverage: {coverage_br:.3f} (target: 0.95)")
print(f"  Mean Width: {mean_width_br:.4f}")

In [None]:
# Visualize NNBR predictions
fig, ax = plt.subplots(figsize=(12, 6))

# Sort for plotting
sort_idx = np.argsort(X_test.flatten())
X_sorted = X_test[sort_idx].flatten()
y_pred_sorted = y_pred_br[sort_idx]
y_lower_sorted = y_lower_br[sort_idx]
y_upper_sorted = y_upper_br[sort_idx]

# Plot training data
ax.scatter(X_train, y_train, alpha=0.6, s=30, label='Training data', color='blue')

# Plot test data
ax.scatter(X_test, y_test, alpha=0.3, s=20, label='Test data', color='gray')

# Plot predictions
ax.plot(X_sorted, y_pred_sorted, 'g-', linewidth=2, label='NNBR prediction')

# Plot uncertainty band
ax.fill_between(X_sorted, y_lower_sorted, y_upper_sorted, 
                 alpha=0.3, color='green', label='95% prediction interval (calibrated)')

# Plot true function
X_plot = np.linspace(0, 1, 200)
y_true = dataset._generate_clean(X_plot)
ax.plot(X_plot, y_true, 'k--', linewidth=2, label='True function', zorder=10)

ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title(f'NNBR (Calibrated): Coverage = {coverage_br:.3f}, Width = {mean_width_br:.4f}')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

## 3. Side-by-Side Comparison

Let's compare both methods on the same dataset.

In [None]:
# Side-by-side comparison plot
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# NNGMM plot
ax = axes[0]
sort_idx = np.argsort(X_test.flatten())
X_sorted = X_test[sort_idx].flatten()

ax.scatter(X_train, y_train, alpha=0.6, s=30, label='Training data', color='blue')
ax.scatter(X_test, y_test, alpha=0.3, s=20, label='Test data', color='gray')
ax.plot(X_sorted, y_pred_gmm[sort_idx], 'r-', linewidth=2, label='Prediction')
ax.fill_between(X_sorted, y_lower_gmm[sort_idx], y_upper_gmm[sort_idx], 
                 alpha=0.3, color='red', label='95% interval')
ax.plot(X_plot, y_true, 'k--', linewidth=2, label='True function', zorder=10)
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title(f'NNGMM\nCoverage={coverage_gmm:.3f}, Width={mean_width_gmm:.4f}, RMSE={rmse_gmm:.4f}')
ax.legend()
ax.grid(True, alpha=0.3)

# NNBR plot
ax = axes[1]
ax.scatter(X_train, y_train, alpha=0.6, s=30, label='Training data', color='blue')
ax.scatter(X_test, y_test, alpha=0.3, s=20, label='Test data', color='gray')
ax.plot(X_sorted, y_pred_br[sort_idx], 'g-', linewidth=2, label='Prediction')
ax.fill_between(X_sorted, y_lower_br[sort_idx], y_upper_br[sort_idx], 
                 alpha=0.3, color='green', label='95% interval (calibrated)')
ax.plot(X_plot, y_true, 'k--', linewidth=2, label='True function', zorder=10)
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title(f'NNBR (Calibrated)\nCoverage={coverage_br:.3f}, Width={mean_width_br:.4f}, RMSE={rmse_br:.4f}')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Comparison table
comparison_df = pd.DataFrame({
    'Method': ['NNGMM', 'NNBR'],
    'Coverage': [f'{coverage_gmm:.3f}', f'{coverage_br:.3f}'],
    'RMSE': [f'{rmse_gmm:.4f}', f'{rmse_br:.4f}'],
    'Mean Width': [f'{mean_width_gmm:.4f}', f'{mean_width_br:.4f}']
})

print("\nMethod Comparison:")
print(comparison_df.to_string(index=False))
print("\nTarget coverage: 0.95")

## 4. Testing on a Nonlinear Dataset

Let's test both methods on a more challenging nonlinear dataset.

In [None]:
# Create exponential decay dataset
set_global_seed(42)

exp_dataset = ExponentialDecayDataset(
    a=2.0,
    b=3.0,
    c=0.5,
    n_samples=100,
    noise_model='homoskedastic',
    noise_level=0.05,
    seed=42
)

exp_data = exp_dataset.generate()

X_train_exp = exp_data.X_train.reshape(-1, 1)
y_train_exp = exp_data.y_train
X_test_exp = exp_data.X_test.reshape(-1, 1)
y_test_exp = exp_data.y_test

print(f"Exponential Decay Dataset:")
print(f"  Training samples: {len(X_train_exp)}")
print(f"  Test samples: {len(X_test_exp)}")

In [None]:
# Train NNGMM on exponential dataset
mlp_gmm_exp = MLPRegressor(
    hidden_layer_sizes=(20,),
    activation='tanh',
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)

nngmm_exp = NeuralNetworkGMM(
    nn=mlp_gmm_exp,
    n_components=3,
    n_samples=500
)

print("Fitting NNGMM on exponential dataset...")
nngmm_exp.fit(X_train_exp, y_train_exp)

y_pred_gmm_exp, y_std_gmm_exp = nngmm_exp.predict(X_test_exp, return_std=True)
y_pred_gmm_exp = y_pred_gmm_exp.flatten()

y_lower_gmm_exp = y_pred_gmm_exp - 1.96 * y_std_gmm_exp
y_upper_gmm_exp = y_pred_gmm_exp + 1.96 * y_std_gmm_exp

rmse_gmm_exp = np.sqrt(np.mean((y_test_exp - y_pred_gmm_exp)**2))
coverage_gmm_exp = np.mean((y_test_exp >= y_lower_gmm_exp) & (y_test_exp <= y_upper_gmm_exp))
width_gmm_exp = np.mean(y_upper_gmm_exp - y_lower_gmm_exp)

print(f"NNGMM: Coverage={coverage_gmm_exp:.3f}, RMSE={rmse_gmm_exp:.4f}, Width={width_gmm_exp:.4f}")

In [None]:
# Train NNBR on exponential dataset
mlp_br_exp = MLPRegressor(
    hidden_layer_sizes=(20,),
    activation='tanh',
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)

br_exp = BayesianRidge()
nnbr_exp = NeuralNetworkBLR(nn=mlp_br_exp, br=br_exp)

# Split for calibration
X_train_fit_exp, X_train_val_exp, y_train_fit_exp, y_train_val_exp = train_test_split(
    X_train_exp, y_train_exp, test_size=0.2, random_state=42
)

print("Fitting NNBR on exponential dataset with calibration...")
nnbr_exp.fit(X_train_fit_exp, y_train_fit_exp, val_X=X_train_val_exp, val_y=y_train_val_exp)

y_pred_br_exp, y_std_br_exp = nnbr_exp.predict(X_test_exp, return_std=True)
if y_pred_br_exp.ndim > 1:
    y_pred_br_exp = y_pred_br_exp.flatten()

y_lower_br_exp = y_pred_br_exp - 1.96 * y_std_br_exp
y_upper_br_exp = y_pred_br_exp + 1.96 * y_std_br_exp

rmse_br_exp = np.sqrt(np.mean((y_test_exp - y_pred_br_exp)**2))
coverage_br_exp = np.mean((y_test_exp >= y_lower_br_exp) & (y_test_exp <= y_upper_br_exp))
width_br_exp = np.mean(y_upper_br_exp - y_lower_br_exp)

print(f"NNBR: Coverage={coverage_br_exp:.3f}, RMSE={rmse_br_exp:.4f}, Width={width_br_exp:.4f}")

In [None]:
# Visualize comparison on exponential dataset
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# NNGMM plot
ax = axes[0]
sort_idx = np.argsort(X_test_exp.flatten())
X_sorted = X_test_exp[sort_idx].flatten()

ax.scatter(X_train_exp, y_train_exp, alpha=0.6, s=30, label='Training data', color='blue')
ax.scatter(X_test_exp, y_test_exp, alpha=0.3, s=20, label='Test data', color='gray')
ax.plot(X_sorted, y_pred_gmm_exp[sort_idx], 'r-', linewidth=2, label='NNGMM prediction')
ax.fill_between(X_sorted, y_lower_gmm_exp[sort_idx], y_upper_gmm_exp[sort_idx], 
                 alpha=0.3, color='red', label='95% interval')

# Plot true function
X_plot = np.linspace(0, 1, 200)
y_true_exp = exp_dataset._generate_clean(X_plot)
ax.plot(X_plot, y_true_exp, 'k--', linewidth=2, label='True function', zorder=10)

ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title(f'NNGMM - Exponential Decay\nCov={coverage_gmm_exp:.3f}, Width={width_gmm_exp:.4f}, RMSE={rmse_gmm_exp:.4f}')
ax.legend()
ax.grid(True, alpha=0.3)

# NNBR plot
ax = axes[1]
ax.scatter(X_train_exp, y_train_exp, alpha=0.6, s=30, label='Training data', color='blue')
ax.scatter(X_test_exp, y_test_exp, alpha=0.3, s=20, label='Test data', color='gray')
ax.plot(X_sorted, y_pred_br_exp[sort_idx], 'g-', linewidth=2, label='NNBR prediction')
ax.fill_between(X_sorted, y_lower_br_exp[sort_idx], y_upper_br_exp[sort_idx], 
                 alpha=0.3, color='green', label='95% interval (calibrated)')
ax.plot(X_plot, y_true_exp, 'k--', linewidth=2, label='True function', zorder=10)

ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title(f'NNBR - Exponential Decay\nCov={coverage_br_exp:.3f}, Width={width_br_exp:.4f}, RMSE={rmse_br_exp:.4f}')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Benchmark Results: Large-Scale Comparison

We've trained both methods on 56 different configurations (7 datasets × 2 noise models × 4 noise levels). Here are the key findings from the benchmark.

In [None]:
# Load benchmark results
nnbr_results = pd.read_csv('../../results/nnbr_fits/nnbr_results_summary.csv')
nngmm_results = pd.read_csv('../../results/nngmm_fits/nngmm_results_summary.csv')

# Convert coverage and RMSE to numeric
nnbr_results['Coverage_num'] = nnbr_results['Coverage'].astype(float)
nnbr_results['RMSE_num'] = nnbr_results['RMSE'].astype(float)
nnbr_results['Width_num'] = nnbr_results['Mean Width'].astype(float)

nngmm_results['Coverage_num'] = nngmm_results['Coverage'].astype(float)
nngmm_results['RMSE_num'] = nngmm_results['RMSE'].astype(float)
nngmm_results['Width_num'] = nngmm_results['Mean Width'].astype(float)

print("NNBR Benchmark Summary (56 experiments):")
print(f"  Average Coverage: {nnbr_results['Coverage_num'].mean():.3f}")
print(f"  Average RMSE: {nnbr_results['RMSE_num'].mean():.4f}")
print(f"  Average Width: {nnbr_results['Width_num'].mean():.4f}")
print(f"  Coverage std: {nnbr_results['Coverage_num'].std():.3f}")

print("\nNNGMM Benchmark Summary (56 experiments):")
print(f"  Average Coverage: {nngmm_results['Coverage_num'].mean():.3f}")
print(f"  Average RMSE: {nngmm_results['RMSE_num'].mean():.4f}")
print(f"  Average Width: {nngmm_results['Width_num'].mean():.4f}")
print(f"  Coverage std: {nngmm_results['Coverage_num'].std():.3f}")

In [None]:
# Visualize benchmark comparison
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Coverage comparison
ax = axes[0]
ax.hist(nnbr_results['Coverage_num'], bins=20, alpha=0.6, label='NNBR', color='green')
ax.hist(nngmm_results['Coverage_num'], bins=20, alpha=0.6, label='NNGMM', color='red')
ax.axvline(0.95, color='black', linestyle='--', linewidth=2, label='Target (0.95)')
ax.set_xlabel('Coverage')
ax.set_ylabel('Frequency')
ax.set_title('Coverage Distribution\n(56 experiments each)')
ax.legend()
ax.grid(True, alpha=0.3)

# RMSE comparison
ax = axes[1]
ax.hist(nnbr_results['RMSE_num'], bins=20, alpha=0.6, label='NNBR', color='green')
ax.hist(nngmm_results['RMSE_num'], bins=20, alpha=0.6, label='NNGMM', color='red')
ax.set_xlabel('RMSE')
ax.set_ylabel('Frequency')
ax.set_title('RMSE Distribution')
ax.legend()
ax.grid(True, alpha=0.3)

# Width comparison
ax = axes[2]
ax.hist(nnbr_results['Width_num'], bins=20, alpha=0.6, label='NNBR', color='green')
ax.hist(nngmm_results['Width_num'], bins=20, alpha=0.6, label='NNGMM', color='red')
ax.set_xlabel('Mean Interval Width')
ax.set_ylabel('Frequency')
ax.set_title('Interval Width Distribution')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Count experiments with good coverage (0.90 - 0.98)
good_coverage_nnbr = ((nnbr_results['Coverage_num'] >= 0.90) & 
                      (nnbr_results['Coverage_num'] <= 0.98)).sum()
good_coverage_nngmm = ((nngmm_results['Coverage_num'] >= 0.90) & 
                       (nngmm_results['Coverage_num'] <= 0.98)).sum()

print(f"Experiments with good coverage (0.90-0.98):")
print(f"  NNBR: {good_coverage_nnbr}/56 ({100*good_coverage_nnbr/56:.1f}%)")
print(f"  NNGMM: {good_coverage_nngmm}/56 ({100*good_coverage_nngmm/56:.1f}%)")

# Count experiments with negative R² (indicates poor fit)
nnbr_results['R2_num'] = nnbr_results['R²'].astype(float)
nngmm_results['R2_num'] = nngmm_results['R²'].astype(float)

negative_r2_nnbr = (nnbr_results['R2_num'] < 0).sum()
negative_r2_nngmm = (nngmm_results['R2_num'] < 0).sum()

print(f"\nExperiments with negative R² (poor fits):")
print(f"  NNBR: {negative_r2_nnbr}/56 ({100*negative_r2_nnbr/56:.1f}%)")
print(f"  NNGMM: {negative_r2_nngmm}/56 ({100*negative_r2_nngmm/56:.1f}%)")

## 6. Key Findings and Practical Recommendations

### NNBR Strengths:

1. **Better Calibration**: Post-hoc calibration using validation data significantly improves coverage
   - NNBR achieves closer to target 95% coverage on average
   - More consistent coverage across different datasets

2. **More Stable**: Bayesian linear regression is more robust than GMM fitting
   - Fewer catastrophic failures (negative R²)
   - More reliable across diverse problem types

3. **Computational Efficiency**: Faster and simpler than GMM approach

### NNGMM Challenges:

1. **Unstable GMM Fitting**: 
   - GMM can diverge or produce poor fits, especially on complex datasets
   - Many experiments show negative R² values
   - Coverage is often far from target

2. **Lack of Calibration**: No built-in mechanism to adjust uncertainties based on validation performance

3. **Sensitivity to Hyperparameters**: Number of GMM components affects performance

### Practical Recommendations:

1. **Default Choice**: Use **NNBR** for most applications
   - More reliable and stable
   - Better calibrated uncertainties
   - Easier to tune

2. **When to Consider NNGMM**:
   - If you suspect truly multimodal prediction distributions
   - For simple, well-behaved problems where GMM is stable
   - When you can afford to validate GMM quality

3. **Best Practices for NNBR**:
   - Always use validation split for calibration (20% of training data)
   - Use 1.96 × std for 95% intervals (Gaussian assumption)
   - Verify calibration quality on held-out data

4. **Neural Network Architecture**:
   - Both methods benefit from similar NN architectures
   - Start with 20 hidden units, tanh activation
   - Use LBFGS solver for small-medium datasets

### Calibration Quality:

The key advantage of NNBR is its calibration procedure:
- Learns a scaling factor for uncertainties from validation data
- Corrects for under/over-confident predictions
- Results in coverage closer to nominal level (95%)

NNGMM lacks this mechanism, leading to:
- Highly variable coverage (often too low)
- No automatic adjustment for miscalibration
- Requires manual tuning of GMM components

## Summary

This notebook demonstrated:

1. **NNGMM**: Neural network features + GMM for uncertainty
   - Flexible but unstable
   - Prone to poor fits on some datasets
   - No calibration mechanism

2. **NNBR**: Neural network features + Bayesian regression
   - More stable and reliable
   - Post-hoc calibration improves coverage
   - Better performance across diverse datasets

3. **Benchmark Comparison**: Across 56 experiments
   - NNBR shows superior calibration and stability
   - NNGMM struggles on many datasets

4. **Recommendation**: Use NNBR as the default neural network UQ method
   - More predictable behavior
   - Better calibrated uncertainties
   - Easier to deploy in practice

### References

- **pycse library**: https://github.com/jkitchin/pycse
- **NeuralNetworkGMM**: `pycse.sklearn.nngmm.NeuralNetworkGMM`
- **NeuralNetworkBLR**: `pycse.sklearn.nnbr.NeuralNetworkBLR`
- Benchmark results: `results/nnbr_fits/` and `results/nngmm_fits/`