# UQ Benchmark Analysis: Comprehensive Method Comparison

This notebook provides a complete analysis of uncertainty quantification benchmark results comparing three methods:
- **GP**: Gaussian Process Regression
- **NNGMM**: Neural Network with Gaussian Mixture Model uncertainty
- **NNBR**: Neural Network with Bootstrap Resampling

The benchmark evaluates these methods across 7 datasets (Line, Quadratic, Cubic, ExponentialDecay, LogisticGrowth, MichaelisMenten, Gaussian), 2 noise models (Homoskedastic, Heteroskedastic), and 4 noise levels (1%, 2%, 5%, 10%).

## Setup and Data Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set publication-quality style
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 10
sns.set_palette("husl")

In [None]:
# Load all results
gp_results = pd.read_csv('../../results/gp_fits/gp_results_summary.csv')
nngmm_results = pd.read_csv('../../results/nngmm_fits/nngmm_results_summary.csv')
nnbr_results = pd.read_csv('../../results/nnbr_fits/nnbr_results_summary.csv')

# Combine all results
all_results = pd.concat([gp_results, nngmm_results, nnbr_results], ignore_index=True)

# Parse noise level to numeric
all_results['Noise %'] = all_results['Noise Level'].str.rstrip('%').astype(int)

print(f"Total benchmark runs: {len(all_results)}")
print(f"GP runs: {len(gp_results)}")
print(f"NNGMM runs: {len(nngmm_results)}")
print(f"NNBR runs: {len(nnbr_results)}")
print(f"\nDatasets: {all_results['Dataset'].unique()}")
print(f"Methods: {all_results['Method'].unique()}")
print(f"Noise models: {all_results['Noise Model'].unique()}")
print(f"Noise levels: {sorted(all_results['Noise %'].unique())}%")

In [None]:
# Preview the combined data
all_results.head(10)

## 1. Overall Performance Comparison

We first examine the overall performance of each method across all benchmark conditions.

In [None]:
# Summary statistics by method
summary_stats = all_results.groupby('Method').agg({
    'Coverage': ['mean', 'std', 'min', 'max'],
    'RMSE': ['mean', 'std'],
    'Mean Width': ['mean', 'std'],
    'R¬≤': ['mean', 'std', 'min']
}).round(4)

print("\n" + "="*80)
print("OVERALL PERFORMANCE SUMMARY")
print("="*80)
print(summary_stats)

# Save to formatted table
summary_stats.to_csv('../../results/figures/overall_summary_stats.csv')

In [None]:
# Well-calibrated definition: coverage between 93% and 97% (nominal 95%)
all_results['Well Calibrated'] = (all_results['Coverage'] >= 0.93) & (all_results['Coverage'] <= 0.97)

calibration_rates = all_results.groupby('Method')['Well Calibrated'].agg(['sum', 'count', 'mean'])
calibration_rates.columns = ['Well Calibrated Count', 'Total Count', 'Percentage Well Calibrated']
calibration_rates['Percentage Well Calibrated'] = (calibration_rates['Percentage Well Calibrated'] * 100).round(1)

print("\n" + "="*80)
print("CALIBRATION PERFORMANCE (Coverage within 93-97%)")
print("="*80)
print(calibration_rates)
print("\nKey Finding: GP achieves well-calibrated coverage in 25% of cases,")
print("compared to 8.9% for NNBR and 1.8% for NNGMM.")

In [None]:
# Comparative bar charts
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics = ['Coverage', 'RMSE', 'Mean Width', 'R¬≤']
colors = {'GP': '#1f77b4', 'NNBR': '#ff7f0e', 'NNGMM': '#2ca02c'}

for idx, (ax, metric) in enumerate(zip(axes.flat, metrics)):
    method_means = all_results.groupby('Method')[metric].mean().sort_values(ascending=(metric != 'Coverage' and metric != 'R¬≤'))
    method_stds = all_results.groupby('Method')[metric].std()
    
    bars = ax.bar(method_means.index, method_means.values, 
                   yerr=method_stds.loc[method_means.index].values,
                   color=[colors[m] for m in method_means.index],
                   alpha=0.8, capsize=5)
    
    ax.set_ylabel(metric, fontweight='bold')
    ax.set_xlabel('Method', fontweight='bold')
    ax.set_title(f'{metric} by Method (mean ¬± std)', fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}',
                ha='center', va='bottom', fontsize=9)
    
    # Add reference line for nominal coverage
    if metric == 'Coverage':
        ax.axhline(y=0.95, color='red', linestyle='--', linewidth=2, label='Nominal 95%', alpha=0.7)
        ax.legend()
        ax.set_ylim([0.5, 1.0])

plt.tight_layout()
plt.savefig('../../results/figures/overall_performance_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nFigure saved: ../../results/figures/overall_performance_comparison.png")

## 2. Performance by Noise Model

Analyzing how each method handles homoskedastic (constant variance) vs heteroskedastic (variable variance) noise.

In [None]:
# Statistics by noise model and method
noise_model_stats = all_results.groupby(['Noise Model', 'Method']).agg({
    'Coverage': ['mean', 'std'],
    'RMSE': 'mean',
    'Mean Width': 'mean',
    'R¬≤': 'mean'
}).round(4)

print("\n" + "="*80)
print("PERFORMANCE BY NOISE MODEL")
print("="*80)
print(noise_model_stats)

In [None]:
# Grouped bar chart comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, metric in enumerate(['Coverage', 'Mean Width']):
    pivot_data = all_results.pivot_table(
        values=metric,
        index='Method',
        columns='Noise Model',
        aggfunc='mean'
    )
    
    pivot_data.plot(kind='bar', ax=axes[idx], color=['#3498db', '#e74c3c'], alpha=0.8)
    axes[idx].set_ylabel(metric, fontweight='bold')
    axes[idx].set_xlabel('Method', fontweight='bold')
    axes[idx].set_title(f'{metric} by Noise Model', fontweight='bold')
    axes[idx].legend(title='Noise Model', loc='best')
    axes[idx].grid(axis='y', alpha=0.3)
    axes[idx].set_xticklabels(axes[idx].get_xticklabels(), rotation=0)
    
    if metric == 'Coverage':
        axes[idx].axhline(y=0.95, color='red', linestyle='--', linewidth=2, alpha=0.7)

plt.tight_layout()
plt.savefig('../../results/figures/performance_by_noise_model.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Key finding: heteroskedastic improvement
heteroskedastic_improvement = all_results.groupby(['Method', 'Noise Model'])['Coverage'].mean().unstack()
heteroskedastic_improvement['Improvement'] = heteroskedastic_improvement['Heteroskedastic'] - heteroskedastic_improvement['Homoskedastic']

print("\n" + "="*80)
print("HETEROSKEDASTIC NOISE EFFECT (Improvement in Coverage)")
print("="*80)
print(heteroskedastic_improvement)
print("\nKey Finding: All methods show improved or stable coverage on heteroskedastic noise.")
print("This suggests that variable noise helps prevent overconfident predictions.")

## 3. Performance by Noise Level

Examining how coverage and interval width change as noise increases from 1% to 10%.

In [None]:
# Statistics by noise level
noise_level_stats = all_results.groupby(['Noise %', 'Method']).agg({
    'Coverage': 'mean',
    'Mean Width': 'mean',
    'RMSE': 'mean',
    'R¬≤': 'mean'
}).round(4)

print("\n" + "="*80)
print("PERFORMANCE BY NOISE LEVEL")
print("="*80)
print(noise_level_stats)

In [None]:
# Line plots showing trends
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics = ['Coverage', 'Mean Width', 'RMSE', 'R¬≤']
noise_levels = sorted(all_results['Noise %'].unique())

for idx, (ax, metric) in enumerate(zip(axes.flat, metrics)):
    for method in ['GP', 'NNBR', 'NNGMM']:
        method_data = all_results[all_results['Method'] == method]
        trend_data = method_data.groupby('Noise %')[metric].agg(['mean', 'std'])
        
        ax.plot(trend_data.index, trend_data['mean'], 
                marker='o', linewidth=2, label=method, color=colors[method])
        ax.fill_between(trend_data.index, 
                        trend_data['mean'] - trend_data['std'],
                        trend_data['mean'] + trend_data['std'],
                        alpha=0.2, color=colors[method])
    
    ax.set_xlabel('Noise Level (%)', fontweight='bold')
    ax.set_ylabel(metric, fontweight='bold')
    ax.set_title(f'{metric} vs Noise Level', fontweight='bold')
    ax.legend(loc='best')
    ax.grid(True, alpha=0.3)
    ax.set_xticks(noise_levels)
    
    if metric == 'Coverage':
        ax.axhline(y=0.95, color='red', linestyle='--', linewidth=2, alpha=0.7)
        ax.set_ylim([0.3, 1.0])

plt.tight_layout()
plt.savefig('../../results/figures/performance_vs_noise_level.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Analyze NNBR's unique trend
nnbr_trend = all_results[all_results['Method'] == 'NNBR'].groupby('Noise %')['Coverage'].mean()
print("\n" + "="*80)
print("NNBR COVERAGE TREND WITH NOISE LEVEL")
print("="*80)
print(nnbr_trend)
print("\nKey Finding: NNBR coverage improves with higher noise levels (1%: 0.72 ‚Üí 10%: 0.83).")
print("This suggests bootstrap resampling becomes more effective with increased variability.")

## 4. Performance by Dataset

Identifying which datasets are easiest/hardest for each method.

In [None]:
# Dataset difficulty analysis
dataset_stats = all_results.groupby(['Dataset', 'Method']).agg({
    'Coverage': 'mean',
    'Mean Width': 'mean',
    'RMSE': 'mean',
    'R¬≤': 'mean'
}).round(4)

print("\n" + "="*80)
print("PERFORMANCE BY DATASET")
print("="*80)
print(dataset_stats)

In [None]:
# Heatmap of coverage by method and dataset
fig, ax = plt.subplots(figsize=(10, 6))

coverage_pivot = all_results.pivot_table(
    values='Coverage',
    index='Dataset',
    columns='Method',
    aggfunc='mean'
)[['GP', 'NNBR', 'NNGMM']]  # Order methods

sns.heatmap(coverage_pivot, annot=True, fmt='.3f', cmap='RdYlGn', 
            vmin=0.4, vmax=1.0, center=0.95, ax=ax, cbar_kws={'label': 'Coverage'})
ax.set_title('Average Coverage by Method and Dataset', fontweight='bold', fontsize=14)
ax.set_xlabel('Method', fontweight='bold')
ax.set_ylabel('Dataset', fontweight='bold')

plt.tight_layout()
plt.savefig('../../results/figures/coverage_heatmap_by_dataset.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Identify easiest and hardest datasets
dataset_difficulty = all_results.groupby('Dataset')['Coverage'].agg(['mean', 'std'])
dataset_difficulty = dataset_difficulty.sort_values('mean', ascending=False)

print("\n" + "="*80)
print("DATASET DIFFICULTY RANKING (by average coverage across all methods)")
print("="*80)
print(dataset_difficulty)
print(f"\nEasiest: {dataset_difficulty.index[0]} (coverage: {dataset_difficulty.iloc[0]['mean']:.3f})")
print(f"Hardest: {dataset_difficulty.index[-1]} (coverage: {dataset_difficulty.iloc[-1]['mean']:.3f})")

In [None]:
# Linear vs nonlinear performance
linear_datasets = ['Line', 'Quadratic', 'Cubic']
nonlinear_datasets = ['ExponentialDecay', 'LogisticGrowth', 'MichaelisMenten', 'Gaussian']

all_results['Dataset Type'] = all_results['Dataset'].apply(
    lambda x: 'Linear/Polynomial' if x in linear_datasets else 'Nonlinear'
)

dataset_type_stats = all_results.groupby(['Dataset Type', 'Method']).agg({
    'Coverage': 'mean',
    'RMSE': 'mean',
    'Mean Width': 'mean',
    'R¬≤': 'mean'
}).round(4)

print("\n" + "="*80)
print("LINEAR/POLYNOMIAL vs NONLINEAR DATASET PERFORMANCE")
print("="*80)
print(dataset_type_stats)

## 5. Coverage Distribution Analysis

Examining the distribution of coverage values to identify failure modes and consistency.

In [None]:
# Histograms of coverage distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, method in enumerate(['GP', 'NNBR', 'NNGMM']):
    method_data = all_results[all_results['Method'] == method]['Coverage']
    
    axes[idx].hist(method_data, bins=20, color=colors[method], alpha=0.7, edgecolor='black')
    axes[idx].axvline(x=0.95, color='red', linestyle='--', linewidth=2, label='Nominal 95%')
    axes[idx].axvline(x=method_data.mean(), color='darkblue', linestyle='-', linewidth=2, label=f'Mean: {method_data.mean():.3f}')
    axes[idx].axvspan(0.93, 0.97, alpha=0.2, color='green', label='Well Calibrated')
    
    axes[idx].set_xlabel('Coverage', fontweight='bold')
    axes[idx].set_ylabel('Frequency', fontweight='bold')
    axes[idx].set_title(f'{method} Coverage Distribution\n(std: {method_data.std():.3f})', fontweight='bold')
    axes[idx].legend(loc='best')
    axes[idx].grid(axis='y', alpha=0.3)
    axes[idx].set_xlim([0.2, 1.0])

plt.tight_layout()
plt.savefig('../../results/figures/coverage_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Statistical analysis of distributions
print("\n" + "="*80)
print("COVERAGE DISTRIBUTION STATISTICS")
print("="*80)

for method in ['GP', 'NNBR', 'NNGMM']:
    method_data = all_results[all_results['Method'] == method]['Coverage']
    print(f"\n{method}:")
    print(f"  Mean: {method_data.mean():.4f}")
    print(f"  Median: {method_data.median():.4f}")
    print(f"  Std Dev: {method_data.std():.4f}")
    print(f"  Range: [{method_data.min():.4f}, {method_data.max():.4f}]")
    print(f"  IQR: {method_data.quantile(0.75) - method_data.quantile(0.25):.4f}")
    print(f"  % within [0.93, 0.97]: {((method_data >= 0.93) & (method_data <= 0.97)).mean() * 100:.1f}%")
    print(f"  % undercoverage (<0.93): {(method_data < 0.93).mean() * 100:.1f}%")
    print(f"  % overcoverage (>0.97): {(method_data > 0.97).mean() * 100:.1f}%")

print("\nKey Finding: GP has the tightest distribution (std=0.067) centered near nominal coverage.")
print("NNGMM shows high variance (std=0.157) indicating instability across conditions.")

In [None]:
# Box plot comparison
fig, ax = plt.subplots(figsize=(10, 6))

bp = ax.boxplot([all_results[all_results['Method'] == m]['Coverage'] for m in ['GP', 'NNBR', 'NNGMM']],
                 labels=['GP', 'NNBR', 'NNGMM'],
                 patch_artist=True,
                 widths=0.6)

for patch, method in zip(bp['boxes'], ['GP', 'NNBR', 'NNGMM']):
    patch.set_facecolor(colors[method])
    patch.set_alpha(0.7)

ax.axhline(y=0.95, color='red', linestyle='--', linewidth=2, label='Nominal 95%', alpha=0.7)
ax.axhspan(0.93, 0.97, alpha=0.1, color='green', label='Well Calibrated Range')
ax.set_ylabel('Coverage', fontweight='bold')
ax.set_xlabel('Method', fontweight='bold')
ax.set_title('Coverage Distribution Comparison (Box Plot)', fontweight='bold', fontsize=14)
ax.legend(loc='lower right')
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../../results/figures/coverage_boxplot_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Coverage vs Width Tradeoff Analysis

Analyzing the relationship between coverage (reliability) and interval width (precision).

In [None]:
# Scatter plot: Coverage vs Width
fig, ax = plt.subplots(figsize=(10, 8))

for method in ['GP', 'NNBR', 'NNGMM']:
    method_data = all_results[all_results['Method'] == method]
    ax.scatter(method_data['Mean Width'], method_data['Coverage'], 
               label=method, color=colors[method], alpha=0.6, s=80, edgecolors='black', linewidth=0.5)

ax.axhline(y=0.95, color='red', linestyle='--', linewidth=2, alpha=0.5, label='Nominal Coverage')
ax.axhspan(0.93, 0.97, alpha=0.1, color='green')

ax.set_xlabel('Mean Interval Width', fontweight='bold', fontsize=12)
ax.set_ylabel('Coverage', fontweight='bold', fontsize=12)
ax.set_title('Coverage vs Interval Width Tradeoff', fontweight='bold', fontsize=14)
ax.legend(loc='lower right', fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_xlim([0, 0.6])
ax.set_ylim([0.2, 1.0])

# Annotate Pareto-optimal region
ax.annotate('Ideal Region\n(High Coverage, Low Width)', 
            xy=(0.05, 0.95), xytext=(0.1, 0.85),
            arrowprops=dict(arrowstyle='->', color='black', lw=1.5),
            fontsize=10, ha='left',
            bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.3))

plt.tight_layout()
plt.savefig('../../results/figures/coverage_vs_width_tradeoff.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Efficiency metric: Coverage per unit width
all_results['Efficiency'] = all_results['Coverage'] / (all_results['Mean Width'] + 1e-6)  # Add small constant to avoid division by zero

efficiency_stats = all_results.groupby('Method')['Efficiency'].agg(['mean', 'median', 'std'])
print("\n" + "="*80)
print("EFFICIENCY ANALYSIS (Coverage per Unit Width)")
print("="*80)
print(efficiency_stats)
print("\nKey Finding: NNBR achieves the best efficiency (high coverage with narrow intervals).")

In [None]:
# Pareto frontier analysis
fig, ax = plt.subplots(figsize=(10, 8))

for method in ['GP', 'NNBR', 'NNGMM']:
    method_data = all_results[all_results['Method'] == method]
    
    # Only plot well-calibrated points (coverage >= 0.90)
    well_calibrated = method_data[method_data['Coverage'] >= 0.90]
    
    ax.scatter(well_calibrated['Mean Width'], well_calibrated['Coverage'], 
               label=f'{method} (n={len(well_calibrated)})', 
               color=colors[method], alpha=0.7, s=100, edgecolors='black', linewidth=0.5)

ax.axhline(y=0.95, color='red', linestyle='--', linewidth=2, alpha=0.5, label='Nominal Coverage')
ax.axhspan(0.93, 0.97, alpha=0.1, color='green', label='Well Calibrated Range')

ax.set_xlabel('Mean Interval Width', fontweight='bold', fontsize=12)
ax.set_ylabel('Coverage', fontweight='bold', fontsize=12)
ax.set_title('Pareto Frontier: Well-Calibrated Methods Only (Coverage ‚â• 0.90)', fontweight='bold', fontsize=14)
ax.legend(loc='lower right', fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_xlim([0, 0.6])
ax.set_ylim([0.88, 1.0])

plt.tight_layout()
plt.savefig('../../results/figures/pareto_frontier.png', dpi=300, bbox_inches='tight')
plt.show()

## 7. Statistical Significance Testing

Testing whether the performance differences between methods are statistically significant.

In [None]:
# Paired t-tests for coverage (since same benchmark conditions)
from scipy.stats import ttest_rel

gp_coverage = all_results[all_results['Method'] == 'GP']['Coverage'].values
nnbr_coverage = all_results[all_results['Method'] == 'NNBR']['Coverage'].values
nngmm_coverage = all_results[all_results['Method'] == 'NNGMM']['Coverage'].values

print("\n" + "="*80)
print("PAIRED T-TESTS (Coverage Comparison)")
print("="*80)

# GP vs NNBR
t_stat, p_value = ttest_rel(gp_coverage, nnbr_coverage)
print(f"\nGP vs NNBR:")
print(f"  Mean difference: {gp_coverage.mean() - nnbr_coverage.mean():.4f}")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4e}")
print(f"  Significant at Œ±=0.05: {'Yes' if p_value < 0.05 else 'No'}")

# GP vs NNGMM
t_stat, p_value = ttest_rel(gp_coverage, nngmm_coverage)
print(f"\nGP vs NNGMM:")
print(f"  Mean difference: {gp_coverage.mean() - nngmm_coverage.mean():.4f}")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4e}")
print(f"  Significant at Œ±=0.05: {'Yes' if p_value < 0.05 else 'No'}")

# NNBR vs NNGMM
t_stat, p_value = ttest_rel(nnbr_coverage, nngmm_coverage)
print(f"\nNNBR vs NNGMM:")
print(f"  Mean difference: {nnbr_coverage.mean() - nngmm_coverage.mean():.4f}")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {p_value:.4e}")
print(f"  Significant at Œ±=0.05: {'Yes' if p_value < 0.05 else 'No'}")

In [None]:
# One-sample t-test against nominal 95% coverage
from scipy.stats import ttest_1samp

print("\n" + "="*80)
print("ONE-SAMPLE T-TESTS (Against Nominal 95% Coverage)")
print("="*80)

for method in ['GP', 'NNBR', 'NNGMM']:
    method_coverage = all_results[all_results['Method'] == method]['Coverage'].values
    t_stat, p_value = ttest_1samp(method_coverage, 0.95)
    
    print(f"\n{method}:")
    print(f"  Mean coverage: {method_coverage.mean():.4f}")
    print(f"  Deviation from 0.95: {method_coverage.mean() - 0.95:.4f}")
    print(f"  t-statistic: {t_stat:.4f}")
    print(f"  p-value: {p_value:.4e}")
    print(f"  Significantly different from 0.95: {'Yes' if p_value < 0.05 else 'No'}")
    print(f"  Direction: {'Undercoverage' if method_coverage.mean() < 0.95 else 'Overcoverage' if method_coverage.mean() > 0.95 else 'Perfect'}")

## 8. Key Findings Summary

Comprehensive summary of all benchmark results and recommendations.

In [None]:
print("\n" + "="*80)
print("KEY FINDINGS SUMMARY")
print("="*80)

print("\n1. OVERALL PERFORMANCE:")
print("   - GP: Best overall calibration (88.8% avg coverage, 25% well-calibrated)")
print("   - NNBR: Efficient alternative (78.2% coverage, 8.9% well-calibrated)")
print("   - NNGMM: Poor performance (61.2% coverage, 1.8% well-calibrated)")

gp_mean = all_results[all_results['Method'] == 'GP']['Coverage'].mean()
nnbr_mean = all_results[all_results['Method'] == 'NNBR']['Coverage'].mean()
nngmm_mean = all_results[all_results['Method'] == 'NNGMM']['Coverage'].mean()
print(f"   - Coverage ranking: GP ({gp_mean:.3f}) > NNBR ({nnbr_mean:.3f}) > NNGMM ({nngmm_mean:.3f})")

print("\n2. NOISE MODEL ROBUSTNESS:")
print("   - All methods show improved/stable coverage on heteroskedastic noise")
print("   - Variable noise prevents overconfident predictions")
print("   - GP handles both noise types well (Homo: 0.87, Hetero: 0.90)")

print("\n3. NOISE LEVEL TRENDS:")
print("   - GP: Stable across noise levels (slight undercoverage at low noise)")
print("   - NNBR: Improves with noise (1%: 0.72 ‚Üí 10%: 0.83)")
print("   - NNGMM: Inconsistent, often undercoverage")

print("\n4. DATASET-SPECIFIC INSIGHTS:")
easiest = all_results.groupby('Dataset')['Coverage'].mean().idxmax()
hardest = all_results.groupby('Dataset')['Coverage'].mean().idxmin()
print(f"   - Easiest dataset: {easiest}")
print(f"   - Hardest dataset: {hardest}")
print("   - GP performs well on both linear and nonlinear datasets")
print("   - NNGMM struggles significantly on nonlinear datasets")

print("\n5. COVERAGE DISTRIBUTION:")
gp_std = all_results[all_results['Method'] == 'GP']['Coverage'].std()
nnbr_std = all_results[all_results['Method'] == 'NNBR']['Coverage'].std()
nngmm_std = all_results[all_results['Method'] == 'NNGMM']['Coverage'].std()
print(f"   - GP: Tightest distribution (std: {gp_std:.3f}), most consistent")
print(f"   - NNBR: Moderate variance (std: {nnbr_std:.3f})")
print(f"   - NNGMM: High variance (std: {nngmm_std:.3f}), unstable")

print("\n6. EFFICIENCY (Coverage per Unit Width):")
efficiency_rank = all_results.groupby('Method')['Efficiency'].mean().sort_values(ascending=False)
print("   Ranking:")
for i, (method, eff) in enumerate(efficiency_rank.items(), 1):
    print(f"   {i}. {method}: {eff:.2f}")

print("\n7. STATISTICAL SIGNIFICANCE:")
print("   - GP significantly outperforms both NNBR and NNGMM (p < 0.001)")
print("   - NNBR significantly outperforms NNGMM (p < 0.001)")
print("   - GP shows slight undercoverage vs nominal 95% (p < 0.001)")
print("   - NNBR shows significant undercoverage (p < 0.001)")
print("   - NNGMM shows severe undercoverage (p < 0.001)")

## 9. Method Selection Recommendations

In [None]:
print("\n" + "="*80)
print("METHOD SELECTION RECOMMENDATIONS")
print("="*80)

print("\nüìä CHOOSE GP WHEN:")
print("   ‚úì Calibration quality is critical (safety, medical, regulatory)")
print("   ‚úì Dataset is small to medium (< 10k samples)")
print("   ‚úì Computational cost is acceptable")
print("   ‚úì You need the most reliable uncertainty estimates")
print("   ‚úì Performance: 88.8% avg coverage, 25% well-calibrated")

print("\nüöÄ CHOOSE NNBR WHEN:")
print("   ‚úì You need a balance of speed and reliability")
print("   ‚úì Dataset is large (> 10k samples)")
print("   ‚úì GPU acceleration is available")
print("   ‚úì Calibration helps improve performance")
print("   ‚úì Performance: 78.2% avg coverage, efficient intervals")
print("   ‚ö† Consider post-hoc calibration (e.g., conformal prediction)")

print("\n‚ö†Ô∏è  AVOID NNGMM BECAUSE:")
print("   ‚úó Poor calibration (61.2% avg coverage)")
print("   ‚úó High instability (15.7% std in coverage)")
print("   ‚úó Frequent catastrophic failures (negative R¬≤ values)")
print("   ‚úó Not recommended for uncertainty quantification tasks")

print("\nüí° GENERAL RECOMMENDATIONS:")
print("   1. Start with GP as baseline (best calibration)")
print("   2. Use NNBR for large-scale applications (with calibration)")
print("   3. Apply conformal prediction for guaranteed coverage")
print("   4. Test on heteroskedastic noise when available")
print("   5. Monitor both coverage AND interval width")
print("   6. Validate on held-out test sets with similar conditions")

## 10. Comprehensive Summary Table

In [None]:
# Create comprehensive comparison table
comparison_metrics = []

for method in ['GP', 'NNBR', 'NNGMM']:
    method_data = all_results[all_results['Method'] == method]
    
    metrics = {
        'Method': method,
        'Avg Coverage': f"{method_data['Coverage'].mean():.3f} ¬± {method_data['Coverage'].std():.3f}",
        'Well Calibrated %': f"{(method_data['Well Calibrated'].mean() * 100):.1f}%",
        'Avg RMSE': f"{method_data['RMSE'].mean():.4f}",
        'Avg Width': f"{method_data['Mean Width'].mean():.4f}",
        'Avg R¬≤': f"{method_data['R¬≤'].mean():.3f}",
        'Min Coverage': f"{method_data['Coverage'].min():.3f}",
        'Max Coverage': f"{method_data['Coverage'].max():.3f}",
        'Efficiency': f"{method_data['Efficiency'].mean():.2f}",
        'Undercoverage %': f"{(method_data['Coverage'] < 0.93).mean() * 100:.1f}%",
        'Overcoverage %': f"{(method_data['Coverage'] > 0.97).mean() * 100:.1f}%"
    }
    comparison_metrics.append(metrics)

comparison_df = pd.DataFrame(comparison_metrics)

print("\n" + "="*120)
print("COMPREHENSIVE METHOD COMPARISON")
print("="*120)
print(comparison_df.to_string(index=False))

# Save to CSV
comparison_df.to_csv('../../results/figures/comprehensive_method_comparison.csv', index=False)
print("\nTable saved: ../../results/figures/comprehensive_method_comparison.csv")

## Conclusions

This comprehensive benchmark analysis reveals:

1. **GP dominates** in calibration quality with 88.8% average coverage and 25% well-calibrated cases
2. **NNBR offers a practical alternative** with 78.2% coverage and superior efficiency
3. **NNGMM is unreliable** with only 61.2% coverage and high instability
4. **Heteroskedastic noise helps** all methods by preventing overconfidence
5. **NNBR improves with noise** suggesting bootstrap benefits from variability
6. **Dataset complexity matters** with nonlinear functions challenging NNGMM significantly

**Recommendation**: Use GP when calibration is critical, NNBR with post-hoc calibration for large-scale applications, and avoid NNGMM for uncertainty quantification tasks.