# Paper Figures: Fighting Sampling Bias

This notebook recreates all figures and tables from the paper:

**Figures:**
- Figure 2: Loss due to sampling bias (5 panels: a-e)
- Figure 3: BASL sensitivity analysis  
- Figure 4: Feature bias analysis

**Tables:**
- Table 1: Bayesian evaluation comparison
- Table 2: BASL training effectiveness

In [None]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from typing import Dict, List, Optional, Tuple

# Style settings for paper-quality figures
plt.rcParams.update({
    'font.size': 10,
    'axes.labelsize': 11,
    'axes.titlesize': 11,
    'legend.fontsize': 9,
    'xtick.labelsize': 9,
    'ytick.labelsize': 9,
    'figure.dpi': 100,
    'axes.grid': True,
    'grid.alpha': 0.3,
})

EXPERIMENTS_DIR = Path('../experiments')

## Helper Functions

In [None]:
def load_latest_experiment(prefix: str) -> Tuple[Path, dict]:
    """Load the most recent experiment matching a prefix."""
    exp_dirs = sorted(EXPERIMENTS_DIR.glob(f"{prefix}*"), reverse=True)
    if not exp_dirs:
        raise FileNotFoundError(f"No experiments found with prefix: {prefix}")
    
    exp_dir = exp_dirs[0]
    print(f"Loading experiment: {exp_dir.name}")
    return exp_dir, load_experiment(exp_dir)


def load_experiment(exp_dir: Path) -> dict:
    """Load all data from an experiment directory."""
    data = {}
    
    # Load metrics history if available
    metrics_files = list(exp_dir.glob("metrics_history_*.json"))
    if metrics_files:
        with open(metrics_files[0]) as f:
            data['metrics_history'] = json.load(f)
    
    # Load aggregated results if available
    agg_file = exp_dir / "aggregated.json"
    if agg_file.exists():
        with open(agg_file) as f:
            data['aggregated'] = json.load(f)
    
    # Load Figure 2 data if available (Experiment 1: panels a-c)
    figure2_files = list(exp_dir.glob("figure2_data_*.json"))
    if figure2_files:
        with open(figure2_files[0]) as f:
            data['figure2_data'] = json.load(f)
    
    # Load BASL-LR data if available (Experiment 2: for panels b-c bridge)
    basl_lr_file = exp_dir / "basl_lr_data.json"
    if basl_lr_file.exists():
        with open(basl_lr_file) as f:
            data['basl_lr_data'] = json.load(f)
    
    # Load feature bias analysis if available (Experiment 2 only)
    bias_file = exp_dir / "feature_bias_analysis.json"
    if bias_file.exists():
        with open(bias_file) as f:
            data['feature_bias'] = json.load(f)
    
    return data


def extract_metric_series(metrics_history: list, metric: str = 'abr') -> dict:
    """Extract a metric series from metrics history for all model types."""
    iterations = [m['iteration'] for m in metrics_history]
    
    series = {'iteration': iterations}
    
    model_types = ['oracle', 'model_holdout', 'accepts', 'bayesian']
    for model_type in model_types:
        if model_type in metrics_history[0]:
            series[model_type] = [m[model_type][metric] for m in metrics_history]
    
    return series

## Load Experiment Data

In [None]:
# Load Experiment 1: Bayesian Evaluation (Figure 2 panels a-d)
try:
    exp1_dir, exp1_data = load_latest_experiment('exp1_bayesian_eval')
    print(f"Exp 1 keys: {list(exp1_data.keys())}")
except FileNotFoundError as e:
    print(f"Warning: {e}")
    exp1_data = None

# Load Experiment 2: BASL Training (Figure 2 panel e, Figures 3-4, Table 2)
try:
    exp2_dir, exp2_data = load_latest_experiment('exp2_basl_training')
    print(f"Exp 2 keys: {list(exp2_data.keys())}")
except FileNotFoundError as e:
    print(f"Warning: {e}")
    exp2_data = None

---

## Figure 2: Loss Due to Sampling Bias

Five panels showing how sampling bias propagates:
- **(a) Bias in Data**: Feature distributions (Population vs Accepts vs Rejects)
- **(b) Bias in Model**: LR Coefficients (Accepts-only vs Oracle vs BASL) - 2 features for interpretability
- **(c) Bias in Predictions**: LR Score distributions (Accepts-only vs Oracle vs BASL)
- **(d) Impact on Evaluation**: ABR over iterations (Bayesian vs Accepts-only) - Experiment I core result
- **(e) Impact on Training**: ABR over iterations (BASL vs Accepts-only) - Experiment II dynamics

**Note:** Panels (b) and (c) are "bridge panels" using Logistic Regression for interpretability.
BASL-LR data comes from Experiment 2 and is combined with Experiment 1 data.

In [None]:
def plot_figure_2(exp1_data: dict, exp2_data: dict = None):
    """Plot Figure 2: Complete 5-panel visualization.
    
    Panels (a), (d): From Experiment 1
    Panels (b), (c): Bridge panels - Exp1 (Accepts, Oracle LR) + Exp2 (BASL-LR)
    Panel (e): From Experiment 2
    """
    
    fig = plt.figure(figsize=(16, 10))
    gs = fig.add_gridspec(2, 3, hspace=0.35, wspace=0.35)
    
    # Panel (a): Bias in Data - x_v (most separating feature, typically X1) distributions
    ax_a = fig.add_subplot(gs[0, 0])
    if exp1_data and 'figure2_data' in exp1_data:
        panel_a = exp1_data['figure2_data']['panel_a']
        x_v_feature = panel_a.get('x_v_feature', 'X1')  # Paper-faithful: X1
        
        # Plot density histograms for x_v (most separating feature)
        bins = 50
        ax_a.hist(panel_a['population_x_v'], bins=bins, alpha=0.4, 
                  label='Population', density=True, color='gray')
        ax_a.hist(panel_a['accepts_x_v'], bins=bins, alpha=0.5, 
                  label='Accepts', density=True, color='blue')
        ax_a.hist(panel_a['rejects_x_v'], bins=bins, alpha=0.5, 
                  label='Rejects', density=True, color='red')
        
        ax_a.set_xlabel(f'Feature {x_v_feature} (lower = lower risk)')
        ax_a.set_ylabel('Density')
        ax_a.set_title('(a) Bias in Data')
        ax_a.legend()
    
    # Panel (b): Bias in Model - LR coefficients (paper: Intercept, X1, X2, N1, N2)
    # Three bars per feature: Accepts, Oracle, BASL
    ax_b = fig.add_subplot(gs[0, 1])
    if exp1_data and 'figure2_data' in exp1_data:
        panel_b = exp1_data['figure2_data']['panel_b']

        # Paper-faithful x-axis: Intercept, X1, X2, N1, N2
        feature_names = panel_b['feature_names']

        # Build coefficient arrays: [intercept, coef1, coef2, ...]
        accepts_coefs = [panel_b['accepts_intercept']] + panel_b['accepts_coefficients']
        oracle_coefs = [panel_b['oracle_intercept']] + panel_b['oracle_coefficients']

        # Check if BASL-LR data available from Exp 2 with matching features
        basl_coefs = None
        if exp2_data and 'basl_lr_data' in exp2_data:
            basl_lr = exp2_data['basl_lr_data'][0]  # First seed
            basl_coefs_raw = [basl_lr['basl_intercept']] + basl_lr['basl_coefficients']
            # Only use BASL coefs if they match Exp1 feature count
            if len(basl_coefs_raw) == len(feature_names):
                basl_coefs = basl_coefs_raw
            else:
                print(f"Warning: BASL has {len(basl_coefs_raw)} coefs but Exp1 has {len(feature_names)} features. Skipping BASL in panel (b).")

        x_pos = np.arange(len(feature_names))
        width = 0.25 if basl_coefs else 0.35

        ax_b.bar(x_pos - width, accepts_coefs, width,
                 label='Accepts-only', alpha=0.7, color='red')
        ax_b.bar(x_pos, oracle_coefs, width,
                 label='Oracle', alpha=0.7, color='green')
        if basl_coefs:
            ax_b.bar(x_pos + width, basl_coefs, width,
                     label='BASL', alpha=0.7, color='blue')

        ax_b.set_xlabel('Coefficient')
        ax_b.set_ylabel('Value')
        ax_b.set_title('(b) Bias in Model (LR surrogate)')
        ax_b.set_xticks(x_pos)
        ax_b.set_xticklabels(feature_names, rotation=45, ha='right')
        ax_b.legend()
        ax_b.axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    
    # Panel (c): Bias in Predictions - LR score distributions (3 models)
    ax_c = fig.add_subplot(gs[0, 2])
    if exp1_data and 'figure2_data' in exp1_data:
        panel_c = exp1_data['figure2_data']['panel_c']
        
        bins = 50
        ax_c.hist(panel_c['accepts_scores'], bins=bins, alpha=0.5,
                  label='Accepts-only', density=True, color='red')
        ax_c.hist(panel_c['oracle_scores'], bins=bins, alpha=0.5,
                  label='Oracle', density=True, color='green')
        
        # Add BASL scores if available from Exp 2
        if exp2_data and 'basl_lr_data' in exp2_data:
            basl_lr = exp2_data['basl_lr_data'][0]
            ax_c.hist(basl_lr['basl_scores'], bins=bins, alpha=0.5,
                      label='BASL', density=True, color='blue')
        
        ax_c.set_xlabel('Predicted Score (P(bad))')
        ax_c.set_ylabel('Density')
        ax_c.set_title('(c) Bias in Predictions (LR)')
        ax_c.legend()
    
    # Panel (d): Impact on Evaluation - Experiment I core result
    ax_d = fig.add_subplot(gs[1, 0:2])
    if exp1_data and 'metrics_history' in exp1_data:
        series = extract_metric_series(exp1_data['metrics_history'], 'abr')
        
        ax_d.plot(series['iteration'], series['oracle'], 'k-', 
                 linewidth=2, label='Oracle', marker='o', markersize=3, markevery=10)
        ax_d.plot(series['iteration'], series['accepts'], 'r--', 
                 linewidth=2, label='Accepts-only', marker='s', markersize=3, markevery=10)
        ax_d.plot(series['iteration'], series['bayesian'], 'b-', 
                 linewidth=2, label='Bayesian', marker='^', markersize=3, markevery=10)
        
        ax_d.set_xlabel('Acceptance Loop Iteration')
        ax_d.set_ylabel('ABR (Average Bad Rate)')
        ax_d.set_title('(d) Impact on Evaluation: Bayesian vs Accepts-only (Exp I)')
        ax_d.legend(loc='best')
        ax_d.set_ylim(0, 0.8)
        ax_d.grid(True, alpha=0.3)
    
    # Panel (e): Impact on Training - Experiment II dynamics
    ax_e = fig.add_subplot(gs[1, 2])
    if exp2_data and 'metrics_history' in exp2_data:
        series = extract_metric_series(exp2_data['metrics_history'], 'abr')
        
        # Oracle line (constant)
        ax_e.plot(series['iteration'], series['oracle'], 'k-', 
                 linewidth=2, label='Oracle', marker='o', markersize=3, markevery=10)
        
        # Accepts-only baseline (constant reference from aggregated data)
        if 'aggregated' in exp2_data:
            agg = exp2_data['aggregated']
            baseline_key = 'xgb_accepts_abr_mean' if 'xgb_accepts_abr_mean' in agg else 'accepts_abr_mean'
            baseline_abr = agg[baseline_key]
            ax_e.axhline(y=baseline_abr, color='r', linestyle='--', 
                        linewidth=2, label='Accepts-only (baseline)')
        
        # BASL model evolution
        if 'model_holdout' in series:
            ax_e.plot(series['iteration'], series['model_holdout'], 'b-', 
                     linewidth=2, label='BASL', marker='^', markersize=3, markevery=10)
        
        ax_e.set_xlabel('Iteration')
        ax_e.set_ylabel('ABR')
        ax_e.set_title('(e) Impact on Training:\nBASL vs Accepts-only (Exp II)')
        ax_e.legend(loc='best', fontsize=8)
        ax_e.set_ylim(0, 0.8)
        ax_e.grid(True, alpha=0.3)
    
    plt.suptitle('Figure 2: Loss Due to Sampling Bias', fontsize=14, fontweight='bold', y=0.995)
    plt.show()
    
    return fig

# Plot Figure 2
fig2 = plot_figure_2(exp1_data, exp2_data)

---

## Table 1: Bayesian Evaluation Reliability

RMSE between estimated and oracle (true) metrics.

In [None]:
def compute_rmse(true_values: list, estimated_values: list) -> float:
    """Compute Root Mean Square Error."""
    true_arr = np.array(true_values)
    est_arr = np.array(estimated_values)
    return np.sqrt(np.mean((true_arr - est_arr) ** 2))


def create_table_1(exp1_data: dict) -> pd.DataFrame:
    """Create Table 1: Evaluation method comparison."""
    
    if not exp1_data or 'metrics_history' not in exp1_data:
        print("Experiment 1 data not available")
        return None
    
    metrics_history = exp1_data['metrics_history']
    
    # Skip first iteration (initialization)
    metrics_history = [m for m in metrics_history if m['iteration'] > 0]
    
    # Extract metrics
    metrics = ['auc', 'brier', 'abr']
    
    results = {}
    
    for method in ['accepts', 'bayesian']:
        results[method] = {}
        for metric in metrics:
            oracle_vals = [m['oracle'][metric] for m in metrics_history]
            method_vals = [m[method][metric] for m in metrics_history]
            rmse = compute_rmse(oracle_vals, method_vals)
            results[method][f'{metric}_rmse'] = rmse
    
    # Create DataFrame
    df = pd.DataFrame(results).T
    df.index.name = 'Method'
    df.index = ['Accepts-only', 'Bayesian']
    
    # Rename columns for clarity
    df.columns = ['AUC RMSE', 'Brier RMSE', 'ABR RMSE']
    
    # Format values
    df = df.round(4)
    
    return df


# Create and display Table 1
table_1 = create_table_1(exp1_data)
if table_1 is not None:
    print("\n" + "="*70)
    print("Table 1: Bayesian Evaluation - RMSE vs Oracle")
    print("="*70)
    print(table_1.to_string())
    print("="*70)
    print("\nInterpretation: Lower RMSE = better estimation of true (oracle) performance.")
    print("Bayesian evaluation should show lower RMSE than accepts-only.")

---

## Table 2: BASL Training Effectiveness

Comparison of final model performance (AUC and ABR).

In [None]:
def create_table_2(exp2_data: dict) -> pd.DataFrame:
    """Create Table 2: Training method comparison."""
    
    if not exp2_data or 'aggregated' not in exp2_data:
        print("Experiment 2 aggregated data not available")
        return None
    
    agg = exp2_data['aggregated']
    
    # Handle both key naming conventions (xgb_accepts_* or accepts_*)
    accepts_auc_key = 'xgb_accepts_auc_mean' if 'xgb_accepts_auc_mean' in agg else 'accepts_auc_mean'
    accepts_auc_std_key = 'xgb_accepts_auc_std' if 'xgb_accepts_auc_std' in agg else 'accepts_auc_std'
    accepts_abr_key = 'xgb_accepts_abr_mean' if 'xgb_accepts_abr_mean' in agg else 'accepts_abr_mean'
    accepts_abr_std_key = 'xgb_accepts_abr_std' if 'xgb_accepts_abr_std' in agg else 'accepts_abr_std'
    
    # Extract final metrics
    data = {
        'Method': ['Oracle', 'Accepts-only', 'BASL'],
        'AUC': [
            f"{agg['oracle_auc_mean']:.4f} +/- {agg['oracle_auc_std']:.4f}",
            f"{agg[accepts_auc_key]:.4f} +/- {agg[accepts_auc_std_key]:.4f}",
            f"{agg['basl_auc_mean']:.4f} +/- {agg['basl_auc_std']:.4f}",
        ],
        'ABR': [
            f"{agg['oracle_abr_mean']:.4f} +/- {agg['oracle_abr_std']:.4f}",
            f"{agg[accepts_abr_key]:.4f} +/- {agg[accepts_abr_std_key]:.4f}",
            f"{agg['basl_abr_mean']:.4f} +/- {agg['basl_abr_std']:.4f}",
        ],
    }
    
    df = pd.DataFrame(data)
    df = df.set_index('Method')
    
    return df


# Create and display Table 2
table_2 = create_table_2(exp2_data)
if table_2 is not None:
    print("\n" + "="*70)
    print("Table 2: BASL Training Effectiveness")
    print("="*70)
    print(table_2.to_string())
    print("="*70)
    print("\nInterpretation: BASL should outperform accepts-only and approach oracle.")
    print("Higher AUC = better discrimination. Lower ABR = better calibration.")

---

## Figure 4: Feature Bias Analysis

Shows bias in predicted bad rates across the feature distribution.

In [None]:
def plot_figure_4(exp2_data: dict):
    """Plot Figure 4: Feature bias analysis."""
    
    if not exp2_data or 'feature_bias' not in exp2_data:
        print("Feature bias data not available")
        return None
    
    feature_bias = exp2_data['feature_bias']
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    for seed_data in feature_bias:
        model_data = seed_data['model']
        oracle_data = seed_data['oracle']
        
        # Panel 1: True vs Predicted bad rate
        ax1 = axes[0]
        bins = [d['bin'] for d in model_data]
        
        # Handle both old (x0_min/x0_max) and new (X1_min/X1_max) key names
        if 'X1_min' in model_data[0]:
            x1_centers = [(d['X1_min'] + d['X1_max']) / 2 for d in model_data]
            x_label = 'Feature X1 (lower = lower risk)'
        else:
            x1_centers = [(d['x0_min'] + d['x0_max']) / 2 for d in model_data]
            x_label = 'Feature x0 (lower = lower risk)'
        
        true_rates = [d['true_bad_rate'] for d in model_data]
        model_pred = [d['predicted_bad_rate'] for d in model_data]
        oracle_pred = [d['predicted_bad_rate'] for d in oracle_data]
        
        ax1.plot(x1_centers, true_rates, 'k-', linewidth=2.5, label='True bad rate', marker='o', markersize=6)
        ax1.plot(x1_centers, model_pred, 'r--', linewidth=2, label='Model (BASL)', marker='s', markersize=5)
        ax1.plot(x1_centers, oracle_pred, 'g:', linewidth=2, label='Oracle', marker='^', markersize=5)
        
        ax1.set_xlabel(x_label)
        ax1.set_ylabel('Bad Rate')
        ax1.set_title('True vs Predicted Bad Rate Across X1')
        ax1.legend()
        ax1.set_ylim(0, 1.1)
        ax1.grid(True, alpha=0.3)
        
        # Panel 2: Bias across X1
        ax2 = axes[1]
        model_bias = [d['bias'] for d in model_data]
        oracle_bias = [d['bias'] for d in oracle_data]
        
        width = 0.3
        x_pos = np.array(x1_centers)
        
        ax2.bar(x_pos - width/2, model_bias, width=width, 
                label='Model bias', alpha=0.7, color='red')
        ax2.bar(x_pos + width/2, oracle_bias, width=width, 
                label='Oracle bias', alpha=0.7, color='green')
        ax2.axhline(y=0, color='k', linestyle='-', linewidth=1)
        
        ax2.set_xlabel(x_label)
        ax2.set_ylabel('Bias (Predicted - True)')
        ax2.set_title('Prediction Bias Across Feature Distribution')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
    
    plt.suptitle('Figure 4: Feature Bias Analysis', fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    return fig

# Plot Figure 4
fig4 = plot_figure_4(exp2_data)

---

## Additional Analysis: AUC Over Iterations

In [None]:
def plot_auc_comparison(exp1_data: dict, exp2_data: dict = None):
    """Plot AUC comparison over iterations."""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Experiment 1: Evaluation
    ax1 = axes[0]
    if exp1_data and 'metrics_history' in exp1_data:
        series = extract_metric_series(exp1_data['metrics_history'], 'auc')
        
        ax1.plot(series['iteration'], series['oracle'], 'k-', 
                 linewidth=2, label='Oracle', marker='o', markersize=3, markevery=10)
        ax1.plot(series['iteration'], series['accepts'], 'r--', 
                 linewidth=2, label='Accepts-only', marker='s', markersize=3, markevery=10)
        ax1.plot(series['iteration'], series['bayesian'], 'b-', 
                 linewidth=2, label='Bayesian', marker='^', markersize=3, markevery=10)
        
        ax1.set_xlabel('Iteration')
        ax1.set_ylabel('AUC')
        ax1.set_title('Experiment 1: AUC Evaluation')
        ax1.legend(loc='best')
        ax1.set_ylim(0.4, 1.0)
        ax1.grid(True, alpha=0.3)
    
    # Experiment 2: Training
    ax2 = axes[1]
    if exp2_data and 'metrics_history' in exp2_data:
        series = extract_metric_series(exp2_data['metrics_history'], 'auc')
        
        # Oracle line
        ax2.plot(series['iteration'], series['oracle'], 'k-', 
                 linewidth=2, label='Oracle', marker='o', markersize=3, markevery=10)
        
        # Accepts-only baseline (constant reference from aggregated data)
        if 'aggregated' in exp2_data:
            agg = exp2_data['aggregated']
            baseline_key = 'xgb_accepts_auc_mean' if 'xgb_accepts_auc_mean' in agg else 'accepts_auc_mean'
            baseline_auc = agg[baseline_key]
            ax2.axhline(y=baseline_auc, color='r', linestyle='--', 
                       linewidth=2, label='Accepts-only (baseline)')
        
        # BASL model evolution
        if 'model_holdout' in series:
            ax2.plot(series['iteration'], series['model_holdout'], 'b-', 
                     linewidth=2, label='BASL', marker='^', markersize=3, markevery=10)
        
        ax2.set_xlabel('Iteration')
        ax2.set_ylabel('AUC')
        ax2.set_title('Experiment 2: AUC Training')
        ax2.legend(loc='best')
        ax2.set_ylim(0.4, 1.0)
        ax2.grid(True, alpha=0.3)
    
    plt.suptitle('AUC Over Iterations', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    return fig

# Plot AUC comparison
fig_auc = plot_auc_comparison(exp1_data, exp2_data)

---

## Summary Statistics

In [None]:
def print_summary_stats(exp1_data: dict, exp2_data: dict):
    """Print summary statistics from experiments."""
    
    print("\n" + "="*70)
    print("EXPERIMENT SUMMARY STATISTICS")
    print("="*70)
    
    # Experiment 1
    if exp1_data and 'metrics_history' in exp1_data:
        final = exp1_data['metrics_history'][-1]
        print("\nExperiment 1: Bayesian Evaluation (Final Iteration)")
        print("-"*70)
        print(f"  Iteration: {final['iteration']}")
        print(f"  Oracle ABR:     {final['oracle']['abr']:.4f}")
        print(f"  Accepts ABR:    {final['accepts']['abr']:.4f}")
        print(f"  Bayesian ABR:   {final['bayesian']['abr']:.4f}")
        
        oracle_abr = final['oracle']['abr']
        accepts_bias = abs(final['accepts']['abr'] - oracle_abr)
        bayesian_bias = abs(final['bayesian']['abr'] - oracle_abr)
        print(f"\n  Accepts bias (vs Oracle):   {accepts_bias:.4f}")
        print(f"  Bayesian bias (vs Oracle):  {bayesian_bias:.4f}")
        if accepts_bias > 0:
            improvement = (accepts_bias - bayesian_bias) / accepts_bias * 100
            print(f"  Bias reduction:             {improvement:.1f}%")
    
    # Experiment 2
    if exp2_data and 'aggregated' in exp2_data:
        agg = exp2_data['aggregated']
        
        # Get baseline accepts-only metrics (constant reference)
        baseline_auc_key = 'xgb_accepts_auc_mean' if 'xgb_accepts_auc_mean' in agg else 'accepts_auc_mean'
        baseline_abr_key = 'xgb_accepts_abr_mean' if 'xgb_accepts_abr_mean' in agg else 'accepts_abr_mean'
        
        print("\nExperiment 2: BASL Training (Final Model Performance)")
        print("-"*70)
        print(f"  Oracle AUC:         {agg['oracle_auc_mean']:.4f}")
        print(f"  Oracle ABR:         {agg['oracle_abr_mean']:.4f}")
        print(f"  Accepts-only AUC:   {agg[baseline_auc_key]:.4f}")
        print(f"  Accepts-only ABR:   {agg[baseline_abr_key]:.4f}")
        print(f"  BASL AUC:           {agg['basl_auc_mean']:.4f}")
        print(f"  BASL ABR:           {agg['basl_abr_mean']:.4f}")
        
        # Calculate improvements
        oracle_abr = agg['oracle_abr_mean']
        accepts_gap = abs(agg[baseline_abr_key] - oracle_abr)
        basl_gap = abs(agg['basl_abr_mean'] - oracle_abr)
        
        oracle_auc = agg['oracle_auc_mean']
        accepts_auc_gap = abs(agg[baseline_auc_key] - oracle_auc)
        basl_auc_gap = abs(agg['basl_auc_mean'] - oracle_auc)
        
        print(f"\n  ABR Gap to Oracle:")
        print(f"    Accepts-only:     {accepts_gap:.4f} ({accepts_gap/oracle_abr*100:+.1f}%)")
        print(f"    BASL:             {basl_gap:.4f} ({basl_gap/oracle_abr*100:+.1f}%)")
        if accepts_gap > 0:
            abr_recovery = (accepts_gap - basl_gap) / accepts_gap * 100
            print(f"  ABR Recovery:       {abr_recovery:.1f}%")
        
        print(f"\n  AUC Gap to Oracle:")
        print(f"    Accepts-only:     {accepts_auc_gap:.4f}")
        print(f"    BASL:             {basl_auc_gap:.4f}")
        if accepts_auc_gap > 0:
            auc_recovery = (accepts_auc_gap - basl_auc_gap) / accepts_auc_gap * 100
            print(f"  AUC Recovery:       {auc_recovery:.1f}%")
    
    print("\n" + "="*70)


print_summary_stats(exp1_data, exp2_data)

---

## Interpretation Guide

### Figure 2 Interpretation:
- **Panel (a)**: Accepts and rejects have different distributions than population (covariate shift)
- **Panel (b)**: Coefficients differ between accepts-only and oracle models (parameter bias)
- **Panel (c)**: Score distributions differ (prediction bias)
- **Panel (d)**: Bayesian evaluation tracks oracle better than accepts-only (evaluation bias reduction)
- **Panel (e)**: BASL training reduces gap to oracle compared to accepts-only (training bias reduction)

### Table 1 Interpretation:
- Lower RMSE = better estimation of true performance
- Bayesian should show lower RMSE than accepts-only across all metrics

### Table 2 Interpretation:
- BASL should achieve AUC/ABR closer to oracle than accepts-only
- Higher AUC = better discrimination between good/bad
- Lower ABR (closer to oracle) = better calibration

### Figure 4 Interpretation:
- Accepts-only model shows large bias in reject regions (negative bias)
- Oracle model shows minimal bias across all regions
- BASL (not shown) should reduce bias compared to accepts-only