# 02: Estimator Comparison - Mechanistic HbA1c Estimation Methods

This notebook compares three mechanistic estimators for HbA1c prediction:

1. **ADAG** - A1c-Derived Average Glucose equation (inverted)
2. **Glycation Kinetics** - First-order kinetics model adjusted for hemoglobin and RBC lifespan
3. **Multi-Linear Regression** - Linear model using FPG, age, lipids, and hemoglobin

---

## Background

Each estimator has different strengths and limitations:

- **ADAG**: Simplest approach, uses only FPG. Based on Nathan et al. (2008).
- **Kinetic**: Incorporates hemoglobin level and RBC lifespan for physiologic adjustment.
- **Regression**: Multi-marker approach combining FPG with age, lipids, and hemoglobin.

We will compare these methods against HPLC-measured HbA1c from NHANES data.

In [None]:
# Standard library imports
import sys
from pathlib import Path

# Add parent directory to path for imports
sys.path.insert(0, str(Path.cwd().parent))

# Third-party imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Local imports
from hba1cE.models import (
    calc_hba1c_adag,
    calc_hba1c_kinetic,
    calc_hba1c_regression,
    fit_regression_coefficients,
)

# Configure matplotlib
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("Imports successful!")

---

## Step 1: Load Cleaned NHANES Data

Load the cleaned dataset from Notebook 01.

In [None]:
# Load cleaned data
DATA_DIR = Path.cwd().parent / "data"
PROCESSED_DIR = DATA_DIR / "processed"

df = pd.read_csv(PROCESSED_DIR / "nhanes_glycemic_cleaned.csv")

print(f"Loaded {len(df):,} records")
print(f"\nColumns: {list(df.columns)}")
print(f"\nSummary statistics:")
df[['hba1c_percent', 'fpg_mgdl', 'hgb_gdl', 'tg_mgdl', 'hdl_mgdl', 'age_years']].describe().round(2)

---

## Step 2: Apply All Estimators

Calculate HbA1c estimates using each mechanistic method.

In [None]:
# 1. ADAG Estimator (FPG only)
df['hba1c_adag'] = calc_hba1c_adag(df['fpg_mgdl'].values)

# 2. Glycation Kinetics Estimator (FPG + Hemoglobin)
df['hba1c_kinetic'] = calc_hba1c_kinetic(
    fpg_mgdl=df['fpg_mgdl'].values,
    hgb_gdl=df['hgb_gdl'].values
)

# 3. Fit Multi-Linear Regression on training data
# (Fit on 70% of data to avoid overfitting, evaluate on all)
np.random.seed(42)
train_mask = np.random.random(len(df)) < 0.7
train_df = df[train_mask]

fitted_coeffs = fit_regression_coefficients(train_df)
print("Fitted Regression Coefficients:")
for name, value in fitted_coeffs.items():
    print(f"  {name}: {value:.6f}")

# Apply regression estimator
df['hba1c_regression'] = calc_hba1c_regression(
    fpg_mgdl=df['fpg_mgdl'].values,
    age_years=df['age_years'].values,
    tg_mgdl=df['tg_mgdl'].values,
    hdl_mgdl=df['hdl_mgdl'].values,
    hgb_gdl=df['hgb_gdl'].values,
    coefficients=fitted_coeffs
)

print(f"\nEstimates computed for {len(df):,} samples")

---

## Step 3: Calculate Performance Metrics

Evaluate each estimator against HPLC-measured HbA1c.

In [None]:
def calculate_metrics(y_true, y_pred, name):
    """Calculate performance metrics for an estimator."""
    errors = y_pred - y_true
    abs_errors = np.abs(errors)
    
    rmse = np.sqrt(np.mean(errors**2))
    mae = np.mean(abs_errors)
    bias = np.mean(errors)
    r_pearson, _ = stats.pearsonr(y_true, y_pred)
    pct_within_0_5 = 100 * np.mean(abs_errors <= 0.5)
    pct_within_1_0 = 100 * np.mean(abs_errors <= 1.0)
    
    return {
        'Method': name,
        'RMSE': rmse,
        'MAE': mae,
        'Bias': bias,
        'Pearson r': r_pearson,
        '% within ±0.5%': pct_within_0_5,
        '% within ±1.0%': pct_within_1_0,
    }

# Calculate metrics for each estimator
y_true = df['hba1c_percent'].values

metrics_list = [
    calculate_metrics(y_true, df['hba1c_adag'].values, 'ADAG'),
    calculate_metrics(y_true, df['hba1c_kinetic'].values, 'Kinetic'),
    calculate_metrics(y_true, df['hba1c_regression'].values, 'Regression'),
]

metrics_df = pd.DataFrame(metrics_list)
print("\n" + "="*70)
print("ESTIMATOR PERFORMANCE COMPARISON")
print("="*70)
print(metrics_df.to_string(index=False))

---

## Step 4: Scatter Plots - Estimated vs Measured HbA1c

Visualize each estimator's predictions against HPLC-measured HbA1c.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

estimators = [
    ('hba1c_adag', 'ADAG', 'steelblue'),
    ('hba1c_kinetic', 'Kinetic', 'forestgreen'),
    ('hba1c_regression', 'Regression', 'darkorange'),
]

for ax, (col, name, color) in zip(axes, estimators):
    y_pred = df[col].values
    
    # Scatter plot
    ax.scatter(y_true, y_pred, alpha=0.2, s=8, c=color)
    
    # Perfect agreement line
    lims = [3, 15]
    ax.plot(lims, lims, 'k--', linewidth=2, label='Perfect agreement')
    
    # ±0.5% lines
    ax.plot(lims, [lims[0]-0.5, lims[1]-0.5], 'r:', alpha=0.5)
    ax.plot(lims, [lims[0]+0.5, lims[1]+0.5], 'r:', alpha=0.5, label='±0.5% bounds')
    
    # Calculate metrics for subtitle
    rmse = np.sqrt(np.mean((y_pred - y_true)**2))
    r, _ = stats.pearsonr(y_true, y_pred)
    
    ax.set_xlabel('Measured HbA1c (%)')
    ax.set_ylabel('Estimated HbA1c (%)')
    ax.set_title(f'{name} Estimator\nRMSE={rmse:.2f}%, r={r:.3f}')
    ax.set_xlim(lims)
    ax.set_ylim(lims)
    ax.legend(loc='upper left', fontsize=8)
    ax.set_aspect('equal')

plt.tight_layout()
plt.savefig(DATA_DIR / 'estimator_comparison_scatter.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nFigure saved to: {DATA_DIR / 'estimator_comparison_scatter.png'}")

---

## Step 5: Bland-Altman Plots

Assess agreement and bias patterns across the HbA1c range.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, (col, name, color) in zip(axes, estimators):
    y_pred = df[col].values
    
    # Bland-Altman: mean vs difference
    mean_values = (y_true + y_pred) / 2
    diff_values = y_pred - y_true
    
    mean_diff = np.mean(diff_values)
    std_diff = np.std(diff_values)
    loa_upper = mean_diff + 1.96 * std_diff
    loa_lower = mean_diff - 1.96 * std_diff
    
    # Scatter plot
    ax.scatter(mean_values, diff_values, alpha=0.2, s=8, c=color)
    
    # Mean bias line
    ax.axhline(y=mean_diff, color='black', linestyle='-', linewidth=2,
               label=f'Mean bias: {mean_diff:.2f}%')
    
    # Limits of agreement
    ax.axhline(y=loa_upper, color='red', linestyle='--', linewidth=1.5,
               label=f'+1.96 SD: {loa_upper:.2f}%')
    ax.axhline(y=loa_lower, color='red', linestyle='--', linewidth=1.5,
               label=f'-1.96 SD: {loa_lower:.2f}%')
    
    ax.axhline(y=0, color='gray', linestyle=':', alpha=0.5)
    
    ax.set_xlabel('Mean of Measured and Estimated HbA1c (%)')
    ax.set_ylabel('Estimated - Measured HbA1c (%)')
    ax.set_title(f'{name} Bland-Altman Plot')
    ax.legend(loc='upper left', fontsize=8)
    ax.set_xlim(3, 15)
    ax.set_ylim(-5, 5)

plt.tight_layout()
plt.savefig(DATA_DIR / 'estimator_comparison_bland_altman.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nFigure saved to: {DATA_DIR / 'estimator_comparison_bland_altman.png'}")

---

## Step 6: Error Distribution Analysis

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))

for col, name, color in estimators:
    errors = df[col].values - y_true
    ax.hist(errors, bins=50, alpha=0.5, label=name, color=color, density=True)

ax.axvline(x=0, color='black', linestyle='--', linewidth=2)
ax.axvline(x=-0.5, color='red', linestyle=':', alpha=0.7)
ax.axvline(x=0.5, color='red', linestyle=':', alpha=0.7, label='±0.5% target')

ax.set_xlabel('Prediction Error (Estimated - Measured HbA1c, %)')
ax.set_ylabel('Density')
ax.set_title('Error Distribution by Estimator')
ax.legend()
ax.set_xlim(-4, 4)

plt.tight_layout()
plt.savefig(DATA_DIR / 'estimator_comparison_error_dist.png', dpi=150, bbox_inches='tight')
plt.show()

---

## Step 7: Performance by HbA1c Strata

Evaluate estimator performance across clinical HbA1c ranges.

In [None]:
# Define strata
def assign_stratum(hba1c):
    if hba1c < 5.7:
        return 'Normal (<5.7%)'
    elif hba1c < 6.5:
        return 'Prediabetes (5.7-6.4%)'
    else:
        return 'Diabetes (≥6.5%)'

df['stratum'] = df['hba1c_percent'].apply(assign_stratum)

# Calculate RMSE by stratum for each estimator
strata_results = []
for stratum in ['Normal (<5.7%)', 'Prediabetes (5.7-6.4%)', 'Diabetes (≥6.5%)']:
    mask = df['stratum'] == stratum
    n = mask.sum()
    y_true_stratum = df.loc[mask, 'hba1c_percent'].values
    
    row = {'Stratum': stratum, 'N': n}
    for col, name, _ in estimators:
        y_pred_stratum = df.loc[mask, col].values
        rmse = np.sqrt(np.mean((y_pred_stratum - y_true_stratum)**2))
        row[f'{name} RMSE'] = rmse
    strata_results.append(row)

strata_df = pd.DataFrame(strata_results)
print("\n" + "="*70)
print("PERFORMANCE BY HbA1c STRATUM (RMSE in %)")
print("="*70)
print(strata_df.to_string(index=False))

In [None]:
# Visualize RMSE by stratum
fig, ax = plt.subplots(figsize=(10, 6))

strata_names = strata_df['Stratum'].values
x = np.arange(len(strata_names))
width = 0.25

colors = ['steelblue', 'forestgreen', 'darkorange']
for i, (_, name, color) in enumerate(estimators):
    rmse_values = strata_df[f'{name} RMSE'].values
    ax.bar(x + i*width, rmse_values, width, label=name, color=color)

ax.axhline(y=0.5, color='red', linestyle='--', linewidth=2, label='Target RMSE (0.5%)')

ax.set_xlabel('HbA1c Stratum')
ax.set_ylabel('RMSE (%)')
ax.set_title('Estimator RMSE by Clinical HbA1c Stratum')
ax.set_xticks(x + width)
ax.set_xticklabels(strata_names)
ax.legend()

plt.tight_layout()
plt.savefig(DATA_DIR / 'estimator_comparison_by_stratum.png', dpi=150, bbox_inches='tight')
plt.show()

---

## Interpretation and Limitations

### ADAG Estimator
- **Pros**: Simple, requires only FPG measurement
- **Cons**: Underestimates at high HbA1c, doesn't account for hemoglobin or age effects
- **Best for**: Quick screening, resource-limited settings

### Glycation Kinetics Model
- **Pros**: Physiologically motivated, adjusts for anemia
- **Cons**: Simplified model of complex glycation dynamics, may underestimate in diabetes range
- **Best for**: Patients with known hemoglobin abnormalities

### Multi-Linear Regression
- **Pros**: Uses multiple biomarkers, potentially more accurate
- **Cons**: Requires more inputs, coefficients are dataset-specific
- **Best for**: Comprehensive estimation when full panel available

### Common Limitations
- All methods struggle at **high HbA1c values** (diabetes range) where the FPG-HbA1c relationship is weaker
- None account for **hemoglobin variants** (HbS, HbC, etc.)
- All are **estimation only** - cannot replace direct HbA1c measurement for diagnosis

---

## Summary

This notebook compared three mechanistic HbA1c estimators:

1. Generated scatter plots showing estimated vs measured HbA1c
2. Created Bland-Altman plots to assess bias and limits of agreement
3. Analyzed error distributions
4. Evaluated performance across clinical HbA1c strata

### Key Findings
- All estimators show reasonable correlation with measured HbA1c
- Performance degrades in the diabetes range (HbA1c ≥6.5%)
- Multi-marker approaches may offer improved performance

### Next Steps
Continue to **Notebook 03: Model Training** to develop ML-based estimators that may outperform these mechanistic approaches.