# External Validation: Kaggle Diabetes Prediction Dataset

This notebook validates our HbA1c estimation models on an **independent external dataset** — the
[Kaggle Diabetes Prediction Dataset](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset).

## Dataset Overview

| Property | Value |
|---|---|
| Records | ~100 000 |
| Source | Kaggle (Mohammed Mustafa) |
| HbA1c column | `HbA1c_level` (discrete, 18 unique values) |
| Glucose column | `blood_glucose_level` (may **not** be fasting) |
| Additional features | age, gender, BMI |
| Missing features | TG, HDL, haemoglobin, MCV |

### Limitations

* The dataset is **synthetic / aggregated** — HbA1c and glucose values are discretised.
* Glucose may **not** be fasting, so ADAG-based estimates should be interpreted cautiously.
* No lipid panel (TG, HDL), haemoglobin, or MCV — only the ADAG mechanistic estimator can
  be applied directly.  ML models (Ridge, RF, LightGBM) cannot be applied because they
  require the full multi-marker feature set.
* HbA1c measurement method is **unknown** (not confirmed HPLC).

In [None]:
import sys
from pathlib import Path

# Ensure package is importable from notebooks/
sys.path.insert(0, str(Path.cwd().parent))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.dpi'] = 120
plt.rcParams['figure.figsize'] = (10, 6)

## 1. Load External Dataset

In [None]:
from hba1cE.data import load_external_kaggle_diabetes

EXT_CSV = Path.cwd().parent / 'data' / 'external' / 'diabetes_prediction_dataset.csv'
ext_df = load_external_kaggle_diabetes(str(EXT_CSV))

print(f'Records after cleaning: {len(ext_df):,}')
print(f'Columns: {ext_df.columns.tolist()}')
ext_df.describe().round(2)

## 2. Generate Predictions from Mechanistic Estimators

Only the **ADAG** estimator can be applied directly because the kinetic and regression
models require haemoglobin, triglycerides, HDL, and other features **not available** in
this dataset.

In [None]:
from hba1cE.models import calc_hba1c_adag

y_true = ext_df['hba1c_percent'].values
fpg = ext_df['fpg_mgdl'].values

# ADAG estimation
y_pred_adag = calc_hba1c_adag(fpg)

print(f'True HbA1c  — mean: {y_true.mean():.2f}, std: {y_true.std():.2f}')
print(f'ADAG est.   — mean: {y_pred_adag.mean():.2f}, std: {y_pred_adag.std():.2f}')

## 3. Evaluate ADAG on External Dataset

In [None]:
from hba1cE.evaluate import evaluate_model, bland_altman_stats, lins_ccc

metrics = evaluate_model(y_true, y_pred_adag, model_name='ADAG (external)')

print('=== ADAG Performance on External Dataset ===')
for k, v in metrics.items():
    if k == 'ba_stats':
        print(f'  Bland-Altman:')
        for bk, bv in v.items():
            print(f'    {bk}: {bv:.4f}')
    elif isinstance(v, float):
        print(f'  {k}: {v:.4f}')
    else:
        print(f'  {k}: {v}')

## 4. Visualisations

### 4a. Scatter Plot: Estimated vs Measured HbA1c

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(y_true, y_pred_adag, alpha=0.15, s=10, color='steelblue')
lims = [3, 10]
ax.plot(lims, lims, 'k--', lw=1.5, label='Identity')
ax.set_xlabel('Measured HbA1c (%)', fontsize=12)
ax.set_ylabel('ADAG-Estimated HbA1c (%)', fontsize=12)
ax.set_title('External Validation: ADAG Estimated vs Measured HbA1c', fontsize=14)
ax.set_xlim(lims)
ax.set_ylim(lims)
ax.legend(fontsize=11)
ax.set_aspect('equal')
plt.tight_layout()
plt.show()

### 4b. Bland-Altman Plot

In [None]:
ba = bland_altman_stats(y_true, y_pred_adag)
means = (y_true + y_pred_adag) / 2
diffs = y_pred_adag - y_true

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(means, diffs, alpha=0.15, s=10, color='steelblue')
ax.axhline(ba['mean_bias'], color='red', lw=1.5, label=f'Mean bias = {ba["mean_bias"]:.3f}')
ax.axhline(ba['loa_upper'], color='grey', ls='--', lw=1,
           label=f'+1.96 SD = {ba["loa_upper"]:.3f}')
ax.axhline(ba['loa_lower'], color='grey', ls='--', lw=1,
           label=f'-1.96 SD = {ba["loa_lower"]:.3f}')
ax.set_xlabel('Mean of Measured & Estimated HbA1c (%)', fontsize=12)
ax.set_ylabel('Estimated − Measured HbA1c (%)', fontsize=12)
ax.set_title('Bland-Altman Plot: ADAG on External Dataset', fontsize=14)
ax.legend(fontsize=10)
plt.tight_layout()
plt.show()

### 4c. Error Distribution Histogram

In [None]:
errors = y_pred_adag - y_true
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(errors, bins=50, color='steelblue', edgecolor='white', alpha=0.8)
ax.axvline(0, color='black', ls='--', lw=1.5)
ax.axvline(errors.mean(), color='red', lw=1.5, label=f'Mean error = {errors.mean():.3f}')
ax.set_xlabel('Prediction Error (estimated − measured) (%)', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Error Distribution: ADAG on External Dataset', fontsize=14)
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()

## 5. HbA1c-Stratified Evaluation

In [None]:
from hba1cE.evaluate import evaluate_by_hba1c_strata

strata_results = evaluate_by_hba1c_strata(y_true, y_pred_adag, y_true)

strata_rows = []
for stratum, res in strata_results.items():
    if res is not None:
        strata_rows.append({
            'Stratum': stratum,
            'N': int(res.get('n', 0)) if 'n' in res else '—',
            'RMSE': f"{res['rmse']:.3f}",
            'MAE': f"{res['mae']:.3f}",
            'Bias': f"{res['bias']:.3f}",
            'Lin CCC': f"{res['lin_ccc']:.3f}",
            '% ±0.5': f"{res['pct_within_0_5']:.1f}",
        })
    else:
        strata_rows.append({'Stratum': stratum, 'N': '—', 'RMSE': '—',
                            'MAE': '—', 'Bias': '—', 'Lin CCC': '—', '% ±0.5': '—'})

strata_df = pd.DataFrame(strata_rows)
print('\n=== ADAG Performance by HbA1c Stratum (External) ===')
print(strata_df.to_string(index=False))

## 6. Bootstrap Confidence Intervals

In [None]:
from hba1cE.evaluate import bootstrap_ci
from sklearn.metrics import mean_squared_error, mean_absolute_error

def rmse_func(yt, yp):
    return float(np.sqrt(np.mean((yt - yp) ** 2)))

def mae_func(yt, yp):
    return float(np.mean(np.abs(yt - yp)))

def bias_func(yt, yp):
    return float(np.mean(yp - yt))

def pct_within_05(yt, yp):
    return float(100 * np.mean(np.abs(yt - yp) <= 0.5))

ci_metrics = {
    'RMSE':         rmse_func,
    'MAE':          mae_func,
    'Bias':         bias_func,
    'Lin CCC':      lins_ccc,
    '% within ±0.5': pct_within_05,
}

print('=== Bootstrap 95% CIs (n=2000) — ADAG on External Dataset ===\n')
print(f'{"Metric":<20} {"Point":>8} {"95% CI":>22}')
print('-' * 52)

for name, func in ci_metrics.items():
    lo, hi, mean = bootstrap_ci(y_true, y_pred_adag, func, n_bootstrap=2000)
    print(f'{name:<20} {mean:>8.3f}   [{lo:.3f}, {hi:.3f}]')

## 7. Clinical Classification Agreement

Compare clinical classification (normal / prediabetes / diabetes) between measured
and estimated HbA1c at the diagnostic thresholds of **5.7 %** and **6.5 %**.

In [None]:
def classify_hba1c(values):
    return np.select(
        [values < 5.7, values < 6.5],
        ['Normal', 'Prediabetes'],
        default='Diabetes',
    )

true_class = classify_hba1c(y_true)
pred_class = classify_hba1c(y_pred_adag)

agreement = (true_class == pred_class).mean() * 100
print(f'Overall classification agreement: {agreement:.1f}%\n')

# Confusion-style summary
ct = pd.crosstab(
    pd.Categorical(true_class, categories=['Normal', 'Prediabetes', 'Diabetes']),
    pd.Categorical(pred_class, categories=['Normal', 'Prediabetes', 'Diabetes']),
    rownames=['Measured'],
    colnames=['Estimated'],
)
print(ct)

## 8. Summary & Discussion

### Key Findings

* The ADAG estimator was applied to an external dataset of ~100K records.
* Performance metrics (RMSE, MAE, Lin's CCC, % within ±0.5%) are reported with
  bootstrap 95% confidence intervals.
* ML models (Ridge, RF, LightGBM) could **not** be applied because the external
  dataset lacks triglycerides, HDL, haemoglobin, and MCV features.

### Dataset Limitations

1. **Synthetic / aggregated data** — HbA1c and glucose values are heavily discretised
   (only 18 unique values each), which limits the granularity of the comparison.
2. **Glucose may not be fasting** — the ADAG equation assumes fasting plasma glucose;
   using non-fasting values introduces systematic upward bias.
3. **Unknown HbA1c measurement method** — the ground truth may not be HPLC.
4. **Missing biomarkers** — no lipid panel, haemoglobin, or MCV prevents ML model
   evaluation and limits the scope of this external validation to the ADAG estimator.
5. **Population mismatch** — the Kaggle dataset's demographics may differ from the
   NHANES US population used for model training.

### Conclusion

This external validation demonstrates the generalisability (or lack thereof) of the
simplest mechanistic estimator.  For a more rigorous external validation, a clinical
research dataset with HPLC-measured HbA1c and a full biomarker panel (e.g. ARIC via
BioLINCC) is recommended.