# 01: Data Sourcing - NHANES Glycemic Data Pipeline

This notebook demonstrates the complete data sourcing pipeline for HbA1c estimation research:

1. **Downloading** NHANES glycemic data files (GHB, GLU, TRIGLY, HDL, CBC, DEMO)
2. **Parsing** SAS Transport (XPT) files into pandas DataFrames
3. **Cleaning** and harmonizing data across NHANES cycles
4. **Quality reporting** for data validation
5. **Visualization** of key relationships (HbA1c vs FPG)

---

## Background

NHANES (National Health and Nutrition Examination Survey) provides HPLC-measured HbA1c values
alongside fasting plasma glucose, lipid panels, and complete blood counts. This makes it an
ideal dataset for training and validating HbA1c estimation models.

**Data Sources:**
- GHB: Glycohemoglobin (HbA1c) - HPLC method
- GLU: Fasting Plasma Glucose  
- TRIGLY: Triglycerides
- HDL: HDL Cholesterol
- CBC: Complete Blood Count (hemoglobin, MCV)
- DEMO: Demographics (age, sex)

In [None]:
# Standard library imports
import sys
from pathlib import Path

# Add parent directory to path for imports
sys.path.insert(0, str(Path.cwd().parent))

# Third-party imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Local imports
from hba1cE.data import (
    download_nhanes_glycemic,
    read_xpt,
    clean_glycemic_data,
    generate_quality_report,
    NHANES_FILE_MAPPINGS,
)

# Configure matplotlib
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

print("Imports successful!")
print(f"Available NHANES cycles: {list(NHANES_FILE_MAPPINGS.keys())}")

---

## Step 1: Download NHANES Data

Download XPT files from the CDC NHANES website for cycles 2011-2018.
Files are saved to `data/raw/` and will be skipped if already downloaded.

In [None]:
# Set data directory (relative to notebook)
DATA_DIR = Path.cwd().parent / "data"

# Download all cycles (skip if files already exist)
downloaded_files = download_nhanes_glycemic(
    output_dir=str(DATA_DIR),
    cycles=["2017-2018"]  # Start with one cycle for demo; expand as needed
)

print("\nDownloaded files:")
for cycle, files in downloaded_files.items():
    print(f"\n{cycle}:")
    for data_type, filepath in files.items():
        status = "✓" if filepath else "✗"
        print(f"  {status} {data_type}: {filepath}")

---

## Step 2: Parse XPT Files

Read each XPT file into a pandas DataFrame using the `read_xpt()` function.

In [None]:
# Define paths to data files (for 2017-2018 cycle)
RAW_DIR = DATA_DIR / "raw"

# Read all required datasets
ghb_df = read_xpt(str(RAW_DIR / "GHB_J.XPT"))     # HbA1c
glu_df = read_xpt(str(RAW_DIR / "GLU_J.XPT"))     # Fasting glucose
trigly_df = read_xpt(str(RAW_DIR / "TRIGLY_J.XPT"))  # Triglycerides
hdl_df = read_xpt(str(RAW_DIR / "HDL_J.XPT"))     # HDL
cbc_df = read_xpt(str(RAW_DIR / "CBC_J.XPT"))     # CBC (Hgb, MCV)
demo_df = read_xpt(str(RAW_DIR / "DEMO_J.XPT"))   # Demographics

print("Dataset shapes:")
print(f"  GHB (HbA1c):     {ghb_df.shape}")
print(f"  GLU (Glucose):   {glu_df.shape}")
print(f"  TRIGLY:          {trigly_df.shape}")
print(f"  HDL:             {hdl_df.shape}")
print(f"  CBC:             {cbc_df.shape}")
print(f"  DEMO:            {demo_df.shape}")

In [None]:
# Preview the GHB (HbA1c) dataset
print("GHB Dataset Preview:")
print(ghb_df.head(10))
print(f"\nKey column: LBXGH = HbA1c (%)")
print(f"HbA1c range: {ghb_df['LBXGH'].min():.1f}% - {ghb_df['LBXGH'].max():.1f}%")

---

## Step 3: Clean and Merge Data

The `clean_glycemic_data()` function:
- Merges datasets on SEQN (participant ID)
- Renames columns to standardized names
- Removes physiologic outliers
- Returns only complete cases

In [None]:
# Clean and merge all datasets
cleaned_df = clean_glycemic_data(
    ghb_df=ghb_df,
    glu_df=glu_df,
    trigly_df=trigly_df,
    hdl_df=hdl_df,
    cbc_df=cbc_df,
    demo_df=demo_df,
)

print(f"Cleaned dataset shape: {cleaned_df.shape}")
print(f"\nColumn names: {list(cleaned_df.columns)}")
print(f"\nFirst 5 rows:")
cleaned_df.head()

In [None]:
# Summary statistics
print("Summary Statistics:")
cleaned_df.describe().round(2)

---

## Step 4: Generate Quality Report

Generate a comprehensive data quality report with record counts, descriptive statistics,
and clinical distribution breakdowns.

In [None]:
# Generate and save quality report
REPORT_PATH = DATA_DIR / "quality_report.txt"

report = generate_quality_report(cleaned_df, str(REPORT_PATH))

print(f"Report saved to: {REPORT_PATH}")
print(f"\n{'='*50}")
print("QUALITY REPORT SUMMARY")
print(f"{'='*50}")
print(f"\nTotal Records: {report['record_count']:,}")

print(f"\n--- HbA1c Distribution ---")
hba1c_dist = report['hba1c_distribution']
total = report['record_count']
print(f"  Normal (<5.7%):      {hba1c_dist['normal_lt_5.7']:,} ({100*hba1c_dist['normal_lt_5.7']/total:.1f}%)")
print(f"  Prediabetes (5.7-6.4%): {hba1c_dist['prediabetes_5.7_to_6.4']:,} ({100*hba1c_dist['prediabetes_5.7_to_6.4']/total:.1f}%)")
print(f"  Diabetes (≥6.5%):    {hba1c_dist['diabetes_gte_6.5']:,} ({100*hba1c_dist['diabetes_gte_6.5']/total:.1f}%)")

---

## Step 5: Data Visualization

Visualize key relationships and distributions in the data.

In [None]:
# Create figure with subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. HbA1c vs FPG Scatter Plot
ax1 = axes[0, 0]
ax1.scatter(
    cleaned_df['fpg_mgdl'], 
    cleaned_df['hba1c_percent'],
    alpha=0.3, 
    s=10,
    c='steelblue'
)
ax1.axhline(y=5.7, color='orange', linestyle='--', alpha=0.7, label='Prediabetes (5.7%)')
ax1.axhline(y=6.5, color='red', linestyle='--', alpha=0.7, label='Diabetes (6.5%)')
ax1.axvline(x=100, color='orange', linestyle=':', alpha=0.5)
ax1.axvline(x=126, color='red', linestyle=':', alpha=0.5)
ax1.set_xlabel('Fasting Plasma Glucose (mg/dL)')
ax1.set_ylabel('HbA1c (%)')
ax1.set_title('HbA1c vs Fasting Plasma Glucose')
ax1.legend(loc='upper left')
ax1.set_xlim(40, 400)
ax1.set_ylim(3, 15)

# 2. HbA1c Distribution
ax2 = axes[0, 1]
ax2.hist(
    cleaned_df['hba1c_percent'], 
    bins=50, 
    color='steelblue', 
    edgecolor='white',
    alpha=0.8
)
ax2.axvline(x=5.7, color='orange', linestyle='--', linewidth=2, label='Prediabetes threshold')
ax2.axvline(x=6.5, color='red', linestyle='--', linewidth=2, label='Diabetes threshold')
ax2.set_xlabel('HbA1c (%)')
ax2.set_ylabel('Count')
ax2.set_title('HbA1c Distribution')
ax2.legend()

# 3. FPG Distribution
ax3 = axes[1, 0]
ax3.hist(
    cleaned_df['fpg_mgdl'], 
    bins=50, 
    color='forestgreen', 
    edgecolor='white',
    alpha=0.8
)
ax3.axvline(x=100, color='orange', linestyle='--', linewidth=2, label='Prediabetes threshold')
ax3.axvline(x=126, color='red', linestyle='--', linewidth=2, label='Diabetes threshold')
ax3.set_xlabel('Fasting Plasma Glucose (mg/dL)')
ax3.set_ylabel('Count')
ax3.set_title('Fasting Plasma Glucose Distribution')
ax3.legend()

# 4. Correlation Matrix
ax4 = axes[1, 1]
corr_vars = ['hba1c_percent', 'fpg_mgdl', 'tg_mgdl', 'hdl_mgdl', 'hgb_gdl', 'age_years']
corr_matrix = cleaned_df[corr_vars].corr()
im = ax4.imshow(corr_matrix, cmap='RdBu_r', vmin=-1, vmax=1)
ax4.set_xticks(range(len(corr_vars)))
ax4.set_yticks(range(len(corr_vars)))
ax4.set_xticklabels(['HbA1c', 'FPG', 'TG', 'HDL', 'Hgb', 'Age'], rotation=45, ha='right')
ax4.set_yticklabels(['HbA1c', 'FPG', 'TG', 'HDL', 'Hgb', 'Age'])
ax4.set_title('Correlation Matrix')

# Add correlation values
for i in range(len(corr_vars)):
    for j in range(len(corr_vars)):
        text = f"{corr_matrix.iloc[i, j]:.2f}"
        ax4.text(j, i, text, ha='center', va='center', fontsize=9,
                color='white' if abs(corr_matrix.iloc[i, j]) > 0.5 else 'black')

plt.colorbar(im, ax=ax4, shrink=0.8)

plt.tight_layout()
plt.savefig(DATA_DIR / 'data_visualization.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nVisualization saved to: {DATA_DIR / 'data_visualization.png'}")

---

## Summary

This notebook demonstrated the complete NHANES data sourcing pipeline:

1. **Downloaded** NHANES XPT files for glycemic, lipid, CBC, and demographic data
2. **Parsed** SAS transport files into pandas DataFrames
3. **Cleaned** and merged datasets, removing outliers and incomplete cases
4. **Generated** a quality report with clinical distribution breakdowns
5. **Visualized** HbA1c vs FPG relationships and distributions

### Key Observations

- Strong correlation between HbA1c and fasting plasma glucose (as expected)
- Dataset includes samples across the full clinical spectrum (normal → diabetes)
- Triglycerides show moderate positive correlation with HbA1c
- Age shows weak but positive correlation with HbA1c

### Next Steps

Continue to **Notebook 02: Estimator Comparison** to compare mechanistic HbA1c estimation methods.

In [None]:
# Save cleaned data for subsequent notebooks
PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_DIR.mkdir(exist_ok=True)

cleaned_df.to_csv(PROCESSED_DIR / "nhanes_glycemic_cleaned.csv", index=False)
print(f"Cleaned data saved to: {PROCESSED_DIR / 'nhanes_glycemic_cleaned.csv'}")
print(f"Shape: {cleaned_df.shape}")