# 01 — Data Sourcing & Harmonization

This notebook demonstrates the complete NHANES kidney-function data pipeline implemented in `eGFR/data.py`:

1. **Downloading** — Fetch SAS transport (`.XPT`) files from the CDC NHANES website
2. **Parsing** — Read XPT files into pandas DataFrames
3. **Cleaning** — Merge, rename, correct, and filter the data
4. **Quality Report** — Generate a summary of the cleaned dataset
5. **Visualization** — Explore creatinine distributions and demographic characteristics

---

## Data Sources

| Dataset | Description | Key Variables |
|---------|-------------|---------------|
| **BIOPRO** | Biochemistry profile | `LBXSCR` (serum creatinine) |
| **DEMO** | Demographics | `RIDAGEYR` (age), `RIAGENDR` (sex) |
| **BMX** | Body measures | `BMXWT` (weight), `BMXHT` (height) |
| **SSPRT** | Cystatin C (1999-2002 only) | `SSPRT` (cystatin C) |

All data comes from the [CDC NHANES](https://wwwn.cdc.gov/nchs/nhanes/) public-use data files.

In [None]:
import sys, os

# Ensure the project root is on the path so we can import eGFR
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

import warnings
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use("Agg")  # non-interactive backend for headless execution
import matplotlib.pyplot as plt

# eGFR package imports
from eGFR.data import (
    download_nhanes_kidney,
    read_xpt,
    clean_kidney_data,
    generate_quality_report,
    DEFAULT_CYCLES,
    CYSTATIN_CYCLES,
)
from eGFR.utils import egfr_to_ckd_stage

print(f"Project root: {project_root}")
print(f"Default NHANES cycles: {DEFAULT_CYCLES}")
print(f"Cystatin C cycles:     {CYSTATIN_CYCLES}")

---

## Step 1 — Download NHANES Data

The `download_nhanes_kidney()` function downloads XPT files from the CDC NHANES server.
It downloads BIOPRO, DEMO, and BMX for cycles 2005–2018, plus cystatin C (SSPRT) for
1999–2002.

Files are saved to `data/raw/` and existing files are skipped automatically.

> **Note:** This cell requires internet access. If the CDC server is unavailable,
> the pipeline falls back to synthetic demonstration data (see Step 2).

In [None]:
# Download a single cycle to demonstrate (change to DEFAULT_CYCLES for all data)
RAW_DIR = os.path.join(project_root, "data", "raw")

# Download just 2017-2018 as a demo (3 files: BIOPRO, DEMO, BMX)
demo_cycles = ["2017-2018"]

print(f"Output directory: {RAW_DIR}")
print(f"Downloading cycles: {demo_cycles}")
print()

summary = download_nhanes_kidney(output_dir=RAW_DIR, cycles=demo_cycles)

print(f"\nDownloaded: {len(summary['downloaded'])} files")
print(f"Failed:     {len(summary['failed'])} files")

---

## Step 2 — Parse XPT Files

The `read_xpt()` function reads SAS transport files into pandas DataFrames using
`pd.read_sas(format='xport')`.

If real NHANES XPT files are available, they are loaded directly. Otherwise,
we generate **synthetic demonstration data** that mirrors the NHANES schema so the
rest of the pipeline can be demonstrated end-to-end.

In [None]:
def _make_synthetic_data(n=2000, seed=42):
    """Generate synthetic NHANES-like DataFrames for demonstration.

    The synthetic data mirrors the schema produced by real NHANES XPT files
    (BIOPRO, DEMO, BMX) with clinically plausible distributions.
    """
    rng = np.random.default_rng(seed)
    seqn = np.arange(1, n + 1, dtype=float)
    sex = rng.choice([1.0, 2.0], size=n)  # 1=male, 2=female
    age = rng.uniform(18, 85, size=n)

    # Creatinine: lognormal, males higher than females
    cr = np.where(
        sex == 1,
        rng.lognormal(np.log(1.0), 0.2, size=n),
        rng.lognormal(np.log(0.8), 0.2, size=n),
    )

    # Weight and height
    weight = np.where(
        sex == 1,
        rng.normal(85, 15, size=n),
        rng.normal(72, 15, size=n),
    )
    height = np.where(
        sex == 1,
        rng.normal(175, 8, size=n),
        rng.normal(162, 7, size=n),
    )

    biopro = pd.DataFrame({"SEQN": seqn, "LBXSCR": cr})
    demo = pd.DataFrame({"SEQN": seqn, "RIDAGEYR": age, "RIAGENDR": sex})
    bmx = pd.DataFrame({"SEQN": seqn, "BMXWT": weight, "BMXHT": height})
    return biopro, demo, bmx


# Try loading real XPT files; fall back to synthetic data
USE_REAL_DATA = False
biopro_path = os.path.join(RAW_DIR, "BIOPRO_J.XPT")
demo_path = os.path.join(RAW_DIR, "DEMO_J.XPT")
bmx_path = os.path.join(RAW_DIR, "BMX_J.XPT")

try:
    biopro_df = read_xpt(biopro_path)
    demo_df = read_xpt(demo_path)
    bmx_df = read_xpt(bmx_path)
    USE_REAL_DATA = True
    print("✓ Loaded real NHANES XPT files")
except Exception as e:
    print(f"⚠ Could not load XPT files ({type(e).__name__}: {e})")
    print("  → Using synthetic demonstration data instead.\n")
    biopro_df, demo_df, bmx_df = _make_synthetic_data(n=2000)

print(f"BIOPRO: {biopro_df.shape[0]:,} rows, {biopro_df.shape[1]} columns")
print(f"DEMO:   {demo_df.shape[0]:,} rows, {demo_df.shape[1]} columns")
print(f"BMX:    {bmx_df.shape[0]:,} rows, {bmx_df.shape[1]} columns")

data_label = "NHANES 2017–2018" if USE_REAL_DATA else "Synthetic Demo Data"
print(f"\nData source: {data_label}")

print("\n--- BIOPRO key columns ---")
print(biopro_df[["SEQN", "LBXSCR"]].head())

print("\n--- DEMO key columns ---")
print(demo_df[["SEQN", "RIDAGEYR", "RIAGENDR"]].head())

print("\n--- BMX key columns ---")
print(bmx_df[["SEQN", "BMXWT", "BMXHT"]].head())

---

## Step 3 — Clean & Merge Data

The `clean_kidney_data()` function:

1. **Merges** BIOPRO, DEMO, and BMX on `SEQN` (inner join)
2. **Renames** columns to a standard schema (`cr_mgdl`, `age_years`, `sex`, etc.)
3. **Applies IDMS correction** for pre-2007 data (×0.95) when requested
4. **Removes outliers** — creatinine outside 0.2–15 mg/dL, age < 18
5. **Drops NaN** rows in core columns

> **IDMS Note:** The 2017–2018 cycle uses IDMS-standardized creatinine, so
> `apply_idms_correction=False` (default) is appropriate.

In [None]:
# Clean the data (2017-2018 is post-IDMS, no correction needed)
with warnings.catch_warnings(record=True) as w:
    warnings.simplefilter("always")
    cleaned = clean_kidney_data(
        biopro_df=biopro_df,
        demo_df=demo_df,
        bmx_df=bmx_df,
        cystatin_df=None,
        apply_idms_correction=False,
    )
    if w:
        for warning in w:
            print(f"⚠ {warning.message}")

print(f"\nCleaned dataset: {cleaned.shape[0]:,} rows, {cleaned.shape[1]} columns")
print(f"Columns: {list(cleaned.columns)}")
print(f"\nSample rows:")
cleaned.head(10)

---

## Step 4 — Generate Quality Report

The `generate_quality_report()` function produces a summary including:
- Record count
- Descriptive statistics (mean/SD) for key variables
- CKD stage distribution (via CKD-EPI 2021)
- Sex distribution

In [None]:
report_path = os.path.join(project_root, "data", "quality_report.txt")
report_text = generate_quality_report(cleaned, report_path)

print(report_text)
print(f"\n(Report saved to: {report_path})")

---

## Step 5 — Visualizations

### 5.1 Serum Creatinine Distribution

Serum creatinine is the primary biomarker for eGFR estimation. Expected patterns:
- Right-skewed distribution (most values 0.5–1.5 mg/dL)
- Higher values in males than females (greater muscle mass)
- A long tail representing kidney impairment

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# --- Overall distribution ---
ax = axes[0]
ax.hist(cleaned["cr_mgdl"], bins=60, color="#2196F3", edgecolor="white",
        alpha=0.85, density=True)
ax.set_xlabel("Serum Creatinine (mg/dL)", fontsize=12)
ax.set_ylabel("Density", fontsize=12)
ax.set_title("Creatinine Distribution (Overall)", fontsize=13, fontweight="bold")
ax.axvline(cleaned["cr_mgdl"].median(), color="red", linestyle="--", linewidth=1.5,
           label=f"Median: {cleaned['cr_mgdl'].median():.2f}")
ax.legend(fontsize=10)
ax.set_xlim(0, 5)

# --- By sex ---
ax = axes[1]
male = cleaned.loc[cleaned["sex"] == 1, "cr_mgdl"]
female = cleaned.loc[cleaned["sex"] == 2, "cr_mgdl"]
ax.hist(male, bins=60, color="#1E88E5", edgecolor="white", alpha=0.65,
        density=True, label=f"Male (n={len(male)})")
ax.hist(female, bins=60, color="#E91E63", edgecolor="white", alpha=0.65,
        density=True, label=f"Female (n={len(female)})")
ax.set_xlabel("Serum Creatinine (mg/dL)", fontsize=12)
ax.set_ylabel("Density", fontsize=12)
ax.set_title("Creatinine Distribution by Sex", fontsize=13, fontweight="bold")
ax.legend(fontsize=10)
ax.set_xlim(0, 5)

plt.tight_layout()
plt.savefig(os.path.join(project_root, "data", "creatinine_distribution.png"),
            dpi=150, bbox_inches="tight")
plt.show()
print("Figure saved to data/creatinine_distribution.png")

### 5.2 Age & Sex Demographics

NHANES samples adults ≥18. We expect a broad age distribution reflecting the
US civilian non-institutionalized population.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# --- Age histogram ---
ax = axes[0]
ax.hist(cleaned["age_years"], bins=40, color="#4CAF50", edgecolor="white", alpha=0.85)
ax.set_xlabel("Age (years)", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
ax.set_title("Age Distribution", fontsize=13, fontweight="bold")
ax.axvline(cleaned["age_years"].median(), color="red", linestyle="--", linewidth=1.5,
           label=f"Median: {cleaned['age_years'].median():.0f} years")
ax.legend(fontsize=10)

# --- Sex breakdown (pie chart) ---
ax = axes[1]
sex_counts = cleaned["sex"].value_counts().sort_index()
labels = ["Male", "Female"]
colors = ["#1E88E5", "#E91E63"]
counts = [sex_counts.get(1, 0), sex_counts.get(2, 0)]
ax.pie(counts, labels=labels, colors=colors, autopct="%1.1f%%",
       startangle=90, textprops={"fontsize": 12})
ax.set_title("Sex Distribution", fontsize=13, fontweight="bold")

plt.tight_layout()
plt.savefig(os.path.join(project_root, "data", "age_sex_demographics.png"),
            dpi=150, bbox_inches="tight")
plt.show()
print("Figure saved to data/age_sex_demographics.png")

### 5.3 Creatinine vs. Age (Scatter)

Creatinine tends to increase with age due to declining kidney function.
Males typically have higher creatinine due to greater muscle mass.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

male_mask = cleaned["sex"] == 1
female_mask = cleaned["sex"] == 2

ax.scatter(cleaned.loc[male_mask, "age_years"],
           cleaned.loc[male_mask, "cr_mgdl"],
           alpha=0.25, s=10, color="#1E88E5", label="Male")
ax.scatter(cleaned.loc[female_mask, "age_years"],
           cleaned.loc[female_mask, "cr_mgdl"],
           alpha=0.25, s=10, color="#E91E63", label="Female")

ax.set_xlabel("Age (years)", fontsize=12)
ax.set_ylabel("Serum Creatinine (mg/dL)", fontsize=12)
ax.set_title("Serum Creatinine vs. Age", fontsize=14, fontweight="bold")
ax.legend(fontsize=11, markerscale=5)
ax.set_ylim(0, 5)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(project_root, "data", "creatinine_vs_age.png"),
            dpi=150, bbox_inches="tight")
plt.show()
print("Figure saved to data/creatinine_vs_age.png")

### 5.4 CKD Stage Distribution

We compute eGFR using CKD-EPI 2021 and classify each subject by CKD stage.
In a general population sample, most subjects fall in G1–G2 (normal/mild).

In [None]:
from eGFR.data import _calc_ckd_epi_2021

# Compute eGFR for each subject
cleaned["egfr"] = cleaned.apply(
    lambda row: _calc_ckd_epi_2021(row["cr_mgdl"], row["age_years"], row["sex"]),
    axis=1,
)
cleaned["ckd_stage"] = cleaned["egfr"].apply(egfr_to_ckd_stage)

# Bar chart of CKD stages
stage_order = ["G1", "G2", "G3a", "G3b", "G4", "G5"]
stage_counts = cleaned["ckd_stage"].value_counts().reindex(stage_order, fill_value=0)

fig, ax = plt.subplots(figsize=(8, 5))
colors = ["#4CAF50", "#8BC34A", "#FFC107", "#FF9800", "#F44336", "#B71C1C"]
bars = ax.bar(stage_order, stage_counts.values, color=colors, edgecolor="white")

# Add count labels on bars
for bar, count in zip(bars, stage_counts.values):
    if count > 0:
        ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 5,
                str(count), ha="center", va="bottom", fontsize=10, fontweight="bold")

ax.set_xlabel("CKD Stage", fontsize=12)
ax.set_ylabel("Count", fontsize=12)
ax.set_title(f"CKD Stage Distribution — CKD-EPI 2021 ({data_label})",
             fontsize=14, fontweight="bold")
ax.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(project_root, "data", "ckd_stage_distribution.png"),
            dpi=150, bbox_inches="tight")
plt.show()

# Print percentages
print("\nCKD Stage Distribution:")
for stage in stage_order:
    n = stage_counts[stage]
    pct = 100.0 * n / len(cleaned)
    print(f"  {stage}: {n:>5d} ({pct:5.1f}%)")

---

## Summary

This notebook demonstrated the complete data sourcing pipeline:

| Step | Function | Output |
|------|----------|--------|
| Download | `download_nhanes_kidney()` | XPT files in `data/raw/` |
| Parse | `read_xpt()` | Raw DataFrames |
| Clean | `clean_kidney_data()` | Harmonized DataFrame |
| Report | `generate_quality_report()` | Text quality report |

### Key Observations

- Creatinine is right-skewed, with males showing higher values than females
- NHANES provides a broad age distribution representative of the US adult population
- Most subjects fall in CKD stages G1–G2 (≥60 mL/min/1.73m²), as expected for a general population sample
- The pipeline handles merging, renaming, IDMS correction, and outlier removal automatically

### Next Steps

- Download all cycles (2005–2018) for full training dataset
- Implement CKD-EPI 2021, MDRD, and Cockcroft-Gault equations in `eGFR/models.py`
- Compare equation estimates across the population (Notebook 02)