# Data Sourcing Notebook

This notebook documents the complete NHANES lipid data pipeline for the LDL-C estimation project.

## Overview

We will:
1. **Download** NHANES lipid panel data (TC, HDL, TG, direct LDL)
2. **Parse** the XPT files into DataFrames
3. **Clean** the data (merge, handle missing values, remove outliers)
4. **Generate** a quality report
5. **Visualize** the TG and LDL distributions

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from ldlC.data import (
    download_nhanes_lipids,
    read_xpt,
    clean_lipid_data,
    generate_quality_report,
    CYCLE_SUFFIXES,
)

# Set style for plots
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 12

---

## Step 1: Download NHANES Lipid Data

NHANES (National Health and Nutrition Examination Survey) provides comprehensive lipid panel data including direct LDL measurements via beta-quantification.

Our download function fetches:
- **TRIGLY**: Triglycerides
- **HDL**: HDL cholesterol
- **TCHOL**: Total cholesterol
- **BIOPRO**: Biochemistry profile (contains direct LDL)

In [None]:
# Available NHANES cycles (2005-2018)
print("Available NHANES cycles:")
for cycle, suffix in CYCLE_SUFFIXES.items():
    print(f"  {cycle} -> Suffix: {suffix}")

In [None]:
# Download data for a single cycle as demonstration
# Uncomment to actually download (requires internet connection)

# downloaded = download_nhanes_lipids(
#     output_dir="../data",
#     cycles=["2015-2016"],  # Start with one cycle
#     include_direct_ldl=True
# )
# print(f"\nDownloaded files: {downloaded}")

---

## Step 2: Parse XPT Files

NHANES distributes data in SAS transport format (.XPT). The `read_xpt()` function parses these into pandas DataFrames.

In [None]:
# Example: parse XPT files (adjust paths if you downloaded data)
# This demonstrates the API - actual data paths may differ

# tc_df = read_xpt("../data/raw/TCHOL_I.XPT")
# hdl_df = read_xpt("../data/raw/HDL_I.XPT")
# tg_df = read_xpt("../data/raw/TRIGLY_I.XPT")
# ldl_df = read_xpt("../data/raw/BIOPRO_I.XPT")

# print(f"TC DataFrame shape: {tc_df.shape}")
# print(f"HDL DataFrame shape: {hdl_df.shape}")
# print(f"TG DataFrame shape: {tg_df.shape}")
# print(f"LDL Direct DataFrame shape: {ldl_df.shape}")

### Key NHANES Column Names

| Measurement | Column Name | File |
|-------------|-------------|------|
| Total Cholesterol | LBXTC | TCHOL |
| HDL Cholesterol | LBDHDD | HDL |
| Triglycerides | LBXSTR | TRIGLY |
| Direct LDL | LBDLDL | BIOPRO |

---

## Step 3: Clean and Merge Data

The `clean_lipid_data()` function:
1. Merges datasets on SEQN (sample identifier)
2. Renames columns to standardized names
3. Removes physiologic outliers:
   - TC < 50 mg/dL
   - TG > 2000 mg/dL
   - HDL < 10 mg/dL
4. Calculates non-HDL cholesterol

In [None]:
# Example: clean and merge data
# Uncomment when you have downloaded the actual data

# cleaned_df = clean_lipid_data(tc_df, hdl_df, tg_df, ldl_df)
# print(f"Cleaned DataFrame shape: {cleaned_df.shape}")
# print(f"\nColumns: {cleaned_df.columns.tolist()}")
# cleaned_df.head()

---

## Step 4: Generate Quality Report

The quality report provides:
- Total record count
- Mean/SD for TC, HDL, TG, and direct LDL
- Missing value counts
- TG distribution across clinically relevant thresholds

In [None]:
# Example: generate quality report
# Uncomment when you have cleaned data

# report = generate_quality_report(cleaned_df, "../data/quality_report.txt")
# 
# print(f"Record count: {report['record_count']}")
# print(f"\nStats: {report['stats']}")
# print(f"\nTG Distribution: {report['tg_distribution']}")

---

## Step 5: Visualize Distributions

### 5.1 Triglyceride Distribution

The TG distribution is critical because LDL estimation accuracy varies by TG level:
- **< 150 mg/dL**: Normal (Friedewald works well)
- **150-400 mg/dL**: Borderline to high (consider Martin-Hopkins)
- **400-800 mg/dL**: High (use Sampson or extended Martin-Hopkins)
- **> 800 mg/dL**: Very high (direct LDL measurement recommended)

In [None]:
# Create synthetic demo data for visualization
# Replace with actual cleaned_df when data is downloaded

np.random.seed(42)
demo_tg = np.concatenate([
    np.random.lognormal(4.5, 0.4, 3500),  # Normal TG
    np.random.lognormal(5.2, 0.3, 1000),  # Elevated TG
    np.random.lognormal(5.8, 0.3, 400),   # High TG
    np.random.lognormal(6.5, 0.2, 100),   # Very high TG
])
demo_tg = np.clip(demo_tg, 30, 1500)

In [None]:
# TG Histogram with clinical thresholds
fig, ax = plt.subplots(figsize=(12, 6))

ax.hist(demo_tg, bins=100, edgecolor='white', alpha=0.7, color='steelblue')

# Add clinical threshold lines
thresholds = [(150, 'Normal/Borderline'), (400, 'High'), (800, 'Very High')]
colors = ['green', 'orange', 'red']
for (thresh, label), c in zip(thresholds, colors):
    ax.axvline(x=thresh, color=c, linestyle='--', linewidth=2, label=f'{label} ({thresh} mg/dL)')

ax.set_xlabel('Triglycerides (mg/dL)', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Triglyceride Distribution with Clinical Thresholds', fontsize=16)
ax.legend(loc='upper right', fontsize=11)
ax.set_xlim(0, 1000)

plt.tight_layout()
plt.show()

In [None]:
# TG Distribution Bar Chart (by category)
tg_cats = {
    '<150': (demo_tg < 150).sum(),
    '150-400': ((demo_tg >= 150) & (demo_tg < 400)).sum(),
    '400-800': ((demo_tg >= 400) & (demo_tg <= 800)).sum(),
    '>800': (demo_tg > 800).sum(),
}

fig, ax = plt.subplots(figsize=(8, 5))
colors = ['#2ecc71', '#f39c12', '#e74c3c', '#9b59b6']
bars = ax.bar(tg_cats.keys(), tg_cats.values(), color=colors, edgecolor='white')

# Add percentage labels
total = sum(tg_cats.values())
for bar, count in zip(bars, tg_cats.values()):
    pct = count / total * 100
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50, 
            f'{pct:.1f}%', ha='center', fontsize=12, fontweight='bold')

ax.set_xlabel('TG Range (mg/dL)', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('Triglyceride Distribution by Clinical Category', fontsize=16)

plt.tight_layout()
plt.show()

### 5.2 LDL Cholesterol Distribution

In [None]:
# Create synthetic LDL data for demonstration
np.random.seed(123)
demo_ldl = np.random.normal(120, 35, 5000)
demo_ldl = np.clip(demo_ldl, 20, 300)

In [None]:
# LDL Histogram with clinical thresholds
fig, ax = plt.subplots(figsize=(12, 6))

ax.hist(demo_ldl, bins=60, edgecolor='white', alpha=0.7, color='coral')

# Add clinical threshold lines (ATP III guidelines)
ldl_thresholds = [
    (70, 'Very Low Risk Target'),
    (100, 'Optimal'),
    (130, 'Borderline High'),
    (160, 'High'),
    (190, 'Very High'),
]
colors = ['#27ae60', '#2ecc71', '#f39c12', '#e67e22', '#e74c3c']

for (thresh, label), c in zip(ldl_thresholds, colors):
    ax.axvline(x=thresh, color=c, linestyle='--', linewidth=2, label=f'{label} ({thresh})')

ax.set_xlabel('LDL Cholesterol (mg/dL)', fontsize=14)
ax.set_ylabel('Count', fontsize=14)
ax.set_title('LDL-C Distribution with Clinical Thresholds', fontsize=16)
ax.legend(loc='upper right', fontsize=10)

plt.tight_layout()
plt.show()

---

## Summary

This notebook demonstrated the complete data pipeline:

1. **Download**: `download_nhanes_lipids()` fetches XPT files from CDC
2. **Parse**: `read_xpt()` converts SAS transport files to DataFrames
3. **Clean**: `clean_lipid_data()` merges, filters outliers, standardizes columns
4. **Report**: `generate_quality_report()` produces summary statistics
5. **Visualize**: Distributions show clinical relevance of TG and LDL ranges

### Next Steps

- Proceed to `02_equation_comparison.ipynb` to compare LDL estimation methods
- Use cleaned data for model training in Phase 3