# 01 - NHANES Data Sourcing Pipeline

This notebook demonstrates the complete data pipeline for acquiring and processing NHANES testosterone data.

## Overview

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. We use NHANES data to train our free testosterone estimation models.

**Data Files Used:**
- **TST** - Testosterone (LBXTST in ng/dL)
- **SHBG** - Sex Hormone Binding Globulin (LBXSHBG in nmol/L)
- **BIOPRO** - Biochemistry Profile containing Albumin (LBXSAL in g/dL)

**Survey Cycles:**
- 2011-2012
- 2013-2014
- 2015-2016

## Pipeline Steps

1. **Download** - Fetch XPT files from CDC NHANES website
2. **Parse** - Read XPT files into pandas DataFrames
3. **Clean** - Merge, convert units, and remove outliers
4. **Report** - Generate quality report for verification

## Setup

First, we import the necessary modules from our `freeT` package.

In [None]:
import sys
from pathlib import Path

# Add parent directory to path for imports (if running from notebooks/)
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import our data pipeline functions
from freeT.data import download_nhanes, read_xpt, clean_nhanes_data, generate_quality_report
from freeT.utils import ng_dl_to_nmol_l, nmol_l_to_ng_dl

print("Imports successful!")
print(f"Project root: {project_root}")

## Step 1: Download NHANES Data

The `download_nhanes()` function downloads XPT files from the CDC website. It:
- Creates the output directory structure automatically
- Skips files that already exist
- Handles download errors gracefully

**Note:** This step requires an internet connection and may take a few minutes depending on connection speed.

In [None]:
# Define output directory for raw data
data_dir = project_root / "data" / "raw"

# Download NHANES data for all cycles (2011-2016)
print("Starting NHANES data download...\n")
download_result = download_nhanes(
    output_dir=str(data_dir),
    cycles=["2011-2012", "2013-2014", "2015-2016"],
    verbose=True
)

print(f"\nDownload complete!")
print(f"Files downloaded: {len(download_result['downloaded'])}")
print(f"Files skipped (already exist): {len(download_result['skipped'])}")
print(f"Files failed: {len(download_result['failed'])}")

## Step 2: Parse XPT Files

The `read_xpt()` function reads SAS transport format files into pandas DataFrames.

Let's load data from one cycle (2015-2016) as an example.

In [None]:
# Define paths to data files for 2015-2016 cycle
cycle_dir = data_dir / "2015_2016"

tst_path = cycle_dir / "TST_I.XPT"
shbg_path = cycle_dir / "SHBG_I.XPT"
alb_path = cycle_dir / "BIOPRO_I.XPT"

# Read the XPT files
print("Reading XPT files...")
tst_df = read_xpt(tst_path)
shbg_df = read_xpt(shbg_path)
alb_df = read_xpt(alb_path)

print(f"\nTST data: {len(tst_df)} records, {len(tst_df.columns)} columns")
print(f"SHBG data: {len(shbg_df)} records, {len(shbg_df.columns)} columns")
print(f"ALB data: {len(alb_df)} records, {len(alb_df.columns)} columns")

In [None]:
# Preview the testosterone data
print("Testosterone data (TST) - Key columns:")
print(tst_df[['SEQN', 'LBXTST']].head(10))
print("\nLBXTST = Total testosterone in ng/dL")

In [None]:
# Preview the SHBG data
print("SHBG data - Key columns:")
print(shbg_df[['SEQN', 'LBXSHBG']].head(10))
print("\nLBXSHBG = SHBG in nmol/L")

In [None]:
# Preview the albumin data (from biochemistry profile)
print("Biochemistry Profile - Albumin column:")
print(alb_df[['SEQN', 'LBXSAL']].head(10))
print("\nLBXSAL = Serum albumin in g/dL")

## Step 3: Clean and Merge Data

The `clean_nhanes_data()` function:
1. **Merges** datasets on SEQN (participant ID)
2. **Converts units** to standardized SI units:
   - Testosterone: ng/dL → nmol/L (multiply by 0.0347)
   - Albumin: g/dL → g/L (multiply by 10)
   - SHBG: Already in nmol/L (no conversion)
3. **Removes outliers** based on physiological limits:
   - TT < 0.5 nmol/L (unreliably low)
   - SHBG > 250 nmol/L (abnormally high)
   - Albumin < 30 g/L (severe hypoalbuminemia)

In [None]:
# Clean and merge the data for 2015-2016 cycle
print("Cleaning 2015-2016 cycle data...\n")
clean_df = clean_nhanes_data(tst_df, shbg_df, alb_df, verbose=True)

In [None]:
# Preview the cleaned data
print("\nCleaned data preview:")
print(clean_df.head(10))

print("\nColumn descriptions:")
print("- seqn: Participant ID")
print("- tt_nmoll: Total testosterone (nmol/L)")
print("- shbg_nmoll: SHBG (nmol/L)")
print("- alb_gl: Albumin (g/L)")

In [None]:
# Basic statistics of the cleaned data
print("Descriptive Statistics:")
print(clean_df[['tt_nmoll', 'shbg_nmoll', 'alb_gl']].describe())

### Combining Multiple Cycles

For training, we typically want to combine data from all available cycles to maximize sample size.

In [None]:
import pandas as pd

# Define file mappings for each cycle
cycles = {
    "2011-2012": {"TST": "TST_G.XPT", "SHBG": "SHBG_G.XPT", "ALB": "BIOPRO_G.XPT"},
    "2013-2014": {"TST": "TST_H.XPT", "SHBG": "SHBG_H.XPT", "ALB": "BIOPRO_H.XPT"},
    "2015-2016": {"TST": "TST_I.XPT", "SHBG": "SHBG_I.XPT", "ALB": "BIOPRO_I.XPT"},
}

all_clean_dfs = []

for cycle, files in cycles.items():
    cycle_folder = cycle.replace("-", "_")
    cycle_path = data_dir / cycle_folder
    
    try:
        print(f"\n{'='*50}")
        print(f"Processing cycle: {cycle}")
        print(f"{'='*50}")
        
        # Read files
        tst = read_xpt(cycle_path / files["TST"])
        shbg = read_xpt(cycle_path / files["SHBG"])
        alb = read_xpt(cycle_path / files["ALB"])
        
        # Clean and merge
        clean = clean_nhanes_data(tst, shbg, alb, verbose=True)
        clean['cycle'] = cycle  # Add cycle identifier
        
        all_clean_dfs.append(clean)
        
    except FileNotFoundError as e:
        print(f"  [SKIP] Cycle {cycle}: {e}")

# Combine all cycles
if all_clean_dfs:
    combined_df = pd.concat(all_clean_dfs, ignore_index=True)
    print(f"\n{'='*50}")
    print(f"COMBINED DATASET: {len(combined_df)} total records")
    print(f"{'='*50}")
else:
    print("No data available - please run download step first.")

In [None]:
# Summary by cycle
if 'combined_df' in dir() and combined_df is not None:
    print("Records per cycle:")
    print(combined_df['cycle'].value_counts().sort_index())
    
    print("\nOverall statistics:")
    print(combined_df[['tt_nmoll', 'shbg_nmoll', 'alb_gl']].describe())

## Step 4: Generate Quality Report

The `generate_quality_report()` function creates a comprehensive quality report including:
- Total record count
- Mean and standard deviation for each variable
- Min/max values
- Missing value counts

In [None]:
# Generate quality report
reports_dir = project_root / "reports"
report_path = reports_dir / "data_quality_report.txt"

if 'combined_df' in dir() and combined_df is not None:
    # Use combined data for report (excluding the 'cycle' column for analysis)
    analysis_df = combined_df.drop(columns=['cycle'])
    report = generate_quality_report(analysis_df, str(report_path))
    
    print(f"Quality report saved to: {report_path}")
    print("\n" + "="*60)
    print("REPORT SUMMARY")
    print("="*60)
    print(f"Total records: {report['record_count']}")
    print("\nStatistics:")
    for col, stats in report['statistics'].items():
        print(f"  {col}: mean={stats['mean']:.2f}, SD={stats['sd']:.2f}")
else:
    print("No data available for report - please run previous steps first.")

In [None]:
# View the full report file
if report_path.exists():
    print("Full report contents:")
    print("\n" + report_path.read_text())

## Save Processed Data

Finally, we save the cleaned and combined dataset for use in subsequent notebooks.

In [None]:
# Save processed data
processed_dir = project_root / "data" / "processed"
processed_dir.mkdir(parents=True, exist_ok=True)

if 'combined_df' in dir() and combined_df is not None:
    output_path = processed_dir / "nhanes_combined.csv"
    combined_df.to_csv(output_path, index=False)
    print(f"Processed data saved to: {output_path}")
    print(f"Total records: {len(combined_df)}")
else:
    print("No data to save - please run previous steps first.")

## Summary

This notebook demonstrated the complete NHANES data pipeline:

1. **Download** - `download_nhanes()` fetches XPT files from CDC
2. **Parse** - `read_xpt()` reads SAS transport format files
3. **Clean** - `clean_nhanes_data()` merges, converts units, removes outliers
4. **Report** - `generate_quality_report()` creates verification report

### Next Steps

- **Notebook 02** - Solver Comparison: Compare Vermeulen, Södergård, and Zakharov methods
- **Notebook 03** - Model Training: Train ML models on this data
- **Notebook 04** - Evaluation: Validate model performance

### Key Variables

| Variable | Unit | Description |
|----------|------|-------------|
| tt_nmoll | nmol/L | Total testosterone |
| shbg_nmoll | nmol/L | Sex Hormone Binding Globulin |
| alb_gl | g/L | Serum albumin |