# PopHealth Observatory
## Population health & nutrition analytics using NHANES survey microdata

This notebook demonstrates how to use the NHANESExplorer (now part of the PopHealth Observatory package) to download, process, and analyze data from the National Health and Nutrition Examination Survey (NHANES). The observatory provides tools to analyze health metrics across demographic groups and survey cycles.

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import io
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from typing import Dict, List, Optional, Tuple
import ipywidgets as widgets
from ipywidgets import interact, fixed
import warnings
warnings.filterwarnings('ignore')

# Set plot styling
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("colorblind")

## 2. Observatory Class Import

We'll import the NHANESExplorer class from the PopHealth Observatory package (`pophealth_observatory`).

In [5]:
# Import the NHANESExplorer class from the PopHealth Observatory package
from pophealth_observatory import NHANESExplorer

# Initialize the explorer
explorer = NHANESExplorer()

# Display the available survey cycles and components
print("Available NHANES survey cycles:")
print(explorer.available_cycles)
print("\nAvailable data components:")
for name, code in explorer.components.items():
    print(f"  - {name}: {code}")

Available NHANES survey cycles:
['2017-2018', '2015-2016', '2013-2014', '2011-2012', '2009-2010']

Available data components:
  - demographics: DEMO
  - body_measures: BMX
  - blood_pressure: BPX
  - cholesterol: TCHOL
  - diabetes: GLU
  - dietary: DR1TOT
  - physical_activity: PAQ
  - smoking: SMQ
  - alcohol: ALQ


## 3. Data Acquisition

Let's download some key NHANES data components for the most recent survey cycle (2017-2018).

In [None]:
# Choose a valid cycle (ensure it exists in explorer.cycle_suffix_map)
cycle_debug = '2017-2018'  # try '2021-2022' or others from explorer.available_cycles
component_code = explorer.components['demographics']

# 1. Derive letter suffix & candidate URLs (mirrors updated logic)
letter = explorer.cycle_suffix_map.get(cycle_debug, 'UNKNOWN') if hasattr(explorer, 'cycle_suffix_map') else 'NA'
print('Cycle:', cycle_debug, '| Letter suffix:', letter)

base_url = getattr(explorer, 'base_url', 'N/A')
alt_base = getattr(explorer, 'alt_base_url', 'N/A')
primary = f"{base_url}/{cycle_debug}/{component_code}_{letter}.XPT"
alt1 = f"{alt_base}/{cycle_debug.split('-')[0]}/DataFiles/{component_code}_{letter}.xpt"
alt2 = f"{alt_base}/{cycle_debug.split('-')[0]}/DataFiles/{component_code}_{letter}.XPT"
CandidateURLs = [primary, alt1, alt2]
print('\nCandidate URLs (in order):')
for u in CandidateURLs:
    print('  ', u)

# 2. Try fetching each URL, report HTTP status
import requests, io
raw_bytes = None
for idx, u in enumerate(CandidateURLs, 1):
    try:
        resp = requests.get(u, timeout=20)
        print(f"Attempt {idx}: {u} -> status {resp.status_code}")
        if resp.status_code == 200:
            raw_bytes = resp.content
            print('  -> Success, size:', len(raw_bytes), 'bytes')
            break
    except Exception as e:
        print(f"  -> Error: {e}")

if raw_bytes is None:
    print('\nNo successful download; stopping here.')
else:
    # 3. Read SAS transport file
    try:
        demo_raw = pd.read_sas(io.BytesIO(raw_bytes), format='xport')
        print('\nRaw demographics shape:', demo_raw.shape)
        print('Columns sample:', list(demo_raw.columns[:20]))
    except Exception as e:
        print('Failed to parse XPT:', e)
        demo_raw = pd.DataFrame()

    # 4. Apply variable selection + renaming (same mapping as in method)
    demo_vars = {
        'SEQN': 'participant_id',
        'RIAGENDR': 'gender',
        'RIDAGEYR': 'age_years',
        'RIDRETH3': 'race_ethnicity',
        'DMDEDUC2': 'education',
        'INDFMPIR': 'poverty_ratio',
        'WTMEC2YR': 'exam_weight',
    }
    available = [c for c in demo_vars if c in demo_raw.columns]
    print('\nVariables present from mapping:', available)
    demo_clean = demo_raw[available].copy() if available else pd.DataFrame()
    if not demo_clean.empty:
        demo_clean = demo_clean.rename(columns={k:v for k,v in demo_vars.items() if k in available})

        # 5. Add decoded labels
        if 'gender' in demo_clean.columns:
            demo_clean['gender_label'] = demo_clean['gender'].map({1:'Male',2:'Female'})
        if 'race_ethnicity' in demo_clean.columns:
            race_labels = {1:'Mexican American',2:'Other Hispanic',3:'Non-Hispanic White',4:'Non-Hispanic Black',6:'Non-Hispanic Asian',7:'Other/Multi-racial'}
            demo_clean['race_ethnicity_label'] = demo_clean['race_ethnicity'].map(race_labels)

        print('\nCleaned demographics shape:', demo_clean.shape)
        display(demo_clean.head())
    else:
        print('No mapped variables found in downloaded file.')

### Debug: Deconstruct `get_demographics_data()`
The following cell manually reproduces each internal step of `get_demographics_data` so you can inspect URL construction, HTTP responses, raw columns, and recoding. If you are getting empty data, likely the cycle string does not match a known mapping (e.g. use `2021-2022` not `2021-2023`).

In [16]:
# Set the survey cycle
cycle = '2021-2023'

# Download demographics data

# The cycle string '2021-2023' does not match any available NHANES survey cycle in explorer.available_cycles.
# NHANES cycles are typically named like '2017-2018', '2019-2020', etc.
# Use one of the valid cycles from explorer.available_cycles, for example:
# cycle = '2017-2018'

# If you use an invalid cycle string, explorer.get_demographics_data(cycle) will return an empty DataFrame.
print("Available cycles:", explorer.available_cycles)
demo_df = explorer.get_demographics_data(cycle)

# Display the first few rows
print(f"Demographics data shape: {demo_df.shape}")
demo_df.head()

Available cycles: ['2017-2018', '2015-2016', '2013-2014', '2011-2012', '2009-2010']
Demographics data shape: (0, 0)
Demographics data shape: (0, 0)


In [15]:
# Download body measurements data
body_df = explorer.get_body_measures(cycle)

# Display the first few rows
print(f"Body measurements data shape: {body_df.shape}")
body_df.head()

Body measurements data shape: (0, 0)


In [8]:
# Download blood pressure data
bp_df = explorer.get_blood_pressure(cycle)

# Display the first few rows
print(f"Blood pressure data shape: {bp_df.shape}")
bp_df.head()

Blood pressure data shape: (0, 0)


## 4. Data Processing

Now let's create a merged dataset that combines demographics, body measurements, and blood pressure data.

In [9]:
# Create a merged dataset with demographics, body measurements, and blood pressure
merged_df = explorer.create_merged_dataset(cycle)

# Display column names and data types
print(f"Merged dataset shape: {merged_df.shape}")
merged_df.dtypes

Creating merged dataset for 2021-2023...
Merged dataset created with 0 participants and 0 variables
Merged dataset shape: (0, 0)
Merged dataset created with 0 participants and 0 variables
Merged dataset shape: (0, 0)


Series([], dtype: object)

In [10]:
# Display the first few rows of the merged dataset
merged_df.head()

## 5. Data Analysis

Let's perform some basic analyses on the merged dataset.

In [None]:
# Generate a summary report
summary_report = explorer.generate_summary_report(merged_df)
print(summary_report)

In [None]:
# Analyze BMI by race/ethnicity
if 'bmi' in merged_df.columns and 'race_ethnicity_label' in merged_df.columns:
    bmi_by_race = explorer.analyze_by_demographics(merged_df, 'bmi', 'race_ethnicity_label')
    print("BMI Statistics by Race/Ethnicity:")
    display(bmi_by_race)

In [None]:
# Analyze blood pressure by gender
if 'avg_systolic' in merged_df.columns and 'gender_label' in merged_df.columns:
    bp_by_gender = explorer.analyze_by_demographics(merged_df, 'avg_systolic', 'gender_label')
    print("Systolic Blood Pressure Statistics by Gender:")
    display(bp_by_gender)

## 6. Visualizations

Let's create some visualizations to explore the data.

In [None]:
# Visualize BMI by race/ethnicity
if 'bmi' in merged_df.columns and 'race_ethnicity_label' in merged_df.columns:
    explorer.create_demographic_visualization(merged_df, 'bmi', 'race_ethnicity_label')

In [None]:
# Visualize blood pressure by gender
if 'avg_systolic' in merged_df.columns and 'gender_label' in merged_df.columns:
    explorer.create_demographic_visualization(merged_df, 'avg_systolic', 'gender_label')

In [None]:
# Visualize BMI distribution
plt.figure(figsize=(10, 6))
sns.histplot(merged_df['bmi'].dropna(), bins=30, kde=True)
plt.axvline(x=18.5, color='r', linestyle='--', label='Underweight/Normal')
plt.axvline(x=25, color='y', linestyle='--', label='Normal/Overweight')
plt.axvline(x=30, color='g', linestyle='--', label='Overweight/Obese')
plt.title('BMI Distribution in NHANES 2017-2018')
plt.xlabel('BMI')
plt.ylabel('Count')
plt.legend()
plt.show()

In [None]:
# Visualize BMI categories by gender
if 'bmi_category' in merged_df.columns and 'gender_label' in merged_df.columns:
    # Create a cross-tabulation
    bmi_gender_crosstab = pd.crosstab(
        merged_df['gender_label'], 
        merged_df['bmi_category'], 
        normalize='index'
    ) * 100
    
    # Plot
    plt.figure(figsize=(12, 6))
    bmi_gender_crosstab.plot(kind='bar', stacked=True, colormap='viridis')
    plt.title('BMI Categories by Gender')
    plt.xlabel('Gender')
    plt.ylabel('Percentage')
    plt.legend(title='BMI Category')
    plt.xticks(rotation=0)
    for i, v in enumerate(bmi_gender_crosstab.iloc[0]):
        plt.text(i-0.2, v/2, f"{v:.1f}%", color='white', fontweight='bold')
    for i, v in enumerate(bmi_gender_crosstab.iloc[1]):
        plt.text(i+0.05, v/2, f"{v:.1f}%", color='white', fontweight='bold')
    plt.show()

## 7. Interactive Dashboard

Let's create a simple interactive dashboard to explore the data.

In [None]:
# Define a function to create an interactive visualization
def interactive_analysis(metric, demographic, df=merged_df):
    if metric not in df.columns or demographic not in df.columns:
        print(f"Column {metric} or {demographic} not found in dataset")
        return
    
    # Remove missing values
    plot_df = df[[demographic, metric]].dropna()
    
    # Create figure with subplots
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Box plot
    sns.boxplot(data=plot_df, x=demographic, y=metric, ax=axes[0])
    axes[0].set_title(f'{metric} by {demographic}')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Bar plot of means
    means = plot_df.groupby(demographic)[metric].mean().sort_values(ascending=False)
    means.plot(kind='bar', ax=axes[1], color='skyblue')
    axes[1].set_title(f'Mean {metric} by {demographic}')
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].set_ylabel(f'Mean {metric}')
    
    # Add value labels to the bar plot
    for i, v in enumerate(means):
        axes[1].text(i, v + 0.1, f"{v:.1f}", ha='center')
    
    plt.tight_layout()
    plt.show()
    
    # Display summary statistics
    stats = plot_df.groupby(demographic)[metric].agg([
        'count', 'mean', 'median', 'std', 'min', 'max'
    ]).round(2)
    stats.columns = ['Count', 'Mean', 'Median', 'Std Dev', 'Min', 'Max']
    return stats

# Create dropdown menus for metrics and demographics
numeric_metrics = ['age_years', 'bmi', 'weight_kg', 'height_cm', 'waist_cm', 'avg_systolic', 'avg_diastolic']
available_metrics = [m for m in numeric_metrics if m in merged_df.columns]

demographics = ['gender_label', 'race_ethnicity_label']
available_demographics = [d for d in demographics if d in merged_df.columns]

# Create interactive widget
interact(
    interactive_analysis,
    metric=widgets.Dropdown(options=available_metrics, description='Metric:'),
    demographic=widgets.Dropdown(options=available_demographics, description='Demographic:'),
    df=fixed(merged_df)
);

## 8. Example Analyses

Let's perform some more specific analyses on the data.

In [None]:
# Analyze relationship between BMI and blood pressure
if all(col in merged_df.columns for col in ['bmi', 'avg_systolic', 'avg_diastolic']):
    # Create scatter plot with regression line
    plt.figure(figsize=(10, 6))
    sns.regplot(data=merged_df, x='bmi', y='avg_systolic', scatter_kws={'alpha':0.3}, line_kws={'color':'red'})
    plt.title('Relationship Between BMI and Systolic Blood Pressure')
    plt.xlabel('BMI')
    plt.ylabel('Systolic Blood Pressure (mmHg)')
    
    # Calculate and display correlation coefficient
    correlation = merged_df[['bmi', 'avg_systolic']].corr().iloc[0, 1]
    plt.text(40, merged_df['avg_systolic'].min() + 5, f"Correlation: {correlation:.3f}", fontsize=12)
    plt.show()
    
    # Calculate summary statistics by BMI category
    if 'bmi_category' in merged_df.columns:
        bp_by_bmi_category = explorer.analyze_by_demographics(merged_df, 'avg_systolic', 'bmi_category')
        print("\nBlood Pressure Statistics by BMI Category:")
        display(bp_by_bmi_category)

In [None]:
# Analyze age distribution by gender and BMI category
if all(col in merged_df.columns for col in ['age_years', 'gender_label', 'bmi_category']):
    plt.figure(figsize=(14, 8))
    sns.violinplot(data=merged_df, x='bmi_category', y='age_years', hue='gender_label', split=True)
    plt.title('Age Distribution by BMI Category and Gender')
    plt.xlabel('BMI Category')
    plt.ylabel('Age (years)')
    plt.legend(title='Gender')
    plt.show()

## 9. Multi-cycle Analysis

Let's compare some metrics across multiple NHANES cycles.

In [None]:
# Function to get mean BMI across multiple cycles
def get_mean_bmi_by_cycle(cycles):
    results = []
    for cycle in cycles:
        print(f"Processing cycle {cycle}...")
        # Get body measurements data for this cycle
        body_df = explorer.get_body_measures(cycle)
        if not body_df.empty and 'bmi' in body_df.columns:
            mean_bmi = body_df['bmi'].mean()
            results.append({'cycle': cycle, 'mean_bmi': mean_bmi})
    return pd.DataFrame(results)

# Get BMI trends across the last 3 cycles
cycles_to_analyze = explorer.available_cycles[:3]  # Most recent 3 cycles
bmi_trends = get_mean_bmi_by_cycle(cycles_to_analyze)

# Plot BMI trends
if not bmi_trends.empty:
    plt.figure(figsize=(10, 6))
    plt.plot(bmi_trends['cycle'], bmi_trends['mean_bmi'], marker='o', linestyle='-', linewidth=2)
    plt.title('Mean BMI Across NHANES Cycles')
    plt.xlabel('Survey Cycle')
    plt.ylabel('Mean BMI')
    plt.grid(True, alpha=0.3)
    plt.xticks(rotation=45)
    
    # Add value labels
    for i, row in bmi_trends.iterrows():
        plt.text(i, row['mean_bmi'] + 0.05, f"{row['mean_bmi']:.2f}", ha='center')
    
    # Set y-axis to start from a reasonable value for better visualization
    plt.ylim(bottom=bmi_trends['mean_bmi'].min() - 0.5, top=bmi_trends['mean_bmi'].max() + 0.5)
    
    plt.tight_layout()
    plt.show()

## 10. Geographical Analysis

Note: NHANES doesn't provide detailed geographic data below the national level in public datasets to protect participant confidentiality. However, we can demonstrate how you might analyze such data if it were available.

In [None]:
# Simulated geographic analysis using plotly
# (Note: This uses simulated data since NHANES doesn't provide detailed geographic information)

# Create simulated data - obesity rates by state
import random
from urllib.request import urlopen
import json

# Try to load US states geojson for mapping
try:
    with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
        counties = json.load(response)
    
    # Simulate state-level obesity data
    state_fips = pd.read_csv('https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_fips_master.csv')
    
    # Generate random obesity rates (simulated data)
    np.random.seed(42)  # For reproducibility
    state_fips['obesity_rate'] = np.random.normal(loc=30, scale=5, size=len(state_fips))
    state_fips['obesity_rate'] = state_fips['obesity_rate'].clip(lower=20, upper=40).round(1)
    
    # Create choropleth map
    fig = px.choropleth(
        state_fips,
        geojson=counties,
        locations='state_code', 
        color='obesity_rate',
        color_continuous_scale='YlOrRd',
        range_color=(20, 40),
        scope="usa",
        labels={'obesity_rate':'Obesity Rate (%)'},
        title="Simulated Obesity Rates by State (For Demonstration Only)"
    )
    fig.update_layout(margin={"r":0,"t":30,"l":0,"b":0}, height=600)
    fig.show()
    
    print("NOTE: The map above uses simulated data for demonstration purposes only. ")
    print("NHANES does not provide public state-level estimates due to confidentiality constraints.")
    print("For actual state-level estimates, consider using BRFSS data from the CDC.")
    
except Exception as e:
    print(f"Could not create geographic visualization: {str(e)}")
    print("Note: This requires an internet connection to fetch the GeoJSON data.")

## 11. Conclusion

In this notebook, we've demonstrated how to use the NHANESExplorer class to download, process, and analyze NHANES data. We've explored various health metrics across demographic groups and created visualizations to better understand the data.

The NHANESExplorer provides a convenient way to work with NHANES data and can be extended to support additional analyses and visualizations. For real geographic analyses, consider using complementary datasets like BRFSS (Behavioral Risk Factor Surveillance System) which provides state-level estimates.

## Next Steps

1. Explore additional NHANES components like dietary intake or physical activity
2. Develop more sophisticated statistical analyses
3. Create custom visualizations for specific research questions
4. Implement machine learning models to predict health outcomes based on NHANES variables