# 01_load_adam_xpt_files
## Purpose: Load data & make some simple Plots

This notebook shows how to:
1) Load ADaM `.xpt` files
2) Briefly describe key datasets
3) Make a couple of simple, sanity-check plots

## Provenance 
Sample data: PHUSE CDISC pilot (`cdiscpilot01`)  
  - ADaM folder: https://github.com/phuse-org/phuse-scripts/tree/master/data/adam/cdiscpilot01  
  - Or the updated 2018 version: https://github.com/phuse-org/phuse-scripts/blob/master/data/adam/cdiscpilot_update1.zip
  - I'm using the updated folder.

Download Options
  - Download single folder (macOS): `brew install svn` then  
  `svn export https://github.com/phuse-org/phuse-scripts/trunk/data/adam/cdiscpilot01`  
  - Download single folder (Windows via TortoiseSVN): “SVN Checkout…” the same URL above into your target directory.


## Datasets in this demo (high-level)

| File          | What it is (typical content) |
|---|---|
| `adsl.xpt`    | Subject-level analysis dataset (one row per subject; demographics, treatment assignments, flags). |
| `adae.xpt`    | Adverse events analysis dataset (analysis-ready AE terms, timing, severity/seriousness flags, relationships). |
| `advs.xpt`    | Vital signs (analysis-ready VS measures like BP, pulse, temp; PARAM/PARAMCD, visit/time variables). |
| `adtte.xpt`   | Time-to-event analysis dataset (start/stop, event/censor flags, analysis time). |
| `adlbc.xpt`   | Clinical chemistry labs (ALT, AST, ALP, BILI, etc.; analysis variables, baseline/shift flags). |
| `adlbh.xpt`   | Hematology labs (HGB, HCT, PLT, etc.). |
| `adlbhy.xpt`  | Hy’s Law lab derivations (bilirubin/ALT/AST combinations). |
| `adlbcpv.xpt` | Chemistry lab **parameter-value** layout (one result per row with PARAMCD). |
| `adlbhpv.xpt` | Hematology lab **parameter-value** layout. |
| `adadas.xpt`  | ADAS-Cog questionnaire analysis dataset. |
| `adcibc.xpt`  | CIBIC (Clinician’s Interview-Based Impression of Change) analysis dataset. |
| `adnpix.xpt`  | NPI-X (Neuropsychiatric Inventory) analysis dataset. |
| `define.xml`  | Define-XML metadata describing structures, variables, derivations, and value-level metadata. |
| `define2-0-0.xsl` | Stylesheet to render Define-XML in a browser. |

Descriptions reflect the Define-XML for this package. :contentReference[oaicite:0]{index=0}


# Imports

In [None]:
# Minimal imports; pyreadstat is fast and preserves labels
import os
from pathlib import Path
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 25)

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import dataframe_image as dfi

import pyreadstat

import xml.etree.ElementTree as ET


In [None]:
DATA_DIR = Path("../data/raw/").resolve()

def read_xpt(path: Path) -> pd.DataFrame:
    """
    Read a SAS XPT (transport) file into a pandas DataFrame with pyreadstat.
    Returns a DataFrame; value/variable labels are preserved in metadata if needed.
    """
    df, meta = pyreadstat.read_xport(str(path))
    df.attrs["meta"] = meta
    return df


# Load & Peek at Data

In [None]:
# Choose a few commonly used domains for quick checks
files_to_load = {
    "ADSL": DATA_DIR / "adsl.xpt",
    "ADAE": DATA_DIR / "adae.xpt",
    "ADVS": DATA_DIR / "advs.xpt",
    "ADTTE": DATA_DIR / "adtte.xpt",
    "ADLBH": DATA_DIR / "adlbh.xpt",
    "ADLBC": DATA_DIR / "adlbc.xpt"   
}

# What each file stands for
domain_descriptions = {
    "ADSL": "Subject-Level Analysis Dataset",
    "ADAE": "Adverse Events Analysis Dataset",
    "ADVS": "Vital Signs Analysis Dataset",
    "ADTTE": "Time-to-Event Analysis Dataset",
    "ADLBH": "Hematology Labs" ,
    "ADLBC": "Clinical Chemistry Labs"
}

loaded = {}
for name, path in files_to_load.items():
    if path.exists():
        desc = domain_descriptions.get(name, "")
        header = f"{name} — {desc}" if desc else name
        print("=" * len(header))
        print(header)
        print("=" * len(header))
        print(f"File: {path.name}\n")
        loaded[name] = read_xpt(path)
        # peek(loaded[name], 3)

        print(f"Shape: { loaded[name].shape }")
        display(loaded[name].head(3))

        print()  # spacer
        
    else:
        print(f"WARNING: {path.name} not found in {DATA_DIR}")

"""
Save a few to display in markdown.
"""

save_index = -1
for preview_df in [ "ADLBH" , "ADLBC" ]:

    preview = loaded[preview_df][['STUDYID','USUBJID','TRTP','AVISIT','VISIT','PARAM','AVAL','BASE','CHG']].head(5)

    # Small, clean table for GitHub; hide the index to keep it tight
    styled = (
        preview.style
        .set_caption(f"{preview_df} Sample Rows")
        .hide(axis="index")
        # Optional: make text a bit smaller and wrap if needed
        .set_properties(**{"font-size": "9pt", "white-space": "nowrap"})
    )

    # Use Matplotlib backend to avoid Playwright/async issues in notebooks
    save_index += 1; fig_str = "{:02}".format(save_index)
    dfi.export(
        styled,
    f"../figures/{fig_str}_df_preview_{preview_df.lower()}.jpg",
        table_conversion="matplotlib",   # <-- key change
        dpi=150                           # crisp enough but still small
    )
    

### Preview of ADLBH Hematology Labs
<img src="../figures/00_df_preview_adlbh.jpg?v=1" alt="ADLBH sample rows" width="1000">

### Preview of ADLBC Clinical Chemistry Labs
<img src="../figures/01_df_preview_adlbc.jpg?v=1" alt="ADLBC sample rows" width="1000">


# How to quickly look-up column meanings

In [None]:

root = ET.parse("../data/raw/define.xml").getroot()

def var_label(dataset, var):
    for ig in root.findall(".//{*}ItemGroupDef[@Name='%s']" % dataset):
        for itemref in ig.findall(".//{*}ItemRef"):
            oid = itemref.get("ItemOID")
            item = root.find(".//{*}ItemDef[@OID='%s']" % oid)
            if item is not None and item.get("Name") == var:
                txt = item.find(".//{*}Description/{*}TranslatedText")
                return txt.text.strip() if txt is not None else None
    return None

"""
Example:
In the ADSL (Subject Level) Data Set...
ADSL column TRTSDT: Date of First Exposure to Treatment
ADSL column DCDECOD: Standardized Disposition Term
"""

print("Example")
print("ADSL column TRTSDT: ", var_label("ADSL", "TRTSDT"))
print("ADSL column DCDECOD:", var_label("ADSL", "DCDECOD"))

print("\nVital Sign Columns")
print("ADVS column AVAL:   ", var_label("ADVS", "AVAL"))
print("ADVS column TRTSDT: ", var_label("ADVS", "TRTSDT"))
print("ADVS column TRTEDT: ", var_label("ADVS", "TRTEDT"))
print("ADVS column ADT:    ", var_label("ADVS", "ADT"))


So, for example:
- The ADVS data column called AVAL represents the    Analysis Value
- The ADVS data column called TRTSDT represents the  Date of First Exposure to Treatment
- The ADVS data column called TRTEDT represents the  Date of Last Exposure to Treatment
- The ADVS data column called ADT represents the     Analysis Date

# Simple Summaries

In [None]:
"""
Patients per Treatment Arm per Site
"""

site_counts = loaded["ADAE"].groupby(["SITEID", "TRTA"])["USUBJID"].nunique().reset_index(name="n_patients")
site_counts.SITEID = site_counts.SITEID.astype(int)

site_counts.pivot(index="SITEID", columns="TRTA", values="n_patients").plot(
    kind="bar", figsize=(10,5) )

plt.ylabel("Number of Patients")
plt.title("Patients per Treatment Arm by Site")
plt.tight_layout()

"""
Save for Display
"""
# Save compressed image
save_index += 1; fig_str = "{:02}".format(save_index)
out_path = f"../figures/{fig_str}_patients_per_site.jpg"
plt.savefig(out_path, dpi=100, bbox_inches="tight", pil_kwargs={"quality": 70, "optimize": True})
plt.close()


Quick Interpretation:
- Within sites, the treatment arm distributions are about even.
- Some sites have a much larger number of patients.


<img src="../figures/02_patients_per_site.jpg?v=1" alt="Patients per Treatment Arm by Site" width="1000">


# ADaM Vital Signs
- Using the ADVS data set you can look at vital signs within & across patients.
- The trends you see in here are common in other vital sign data sets, for example...
    - Compared to the population range, individual patients occupy a much smaller range of the values
    - So values that are entirely rare to the population might be the norm for an individual patient, and vice versa.
    - Repeated measurements are not a silver bullet: there's huge variation in vitals, even when taken within the same visit. --> so imagine all the variability that occurs between site visits.
- Given these similarities to a couple dozen of the other vital sign data sets I've seen, I'd say it's not a bad example data set.

In [None]:
"""
Show Available Vital Sign Parameters with Value Counts
"""

df_vs = loaded["ADVS"] # a more useful name

preview = (
    df_vs["PARAM"]
    .value_counts()                # add dropna=False if you want to count NaNs
    .rename("n")
    .reset_index()
    .rename(columns={"index": "PARAM"})
    .sort_values(["n", "PARAM"], ascending=[False, True])
)

styled = (
    preview.style
    .set_caption("Available Vital Sign Parameters")
    .hide(axis="index")
    .format({"n": "{:,}"})         # prettier counts
)

save_index += 1
fig_str = f"{save_index:02d}"
out_path = f"../figures/{fig_str}_vitalsign_params_available_advs.jpg"  # PNG for crisper text

dfi.export(
    styled,
    out_path,
    table_conversion="matplotlib",
    dpi=150
)

<img src="../figures/03_vitalsign_params_available_advs.jpg?v=1" alt="Available Vital Signs" width="300">


In [None]:
"""
Get Vital Sign Values & Statisics: Within & Across Patients
"""
df_vs = loaded["ADVS"] # a more useful name

pulse_quants = (
    df_vs
    .query("PARAM == 'Pulse Rate (beats/min)'")
    .groupby("USUBJID")["AVAL"]
    .quantile([0.25, 0.5, 0.75])   # stacked
    .unstack()                     # pivot quantiles into columns
    .reset_index()
    .rename(columns={0.25: "p25", 0.5: "p50", 0.75: "p75"})
    .sort_values("p50")
    .reset_index(drop=True)
)

# Build a dictionary: USUBJID -> index in pulse_quants
index_map = dict(zip(pulse_quants.USUBJID, pulse_quants.index))

# 
bool_pulse_vals = df_vs.PARAM == 'Pulse Rate (beats/min)'

x_vals = df_vs[bool_pulse_vals]["USUBJID"].map(index_map)
y_vals = df_vs[bool_pulse_vals]["AVAL"]

In [None]:
# Toggle: True = bars go left→right (clockwise look); False = right→left (counterclockwise look)
bars_right = True

fig = plt.figure(figsize=(18,5))
gs = gridspec.GridSpec(1, 2, width_ratios=[4,1], wspace=0.05)

# --- Main plot (left) ---
ax_main = plt.subplot(gs[0])
ax_main.plot(x_vals, y_vals, ".c", alpha=.5, label="Heart Rate Obs")
ax_main.vlines(x=pulse_quants.index, ymin=pulse_quants.p25, ymax=pulse_quants.p75,
               colors="b", label="Patient-Specific Q25–Q75")
ax_main.plot(pulse_quants.index, pulse_quants.p50, "-r", label="Patient-Specific Median")
ax_main.grid(True)
ax_main.set_xlabel("Subject Index (Sorted by Median)", fontsize=16)
ax_main.set_ylabel("Pulse Rate (Beats/Min)", fontsize=16)
ax_main.set_title("Heart Rate Distributions: Within vs Across Subjects", fontsize=16)
ax_main.legend()

# --- Histogram (right) ---
ax_hist = plt.subplot(gs[1], sharey=ax_main)
ax_hist.hist(y_vals, bins=np.arange(40,121,2.5), orientation="horizontal", alpha=0.7)

if bars_right:
    # Bars extend right → put ticks/label on the right
    ax_hist.yaxis.tick_right()
    ax_hist.yaxis.set_label_position("right")
else:
    # Bars extend left → flip axis and put ticks/label on the left
    ax_hist.invert_xaxis()
    ax_hist.yaxis.tick_left()
    ax_hist.yaxis.set_label_position("left")

ax_hist.set_xlabel("Count")
ax_hist.set_ylabel("Pulse Rate (Beats/Min)", fontsize=14, labelpad=10)
ax_hist.tick_params(axis="y", labelsize=12)
ax_hist.set_title("Overall Distribution", fontsize=14)
ax_hist.grid(True)

# --- Save ---
save_index += 1; fig_str = "{:02}".format(save_index)
out_path = f"../figures/{fig_str}_heart_rate_quantiles_plus_hist.jpg"
plt.savefig(out_path, dpi=100, bbox_inches="tight", pil_kwargs={"quality": 70, "optimize": True})
plt.close()


Quick Interpretation:
- Compared to the population range, individual patients occupy a much smaller range of the values
- So values that are entirely rare to the population might be the norm for an individual patient, and vice versa.

<img src="../figures/04_heart_rate_quantiles_plus_hist.jpg?v=1" alt="Heart Rate Quantiles and Hist" width="1000">


In [None]:
"""
One subject's vital sign variation over time
"""

subj_id = "01-701-1015"
bool_subj =df_vs.USUBJID == subj_id
spi =0

plt.figure(figsize=(20,10))
plt.suptitle( f"Subject Vital Signs: {subj_id}" , fontsize=24 )
for vs_i in ["Pulse Rate (beats/min)" , "Temperature (C)" ,
            "Systolic Blood Pressure (mmHg)" , "Diastolic Blood Pressure (mmHg)"]:
    
    bool_vs = df_vs.PARAM == vs_i
    spi += 1

    plt.subplot(2,2,spi)
    plt.scatter( df_vs[ bool_subj & bool_vs ].ADT , df_vs[ bool_subj & bool_vs ].AVAL , marker="." , s = 50 )
    plt.ylabel( vs_i , fontsize=20 )
    plt.xlabel( "Time" , fontsize=20 )
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)

plt.tight_layout()

# --- Save ---
save_index += 1; fig_str = "{:02}".format(save_index)
out_path = f"../figures/{fig_str}_subject_vitals_timeseries.jpg"
plt.savefig(out_path, dpi=100, bbox_inches="tight", pil_kwargs={"quality": 70, "optimize": True})
plt.close()

Quick Interpretation:
- Repeated measurements are not a silver bullet: there's huge variation in vitals, even when taken within the same visit. 
- Imagine all the variability that occurs between site visits.

<img src="../figures/05_subject_vitals_timeseries.jpg?v=1" alt="Heart Rate Time Series" width="1000">



## ADaM Lab Values
- We'll give some lab values the same treatment that we gave vital signs.
- Options:
    - ADLBC:  Clinical chemistry labs (ALT, AST, ALP, BILI, etc.; analysis variables, baseline/shift flags).
    - ADLBH.xpt: Hematology labs (HGB, HCT, PLT, etc.).

In [None]:
"""
Show Available Lab Parameters with Value Counts
"""

df_labs = loaded["ADLBC"]

preview = (
    df_labs["PARAM"]
    .value_counts()                # add dropna=False if you want to count NaNs
    .rename("n")
    .reset_index()
    .rename(columns={"index": "PARAM"})
    .sort_values(["n", "PARAM"], ascending=[False, True])
)

styled = (
    preview.style
    .set_caption("Available Lab Parameters")
    .hide(axis="index")
    .format({"n": "{:,}"})         # prettier counts
)

save_index += 1
fig_str = f"{save_index:02d}"
out_path = f"../figures/{fig_str}_lab_params_available_adlbc.jpg"  # PNG for crisper text

dfi.export(
    styled,
    out_path,
    table_conversion="matplotlib",
    dpi=150
)

<img src="../figures/06_lab_params_available_adlbc.jpg?v=1" alt="Available Lab Params" width="300">


In [None]:
"""
Bilirubin Vital Sign Values & Statisics: Within & Across Patients
"""

df_labs = loaded["ADLBC"]


bili_quants = (
    df_labs
    .query("PARAM == 'Bilirubin (umol/L)'")
    .groupby("USUBJID")["AVAL"]
    .quantile([0.25, 0.5, 0.75])   # stacked
    .unstack()                     # pivot quantiles into columns
    .reset_index()
    .rename(columns={0.25: "p25", 0.5: "p50", 0.75: "p75"})
    .sort_values("p50")
    .reset_index(drop=True)
)

# Build a dictionary: USUBJID -> index in bili_quants
index_map = dict(zip(bili_quants.USUBJID, bili_quants.index))

# 
bool_bili_vals = df_labs.PARAM == 'Bilirubin (umol/L)'

x_vals = df_labs[bool_bili_vals]["USUBJID"].map(index_map)
y_vals = df_labs[bool_bili_vals]["AVAL"]

In [None]:
# Toggle: True = bars go left→right (clockwise look); False = right→left (counterclockwise look)
bars_right = True

fig = plt.figure(figsize=(18,5))
gs = gridspec.GridSpec(1, 2, width_ratios=[4,1], wspace=0.05)

# --- Main plot (left) ---
ax_main = plt.subplot(gs[0])
ax_main.plot(x_vals, y_vals, ".c", alpha=.5, label="Bilirubin Obs")
ax_main.vlines(x=bili_quants.index, ymin=bili_quants.p25, ymax=bili_quants.p75,
               colors="b", label="Patient-Specific Q25–Q75")
ax_main.plot(bili_quants.index, bili_quants.p50, "-r", label="Patient-Specific Median")
ax_main.grid(True)
ax_main.set_xlabel("Subject Index (Sorted by Median)", fontsize=16)
ax_main.set_ylabel("Bilirubin (umol/L)", fontsize=16)
ax_main.set_title("Bilirubin Distributions: Within vs Across Subjects", fontsize=16)
ax_main.legend()

ax_main.set_ylim([-1,40])


# --- Histogram (right) ---
ax_hist = plt.subplot(gs[1], sharey=ax_main)
ax_hist.hist(y_vals, bins=np.arange(0,42,2), orientation="horizontal", alpha=0.7)

if bars_right:
    # Bars extend right → put ticks/label on the right
    ax_hist.yaxis.tick_right()
    ax_hist.yaxis.set_label_position("right")
else:
    # Bars extend left → flip axis and put ticks/label on the left
    ax_hist.invert_xaxis()
    ax_hist.yaxis.tick_left()
    ax_hist.yaxis.set_label_position("left")

ax_hist.set_xlabel("Count")
ax_hist.set_ylabel("Bilirubin (umol/L)", fontsize=14, labelpad=10)
ax_hist.tick_params(axis="y", labelsize=12)
ax_hist.set_title("Overall Distribution", fontsize=14)
ax_hist.grid(True)

# --- Save ---
save_index += 1; fig_str = "{:02}".format(save_index)
out_path = f"../figures/{fig_str}_bilirubin_quantiles_plus_hist.jpg"
plt.savefig(out_path, dpi=100, bbox_inches="tight", pil_kwargs={"quality": 70, "optimize": True})
plt.close()


<img src="../figures/07_bilirubin_quantiles_plus_hist.jpg?v=1" alt="Bilirubin Quantiles and Hist" width="1000">
