In [1]:
# ============================================================
# Cell 1 — Environment Verification & AWS Session Initialization
# ============================================================
#
# This cell performs three things:
#   1. Imports all required libraries for AWS communication,
#      filesystem handling, and tabular processing.
#   2. Sets your target S3 bucket (already populated).
#   3. Creates an AWS S3 client using your active environment
#      credentials (pulled from ~/.aws/credentials).
#
# This cell does NOT access the bucket yet. It only ensures that
# the environment is configured correctly and that boto3 can
# initialize a client without error.
# ============================================================

import boto3
import pandas as pd
from botocore.exceptions import ClientError

# Name of the new clean bucket you created
BUCKET = "quantum-clinical-optimization-us-west-2"

# Initialize S3 client
# If credentials are properly configured in ~/.aws,
# boto3 will automatically detect them.
try:
    s3 = boto3.client("s3")
    print("AWS S3 client initialized successfully.")
except Exception as e:
    print("❌ Failed to initialize AWS client:", e)

AWS S3 client initialized successfully.


### Cell 1 — Environment Verification & AWS Session Initialization

This cell prepares the execution environment and confirms that we can talk to AWS S3 before doing any heavy work. It imports the core libraries we rely on throughout the notebook: `boto3` for AWS interactions, `pandas` for tabular data handling, and `ClientError` for clean S3 error reporting. We then define the target S3 bucket, `quantum-clinical-optimization-us-west-2`, which is assumed to be pre-populated with clinical-trials data. Finally, we attempt to construct an S3 client using the active AWS configuration (for example, credentials stored under `~/.aws/credentials`). A successful initialization message confirms that the environment is correctly wired to AWS; if not, the exception message surfaces configuration issues early, before any large-scale processing is attempted.

In [2]:
# ============================================================
# Cell 2 — Basic Bucket Connectivity Test
# ============================================================
#
# Purpose of this cell:
#   • Confirm the notebook can list objects from your S3 bucket.
#   • Detect incorrect bucket names or missing permissions early.
#
# Strategy:
#   • Use list_objects_v2() with MaxKeys=10 to quickly sample
#     the root of the bucket.
#   • Print the top-level keys as a sanity check.
#
# Expected Output:
#   • A short list including:
#       clinical-trials-data/raw/
#       clinical-trials-data/raw/NCT0000xxxx/...
#       etc.
# ============================================================

try:
    response = s3.list_objects_v2(
        Bucket=BUCKET,
        MaxKeys=10  # quick preview
    )

    if "Contents" not in response:
        print("Bucket is reachable but contains no objects.")
    else:
        print(f"Connected to S3 bucket: {BUCKET}")
        print("Sample keys found:")
        for obj in response["Contents"]:
            print("  -", obj["Key"])

except ClientError as e:
    print("❌ S3 access error:", e)
except Exception as e:
    print("❌ Unexpected error:", e)

Connected to S3 bucket: quantum-clinical-optimization-us-west-2
Sample keys found:
  - braket-output/
  - clinical-trials-data/archive/
  - clinical-trials-data/manifests/
  - clinical-trials-data/processed/
  - clinical-trials-data/raw/
  - clinical-trials-data/raw/Contents.txt
  - clinical-trials-data/raw/NCT0000xxxx/NCT00000102.xml
  - clinical-trials-data/raw/NCT0000xxxx/NCT00000104.xml
  - clinical-trials-data/raw/NCT0000xxxx/NCT00000105.xml
  - clinical-trials-data/raw/NCT0000xxxx/NCT00000106.xml


### Cell 2 — Basic Bucket Connectivity Test

This cell verifies that the notebook can successfully communicate with the configured S3 bucket and that your IAM permissions are sufficient to list objects. Using the previously created `s3` client, it calls `list_objects_v2` with `MaxKeys=10` to retrieve a small sample of keys from the bucket root. If the response contains a `Contents` section, the code prints a short list of object keys as a quick visual confirmation that the bucket name is correct and populated (for example, paths starting with `clinical-trials-data/raw/`). If `Contents` is missing, the notebook reports that the bucket is reachable but currently empty. Any `ClientError` (such as access denied or a typo in the bucket name) is caught and reported explicitly, while a generic `Exception` catch provides a safety net for other unexpected issues. This keeps connectivity problems front-loaded and easy to diagnose before launching large-scale ingestion.

In [3]:
# ============================================================
# Cell 3 — Build S3 Metadata Index of All Clinical Trial XMLs
# ============================================================
#
# Purpose:
#   • Build a catalog of every XML file inside:
#         clinical-trials-data/raw/
#   • Extract metadata:
#         - Full S3 key
#         - NCT ID parsed from filename
#         - Top-level directory (NCTxxxx group)
#         - File size (Bytes)
#         - Last-modified timestamp
#
# Why this matters:
#   • This index allows fast searching, sampling, batching,
#     QC analysis, and parallel processing.
#   • This avoids downloading all 266k XML files at once.
#
# Output:
#   - df_index → Pandas DataFrame with full metadata
#   - A preview and dataset counts
#
# ============================================================

import re
import pandas as pd

RAW_PREFIX = "clinical-trials-data/raw/"
xml_pattern = re.compile(r"NCT\d+\.xml$")

all_records = []

# Iterator handles pagination automatically, necessary for 266k+ objects
paginator = s3.get_paginator("list_objects_v2")

print("Scanning S3 keys... this may take 3–10 minutes.")

try:
    for page in paginator.paginate(Bucket=BUCKET, Prefix=RAW_PREFIX):
        for obj in page.get("Contents", []):
            key = obj["Key"]

            # Only index XML files (skip directories, metadata files, etc.)
            if xml_pattern.search(key):
                filename = key.split("/")[-1]
                nct_id = filename.replace(".xml", "")
                folder_group = key.split("/")[-2]  # e.g., NCT0000xxxx

                record = {
                    "s3_key": key,
                    "nct_id": nct_id,
                    "folder": folder_group,
                    "size_bytes": obj["Size"],
                    "last_modified": obj["LastModified"],
                }
                all_records.append(record)

except Exception as e:
    print("❌ Error during S3 scan:", e)
    raise

# Convert into DataFrame
df_index = pd.DataFrame(all_records)

print("\nIndex build complete.")
print(f"Total XML files indexed: {len(df_index):,}")
df_index.head(10)


Scanning S3 keys... this may take 3–10 minutes.

Index build complete.
Total XML files indexed: 557,292


Unnamed: 0,s3_key,nct_id,folder,size_bytes,last_modified
0,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000102,NCT0000xxxx,4217,2025-11-14 10:58:29+00:00
1,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000104,NCT0000xxxx,4357,2025-11-14 10:58:29+00:00
2,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000105,NCT0000xxxx,11623,2025-11-14 10:58:29+00:00
3,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000106,NCT0000xxxx,5572,2025-11-14 10:58:29+00:00
4,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000107,NCT0000xxxx,3312,2025-11-14 10:58:29+00:00
5,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000108,NCT0000xxxx,4293,2025-11-14 10:58:29+00:00
6,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000110,NCT0000xxxx,4687,2025-11-14 10:58:29+00:00
7,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000111,NCT0000xxxx,4252,2025-11-14 10:58:29+00:00
8,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000112,NCT0000xxxx,3612,2025-11-14 10:58:29+00:00
9,clinical-trials-data/raw/NCT0000xxxx/NCT000001...,NCT00000113,NCT0000xxxx,10914,2025-11-14 10:58:29+00:00


### Cell 3 — Build S3 Metadata Index of All Clinical Trial XMLs

This cell constructs a complete metadata index of all clinical trial XML files stored under the S3 prefix `clinical-trials-data/raw/`. Using a `list_objects_v2` paginator, it efficiently walks through every page of objects in the bucket, which is essential at this scale (hundreds of thousands of files). For each object, the code filters down to true XML trial files using a regular expression (`NCT\d+\.xml$`), then parses out the trial identifier (`nct_id`) from the filename and the folder group (e.g., `NCT0000xxxx`) from the key path. It also captures file size in bytes and the last-modified timestamp. All of this information is accumulated into a list of dictionaries which is then converted into a `df_index` DataFrame. The resulting index serves as a lightweight catalog of the entire clinical-trials corpus, enabling downstream sampling, QC, and parallel processing without ever downloading the full XML dataset. The cell finishes by reporting how many XML files were indexed and displaying the first few rows as a sanity check.


In [4]:
# ============================================================
# Cell 4 — Define Robust XML Parsing Utilities
# ============================================================
#
# Purpose:
#   Provide safe, fault-tolerant parsing of ClinicalTrials.gov XML files.
#
# Why robust parsing?
#   • CT.gov XML varies by year, sponsor, system version.
#   • Many files contain missing tags, malformed sections, or odd nesting.
#   • We must avoid notebook crashes by isolating and handling errors.
#
# Output:
#   - parse_xml_from_s3(key): returns a dict of extracted fields
#
# ============================================================

import xml.etree.ElementTree as ET
from botocore.exceptions import ClientError
from io import BytesIO

def safe_findtext(elem, path):
    """Return elem.find(path).text or None if missing."""
    node = elem.find(path)
    return node.text.strip() if node is not None and node.text else None

def parse_xml_from_s3(s3_key):
    """
    Download a single XML file from S3 and extract key structured fields.
    Returns a dictionary. Returns None if parsing fails.
    """

    try:
        # Download XML file into memory
        obj = s3.get_object(Bucket=BUCKET, Key=s3_key)
        xml_bytes = obj["Body"].read()

        # Parse XML safely
        tree = ET.parse(BytesIO(xml_bytes))
        root = tree.getroot()

    except ClientError as e:
        print(f"❌ S3 error for key: {s3_key} — {e}")
        return None
    except ET.ParseError:
        print(f"❌ XML parse error: {s3_key}")
        return None

    # --- Extract Core Fields ---
    record = {}

    # NCT ID
    record["nct_id"] = safe_findtext(root, "./id_info/nct_id")

    # Titles
    record["brief_title"] = safe_findtext(root, "./brief_title")
    record["official_title"] = safe_findtext(root, "./official_title")

    # Study Status
    record["overall_status"] = safe_findtext(root, "./overall_status")

    # Phase
    record["phase"] = safe_findtext(root, "./phase")

    # Conditions (can be multiple)
    record["conditions"] = [
        c.text.strip() for c in root.findall("./condition") if c.text
    ]

    # Interventions (name only for now)
    record["interventions"] = [
        safe_findtext(i, "./intervention_name")
        for i in root.findall("./intervention")
    ]

    # Enrollment (target sample size)
    record["enrollment"] = safe_findtext(root, "./enrollment")

    # Location countries
    record["location_countries"] = [
        c.text.strip()
        for c in root.findall("./location_countries/country")
        if c.text
    ]

    # Sponsor (lead sponsor only)
    record["lead_sponsor"] = safe_findtext(
        root, "./sponsors/lead_sponsor/agency"
    )

    return record


print("XML parsing utilities loaded.")


XML parsing utilities loaded.


### Cell 4 — Robust XML Parsing Utilities

This cell defines the core utilities for safely converting raw ClinicalTrials.gov XML files into structured Python dictionaries. Because CT.gov XML schemas have evolved over time and can be inconsistent or partially filled out, the parser is designed to be fault-tolerant rather than brittle.

The helper function `safe_findtext` encapsulates a common pattern: attempt to locate a nested XML element via a path, and return its stripped text value if present, or `None` if the element or its text is missing. This avoids repetitive `None` checks throughout the parser.

The main function, `parse_xml_from_s3(s3_key)`, downloads a single XML object from S3 into memory using the established `s3` client and bucket name. It then parses the content into an `ElementTree`, catching and handling two main classes of failure:

- `ClientError` from S3 (e.g., missing object, permissions, or transient S3 issues), and  
- `ET.ParseError` for malformed XML that cannot be parsed.

In either case, the function logs a clear error message and returns `None` so that the ingestion loop can skip problematic files without crashing the notebook.

On success, the function extracts a curated set of core fields that are useful for downstream analytics and optimization: the NCT identifier, brief and official titles, overall status, phase, a list of conditions, a list of intervention names, enrollment target, a list of location countries, and the lead sponsor agency. These values are assembled into a `record` dictionary, which becomes the building block for the trial-level metadata table constructed later in the pipeline.


In [5]:
# ============================================================
# Cell 5 — Test XML Parsing on a Single Sample File
# ============================================================

# Select the first file from the index built in Cell 3
sample_key = df_index.iloc[0]["s3_key"]
print("Testing XML parse on file:", sample_key)

# Parse the XML record
parsed = parse_xml_from_s3(sample_key)

# Pretty-print the result
import json
print("\nParsed Record:")
print(json.dumps(parsed, indent=2))

# Basic field sanity checks
print("\nSanity Checks:")
print("NCT ID:", parsed.get("nct_id"))
print("Title:", parsed.get("brief_title"))
print("Conditions:", parsed.get("conditions")[:5] if parsed.get("conditions") else None)
print("Interventions:", parsed.get("interventions")[:5] if parsed.get("interventions") else None)
print("Countries:", parsed.get("location_countries")[:5] if parsed.get("location_countries") else None)


Testing XML parse on file: clinical-trials-data/raw/NCT0000xxxx/NCT00000102.xml

Parsed Record:
{
  "nct_id": "NCT00000102",
  "brief_title": "Congenital Adrenal Hyperplasia: Calcium Channels as Therapeutic Targets",
  "official_title": null,
  "overall_status": "Completed",
  "phase": "Phase 1/Phase 2",
  "conditions": [
    "Congenital Adrenal Hyperplasia"
  ],
  "interventions": [
    "Nifedipine"
  ],
  "enrollment": null,
  "location_countries": [
    "United States"
  ],
  "lead_sponsor": "National Center for Research Resources (NCRR)"
}

Sanity Checks:
NCT ID: NCT00000102
Title: Congenital Adrenal Hyperplasia: Calcium Channels as Therapeutic Targets
Conditions: ['Congenital Adrenal Hyperplasia']
Interventions: ['Nifedipine']
Countries: ['United States']


### Cell 5 — Single-File XML Parsing Smoke Test

This cell performs a targeted smoke test of the XML parsing pipeline before we scale up to the full corpus. Using the `df_index` DataFrame built in Cell 3, it selects the first S3 key and uses `parse_xml_from_s3` (defined in Cell 4) to download and parse that single clinical-trial XML.

The parsed result is pretty-printed as JSON to make the structure and extracted fields easy to inspect visually. This lets you confirm that the parser is correctly populating core attributes such as `nct_id`, `brief_title`, `conditions`, `interventions`, and `location_countries`.

Finally, the cell runs a few basic sanity checks, printing a subset of key fields (and the first few list elements where applicable). If this cell shows sensible values—for example, a well-formed NCT ID, a human-readable title, non-empty condition and intervention lists, and reasonable country information—it provides strong evidence that the XML parsing logic and S3 access are functioning as intended before moving on to large-scale ingestion.


In [6]:
# ============================================================
# Cell 6 — Batch-Validate Parsing on Random Sample of XML Files
# ============================================================

import random
from collections import Counter

# Number of random XML files to test
SAMPLE_SIZE = 200

sample_keys = random.sample(df_index["s3_key"].tolist(), SAMPLE_SIZE)

parsed_records = []
errors = 0
field_counter = Counter()

print(f"Validating parsing across {SAMPLE_SIZE} random trials...")

for key in sample_keys:
    try:
        record = parse_xml_from_s3(key)
        parsed_records.append(record)

        # Track which fields appear
        for field, val in record.items():
            if val not in (None, [], ""):
                field_counter[field] += 1

    except Exception as e:
        print(f"⚠️ Error parsing {key}: {e}")
        errors += 1

print("\n=== Validation Summary ===")
print(f"Total sampled:  {SAMPLE_SIZE}")
print(f"Successfully parsed: {len(parsed_records)}")
print(f"Errors: {errors}")

print("\nField Coverage Across Sample (Top 20 fields):")
for field, count in field_counter.most_common(20):
    print(f"{field:<25} {count}/{SAMPLE_SIZE} ({count/SAMPLE_SIZE:.0%})")


Validating parsing across 200 random trials...

=== Validation Summary ===
Total sampled:  200
Successfully parsed: 200
Errors: 0

Field Coverage Across Sample (Top 20 fields):
nct_id                    200/200 (100%)
brief_title               200/200 (100%)
overall_status            200/200 (100%)
lead_sponsor              200/200 (100%)
conditions                199/200 (100%)
enrollment                195/200 (98%)
official_title            194/200 (97%)
interventions             182/200 (91%)
location_countries        178/200 (89%)
phase                     149/200 (74%)


In [None]:
### Cell 6 — Batch Validation of XML Parsing on a Random Sample

This cell stress-tests the XML parser across a random subset of trials to ensure robustness before scaling to the full dataset. Rather than trusting a single-file smoke test, it draws a random sample of `SAMPLE_SIZE` S3 keys (default 200) from `df_index` and attempts to parse each one using `parse_xml_from_s3`.

For every successfully parsed record, the cell tracks which fields are actually populated. A `Counter` called `field_counter` is incremented whenever a field’s value is non-empty (`None`, empty lists, and empty strings are treated as “missing”). This gives a quick sense of field coverage across the sample—how often key attributes like `overall_status`, `phase`, `conditions`, or `enrollment` appear in practice.

The summary at the end reports:
- How many trials were sampled,
- How many were parsed successfully versus how many produced errors, and
- A ranked list of fields, showing for each how many of the sampled trials provided a non-empty value (and the corresponding percentage).

If the error count is low and core fields show good coverage, it validates that the S3 access pattern and XML parsing logic are stable enough to apply to all trials in the subsequent large-scale ingestion step.


In [7]:
# === Utility logging function + tim synch (used across multiple cells) ===
import datetime, time

print("datetime.utcnow():", datetime.datetime.utcnow())
print("time.time() (epoch seconds):", time.time())

def log(msg: str):
    print(msg, flush=True)

datetime.utcnow(): 2025-11-19 12:49:59.994258
time.time() (epoch seconds): 1763556599.9943516


In [8]:
# === Cell 7a: Smoke test on a small subset ===
sample_n = 200  # adjust as needed

sample_records = []
for _, row in df_index.head(sample_n).iterrows():
    parsed = parse_xml_from_s3(row["s3_key"])
    if not parsed:
        continue
    parsed["size_bytes"] = row["size_bytes"]
    parsed["last_modified"] = row["last_modified"]
    parsed["folder"] = row["folder"]
    sample_records.append(parsed)

sample_df = pd.DataFrame(sample_records)
log(f"Smoke test complete: parsed {len(sample_df)} of {sample_n} trials.")
sample_df.head()


Smoke test complete: parsed 200 of 200 trials.


Unnamed: 0,nct_id,brief_title,official_title,overall_status,phase,conditions,interventions,enrollment,location_countries,lead_sponsor,size_bytes,last_modified,folder
0,NCT00000102,Congenital Adrenal Hyperplasia: Calcium Channe...,,Completed,Phase 1/Phase 2,[Congenital Adrenal Hyperplasia],[Nifedipine],,[United States],National Center for Research Resources (NCRR),4217,2025-11-14 10:58:29+00:00,NCT0000xxxx
1,NCT00000104,Does Lead Burden Alter Neuropsychological Deve...,,Completed,,[Lead Poisoning],[ERP measures of attention and memory],,[United States],National Center for Research Resources (NCRR),4357,2025-11-14 10:58:29+00:00,NCT0000xxxx
2,NCT00000105,Vaccination With Tetanus and KLH to Assess Imm...,Vaccination With Tetanus Toxoid and Keyhole Li...,Terminated,,[Cancer],"[Intracel KLH Vaccine, Biosyn KLH, Montanide I...",112.0,[United States],"Masonic Cancer Center, University of Minnesota",11623,2025-11-14 10:58:29+00:00,NCT0000xxxx
3,NCT00000106,41.8 Degree Centigrade Whole Body Hyperthermia...,,Unknown status,,[Rheumatic Diseases],[Whole body hyperthermia unit],,[United States],National Center for Research Resources (NCRR),5572,2025-11-14 10:58:29+00:00,NCT0000xxxx
4,NCT00000107,Body Water Content in Cyanotic Congenital Hear...,,Completed,,"[Heart Defects, Congenital]",[],,[United States],National Center for Research Resources (NCRR),3312,2025-11-14 10:58:29+00:00,NCT0000xxxx


### Cell 7a — Smoke Test on a Small Subset of Trials

This cell performs a small-scale dry run of the full ingestion pipeline before we commit to processing all ~557k trials. It takes the first `sample_n` rows from `df_index`, downloads and parses each corresponding XML via `parse_xml_from_s3`, and attaches the S3 index metadata (`size_bytes`, `last_modified`, `folder`) to each parsed record. The results are assembled into a `sample_df` DataFrame and summarized with a log message showing how many of the sampled trials were successfully parsed. A quick `.head()` preview lets us visually confirm that the combined XML-derived fields and S3 metadata look correct. If this smoke test parses cleanly and the columns match expectations, it gives us confidence to proceed to the high-throughput extraction step.


In [9]:
# === Cell 7: High-performance extraction of trial-level metadata ===
# Long-running, concurrent pipeline that parses all ~557k trials
# and produces a structured metadata table.

from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import os
from tqdm import tqdm

# --- Configurable knobs -------------------------------------------------------

# Max threads to use for S3 + XML parsing
# You can tune this if needed; start here.
MAX_WORKERS = min(32, (os.cpu_count() or 4) * 5)

# Where to persist outputs
OUTPUT_DIR = Path("data/interim")
LOG_DIR = Path("logs")

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
LOG_DIR.mkdir(parents=True, exist_ok=True)

# --- Worker function ----------------------------------------------------------

def _process_index_row(row_dict):
    """
    Worker for a single trial:
    - Fetch & parse XML from S3
    - Attach index metadata
    - Return either a parsed record or an error description
    """
    s3_key = row_dict["s3_key"]

    try:
        parsed = parse_xml_from_s3(s3_key)
    except Exception as e:
        return {
            "ok": False,
            "s3_key": s3_key,
            "error_type": type(e).__name__,
            "error_message": str(e),
        }

    if not parsed:
        # parse_xml_from_s3 returned None / empty
        return {
            "ok": False,
            "s3_key": s3_key,
            "error_type": "EmptyParse",
            "error_message": "parse_xml_from_s3 returned None/empty",
        }

    # Attach S3 index metadata
    parsed["size_bytes"]    = row_dict["size_bytes"]
    parsed["last_modified"] = row_dict["last_modified"]
    parsed["folder"]        = row_dict["folder"]
    parsed["s3_key"]        = s3_key

    return {"ok": True, "record": parsed}

# --- Main concurrent extraction -----------------------------------------------

# Convert index DataFrame to a list of row dicts for fast iteration
rows = df_index.to_dict("records")

records = []
error_records = []

log(
    f"Starting HIGH-PERFORMANCE metadata extraction for "
    f"{len(rows):,} trials using up to {MAX_WORKERS} threads..."
)

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    # executor.map gives us a generator of results; wrap with tqdm for progress
    for res in tqdm(executor.map(_process_index_row, rows), total=len(rows)):
        if res["ok"]:
            records.append(res["record"])
        else:
            error_records.append({
                "s3_key":        res["s3_key"],
                "error_type":    res["error_type"],
                "error_message": res["error_message"],
            })

log(
    f"Extraction loop complete. Parsed {len(records):,} records. "
    f"Encountered {len(error_records):,} errors."
)

# --- Build final DataFrame + persist artifacts --------------------------------

all_records = pd.DataFrame(records)

metadata_path = OUTPUT_DIR / "clinical_trials_metadata.parquet"
all_records.to_parquet(metadata_path, index=False)
log(f"Wrote metadata table to {metadata_path} with shape {all_records.shape}")

if error_records:
    df_errors = pd.DataFrame(error_records)
    error_path = LOG_DIR / "clinical_trials_ingestion_errors.csv"
    df_errors.to_csv(error_path, index=False)
    log(
        f"Wrote error log to {error_path} "
        f"with {len(df_errors):,} failed trials."
    )
else:
    log("No errors encountered during metadata extraction.")

log("Metadata extraction finished.")
all_records.head()

Starting HIGH-PERFORMANCE metadata extraction for 557,292 trials using up to 10 threads...


100%|███████████████████████████████████████████████████████████████████████| 557292/557292 [2:06:28<00:00, 73.44it/s]

Extraction loop complete. Parsed 557,292 records. Encountered 0 errors.





ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

### Cell 7 — High-Performance Extraction of Trial-Level Metadata

This cell scales the ingestion process from a small smoke test to the full corpus of clinical-trial XMLs. Starting from `df_index`, it converts each index row into a plain Python dictionary and uses a `ThreadPoolExecutor` to parallelize work across many threads. Each worker calls `_process_index_row`, which in turn invokes `parse_xml_from_s3` to download and parse a single XML file, then enriches the parsed record with S3-derived metadata (`size_bytes`, `last_modified`, `folder`, and `s3_key`). Successfully parsed records are appended to the `records` list, while any failures are captured in `error_records` with structured information about the S3 key and error type. A `tqdm` progress bar provides real-time feedback on progress across all ~557k trials. Once the executor completes, the cell assembles the collected records into the `all_records` DataFrame and logs a concise summary of how many trials were parsed successfully versus how many encountered errors. This cell is the core ingestion engine of the notebook, transforming a large, unstructured XML corpus in S3 into a structured, analysis-ready table that can be persisted and reused by downstream notebooks.


In [10]:
# === Cell 7b: Persist backup CSV of trial metadata (no extra dependencies) ===
#
# Purpose:
#   - Save a durable copy of the extracted trial metadata using only core pandas.
#   - This does NOT require pyarrow or fastparquet.
#
# Assumes:
#   - `all_records` is already defined (from Cell 7) and is a pandas DataFrame
#     with ~557k rows of parsed trial metadata.

from pathlib import Path

# Ensure the interim data directory exists
OUTPUT_DIR = Path("data/interim")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Choose a compressed CSV filename
csv_path = OUTPUT_DIR / "clinical_trials_metadata.csv.gz"

# Write the DataFrame as a compressed CSV
#   - compression="gzip" keeps file size manageable
#   - index=False avoids writing the DataFrame index as a column
all_records.to_csv(csv_path, index=False, compression="gzip")

log(f"[Cell 7b] Backup CSV written to {csv_path} with shape {all_records.shape}")
all_records.head()


[Cell 7b] Backup CSV written to data/interim/clinical_trials_metadata.csv.gz with shape (557292, 14)


Unnamed: 0,nct_id,brief_title,official_title,overall_status,phase,conditions,interventions,enrollment,location_countries,lead_sponsor,size_bytes,last_modified,folder,s3_key
0,NCT00000102,Congenital Adrenal Hyperplasia: Calcium Channe...,,Completed,Phase 1/Phase 2,[Congenital Adrenal Hyperplasia],[Nifedipine],,[United States],National Center for Research Resources (NCRR),4217,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
1,NCT00000104,Does Lead Burden Alter Neuropsychological Deve...,,Completed,,[Lead Poisoning],[ERP measures of attention and memory],,[United States],National Center for Research Resources (NCRR),4357,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
2,NCT00000105,Vaccination With Tetanus and KLH to Assess Imm...,Vaccination With Tetanus Toxoid and Keyhole Li...,Terminated,,[Cancer],"[Intracel KLH Vaccine, Biosyn KLH, Montanide I...",112.0,[United States],"Masonic Cancer Center, University of Minnesota",11623,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
3,NCT00000106,41.8 Degree Centigrade Whole Body Hyperthermia...,,Unknown status,,[Rheumatic Diseases],[Whole body hyperthermia unit],,[United States],National Center for Research Resources (NCRR),5572,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
4,NCT00000107,Body Water Content in Cyanotic Congenital Hear...,,Completed,,"[Heart Defects, Congenital]",[],,[United States],National Center for Research Resources (NCRR),3312,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...


### Cell 7b — Persist Backup CSV of Trial Metadata (Engine-Agnostic)

This cell creates a durable, engine-agnostic backup of the full trial-metadata table produced in Cell 7. It first ensures that the `data/interim` directory exists, then writes the `all_records` DataFrame to a compressed CSV file named `clinical_trials_metadata.csv.gz`. Using `compression="gzip"` keeps the file size manageable while remaining highly interoperable across tools, and `index=False` prevents the pandas index from being stored as an extra column. A log message records both the output path and the final shape of the dataset, and a `.head()` preview confirms that the written structure matches expectations. This CSV artifact serves as a safety net and a universal interchange format, independent of any optional Parquet engines or library version issues.


In [11]:
# === Cell 7c: Install pyarrow parquet engine in current environment ===
#
# Purpose:
#   - Install the `pyarrow` library so that pandas can write Parquet files.
#   - This runs inside the currently active kernel, so you do NOT need
#     to restart the kernel or re-run the extraction.
#
# Notes:
#   - `%pip` is a Jupyter magic command; it ensures installation happens
#     in the same environment backing this notebook.

%pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_aarch64.whl.metadata (3.2 kB)
Downloading pyarrow-22.0.0-cp311-cp311-manylinux_2_28_aarch64.whl (45.0 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 MB[0m [31m41.7 MB/s[0m  [33m0:00:01[0m [31m43.1 MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: pyarrow
Successfully installed pyarrow-22.0.0
Note: you may need to restart the kernel to use updated packages.


In [12]:
# === Cell 7d: Persist clinical trials metadata as Parquet ===
#
# Purpose:
#   - Save the same `all_records` DataFrame to a Parquet file for efficient
#     downstream analytics (smaller, faster reads than CSV).
#
# Assumes:
#   - `all_records` is still in memory from Cell 7.
#   - `pyarrow` is installed successfully from Cell 7c.
#   - `OUTPUT_DIR` exists (created in Cell 7b).

from pathlib import Path

OUTPUT_DIR = Path("data/interim")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

metadata_path = OUTPUT_DIR / "clinical_trials_metadata.parquet"

# Write the DataFrame to Parquet using pyarrow (default engine once installed)
all_records.to_parquet(metadata_path, index=False)

log(f"[Cell 7d] Wrote Parquet metadata table to {metadata_path} with shape {all_records.shape}")
all_records.head()


ArrowKeyError: No type extension with name arrow.py_extension_type found

In [14]:
# === Cell 7d: Persist clinical trials metadata as Parquet using pyarrow directly ===
#
# Purpose:
#   - Avoid pandas' parquet engine (which is throwing ArrowKeyError with pyarrow).
#   - Use pyarrow directly to write the Parquet file from the in-memory DataFrame.
#
# Assumes:
#   - `all_records` is defined and populated from the successful extraction.
#   - `pyarrow` is already installed (via your earlier %pip install pyarrow).
#   - CSV backup already exists from Cell 7b.

from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq

# Ensure output directory exists
OUTPUT_DIR = Path("data/interim")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

metadata_path = OUTPUT_DIR / "clinical_trials_metadata.parquet"

# Convert pandas DataFrame -> pyarrow Table
# preserve_index=False so we do not store the DataFrame index as a column
table = pa.Table.from_pandas(all_records, preserve_index=False)

# Write the pyarrow Table directly to Parquet
pq.write_table(table, metadata_path)

log(f"[Cell 7d] Wrote Parquet metadata table to {metadata_path} with shape {all_records.shape}")
all_records.head()


[Cell 7d] Wrote Parquet metadata table to data/interim/clinical_trials_metadata.parquet with shape (557292, 14)


Unnamed: 0,nct_id,brief_title,official_title,overall_status,phase,conditions,interventions,enrollment,location_countries,lead_sponsor,size_bytes,last_modified,folder,s3_key
0,NCT00000102,Congenital Adrenal Hyperplasia: Calcium Channe...,,Completed,Phase 1/Phase 2,[Congenital Adrenal Hyperplasia],[Nifedipine],,[United States],National Center for Research Resources (NCRR),4217,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
1,NCT00000104,Does Lead Burden Alter Neuropsychological Deve...,,Completed,,[Lead Poisoning],[ERP measures of attention and memory],,[United States],National Center for Research Resources (NCRR),4357,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
2,NCT00000105,Vaccination With Tetanus and KLH to Assess Imm...,Vaccination With Tetanus Toxoid and Keyhole Li...,Terminated,,[Cancer],"[Intracel KLH Vaccine, Biosyn KLH, Montanide I...",112.0,[United States],"Masonic Cancer Center, University of Minnesota",11623,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
3,NCT00000106,41.8 Degree Centigrade Whole Body Hyperthermia...,,Unknown status,,[Rheumatic Diseases],[Whole body hyperthermia unit],,[United States],National Center for Research Resources (NCRR),5572,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
4,NCT00000107,Body Water Content in Cyanotic Congenital Hear...,,Completed,,"[Heart Defects, Congenital]",[],,[United States],National Center for Research Resources (NCRR),3312,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...


### Cell 7d — Persist Clinical-Trials Metadata as Parquet Using PyArrow Directly

This cell writes the full clinical-trials metadata table to disk in Parquet format using `pyarrow` directly, sidestepping pandas’ parquet integration, which can be fragile for certain version combinations. It first ensures that the `data/interim` directory exists, then converts the `all_records` DataFrame into a `pyarrow.Table` with `preserve_index=False` so the pandas index is not stored as an extra column. The resulting table is written to `clinical_trials_metadata.parquet` via `pq.write_table`. A log message confirms the output path and the shape of the dataset, and a `.head()` call provides a quick visual sanity check. Together with the CSV backup from Cell 7b, this Parquet artifact becomes the primary, columnar source of truth for downstream analysis and scenario-building notebooks.


In [15]:
# === Cell 7e: Verify Parquet round-trip ===
#
# Purpose:
#   - Sanity-check that the Parquet file we wrote can be read back without issues.
#   - This is just a small read and head() for confidence.

from pathlib import Path
import pyarrow.parquet as pq

metadata_path = Path("data/interim/clinical_trials_metadata.parquet")

# Read Parquet via pyarrow
table = pq.read_table(metadata_path)
df_verify = table.to_pandas()

log(f"[Cell 7e] Reloaded Parquet has shape {df_verify.shape}")
df_verify.head()


[Cell 7e] Reloaded Parquet has shape (557292, 14)


Unnamed: 0,nct_id,brief_title,official_title,overall_status,phase,conditions,interventions,enrollment,location_countries,lead_sponsor,size_bytes,last_modified,folder,s3_key
0,NCT00000102,Congenital Adrenal Hyperplasia: Calcium Channe...,,Completed,Phase 1/Phase 2,[Congenital Adrenal Hyperplasia],[Nifedipine],,[United States],National Center for Research Resources (NCRR),4217,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
1,NCT00000104,Does Lead Burden Alter Neuropsychological Deve...,,Completed,,[Lead Poisoning],[ERP measures of attention and memory],,[United States],National Center for Research Resources (NCRR),4357,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
2,NCT00000105,Vaccination With Tetanus and KLH to Assess Imm...,Vaccination With Tetanus Toxoid and Keyhole Li...,Terminated,,[Cancer],"[Intracel KLH Vaccine, Biosyn KLH, Montanide I...",112.0,[United States],"Masonic Cancer Center, University of Minnesota",11623,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
3,NCT00000106,41.8 Degree Centigrade Whole Body Hyperthermia...,,Unknown status,,[Rheumatic Diseases],[Whole body hyperthermia unit],,[United States],National Center for Research Resources (NCRR),5572,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
4,NCT00000107,Body Water Content in Cyanotic Congenital Hear...,,Completed,,"[Heart Defects, Congenital]",[],,[United States],National Center for Research Resources (NCRR),3312,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...


In [16]:
# === Cell 8a: Canonical in-memory metadata DataFrame ===
#
# Purpose:
#   - Establish a single, clearly named DataFrame (`trials_meta`) that
#     downstream notebooks will expect.
#   - Option 1: If `df_verify` exists from Cell 7e, base it on that.
#   - Option 2: If you're running this notebook fresh, load from Parquet.

from pathlib import Path
import pyarrow.parquet as pq

metadata_path = Path("data/interim/clinical_trials_metadata.parquet")

if "df_verify" in globals():
    # Reuse the DataFrame we just round-tripped from Parquet
    trials_meta = df_verify.copy()
    log(f"[Cell 8a] trials_meta created from df_verify with shape {trials_meta.shape}")
else:
    # Fresh load from Parquet if 7e wasn't run in this session
    table = pq.read_table(metadata_path)
    trials_meta = table.to_pandas()
    log(f"[Cell 8a] trials_meta loaded from {metadata_path} with shape {trials_meta.shape}")

trials_meta.head()


[Cell 8a] trials_meta created from df_verify with shape (557292, 14)


Unnamed: 0,nct_id,brief_title,official_title,overall_status,phase,conditions,interventions,enrollment,location_countries,lead_sponsor,size_bytes,last_modified,folder,s3_key
0,NCT00000102,Congenital Adrenal Hyperplasia: Calcium Channe...,,Completed,Phase 1/Phase 2,[Congenital Adrenal Hyperplasia],[Nifedipine],,[United States],National Center for Research Resources (NCRR),4217,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
1,NCT00000104,Does Lead Burden Alter Neuropsychological Deve...,,Completed,,[Lead Poisoning],[ERP measures of attention and memory],,[United States],National Center for Research Resources (NCRR),4357,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
2,NCT00000105,Vaccination With Tetanus and KLH to Assess Imm...,Vaccination With Tetanus Toxoid and Keyhole Li...,Terminated,,[Cancer],"[Intracel KLH Vaccine, Biosyn KLH, Montanide I...",112.0,[United States],"Masonic Cancer Center, University of Minnesota",11623,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
3,NCT00000106,41.8 Degree Centigrade Whole Body Hyperthermia...,,Unknown status,,[Rheumatic Diseases],[Whole body hyperthermia unit],,[United States],National Center for Research Resources (NCRR),5572,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...
4,NCT00000107,Body Water Content in Cyanotic Congenital Hear...,,Completed,,"[Heart Defects, Congenital]",[],,[United States],National Center for Research Resources (NCRR),3312,2025-11-14 10:58:29+00:00,NCT0000xxxx,clinical-trials-data/raw/NCT0000xxxx/NCT000001...


### Cell 8a — Define `trials_meta` as the Canonical Metadata Table

This cell establishes `trials_meta` as the canonical in-memory DataFrame for all subsequent analysis in the notebook. If the Parquet round-trip check in Cell 7e has already been run, the cell simply copies `df_verify` into `trials_meta`, reusing the DataFrame that was just read back from `clinical_trials_metadata.parquet`. If not, it falls back to reading the Parquet file directly from `data/interim/clinical_trials_metadata.parquet` and converting it to a pandas DataFrame. In both cases, the cell logs the final shape and displays a small preview of `trials_meta`. Standardizing on this single, clearly named DataFrame makes later cells simpler and ensures that downstream notebooks can expect a consistent schema and naming convention when they load the clinical-trials metadata from disk.


In [17]:
# === Cell 8b: Structural + missingness sanity checks ===
#
# Purpose:
#   - Confirm basic properties of the metadata table:
#       * Row/column counts
#       * Column dtypes
#       * Simple missingness overview
#   - This helps catch any obvious ingestion anomalies before we build
#     optimization scenarios on top of this data.

log("[Cell 8b] Basic info for trials_meta:")

# Shape and columns
log(f"Shape: {trials_meta.shape}")
log(f"Columns: {list(trials_meta.columns)}")

# Quick dtypes summary
dtypes_summary = trials_meta.dtypes.to_frame(name="dtype")
display(dtypes_summary)

# Simple missingness summary (fraction of nulls per column)
missing_frac = trials_meta.isna().mean().sort_values(ascending=False)
missing_summary = missing_frac.to_frame(name="missing_fraction")
display(missing_summary)

log("[Cell 8b] Completed structural + missingness checks.")


[Cell 8b] Basic info for trials_meta:
Shape: (557292, 14)
Columns: ['nct_id', 'brief_title', 'official_title', 'overall_status', 'phase', 'conditions', 'interventions', 'enrollment', 'location_countries', 'lead_sponsor', 'size_bytes', 'last_modified', 'folder', 's3_key']


Unnamed: 0,dtype
nct_id,object
brief_title,object
official_title,object
overall_status,object
phase,object
conditions,object
interventions,object
enrollment,object
location_countries,object
lead_sponsor,object


Unnamed: 0,missing_fraction
phase,0.236404
official_title,0.017921
enrollment,0.012652
nct_id,0.0
brief_title,0.0
overall_status,0.0
conditions,0.0
interventions,0.0
location_countries,0.0
lead_sponsor,0.0


[Cell 8b] Completed structural + missingness checks.


In [18]:
# === Cell 8c: Domain sanity checks on key fields ===
#
# Purpose:
#   - Confirm 1 row per trial (`nct_id` uniqueness).
#   - Inspect distributions of high-level fields used later in scenarios
#     (e.g., phase, overall_status, lead_sponsor).
#
# Assumes:
#   - Column names include at least: 'nct_id', 'overall_status', 'phase', 'lead_sponsor'.

# 1) Check uniqueness of nct_id
if "nct_id" in trials_meta.columns:
    total_nct = trials_meta["nct_id"].shape[0]
    unique_nct = trials_meta["nct_id"].nunique(dropna=True)
    log(f"[Cell 8c] nct_id: total rows={total_nct:,}, unique={unique_nct:,}")

    if total_nct != unique_nct:
        log("[Cell 8c] WARNING: nct_id is not unique; multiple rows per trial detected.")
else:
    log("[Cell 8c] WARNING: 'nct_id' column not found in trials_meta.")

# 2) Distribution of overall_status
if "overall_status" in trials_meta.columns:
    log("[Cell 8c] Top overall_status values:")
    display(trials_meta["overall_status"].value_counts(dropna=False).head(20))
else:
    log("[Cell 8c] WARNING: 'overall_status' column not found in trials_meta.")

# 3) Distribution of phase
if "phase" in trials_meta.columns:
    log("[Cell 8c] Top phase values:")
    display(trials_meta["phase"].value_counts(dropna=False).head(20))
else:
    log("[Cell 8c] WARNING: 'phase' column not found in trials_meta.")

# 4) Top lead sponsors (helpful for later scenario design)
if "lead_sponsor" in trials_meta.columns:
    log("[Cell 8c] Top lead_sponsor values:")
    display(trials_meta["lead_sponsor"].value_counts(dropna=False).head(20))
else:
    log("[Cell 8c] WARNING: 'lead_sponsor' column not found in trials_meta.")

log("[Cell 8c] Completed domain sanity checks.")


[Cell 8c] nct_id: total rows=557,292, unique=557,292
[Cell 8c] Top overall_status values:


overall_status
Completed                    305099
Unknown status                83373
Recruiting                    65912
Terminated                    32324
Not yet recruiting            25016
Active, not recruiting        21338
Withdrawn                     15762
Enrolling by invitation        4824
Suspended                      1679
Withheld                        947
No longer available             503
Available                       254
Approved for marketing          233
Temporarily not available        28
Name: count, dtype: int64

[Cell 8c] Top phase values:


phase
N/A                213089
None               131746
Phase 2             62166
Phase 1             46196
Phase 3             40582
Phase 4             34286
Phase 1/Phase 2     16062
Phase 2/Phase 3      7260
Early Phase 1        5905
Name: count, dtype: int64

[Cell 8c] Top lead_sponsor values:


lead_sponsor
Assiut University                                                4348
Cairo University                                                 4125
GlaxoSmithKline                                                  3569
National Cancer Institute (NCI)                                  3513
AstraZeneca                                                      3347
Assistance Publique - Hôpitaux de Paris                          3329
Pfizer                                                           3211
Mayo Clinic                                                      3110
M.D. Anderson Cancer Center                                      2925
Novartis Pharmaceuticals                                         2562
Massachusetts General Hospital                                   2464
National Taiwan University Hospital                              2454
National Institute of Allergy and Infectious Diseases (NIAID)    2419
Boehringer Ingelheim                                             2247
Merck S

[Cell 8c] Completed domain sanity checks.


In [19]:
# === Cell 8d: Persist summary tables for quick reference ===
#
# Purpose:
#   - Save lightweight summary tables derived from `trials_meta` so that
#     other notebooks (or slide decks) can reuse them without recomputing.
#   - Examples:
#       * overall_status counts
#       * phase counts
#       * top lead sponsors
#
# Output:
#   - data/summary/clinical_trials_overall_status_counts.csv
#   - data/summary/clinical_trials_phase_counts.csv
#   - data/summary/clinical_trials_lead_sponsor_top50.csv

from pathlib import Path

SUMMARY_DIR = Path("data/summary")
SUMMARY_DIR.mkdir(parents=True, exist_ok=True)

# 1) overall_status counts
status_counts = (
    trials_meta["overall_status"]
    .value_counts(dropna=False)
    .rename_axis("overall_status")
    .to_frame(name="count")
)
status_path = SUMMARY_DIR / "clinical_trials_overall_status_counts.csv"
status_counts.to_csv(status_path)
log(f"[Cell 8d] Wrote overall_status summary to {status_path}")

# 2) phase counts
phase_counts = (
    trials_meta["phase"]
    .value_counts(dropna=False)
    .rename_axis("phase")
    .to_frame(name="count")
)
phase_path = SUMMARY_DIR / "clinical_trials_phase_counts.csv"
phase_counts.to_csv(phase_path)
log(f"[Cell 8d] Wrote phase summary to {phase_path}")

# 3) top lead sponsors (limit to top 50 for readability)
sponsor_counts = (
    trials_meta["lead_sponsor"]
    .value_counts(dropna=False)
    .head(50)
    .rename_axis("lead_sponsor")
    .to_frame(name="count")
)
sponsor_path = SUMMARY_DIR / "clinical_trials_lead_sponsor_top50.csv"
sponsor_counts.to_csv(sponsor_path)
log(f"[Cell 8d] Wrote top-50 lead_sponsor summary to {sponsor_path}")

status_counts.head(), phase_counts.head(), sponsor_counts.head()


[Cell 8d] Wrote overall_status summary to data/summary/clinical_trials_overall_status_counts.csv
[Cell 8d] Wrote phase summary to data/summary/clinical_trials_phase_counts.csv
[Cell 8d] Wrote top-50 lead_sponsor summary to data/summary/clinical_trials_lead_sponsor_top50.csv


(                     count
 overall_status            
 Completed           305099
 Unknown status       83373
 Recruiting           65912
 Terminated           32324
 Not yet recruiting   25016,
           count
 phase          
 N/A      213089
 None     131746
 Phase 2   62166
 Phase 1   46196
 Phase 3   40582,
                                  count
 lead_sponsor                          
 Assiut University                 4348
 Cairo University                  4125
 GlaxoSmithKline                   3569
 National Cancer Institute (NCI)   3513
 AstraZeneca                       3347)

### Cell 8d — Persist High-Level Summary Tables for Reuse

This cell derives and saves lightweight summary tables that capture the overall shape of the clinical-trial portfolio without needing to reload the full 557k-row dataset. Using `trials_meta`, it computes frequency tables for `overall_status`, `phase`, and the top 50 `lead_sponsor` values, then writes each result to a CSV file under `data/summary/`. These summaries make it easy to reuse portfolio-level statistics in other notebooks, reports, or slide decks without recomputing them from scratch. The log messages record the output file paths, and the displayed heads provide a quick visual check that the aggregated counts look reasonable.


In [20]:
# === Cell 8e: Create a small sample subset for rapid experimentation ===
#
# Purpose:
#   - Create a small, stable subset of trials (e.g., 10,000 rows) that we can
#     use in downstream notebooks for quick iteration without always loading
#     the full ~557k table.
#
# Notes:
#   - We use .sample(..., random_state=42) for reproducibility.
#   - We write both CSV (gzipped) and Parquet versions.

from pathlib import Path

SAMPLE_DIR = Path("data/interim")
SAMPLE_DIR.mkdir(parents=True, exist_ok=True)

sample_n = 10_000  # adjust as desired

if len(trials_meta) <= sample_n:
    trials_sample = trials_meta.copy()
    log(f"[Cell 8e] Dataset has {len(trials_meta):,} rows; using full table as 'sample'.")
else:
    trials_sample = trials_meta.sample(n=sample_n, random_state=42)
    log(f"[Cell 8e] Drew a reproducible sample of {sample_n:,} rows from {len(trials_meta):,} total.")

# Write CSV.gz
sample_csv_path = SAMPLE_DIR / "clinical_trials_metadata_sample.csv.gz"
trials_sample.to_csv(sample_csv_path, index=False, compression="gzip")
log(f"[Cell 8e] Wrote sample CSV to {sample_csv_path}")

# Write Parquet via pyarrow directly (bypassing pandas.to_parquet)
import pyarrow as pa
import pyarrow.parquet as pq

sample_parquet_path = SAMPLE_DIR / "clinical_trials_metadata_sample.parquet"
sample_table = pa.Table.from_pandas(trials_sample, preserve_index=False)
pq.write_table(sample_table, sample_parquet_path)
log(f"[Cell 8e] Wrote sample Parquet to {sample_parquet_path}")

trials_sample.head()


[Cell 8e] Drew a reproducible sample of 10,000 rows from 557,292 total.
[Cell 8e] Wrote sample CSV to data/interim/clinical_trials_metadata_sample.csv.gz
[Cell 8e] Wrote sample Parquet to data/interim/clinical_trials_metadata_sample.parquet


Unnamed: 0,nct_id,brief_title,official_title,overall_status,phase,conditions,interventions,enrollment,location_countries,lead_sponsor,size_bytes,last_modified,folder,s3_key
218080,NCT02811120,PRIME Follow up - Quadri Meningo Vacinees,An Observational Follow up Study of a Phase II...,Completed,,[Meningococcal Disease],[venepuncture only],57,[],Public Health England,10465,2025-11-17 14:27:07+00:00,NCT0281xxxx,clinical-trials-data/raw/NCT0281xxxx/NCT028111...
274981,NCT03552926,Constitution of a Clinico-radiological Databas...,Constitution of a Clinico-radiological Databas...,Recruiting,,[Lacunar Strokes],[],500,[France],Assistance Publique - Hôpitaux de Paris,9624,2025-11-17 13:24:50+00:00,NCT0355xxxx,clinical-trials-data/raw/NCT0355xxxx/NCT035529...
284107,NCT03671863,Children Born With Club Feet,Children Born With Club Feet: Ultrasound Diagn...,Completed,,[Clubfoot],"[Invasive analysis (caryotype, CGH array), Pre...",219,[France],"University Hospital, Montpellier",8300,2025-11-18 08:37:55+00:00,NCT0367xxxx,clinical-trials-data/raw/NCT0367xxxx/NCT036718...
508776,NCT06597084,Anti-epileptogenic Effects of Eslicarbazepine ...,Prevention of Epilepsy in Stroke Patients at H...,Completed,Phase 2,[Post Stroke Epilepsy],"[ESL 800 mg, Placebo]",129,"[Austria, France, Germany, Israel, Italy, Port...",Bial - Portela C S.A.,233193,2025-11-14 15:58:13+00:00,NCT0659xxxx,clinical-trials-data/raw/NCT0659xxxx/NCT065970...
285400,NCT03688685,A Clinical Study to Evaluate CAD-1883 in Essen...,A Phase 2a Open-Label Study to Evaluate the Sa...,Completed,Phase 2,[Essential Tremor],[CAD-1883],25,[United States],Cadent Therapeutics,29053,2025-11-16 21:43:28+00:00,NCT0368xxxx,clinical-trials-data/raw/NCT0368xxxx/NCT036886...


### Cell 8e — Create a Reproducible Sample Subset for Fast Iteration

This cell builds a smaller, reproducible subset of `trials_meta` for rapid experimentation in downstream notebooks. If the full dataset has more than `sample_n` rows (default 10,000), it draws a random sample of that size using a fixed `random_state` so the same trials are selected every time. If the corpus is smaller than `sample_n`, it simply uses the entire table as the “sample.” The resulting `trials_sample` DataFrame is then written to disk in two formats: a gzipped CSV (`clinical_trials_metadata_sample.csv.gz`) for maximal interoperability, and a Parquet file (`clinical_trials_metadata_sample.parquet`) written via pyarrow for efficient, columnar access. Log messages record both output paths and the sample size. This sampled subset provides a lightweight playground for feature engineering, scenario design, and QUBO construction without incurring the overhead of repeatedly scanning all 557k trials.

### Final Summary and Next Steps

This notebook ingests the full ClinicalTrials.gov corpus from S3 and converts it into a clean, analysis-ready metadata table suitable for downstream optimization and quantum experiments.

**What this notebook accomplished:**

- Verified the AWS environment and S3 connectivity to the `quantum-clinical-optimization-us-west-2` bucket.
- Built a comprehensive S3 index (`df_index`) of all ClinicalTrials.gov XML files, including NCT IDs, folder groups, file sizes, and last-modified timestamps.
- Implemented robust, fault-tolerant XML parsing utilities and validated them on both a single trial and a random sample.
- Performed high-throughput, parallel extraction of trial-level metadata for all ~557k trials into a unified `all_records` / `trials_meta` DataFrame.
- Persisted the full metadata table to disk in multiple formats:
  - Engine-agnostic backup: `data/interim/clinical_trials_metadata.csv.gz`
  - Columnar primary artifact: `data/interim/clinical_trials_metadata.parquet`
- Verified Parquet round-trip integrity and established `trials_meta` as the canonical in-memory table.
- Ran structural and domain-level sanity checks (uniqueness of `nct_id`, distributions of `overall_status` and `phase`, and leading sponsors).
- Exported lightweight summary tables and a reproducible 10k-row sample to support fast iteration in later notebooks.

Together, these steps turn a large, heterogeneous XML corpus in S3 into a stable, well-understood foundation for analytics and optimization.

---

### Next Steps (Notebook 02 and Beyond)

The next notebook will build on `clinical_trials_metadata.parquet` to move from **raw metadata** to **concrete optimization scenarios**:

1. **Load and Filter Metadata**
   - Read `clinical_trials_metadata.parquet` (or the 10k-row sample).
   - Focus on one or more use cases, such as a specific:
     - Therapeutic area (e.g., oncology, cardiology),
     - Development phase (e.g., Phase 2 / 3),
     - Sponsor or portfolio slice.

2. **Feature Engineering for Scenario Design**
   - Derive features needed for site/trial selection scenarios:
     - Enrollment size and status,
     - Number and diversity of locations,
     - Sponsor and phase characteristics.
   - Optionally join additional S3-backed datasets (e.g., site performance or cost data).

3. **Define Optimization Scenarios**
   - Formalize a small set of decision problems, such as:
     - Selecting a subset of trials for a constrained portfolio.
     - Selecting sites or regions for a given protocol under budget and capacity limits.
   - Specify objectives (e.g., maximize expected enrollment speed or diversity) and hard/soft constraints.

4. **Construct QUBO Instances**
   - Encode each scenario as a binary decision vector (e.g., one variable per site or trial).
   - Translate objectives and constraints into a QUBO/Ising formulation.
   - Serialize each QUBO instance (e.g., JSON or NumPy-based format) and save it to S3 or `data/qubo_scenarios/`.

5. **Prepare for Quantum and Classical Solvers (Notebook 03)**
   - Define a small set of benchmark scenarios to send to:
     - Classical optimizers (exact or heuristic) as baselines.
     - AWS Braket QAOA workflows on SV1 or other backends.
   - Record metrics for comparison (objective value, constraint violations, runtime, sampling profiles).

At this point, Notebook 01 can be considered **complete**: it owns ingestion, indexing, and core metadata validation. Notebook 02 will focus on **scenario building and QUBO construction**, and Notebook 03 will handle **solver execution and comparative analysis**, including Braket-based quantum and quantum-inspired runs.
