Protocol2USDM

⚠️ Disclaimer: An Experiment in AI-Assisted Development

This project was built with one self-imposed rule: I would not write or edit a single line of code myself.

The goal was to answer a simple question:

"Can someone with no coding experience build production-quality clinical software using only AI assistants?"

Every line of code in this repository was generated by AI coding assistants (Claude, GPT, Gemini). The human role was limited to:

Describing requirements and desired outcomes

Reviewing and approving AI-generated code

Providing feedback and requesting corrections

Testing functionality and reporting issues

As a result, this software:

❌ Is not robust — edge cases may not be handled

❌ Is likely full of bugs — limited systematic testing

❌ Has not been thoroughly tested — no formal QA process

❌ Is NOT fit for any production environment

❌ Should NOT be used for actual clinical trials or regulatory submissions

This project exists solely as:

✅ A demonstrator of AI-assisted development capabilities

✅ A reference implementation of USDM v4.0 extraction concepts

✅ An experiment in human-AI collaboration for software development

Use at your own risk. Contributions and improvements welcome.

Extract clinical protocol content into USDM v4.0 format

Protocol2USDM is an automated pipeline that extracts, validates, and structures clinical trial protocol content into data conformant to the CDISC USDM v4.0 model.

🚀 Try It Now

# Full extraction with SAP, sites, parallel execution
python main_v3.py input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_Protocol.pdf \
  --complete \
  --sap input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_SAP.pdf \
  --sites input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_sites.csv \
  --parallel \
  --model gemini-3-flash

This extracts the full protocol with execution model, enriches entities with NCI terminology codes, includes SAP analysis populations (with STATO mapping and ARS linkage), and site list.

💡 Default Model: gemini-3-flash-preview (Gemini Flash 3) via Vertex AI. Other models (claude-opus-4-5, claude-sonnet-4, chatgpt-5.2, gemini-2.5-pro) are supported. The pipeline defaults to --complete mode when no specific phases are requested.

⚠️ Important: Vertex AI Requirement for Gemini

Gemini models must be accessed via Google Cloud Vertex AI (not AI Studio) to properly disable safety controls. The consumer API (AI Studio) may still block or restrict medical/clinical content even with safety settings disabled. Vertex AI allows BLOCK_NONE safety settings which are essential for clinical protocol extraction.

Required .env configuration for Gemini via Vertex AI:

GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1  # or your preferred region

What's New in v7.1

🏗️ Phase Registry Architecture (NEW)

main_v3.py - New refactored entry point with clean phase registry pattern
pipeline/ module - Modular phase definitions with dependency-aware execution
Parallel execution - Run independent phases concurrently with --parallel
Default --complete mode - Full extraction when no specific phases requested

🎯 Gemini Flash 3 Optimization

Pipeline optimized and tested with gemini-3-flash-preview as default model
Intelligent fallback to gemini-2.5-pro for SoA text extraction when needed
Response validation with automatic retry logic (up to 2 retries)
Stricter prompt guardrails for JSON format compliance

🧠 Execution Model Extraction & Promotion (v7.2)

Time Anchors: Extract temporal reference points (VISIT/EVENT/CONCEPTUAL classification)
Visit Windows: Timing tolerances → Timing.windowLower/windowUpper (ISO 8601)
Subject State Machine: Subject flow → TransitionRule on Encounter
Dosing Regimens: Drug administration → Administration entities
Repetitions: Cycle-based patterns → ScheduledActivityInstance expansion
Traversal Constraints: Subject journey → Epoch/Encounter.previousId/nextId chains
Footnote Conditions: Conditional rules → Condition + ScheduledDecisionInstance
Titration Schedules: Dose escalation → StudyElement with TransitionRule

v7.2 Promotion: Execution model data is now promoted to native USDM entities instead of extensions. Core USDM output is self-sufficient without parsing 11_execution_model.json.

🔄 Pipeline Context Architecture

Context-aware extraction where each phase builds on previous results
Extractors receive existing SoA entities (epochs, encounters, activities) as context
Eliminates arbitrary labels that require downstream resolution
Consistent ID references across USDM output

🔗 Entity Reconciliation Framework

LLM-based semantic mapping of abstract concepts to protocol entities
Epoch, encounter, and activity reconciliation with ID preservation
Replaces fuzzy string matching with intelligent entity resolution

🏛️ USDM 4.0 Alignment

All entities placed at correct locations per CDISC dataStructure.yml
Proper entity hierarchy (studyVersion, studyDesign, scheduleTimeline, activity)
NCI code mappings for dose forms, timing types, and identifier types

Features

Gemini Flash 3 Optimized: Pipeline tuned for best results with gemini-3-flash via Vertex AI
Vision-Validated Extraction: Text extraction validated against actual PDF images
USDM v4.0 Aligned: Outputs follow official CDISC schema with proper entity hierarchy
Execution Model: Full subject state machine, time anchors, visit windows, and dosing regimens
NCI Terminology Enrichment: Automatic enrichment with official NCI codes via EVS API
Pipeline Context: Each extractor receives accumulated context from prior phases
Entity Reconciliation: LLM-based semantic mapping and ID preservation
Rich Provenance: Every cell tagged with source (text/vision/both) for confidence tracking
CDISC CORE Validation: Built-in conformance checking with local engine
Modern Web UI: Complete React/Next.js protocol viewer (see below)

Extraction Capabilities

Module	Entities	CLI Flag
SoA	Activity, PlannedTimepoint, Epoch, Encounter, CommentAnnotation	(default)
Metadata	StudyTitle, StudyIdentifier, Organization, Indication	`--metadata`
Eligibility	EligibilityCriterion, StudyDesignPopulation	`--eligibility`
Objectives	Objective, Endpoint, Estimand	`--objectives`
Study Design	StudyArm, StudyCell, StudyCohort	`--studydesign`
Interventions	StudyIntervention, AdministrableProduct, Substance	`--interventions`
Narrative	NarrativeContent, Abbreviation	`--narrative`
Advanced	StudyAmendment, GeographicScope, Country	`--advanced`
Procedures	Procedure, MedicalDevice, Ingredient, Strength	`--procedures`
Scheduling	Timing, Condition, TransitionRule, ScheduleTimelineExit	`--scheduling`
Doc Structure	NarrativeContentItem, StudyDefinitionDocument	`--docstructure`
Amendments	StudyAmendmentReason, ImpactedEntity	`--amendmentdetails`
Execution Model	TimeAnchor, VisitWindow, StateMachine, Repetition	`--execution`

Conditional Sources (Additional Documents)

Source	Entities	CLI Flag
SAP	AnalysisPopulation, Characteristic	`--sap <path>`
Site List	StudySite, StudyRole, AssignedPerson	`--sites <path>`

Full Protocol Extraction

Extract everything with a single command:

# Default behavior - no flags needed! Runs --complete automatically
python main_v3.py protocol.pdf

# Explicit --complete: Full extraction + all post-processing
python main_v3.py protocol.pdf --complete

# Parallel execution for faster processing
python main_v3.py protocol.pdf --parallel --max-workers 4

Or select specific phases:

python main_v3.py protocol.pdf --metadata --eligibility --objectives
python main_v3.py protocol.pdf --expansion-only --metadata  # Skip SoA
python main_v3.py protocol.pdf --procedures --scheduling --execution

With additional source documents:

python main_v3.py protocol.pdf --sap sap.pdf --sites sites.xlsx

Output: Individual JSONs + combined protocol_usdm.json

Quick Start

# 1. Clone repository
git clone https://github.com/Panikos/Protocol2USDMv3.git
cd Protocol2USDMv3

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set up API keys (.env file)
GOOGLE_CLOUD_PROJECT=your-gcp-project  # Required for Gemini via Vertex AI
GOOGLE_CLOUD_LOCATION=us-central1
OPENAI_API_KEY=...      # Optional: for GPT models
CLAUDE_API_KEY=...      # Optional: for Claude models
CDISC_API_KEY=...       # Optional: for CORE conformance

# 4. Run the pipeline (defaults to --complete with gemini-3-flash-preview)
python main_v3.py input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_Protocol.pdf

# 5. View results in web UI
cd web-ui && npm run dev

Installation

Requirements

Python 3.9+
API keys: OpenAI, Google AI, Claude AI, CDISC API

Setup

# Create virtual environment (recommended)
python -m venv venv
venv\Scripts\activate  # Windows
source venv/bin/activate  # macOS/Linux

# Install dependencies
pip install -r requirements.txt

# Create .env file with API keys
echo "OPENAI_API_KEY=sk-your-key" > .env
echo "GOOGLE_API_KEY=AIza-your-key" >> .env
echo "CDISC_API_KEY=your-cdisc-key" >> .env

CDISC CORE Engine (Optional)

For conformance validation, download the CORE engine:

python tools/core/download_core.py

Note: Get your CDISC API key from https://library.cdisc.org/ (requires CDISC membership)

Usage

Basic Usage

# main_v3.py is the recommended entry point (phase registry architecture)
python main_v3.py <protocol.pdf> [options]

# Legacy main_v2.py has been removed — use main_v3.py

Model Selection

# Default: gemini-3-flash-preview (no --model flag needed)
python main_v3.py protocol.pdf

# Gemini 2.5 Pro (good fallback)
python main_v3.py protocol.pdf --model gemini-2.5-pro

# Claude Opus 4.5 (high accuracy, higher cost)
python main_v3.py protocol.pdf --model claude-opus-4-5

# ChatGPT 5.2
python main_v3.py protocol.pdf --model chatgpt-5.2

Complete Extraction (Default)

# Default behavior - runs --complete automatically when no phases specified
python main_v3.py protocol.pdf

# With SAP document for analysis populations
python main_v3.py protocol.pdf --sap sap.pdf

--complete enables:

Option	Description
`--full-protocol`	All 12 expansion phases (metadata, eligibility, objectives, etc.)
`--soa`	Full SoA extraction pipeline
`--enrich`	NCI terminology code enrichment
`--validate-schema`	USDM schema validation
`--conformance`	CDISC CORE conformance rules

Full Pipeline with Post-Processing

# Run SoA + enrichment + schema validation + CORE conformance
python main_v3.py protocol.pdf --soa

# Or run post-processing steps individually
python main_v3.py protocol.pdf --enrich              # Step 7: NCI terminology
python main_v3.py protocol.pdf --validate-schema     # Step 8: Schema validation
python main_v3.py protocol.pdf --conformance         # Step 9: CORE conformance

Additional Options

--output-dir, -o           Output directory (default: output/<protocol_name>)
--pages, -p                Specific SoA page numbers (comma-separated)
--no-validate              Skip vision validation
--remove-hallucinations    Remove cells not confirmed by vision (default: keep all)
--confidence-threshold     Confidence threshold for hallucination removal (default: 0.7)
--verbose, -v              Enable verbose output
--update-evs-cache         Update EVS terminology cache before enrichment
--update-cache             Update CDISC CORE rules cache (requires CDISC_API_KEY)

Pipeline Steps

SoA Extraction (Steps 1-4)

Step	Description	Output File
1	Find SoA pages & analyze header structure (vision)	`4_header_structure.json`
2	Extract SoA data from text	`5_raw_text_soa.json`
3	Validate extraction against images	`6_validation_result.json`
4	Build SoA output	`9_final_soa.json` + `9_final_soa_provenance.json`

Expansion Phases (with `--full-protocol`)

Phase	Entities	Output File	CLI Flag
Metadata	StudyTitle, Organization, Indication	`2_study_metadata.json`	`--metadata`
Eligibility	EligibilityCriterion, Population	`3_eligibility_criteria.json`	`--eligibility`
Objectives	Objective, Endpoint, Estimand	`4_objectives_endpoints.json`	`--objectives`
Study Design	StudyArm, StudyCell, StudyCohort	`5_study_design.json`	`--studydesign`
Interventions	StudyIntervention, Product, Substance	`6_interventions.json`	`--interventions`
Narrative	Abbreviation, NarrativeContent	`7_narrative_structure.json`	`--narrative`
Advanced	StudyAmendment, GeographicScope, Country	`8_advanced_entities.json`	`--advanced`
Procedures	Procedure, MedicalDevice, Ingredient	`9_procedures_devices.json`	`--procedures`
Scheduling	Timing, Condition, TransitionRule	`10_scheduling_logic.json`	`--scheduling`
Doc Structure	NarrativeContentItem, StudyDefinitionDocument	`13_document_structure.json`	`--docstructure`
Amendments	StudyAmendmentReason, ImpactedEntity	`14_amendment_details.json`	`--amendmentdetails`
Execution	TimeAnchor, Repetition, StateMachine	`11_execution_model.json`	`--execution`

Conditional Sources (with `--sap` or `--sites`)

Source	Entities	Output File
SAP Document	AnalysisPopulation, Characteristic	`11_sap_populations.json`
Site List	StudySite, StudyRole, AssignedPerson	`12_site_list.json`

Post-Processing

Step	Description	Output File
Combine	Merge all extractions	`protocol_usdm.json` ⭐
Terminology	NCI EVS code enrichment	`terminology_enrichment.json`
Schema Fix	Auto-fix schema issues (UUIDs, Codes)	`schema_validation.json`
USDM Validation	Validate against official USDM package	`usdm_validation.json`
Conformance	CDISC CORE rules validation	`conformance_report.json`
ID Mapping	Simple ID → UUID mapping	`id_mapping.json`
Provenance	UUID-based provenance for viewer	`protocol_usdm_provenance.json`

Primary output: output/<protocol>/protocol_usdm.json

Output Structure

The output follows the official USDM v4.0 schema from dataStructure.yml with proper entity placement:

Study → StudyVersion → StudyDesign
         │              │
         │              ├── eligibilityCriteria[]
         │              ├── indications[]
         │              ├── analysisPopulations[]
         │              ├── activities[].definedProcedures[]
         │              └── scheduleTimelines[].timings[], .exits[]
         │
         ├── eligibilityCriterionItems[]
         ├── organizations[]
         ├── narrativeContentItems[]
         ├── abbreviations[]
         ├── conditions[]
         ├── amendments[]
         ├── administrableProducts[]
         ├── medicalDevices[]
         └── studyInterventions[]

For detailed output structure and entity relationships, see docs/ARCHITECTURE.md.

Provenance Tracking

Provenance metadata is stored separately in 9_final_soa_provenance.json and visualized in the web UI:

Source	Color	Meaning
`both`	🟩 Green	Confirmed (text + vision agree)
`text`	🟦 Blue	Text-only (NOT confirmed by vision)
`vision`	🟧 Orange	Vision-only (possible hallucination, needs review)
(none)	🔴 Red	Orphaned (no provenance data)

View provenance in the web UI by running cd web-ui && npm run dev.

Note: By default, all text-extracted cells are kept in the output. Use --remove-hallucinations to exclude cells not confirmed by vision.

SoA Footnotes

Footnotes extracted from SoA tables are stored in StudyDesign.notes as USDM v4.0 CommentAnnotation objects:

"notes": [
  {"id": "soa_fn_1", "text": "a. Within 32 days of administration", "instanceType": "CommentAnnotation"},
  {"id": "soa_fn_2", "text": "b. Participants admitted 10 hours prior", "instanceType": "CommentAnnotation"}
]

Web UI (Complete Revamp)

The web interface has been completely revamped from the legacy Streamlit app to a modern, user-friendly stack built with React 19, Next.js 16, TypeScript, TailwindCSS, and AG Grid.

Launch the Web UI

cd web-ui
npm install
npm run dev

Then open http://localhost:3000 in your browser.

Technology Stack

Component	Technology
Framework	Next.js 16 with App Router
Language	TypeScript
Styling	TailwindCSS with dark mode support
Data Tables	AG Grid for high-performance SoA display
Visualization	Cytoscape.js for interactive timeline graphs
State Management	Zustand
Icons	Lucide React

What's Visible in the Web UI

Protocol Overview:

Study metadata (title, phase, indication, sponsor)
Study identifiers (NCT, EudraCT, IND numbers)
Amendment history

Schedule of Activities (SoA):

Interactive AG Grid table with epoch/encounter groupings
Color-coded provenance indicators (green=confirmed, blue=text-only, orange=vision-only)
Footnote references with hover tooltips
Activity filtering and search

Study Design:

Study arms, epochs, and study cells
Activity groups with child activity linking
Transition rules between epochs

Eligibility & Objectives:

Inclusion/exclusion criteria display
Primary and secondary objectives
Endpoints and estimands

Interventions & Procedures:

Drug products with administration details
Substances and ingredients
Medical devices and procedures

Timeline Visualization:

Interactive Cytoscape graph of epochs and encounters
Node details panel with encounter information
Execution model overlay with time anchors

Advanced Views:

Execution model details (state machine, visit windows, dosing)
Provenance explorer with source tracking
Quality metrics dashboard
Validation results
Raw USDM JSON viewer

Future Roadmap

🚧 In Development: We intend to streamline the UI further and enable digital protocol (USDM JSON) editing directly via the UI. Some of these editing features are present but not fully functional yet. The goal is to allow users to:

Edit USDM entities directly in the browser

Save draft overlays without modifying source data

Publish finalized edits back to the USDM JSON

Track edit history and provenance

Model Benchmark

SoA extraction tested on Alexion Wilson's Disease protocol (Jan 2026):

Model	Activities	Timepoints	Ticks	Expansion Phases	Recommendation
gemini-3-flash ⭐	36 ✓	24 ✓	216	12/12 ✓	Optimized for this release
gemini-2.5-pro	36 ✓	24 ✓	207	12/12 ✓	Good fallback
claude-opus-4-5	36 ✓	24 ✓	212	12/12 ✓	Good, higher cost
chatgpt-5.2	36 ✓	24 ✓	210	12/12 ✓	Good alternative

Notes:

gemini-3-flash: This release is optimized for Gemini Flash 3. Best balance of speed, accuracy, and cost.
gemini-2.5-pro: Used as automatic fallback for SoA text extraction when Gemini 3 has JSON compliance issues.
claude-opus-4-5: High accuracy but significantly higher cost per extraction.
chatgpt-5.2: Latest OpenAI model with good accuracy.

Project Structure

Protocol2USDMv3/
├── main_v3.py                # Entry point (phase registry architecture)
├── llm_providers.py          # LLM provider abstraction layer
├── pipeline/                 # ⭐ NEW: Phase registry architecture
│   ├── __init__.py           # Package exports
│   ├── base_phase.py         # BasePhase class with extract/combine/save
│   ├── phase_registry.py     # Phase registration and discovery
│   ├── orchestrator.py       # Pipeline orchestration with parallel support
│   └── phases/               # Individual phase implementations
│       ├── eligibility.py    # Eligibility criteria phase
│       ├── metadata.py       # Study metadata phase
│       ├── objectives.py     # Objectives & endpoints phase
│       ├── studydesign.py    # Study design phase
│       ├── interventions.py  # Interventions phase
│       ├── narrative.py      # Narrative structure phase
│       ├── advanced.py       # Advanced entities phase
│       ├── procedures.py     # Procedures & devices phase
│       ├── scheduling.py     # Scheduling logic phase
│       ├── docstructure.py   # Document structure phase
│       ├── amendmentdetails.py # Amendment details phase
│       └── execution.py      # Execution model phase
├── core/                     # Core modules
│   ├── usdm_schema_loader.py # Official CDISC schema parser + USDMEntity base
│   ├── usdm_types_generated.py # 86+ USDM types (hand-written, schema-aligned)
│   ├── usdm_types.py         # Unified type interface
│   ├── llm_client.py         # LLM client utilities
│   ├── constants.py          # Centralized constants (DEFAULT_MODEL, etc.)
│   ├── evs_client.py         # NCI EVS API client with caching
│   ├── provenance.py         # ProvenanceTracker for source tracking
│   └── reconciliation/       # Entity reconciliation framework
├── extraction/               # Extraction modules
│   ├── header_analyzer.py    # Vision-based structure
│   ├── text_extractor.py     # Text-based extraction
│   ├── pipeline.py           # SoA extraction pipeline
│   ├── pipeline_context.py   # Context passing between extractors
│   ├── execution/            # Execution model extractors (27 modules)
│   └── */                    # Domain extractors (13 modules)
├── enrichment/               # Terminology enrichment
│   └── terminology.py        # NCI EVS enrichment
├── validation/               # Validation package
│   ├── usdm_validator.py     # Official USDM validation
│   └── cdisc_conformance.py  # CDISC CORE conformance
├── scripts/                  # Utility scripts
│   ├── extractors/           # Standalone CLI extractors
│   └── debug/                # Debug utilities
├── testing/                  # Benchmarking & integration tests
├── tests/                    # Unit tests
├── docs/                     # Architecture documentation
├── web-ui/                   # React/Next.js protocol viewer
├── tools/                    # External tools (CDISC CORE engine)
├── archive/                  # Archived legacy files
└── output/                   # Pipeline outputs

For detailed architecture, see docs/ARCHITECTURE.md.

Testing

# Run unit tests
pytest tests/

# Run integration tests
python testing/test_pipeline_steps.py

# Run golden standard comparison
python testing/compare_golden_vs_extracted.py

# Benchmark models
python testing/benchmark_models.py

Configuration

Environment Variables

# RECOMMENDED: Google Cloud Vertex AI (for Gemini models)
GOOGLE_CLOUD_PROJECT=your-project-id     # Your GCP project
GOOGLE_CLOUD_LOCATION=us-central1        # Region (us-central1, europe-west1, etc.)

# Alternative: Google AI Studio (may have safety restrictions)
GOOGLE_API_KEY=...          # For Gemini via AI Studio (not recommended for clinical)

# Other providers
OPENAI_API_KEY=...          # For GPT models
CLAUDE_API_KEY=...          # For Claude models (Anthropic)

# Required for CDISC conformance validation
CDISC_API_KEY=...           # For CORE rules cache (get from library.cdisc.org)

⚠️ Important: For clinical protocol extraction, use Vertex AI for Gemini models. AI Studio may block medical content even with safety settings disabled.

Supported Models

Google (Optimized for this release):

gemini-3-flash ⭐ Recommended - Pipeline optimized for this model
gemini-2.5-pro - Good fallback, used automatically for SoA extraction

Anthropic:

claude-opus-4-5 - High accuracy, higher cost
claude-sonnet-4 - Good balance of speed and accuracy

OpenAI:

chatgpt-5.2 - Latest OpenAI model, good accuracy

Note: Other models are also supported via the unified provider interface. See llm_providers.py for the full list.

Troubleshooting

Issue	Solution
API key error	Check `.env` file, restart terminal
Gemini blocks content	Use Vertex AI instead of AI Studio (see Configuration)
Missing visits	Verify correct SoA pages found (check `4_header_structure.json`)
Parse errors	Try `gemini-3-flash` model, check verbose logs
Schema errors	Post-processing auto-fixes most issues
Safety filter errors	Ensure using Vertex AI with `BLOCK_NONE` settings

Roadmap / TODO

The following items are planned for upcoming releases:

License

Contact author for permission to use.

Acknowledgments

This project is, in many ways, a workflow wrapper around the incredible work done by CDISC and its volunteers.

CDISC & Data4Knowledge

A special thank you to the Data4Knowledge (D4K) team and the CDISC DDF/USDM community:

DDF Reference Architecture - The USDM standard that powers this entire pipeline
CDISC CORE Engine - Conformance validation engine and rules
usdm Python Package - Official USDM validation library

Most importantly, heartfelt thanks to Dave Iberson-Hurst, Kirsten Walther Langendorf, and Johannes Ulander — who have been extraordinarily kind and supportive despite my repeated questions and pestering. Their openness in sharing their work and time enables projects like this to exist.

Other Resources

NCI EVS for terminology services

Name		Name	Last commit message	Last commit date
Latest commit History 262 Commits
archive		archive
core		core
docs		docs
enrichment		enrichment
extraction		extraction
input		input
output		output
pipeline		pipeline
scripts		scripts
testing		testing
tests		tests
tools		tools
utilities		utilities
validation		validation
web-ui		web-ui
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
USER_GUIDE.md		USER_GUIDE.md
llm_config.yaml		llm_config.yaml
llm_providers.py		llm_providers.py
main_v3.py		main_v3.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Protocol2USDM

⚠️ Disclaimer: An Experiment in AI-Assisted Development

🚀 Try It Now

⚠️ Important: Vertex AI Requirement for Gemini

What's New in v7.1

🏗️ Phase Registry Architecture (NEW)

🎯 Gemini Flash 3 Optimization

🧠 Execution Model Extraction & Promotion (v7.2)

🔄 Pipeline Context Architecture

🔗 Entity Reconciliation Framework

🏛️ USDM 4.0 Alignment

Features

Extraction Capabilities

Conditional Sources (Additional Documents)

Full Protocol Extraction

Quick Start

Installation

Requirements

Setup

CDISC CORE Engine (Optional)

Usage

Basic Usage

Model Selection

Complete Extraction (Default)

Full Pipeline with Post-Processing

Additional Options

Pipeline Steps

SoA Extraction (Steps 1-4)

Expansion Phases (with --full-protocol)

Conditional Sources (with --sap or --sites)

Post-Processing

Output Structure

Provenance Tracking

SoA Footnotes

Web UI (Complete Revamp)

Launch the Web UI

Technology Stack

What's Visible in the Web UI

Future Roadmap

Model Benchmark

Project Structure

Testing

Configuration

Environment Variables

Supported Models

Troubleshooting

Roadmap / TODO

License

Acknowledgments

CDISC & Data4Knowledge

Other Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Expansion Phases (with `--full-protocol`)

Conditional Sources (with `--sap` or `--sites`)

Packages