This project was built with one self-imposed rule: I would not write or edit a single line of code myself.
The goal was to answer a simple question:
"Can someone with no coding experience build production-quality clinical software using only AI assistants?"
Every line of code in this repository was generated by AI coding assistants (Claude, GPT, Gemini). The human role was limited to:
- Describing requirements and desired outcomes
- Reviewing and approving AI-generated code
- Providing feedback and requesting corrections
- Testing functionality and reporting issues
As a result, this software:
- β Is not robust β edge cases may not be handled
- β Is likely full of bugs β limited systematic testing
- β Has not been thoroughly tested β no formal QA process
- β Is NOT fit for any production environment
- β Should NOT be used for actual clinical trials or regulatory submissions
This project exists solely as:
- β A demonstrator of AI-assisted development capabilities
- β A reference implementation of USDM v4.0 extraction concepts
- β An experiment in human-AI collaboration for software development
Use at your own risk. Contributions and improvements welcome.
Extract clinical protocol content into USDM v4.0 format
Protocol2USDM is an automated pipeline that extracts, validates, and structures clinical trial protocol content into data conformant to the CDISC USDM v4.0 model.
# Full extraction with SAP, sites, parallel execution
python main_v3.py input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_Protocol.pdf \
--complete \
--sap input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_SAP.pdf \
--sites input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_sites.csv \
--parallel \
--model gemini-3-flashThis extracts the full protocol with execution model, enriches entities with NCI terminology codes, includes SAP analysis populations (with STATO mapping and ARS linkage), and site list.
π‘ Default Model:
gemini-3-flash-preview(Gemini Flash 3) via Vertex AI. Other models (claude-opus-4-5,claude-sonnet-4,chatgpt-5.2,gemini-2.5-pro) are supported. The pipeline defaults to--completemode when no specific phases are requested.
Gemini models must be accessed via Google Cloud Vertex AI (not AI Studio) to properly disable safety controls. The consumer API (AI Studio) may still block or restrict medical/clinical content even with safety settings disabled. Vertex AI allows BLOCK_NONE safety settings which are essential for clinical protocol extraction.
Required .env configuration for Gemini via Vertex AI:
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1 # or your preferred regionmain_v3.py- New refactored entry point with clean phase registry patternpipeline/module - Modular phase definitions with dependency-aware execution- Parallel execution - Run independent phases concurrently with
--parallel - Default
--completemode - Full extraction when no specific phases requested
- Pipeline optimized and tested with
gemini-3-flash-previewas default model - Intelligent fallback to
gemini-2.5-profor SoA text extraction when needed - Response validation with automatic retry logic (up to 2 retries)
- Stricter prompt guardrails for JSON format compliance
- Time Anchors: Extract temporal reference points (VISIT/EVENT/CONCEPTUAL classification)
- Visit Windows: Timing tolerances β
Timing.windowLower/windowUpper(ISO 8601) - Subject State Machine: Subject flow β
TransitionRuleonEncounter - Dosing Regimens: Drug administration β
Administrationentities - Repetitions: Cycle-based patterns β
ScheduledActivityInstanceexpansion - Traversal Constraints: Subject journey β
Epoch/Encounter.previousId/nextIdchains - Footnote Conditions: Conditional rules β
Condition+ScheduledDecisionInstance - Titration Schedules: Dose escalation β
StudyElementwithTransitionRule
v7.2 Promotion: Execution model data is now promoted to native USDM entities instead of extensions. Core USDM output is self-sufficient without parsing 11_execution_model.json.
- Context-aware extraction where each phase builds on previous results
- Extractors receive existing SoA entities (epochs, encounters, activities) as context
- Eliminates arbitrary labels that require downstream resolution
- Consistent ID references across USDM output
- LLM-based semantic mapping of abstract concepts to protocol entities
- Epoch, encounter, and activity reconciliation with ID preservation
- Replaces fuzzy string matching with intelligent entity resolution
- All entities placed at correct locations per CDISC
dataStructure.yml - Proper entity hierarchy (studyVersion, studyDesign, scheduleTimeline, activity)
- NCI code mappings for dose forms, timing types, and identifier types
- Gemini Flash 3 Optimized: Pipeline tuned for best results with
gemini-3-flashvia Vertex AI - Vision-Validated Extraction: Text extraction validated against actual PDF images
- USDM v4.0 Aligned: Outputs follow official CDISC schema with proper entity hierarchy
- Execution Model: Full subject state machine, time anchors, visit windows, and dosing regimens
- NCI Terminology Enrichment: Automatic enrichment with official NCI codes via EVS API
- Pipeline Context: Each extractor receives accumulated context from prior phases
- Entity Reconciliation: LLM-based semantic mapping and ID preservation
- Rich Provenance: Every cell tagged with source (text/vision/both) for confidence tracking
- CDISC CORE Validation: Built-in conformance checking with local engine
- Modern Web UI: Complete React/Next.js protocol viewer (see below)
| Module | Entities | CLI Flag |
|---|---|---|
| SoA | Activity, PlannedTimepoint, Epoch, Encounter, CommentAnnotation | (default) |
| Metadata | StudyTitle, StudyIdentifier, Organization, Indication | --metadata |
| Eligibility | EligibilityCriterion, StudyDesignPopulation | --eligibility |
| Objectives | Objective, Endpoint, Estimand | --objectives |
| Study Design | StudyArm, StudyCell, StudyCohort | --studydesign |
| Interventions | StudyIntervention, AdministrableProduct, Substance | --interventions |
| Narrative | NarrativeContent, Abbreviation | --narrative |
| Advanced | StudyAmendment, GeographicScope, Country | --advanced |
| Procedures | Procedure, MedicalDevice, Ingredient, Strength | --procedures |
| Scheduling | Timing, Condition, TransitionRule, ScheduleTimelineExit | --scheduling |
| Doc Structure | NarrativeContentItem, StudyDefinitionDocument | --docstructure |
| Amendments | StudyAmendmentReason, ImpactedEntity | --amendmentdetails |
| Execution Model | TimeAnchor, VisitWindow, StateMachine, Repetition | --execution |
| Source | Entities | CLI Flag |
|---|---|---|
| SAP | AnalysisPopulation, Characteristic | --sap <path> |
| Site List | StudySite, StudyRole, AssignedPerson | --sites <path> |
Extract everything with a single command:
# Default behavior - no flags needed! Runs --complete automatically
python main_v3.py protocol.pdf
# Explicit --complete: Full extraction + all post-processing
python main_v3.py protocol.pdf --complete
# Parallel execution for faster processing
python main_v3.py protocol.pdf --parallel --max-workers 4Or select specific phases:
python main_v3.py protocol.pdf --metadata --eligibility --objectives
python main_v3.py protocol.pdf --expansion-only --metadata # Skip SoA
python main_v3.py protocol.pdf --procedures --scheduling --executionWith additional source documents:
python main_v3.py protocol.pdf --sap sap.pdf --sites sites.xlsxOutput: Individual JSONs + combined protocol_usdm.json
# 1. Clone repository
git clone https://github.com/Panikos/Protocol2USDMv3.git
cd Protocol2USDMv3
# 2. Install dependencies
pip install -r requirements.txt
# 3. Set up API keys (.env file)
GOOGLE_CLOUD_PROJECT=your-gcp-project # Required for Gemini via Vertex AI
GOOGLE_CLOUD_LOCATION=us-central1
OPENAI_API_KEY=... # Optional: for GPT models
CLAUDE_API_KEY=... # Optional: for Claude models
CDISC_API_KEY=... # Optional: for CORE conformance
# 4. Run the pipeline (defaults to --complete with gemini-3-flash-preview)
python main_v3.py input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_Protocol.pdf
# 5. View results in web UI
cd web-ui && npm run dev- Python 3.9+
- API keys: OpenAI, Google AI, Claude AI, CDISC API
# Create virtual environment (recommended)
python -m venv venv
venv\Scripts\activate # Windows
source venv/bin/activate # macOS/Linux
# Install dependencies
pip install -r requirements.txt
# Create .env file with API keys
echo "OPENAI_API_KEY=sk-your-key" > .env
echo "GOOGLE_API_KEY=AIza-your-key" >> .env
echo "CDISC_API_KEY=your-cdisc-key" >> .envFor conformance validation, download the CORE engine:
python tools/core/download_core.pyNote: Get your CDISC API key from https://library.cdisc.org/ (requires CDISC membership)
# main_v3.py is the recommended entry point (phase registry architecture)
python main_v3.py <protocol.pdf> [options]
# Legacy main_v2.py has been removed β use main_v3.py# Default: gemini-3-flash-preview (no --model flag needed)
python main_v3.py protocol.pdf
# Gemini 2.5 Pro (good fallback)
python main_v3.py protocol.pdf --model gemini-2.5-pro
# Claude Opus 4.5 (high accuracy, higher cost)
python main_v3.py protocol.pdf --model claude-opus-4-5
# ChatGPT 5.2
python main_v3.py protocol.pdf --model chatgpt-5.2# Default behavior - runs --complete automatically when no phases specified
python main_v3.py protocol.pdf
# With SAP document for analysis populations
python main_v3.py protocol.pdf --sap sap.pdf--complete enables:
| Option | Description |
|---|---|
--full-protocol |
All 12 expansion phases (metadata, eligibility, objectives, etc.) |
--soa |
Full SoA extraction pipeline |
--enrich |
NCI terminology code enrichment |
--validate-schema |
USDM schema validation |
--conformance |
CDISC CORE conformance rules |
# Run SoA + enrichment + schema validation + CORE conformance
python main_v3.py protocol.pdf --soa
# Or run post-processing steps individually
python main_v3.py protocol.pdf --enrich # Step 7: NCI terminology
python main_v3.py protocol.pdf --validate-schema # Step 8: Schema validation
python main_v3.py protocol.pdf --conformance # Step 9: CORE conformance--output-dir, -o Output directory (default: output/<protocol_name>)
--pages, -p Specific SoA page numbers (comma-separated)
--no-validate Skip vision validation
--remove-hallucinations Remove cells not confirmed by vision (default: keep all)
--confidence-threshold Confidence threshold for hallucination removal (default: 0.7)
--verbose, -v Enable verbose output
--update-evs-cache Update EVS terminology cache before enrichment
--update-cache Update CDISC CORE rules cache (requires CDISC_API_KEY)| Step | Description | Output File |
|---|---|---|
| 1 | Find SoA pages & analyze header structure (vision) | 4_header_structure.json |
| 2 | Extract SoA data from text | 5_raw_text_soa.json |
| 3 | Validate extraction against images | 6_validation_result.json |
| 4 | Build SoA output | 9_final_soa.json + 9_final_soa_provenance.json |
| Phase | Entities | Output File | CLI Flag |
|---|---|---|---|
| Metadata | StudyTitle, Organization, Indication | 2_study_metadata.json |
--metadata |
| Eligibility | EligibilityCriterion, Population | 3_eligibility_criteria.json |
--eligibility |
| Objectives | Objective, Endpoint, Estimand | 4_objectives_endpoints.json |
--objectives |
| Study Design | StudyArm, StudyCell, StudyCohort | 5_study_design.json |
--studydesign |
| Interventions | StudyIntervention, Product, Substance | 6_interventions.json |
--interventions |
| Narrative | Abbreviation, NarrativeContent | 7_narrative_structure.json |
--narrative |
| Advanced | StudyAmendment, GeographicScope, Country | 8_advanced_entities.json |
--advanced |
| Procedures | Procedure, MedicalDevice, Ingredient | 9_procedures_devices.json |
--procedures |
| Scheduling | Timing, Condition, TransitionRule | 10_scheduling_logic.json |
--scheduling |
| Doc Structure | NarrativeContentItem, StudyDefinitionDocument | 13_document_structure.json |
--docstructure |
| Amendments | StudyAmendmentReason, ImpactedEntity | 14_amendment_details.json |
--amendmentdetails |
| Execution | TimeAnchor, Repetition, StateMachine | 11_execution_model.json |
--execution |
| Source | Entities | Output File |
|---|---|---|
| SAP Document | AnalysisPopulation, Characteristic | 11_sap_populations.json |
| Site List | StudySite, StudyRole, AssignedPerson | 12_site_list.json |
| Step | Description | Output File |
|---|---|---|
| Combine | Merge all extractions | protocol_usdm.json β |
| Terminology | NCI EVS code enrichment | terminology_enrichment.json |
| Schema Fix | Auto-fix schema issues (UUIDs, Codes) | schema_validation.json |
| USDM Validation | Validate against official USDM package | usdm_validation.json |
| Conformance | CDISC CORE rules validation | conformance_report.json |
| ID Mapping | Simple ID β UUID mapping | id_mapping.json |
| Provenance | UUID-based provenance for viewer | protocol_usdm_provenance.json |
Primary output: output/<protocol>/protocol_usdm.json
The output follows the official USDM v4.0 schema from dataStructure.yml with proper entity placement:
Study β StudyVersion β StudyDesign
β β
β βββ eligibilityCriteria[]
β βββ indications[]
β βββ analysisPopulations[]
β βββ activities[].definedProcedures[]
β βββ scheduleTimelines[].timings[], .exits[]
β
βββ eligibilityCriterionItems[]
βββ organizations[]
βββ narrativeContentItems[]
βββ abbreviations[]
βββ conditions[]
βββ amendments[]
βββ administrableProducts[]
βββ medicalDevices[]
βββ studyInterventions[]
For detailed output structure and entity relationships, see docs/ARCHITECTURE.md.
Provenance metadata is stored separately in 9_final_soa_provenance.json and visualized in the web UI:
| Source | Color | Meaning |
|---|---|---|
both |
π© Green | Confirmed (text + vision agree) |
text |
π¦ Blue | Text-only (NOT confirmed by vision) |
vision |
π§ Orange | Vision-only (possible hallucination, needs review) |
| (none) | π΄ Red | Orphaned (no provenance data) |
View provenance in the web UI by running cd web-ui && npm run dev.
Note: By default, all text-extracted cells are kept in the output. Use --remove-hallucinations to exclude cells not confirmed by vision.
Footnotes extracted from SoA tables are stored in StudyDesign.notes as USDM v4.0 CommentAnnotation objects:
"notes": [
{"id": "soa_fn_1", "text": "a. Within 32 days of administration", "instanceType": "CommentAnnotation"},
{"id": "soa_fn_2", "text": "b. Participants admitted 10 hours prior", "instanceType": "CommentAnnotation"}
]The web interface has been completely revamped from the legacy Streamlit app to a modern, user-friendly stack built with React 19, Next.js 16, TypeScript, TailwindCSS, and AG Grid.
cd web-ui
npm install
npm run devThen open http://localhost:3000 in your browser.
| Component | Technology |
|---|---|
| Framework | Next.js 16 with App Router |
| Language | TypeScript |
| Styling | TailwindCSS with dark mode support |
| Data Tables | AG Grid for high-performance SoA display |
| Visualization | Cytoscape.js for interactive timeline graphs |
| State Management | Zustand |
| Icons | Lucide React |
Protocol Overview:
- Study metadata (title, phase, indication, sponsor)
- Study identifiers (NCT, EudraCT, IND numbers)
- Amendment history
Schedule of Activities (SoA):
- Interactive AG Grid table with epoch/encounter groupings
- Color-coded provenance indicators (green=confirmed, blue=text-only, orange=vision-only)
- Footnote references with hover tooltips
- Activity filtering and search
Study Design:
- Study arms, epochs, and study cells
- Activity groups with child activity linking
- Transition rules between epochs
Eligibility & Objectives:
- Inclusion/exclusion criteria display
- Primary and secondary objectives
- Endpoints and estimands
Interventions & Procedures:
- Drug products with administration details
- Substances and ingredients
- Medical devices and procedures
Timeline Visualization:
- Interactive Cytoscape graph of epochs and encounters
- Node details panel with encounter information
- Execution model overlay with time anchors
Advanced Views:
- Execution model details (state machine, visit windows, dosing)
- Provenance explorer with source tracking
- Quality metrics dashboard
- Validation results
- Raw USDM JSON viewer
π§ In Development: We intend to streamline the UI further and enable digital protocol (USDM JSON) editing directly via the UI. Some of these editing features are present but not fully functional yet. The goal is to allow users to:
- Edit USDM entities directly in the browser
- Save draft overlays without modifying source data
- Publish finalized edits back to the USDM JSON
- Track edit history and provenance
SoA extraction tested on Alexion Wilson's Disease protocol (Jan 2026):
| Model | Activities | Timepoints | Ticks | Expansion Phases | Recommendation |
|---|---|---|---|---|---|
| gemini-3-flash β | 36 β | 24 β | 216 | 12/12 β | Optimized for this release |
| gemini-2.5-pro | 36 β | 24 β | 207 | 12/12 β | Good fallback |
| claude-opus-4-5 | 36 β | 24 β | 212 | 12/12 β | Good, higher cost |
| chatgpt-5.2 | 36 β | 24 β | 210 | 12/12 β | Good alternative |
Notes:
- gemini-3-flash: This release is optimized for Gemini Flash 3. Best balance of speed, accuracy, and cost.
- gemini-2.5-pro: Used as automatic fallback for SoA text extraction when Gemini 3 has JSON compliance issues.
- claude-opus-4-5: High accuracy but significantly higher cost per extraction.
- chatgpt-5.2: Latest OpenAI model with good accuracy.
Protocol2USDMv3/
βββ main_v3.py # Entry point (phase registry architecture)
βββ llm_providers.py # LLM provider abstraction layer
βββ pipeline/ # β NEW: Phase registry architecture
β βββ __init__.py # Package exports
β βββ base_phase.py # BasePhase class with extract/combine/save
β βββ phase_registry.py # Phase registration and discovery
β βββ orchestrator.py # Pipeline orchestration with parallel support
β βββ phases/ # Individual phase implementations
β βββ eligibility.py # Eligibility criteria phase
β βββ metadata.py # Study metadata phase
β βββ objectives.py # Objectives & endpoints phase
β βββ studydesign.py # Study design phase
β βββ interventions.py # Interventions phase
β βββ narrative.py # Narrative structure phase
β βββ advanced.py # Advanced entities phase
β βββ procedures.py # Procedures & devices phase
β βββ scheduling.py # Scheduling logic phase
β βββ docstructure.py # Document structure phase
β βββ amendmentdetails.py # Amendment details phase
β βββ execution.py # Execution model phase
βββ core/ # Core modules
β βββ usdm_schema_loader.py # Official CDISC schema parser + USDMEntity base
β βββ usdm_types_generated.py # 86+ USDM types (hand-written, schema-aligned)
β βββ usdm_types.py # Unified type interface
β βββ llm_client.py # LLM client utilities
β βββ constants.py # Centralized constants (DEFAULT_MODEL, etc.)
β βββ evs_client.py # NCI EVS API client with caching
β βββ provenance.py # ProvenanceTracker for source tracking
β βββ reconciliation/ # Entity reconciliation framework
βββ extraction/ # Extraction modules
β βββ header_analyzer.py # Vision-based structure
β βββ text_extractor.py # Text-based extraction
β βββ pipeline.py # SoA extraction pipeline
β βββ pipeline_context.py # Context passing between extractors
β βββ execution/ # Execution model extractors (27 modules)
β βββ */ # Domain extractors (13 modules)
βββ enrichment/ # Terminology enrichment
β βββ terminology.py # NCI EVS enrichment
βββ validation/ # Validation package
β βββ usdm_validator.py # Official USDM validation
β βββ cdisc_conformance.py # CDISC CORE conformance
βββ scripts/ # Utility scripts
β βββ extractors/ # Standalone CLI extractors
β βββ debug/ # Debug utilities
βββ testing/ # Benchmarking & integration tests
βββ tests/ # Unit tests
βββ docs/ # Architecture documentation
βββ web-ui/ # React/Next.js protocol viewer
βββ tools/ # External tools (CDISC CORE engine)
βββ archive/ # Archived legacy files
βββ output/ # Pipeline outputs
For detailed architecture, see docs/ARCHITECTURE.md.
# Run unit tests
pytest tests/
# Run integration tests
python testing/test_pipeline_steps.py
# Run golden standard comparison
python testing/compare_golden_vs_extracted.py
# Benchmark models
python testing/benchmark_models.py# RECOMMENDED: Google Cloud Vertex AI (for Gemini models)
GOOGLE_CLOUD_PROJECT=your-project-id # Your GCP project
GOOGLE_CLOUD_LOCATION=us-central1 # Region (us-central1, europe-west1, etc.)
# Alternative: Google AI Studio (may have safety restrictions)
GOOGLE_API_KEY=... # For Gemini via AI Studio (not recommended for clinical)
# Other providers
OPENAI_API_KEY=... # For GPT models
CLAUDE_API_KEY=... # For Claude models (Anthropic)
# Required for CDISC conformance validation
CDISC_API_KEY=... # For CORE rules cache (get from library.cdisc.org)
β οΈ Important: For clinical protocol extraction, use Vertex AI for Gemini models. AI Studio may block medical content even with safety settings disabled.
Google (Optimized for this release):
gemini-3-flashβ Recommended - Pipeline optimized for this modelgemini-2.5-pro- Good fallback, used automatically for SoA extraction
Anthropic:
claude-opus-4-5- High accuracy, higher costclaude-sonnet-4- Good balance of speed and accuracy
OpenAI:
chatgpt-5.2- Latest OpenAI model, good accuracy
Note: Other models are also supported via the unified provider interface. See
llm_providers.pyfor the full list.
| Issue | Solution |
|---|---|
| API key error | Check .env file, restart terminal |
| Gemini blocks content | Use Vertex AI instead of AI Studio (see Configuration) |
| Missing visits | Verify correct SoA pages found (check 4_header_structure.json) |
| Parse errors | Try gemini-3-flash model, check verbose logs |
| Schema errors | Post-processing auto-fixes most issues |
| Safety filter errors | Ensure using Vertex AI with BLOCK_NONE settings |
The following items are planned for upcoming releases:
- Web UI Protocol Editing: Enable direct USDM JSON editing via browser with draft/publish workflow
- Biomedical Concepts: Add extraction via a separate comprehensive canonical model for standardized concept mapping
- Multi-Protocol Comparison: Compare USDM outputs across protocol versions
- Gemini Flash 3 Optimization: Pipeline optimized for
gemini-3-flashwith Vertex AI (completed v7.0) - Execution Model Extraction: Time anchors, visit windows, state machine, dosing regimens (completed v7.0)
- Execution Model Promotion: Native USDM entities instead of extensions (completed v7.2)
- Pipeline Context Architecture: Context-aware extraction with accumulated results (completed v7.0)
- Entity Reconciliation Framework: LLM-based semantic mapping and ID preservation (completed v7.0)
- Modern Web UI: Complete React/Next.js revamp from Streamlit (completed v7.0)
- USDM 4.0 Alignment: All entities at correct locations per
dataStructure.yml(completed v7.0) - NCI Code Mappings: Dose forms, timing types, identifier types with NCI codes (completed v7.0)
- Repository Cleanup: Organized scripts, archived legacy files (completed v7.0)
Contact author for permission to use.
This project is, in many ways, a workflow wrapper around the incredible work done by CDISC and its volunteers.
A special thank you to the Data4Knowledge (D4K) team and the CDISC DDF/USDM community:
- DDF Reference Architecture - The USDM standard that powers this entire pipeline
- CDISC CORE Engine - Conformance validation engine and rules
- usdm Python Package - Official USDM validation library
Most importantly, heartfelt thanks to Dave Iberson-Hurst, Kirsten Walther Langendorf, and Johannes Ulander β who have been extraordinarily kind and supportive despite my repeated questions and pestering. Their openness in sharing their work and time enables projects like this to exist.
- NCI EVS for terminology services