Skip to content

Panikos/Protocol2USDM

Repository files navigation

Protocol2USDM

⚠️ Disclaimer: An Experiment in AI-Assisted Development

This project was built with one self-imposed rule: I would not write or edit a single line of code myself.

The goal was to answer a simple question:

"Can someone with no coding experience build production-quality clinical software using only AI assistants?"

Every line of code in this repository was generated by AI coding assistants (Claude, GPT, Gemini). The human role was limited to:

  • Describing requirements and desired outcomes
  • Reviewing and approving AI-generated code
  • Providing feedback and requesting corrections
  • Testing functionality and reporting issues

As a result, this software:

  • ❌ Is not robust β€” edge cases may not be handled
  • ❌ Is likely full of bugs β€” limited systematic testing
  • ❌ Has not been thoroughly tested β€” no formal QA process
  • ❌ Is NOT fit for any production environment
  • ❌ Should NOT be used for actual clinical trials or regulatory submissions

This project exists solely as:

  • βœ… A demonstrator of AI-assisted development capabilities
  • βœ… A reference implementation of USDM v4.0 extraction concepts
  • βœ… An experiment in human-AI collaboration for software development

Use at your own risk. Contributions and improvements welcome.


Extract clinical protocol content into USDM v4.0 format

Protocol2USDM is an automated pipeline that extracts, validates, and structures clinical trial protocol content into data conformant to the CDISC USDM v4.0 model.


πŸš€ Try It Now

# Full extraction with SAP, sites, parallel execution
python main_v3.py input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_Protocol.pdf \
  --complete \
  --sap input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_SAP.pdf \
  --sites input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_sites.csv \
  --parallel \
  --model gemini-3-flash

This extracts the full protocol with execution model, enriches entities with NCI terminology codes, includes SAP analysis populations (with STATO mapping and ARS linkage), and site list.

πŸ’‘ Default Model: gemini-3-flash-preview (Gemini Flash 3) via Vertex AI. Other models (claude-opus-4-5, claude-sonnet-4, chatgpt-5.2, gemini-2.5-pro) are supported. The pipeline defaults to --complete mode when no specific phases are requested.

⚠️ Important: Vertex AI Requirement for Gemini

Gemini models must be accessed via Google Cloud Vertex AI (not AI Studio) to properly disable safety controls. The consumer API (AI Studio) may still block or restrict medical/clinical content even with safety settings disabled. Vertex AI allows BLOCK_NONE safety settings which are essential for clinical protocol extraction.

Required .env configuration for Gemini via Vertex AI:

GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1  # or your preferred region

What's New in v7.1

πŸ—οΈ Phase Registry Architecture (NEW)

  • main_v3.py - New refactored entry point with clean phase registry pattern
  • pipeline/ module - Modular phase definitions with dependency-aware execution
  • Parallel execution - Run independent phases concurrently with --parallel
  • Default --complete mode - Full extraction when no specific phases requested

🎯 Gemini Flash 3 Optimization

  • Pipeline optimized and tested with gemini-3-flash-preview as default model
  • Intelligent fallback to gemini-2.5-pro for SoA text extraction when needed
  • Response validation with automatic retry logic (up to 2 retries)
  • Stricter prompt guardrails for JSON format compliance

🧠 Execution Model Extraction & Promotion (v7.2)

  • Time Anchors: Extract temporal reference points (VISIT/EVENT/CONCEPTUAL classification)
  • Visit Windows: Timing tolerances β†’ Timing.windowLower/windowUpper (ISO 8601)
  • Subject State Machine: Subject flow β†’ TransitionRule on Encounter
  • Dosing Regimens: Drug administration β†’ Administration entities
  • Repetitions: Cycle-based patterns β†’ ScheduledActivityInstance expansion
  • Traversal Constraints: Subject journey β†’ Epoch/Encounter.previousId/nextId chains
  • Footnote Conditions: Conditional rules β†’ Condition + ScheduledDecisionInstance
  • Titration Schedules: Dose escalation β†’ StudyElement with TransitionRule

v7.2 Promotion: Execution model data is now promoted to native USDM entities instead of extensions. Core USDM output is self-sufficient without parsing 11_execution_model.json.

πŸ”„ Pipeline Context Architecture

  • Context-aware extraction where each phase builds on previous results
  • Extractors receive existing SoA entities (epochs, encounters, activities) as context
  • Eliminates arbitrary labels that require downstream resolution
  • Consistent ID references across USDM output

πŸ”— Entity Reconciliation Framework

  • LLM-based semantic mapping of abstract concepts to protocol entities
  • Epoch, encounter, and activity reconciliation with ID preservation
  • Replaces fuzzy string matching with intelligent entity resolution

πŸ›οΈ USDM 4.0 Alignment

  • All entities placed at correct locations per CDISC dataStructure.yml
  • Proper entity hierarchy (studyVersion, studyDesign, scheduleTimeline, activity)
  • NCI code mappings for dose forms, timing types, and identifier types

Features

  • Gemini Flash 3 Optimized: Pipeline tuned for best results with gemini-3-flash via Vertex AI
  • Vision-Validated Extraction: Text extraction validated against actual PDF images
  • USDM v4.0 Aligned: Outputs follow official CDISC schema with proper entity hierarchy
  • Execution Model: Full subject state machine, time anchors, visit windows, and dosing regimens
  • NCI Terminology Enrichment: Automatic enrichment with official NCI codes via EVS API
  • Pipeline Context: Each extractor receives accumulated context from prior phases
  • Entity Reconciliation: LLM-based semantic mapping and ID preservation
  • Rich Provenance: Every cell tagged with source (text/vision/both) for confidence tracking
  • CDISC CORE Validation: Built-in conformance checking with local engine
  • Modern Web UI: Complete React/Next.js protocol viewer (see below)

Extraction Capabilities

Module Entities CLI Flag
SoA Activity, PlannedTimepoint, Epoch, Encounter, CommentAnnotation (default)
Metadata StudyTitle, StudyIdentifier, Organization, Indication --metadata
Eligibility EligibilityCriterion, StudyDesignPopulation --eligibility
Objectives Objective, Endpoint, Estimand --objectives
Study Design StudyArm, StudyCell, StudyCohort --studydesign
Interventions StudyIntervention, AdministrableProduct, Substance --interventions
Narrative NarrativeContent, Abbreviation --narrative
Advanced StudyAmendment, GeographicScope, Country --advanced
Procedures Procedure, MedicalDevice, Ingredient, Strength --procedures
Scheduling Timing, Condition, TransitionRule, ScheduleTimelineExit --scheduling
Doc Structure NarrativeContentItem, StudyDefinitionDocument --docstructure
Amendments StudyAmendmentReason, ImpactedEntity --amendmentdetails
Execution Model TimeAnchor, VisitWindow, StateMachine, Repetition --execution

Conditional Sources (Additional Documents)

Source Entities CLI Flag
SAP AnalysisPopulation, Characteristic --sap <path>
Site List StudySite, StudyRole, AssignedPerson --sites <path>

Full Protocol Extraction

Extract everything with a single command:

# Default behavior - no flags needed! Runs --complete automatically
python main_v3.py protocol.pdf

# Explicit --complete: Full extraction + all post-processing
python main_v3.py protocol.pdf --complete

# Parallel execution for faster processing
python main_v3.py protocol.pdf --parallel --max-workers 4

Or select specific phases:

python main_v3.py protocol.pdf --metadata --eligibility --objectives
python main_v3.py protocol.pdf --expansion-only --metadata  # Skip SoA
python main_v3.py protocol.pdf --procedures --scheduling --execution

With additional source documents:

python main_v3.py protocol.pdf --sap sap.pdf --sites sites.xlsx

Output: Individual JSONs + combined protocol_usdm.json


Quick Start

# 1. Clone repository
git clone https://github.com/Panikos/Protocol2USDMv3.git
cd Protocol2USDMv3

# 2. Install dependencies
pip install -r requirements.txt

# 3. Set up API keys (.env file)
GOOGLE_CLOUD_PROJECT=your-gcp-project  # Required for Gemini via Vertex AI
GOOGLE_CLOUD_LOCATION=us-central1
OPENAI_API_KEY=...      # Optional: for GPT models
CLAUDE_API_KEY=...      # Optional: for Claude models
CDISC_API_KEY=...       # Optional: for CORE conformance

# 4. Run the pipeline (defaults to --complete with gemini-3-flash-preview)
python main_v3.py input/trial/NCT04573309_Wilsons/NCT04573309_Wilsons_Protocol.pdf

# 5. View results in web UI
cd web-ui && npm run dev

Installation

Requirements

  • Python 3.9+
  • API keys: OpenAI, Google AI, Claude AI, CDISC API

Setup

# Create virtual environment (recommended)
python -m venv venv
venv\Scripts\activate  # Windows
source venv/bin/activate  # macOS/Linux

# Install dependencies
pip install -r requirements.txt

# Create .env file with API keys
echo "OPENAI_API_KEY=sk-your-key" > .env
echo "GOOGLE_API_KEY=AIza-your-key" >> .env
echo "CDISC_API_KEY=your-cdisc-key" >> .env

CDISC CORE Engine (Optional)

For conformance validation, download the CORE engine:

python tools/core/download_core.py

Note: Get your CDISC API key from https://library.cdisc.org/ (requires CDISC membership)


Usage

Basic Usage

# main_v3.py is the recommended entry point (phase registry architecture)
python main_v3.py <protocol.pdf> [options]

# Legacy main_v2.py has been removed β€” use main_v3.py

Model Selection

# Default: gemini-3-flash-preview (no --model flag needed)
python main_v3.py protocol.pdf

# Gemini 2.5 Pro (good fallback)
python main_v3.py protocol.pdf --model gemini-2.5-pro

# Claude Opus 4.5 (high accuracy, higher cost)
python main_v3.py protocol.pdf --model claude-opus-4-5

# ChatGPT 5.2
python main_v3.py protocol.pdf --model chatgpt-5.2

Complete Extraction (Default)

# Default behavior - runs --complete automatically when no phases specified
python main_v3.py protocol.pdf

# With SAP document for analysis populations
python main_v3.py protocol.pdf --sap sap.pdf

--complete enables:

Option Description
--full-protocol All 12 expansion phases (metadata, eligibility, objectives, etc.)
--soa Full SoA extraction pipeline
--enrich NCI terminology code enrichment
--validate-schema USDM schema validation
--conformance CDISC CORE conformance rules

Full Pipeline with Post-Processing

# Run SoA + enrichment + schema validation + CORE conformance
python main_v3.py protocol.pdf --soa

# Or run post-processing steps individually
python main_v3.py protocol.pdf --enrich              # Step 7: NCI terminology
python main_v3.py protocol.pdf --validate-schema     # Step 8: Schema validation
python main_v3.py protocol.pdf --conformance         # Step 9: CORE conformance

Additional Options

--output-dir, -o           Output directory (default: output/<protocol_name>)
--pages, -p                Specific SoA page numbers (comma-separated)
--no-validate              Skip vision validation
--remove-hallucinations    Remove cells not confirmed by vision (default: keep all)
--confidence-threshold     Confidence threshold for hallucination removal (default: 0.7)
--verbose, -v              Enable verbose output
--update-evs-cache         Update EVS terminology cache before enrichment
--update-cache             Update CDISC CORE rules cache (requires CDISC_API_KEY)

Pipeline Steps

SoA Extraction (Steps 1-4)

Step Description Output File
1 Find SoA pages & analyze header structure (vision) 4_header_structure.json
2 Extract SoA data from text 5_raw_text_soa.json
3 Validate extraction against images 6_validation_result.json
4 Build SoA output 9_final_soa.json + 9_final_soa_provenance.json

Expansion Phases (with --full-protocol)

Phase Entities Output File CLI Flag
Metadata StudyTitle, Organization, Indication 2_study_metadata.json --metadata
Eligibility EligibilityCriterion, Population 3_eligibility_criteria.json --eligibility
Objectives Objective, Endpoint, Estimand 4_objectives_endpoints.json --objectives
Study Design StudyArm, StudyCell, StudyCohort 5_study_design.json --studydesign
Interventions StudyIntervention, Product, Substance 6_interventions.json --interventions
Narrative Abbreviation, NarrativeContent 7_narrative_structure.json --narrative
Advanced StudyAmendment, GeographicScope, Country 8_advanced_entities.json --advanced
Procedures Procedure, MedicalDevice, Ingredient 9_procedures_devices.json --procedures
Scheduling Timing, Condition, TransitionRule 10_scheduling_logic.json --scheduling
Doc Structure NarrativeContentItem, StudyDefinitionDocument 13_document_structure.json --docstructure
Amendments StudyAmendmentReason, ImpactedEntity 14_amendment_details.json --amendmentdetails
Execution TimeAnchor, Repetition, StateMachine 11_execution_model.json --execution

Conditional Sources (with --sap or --sites)

Source Entities Output File
SAP Document AnalysisPopulation, Characteristic 11_sap_populations.json
Site List StudySite, StudyRole, AssignedPerson 12_site_list.json

Post-Processing

Step Description Output File
Combine Merge all extractions protocol_usdm.json ⭐
Terminology NCI EVS code enrichment terminology_enrichment.json
Schema Fix Auto-fix schema issues (UUIDs, Codes) schema_validation.json
USDM Validation Validate against official USDM package usdm_validation.json
Conformance CDISC CORE rules validation conformance_report.json
ID Mapping Simple ID β†’ UUID mapping id_mapping.json
Provenance UUID-based provenance for viewer protocol_usdm_provenance.json

Primary output: output/<protocol>/protocol_usdm.json


Output Structure

The output follows the official USDM v4.0 schema from dataStructure.yml with proper entity placement:

Study β†’ StudyVersion β†’ StudyDesign
         β”‚              β”‚
         β”‚              β”œβ”€β”€ eligibilityCriteria[]
         β”‚              β”œβ”€β”€ indications[]
         β”‚              β”œβ”€β”€ analysisPopulations[]
         β”‚              β”œβ”€β”€ activities[].definedProcedures[]
         β”‚              └── scheduleTimelines[].timings[], .exits[]
         β”‚
         β”œβ”€β”€ eligibilityCriterionItems[]
         β”œβ”€β”€ organizations[]
         β”œβ”€β”€ narrativeContentItems[]
         β”œβ”€β”€ abbreviations[]
         β”œβ”€β”€ conditions[]
         β”œβ”€β”€ amendments[]
         β”œβ”€β”€ administrableProducts[]
         β”œβ”€β”€ medicalDevices[]
         └── studyInterventions[]

For detailed output structure and entity relationships, see docs/ARCHITECTURE.md.

Provenance Tracking

Provenance metadata is stored separately in 9_final_soa_provenance.json and visualized in the web UI:

Source Color Meaning
both 🟩 Green Confirmed (text + vision agree)
text 🟦 Blue Text-only (NOT confirmed by vision)
vision 🟧 Orange Vision-only (possible hallucination, needs review)
(none) πŸ”΄ Red Orphaned (no provenance data)

View provenance in the web UI by running cd web-ui && npm run dev.

Note: By default, all text-extracted cells are kept in the output. Use --remove-hallucinations to exclude cells not confirmed by vision.

SoA Footnotes

Footnotes extracted from SoA tables are stored in StudyDesign.notes as USDM v4.0 CommentAnnotation objects:

"notes": [
  {"id": "soa_fn_1", "text": "a. Within 32 days of administration", "instanceType": "CommentAnnotation"},
  {"id": "soa_fn_2", "text": "b. Participants admitted 10 hours prior", "instanceType": "CommentAnnotation"}
]

Web UI (Complete Revamp)

The web interface has been completely revamped from the legacy Streamlit app to a modern, user-friendly stack built with React 19, Next.js 16, TypeScript, TailwindCSS, and AG Grid.

Launch the Web UI

cd web-ui
npm install
npm run dev

Then open http://localhost:3000 in your browser.

Technology Stack

Component Technology
Framework Next.js 16 with App Router
Language TypeScript
Styling TailwindCSS with dark mode support
Data Tables AG Grid for high-performance SoA display
Visualization Cytoscape.js for interactive timeline graphs
State Management Zustand
Icons Lucide React

What's Visible in the Web UI

Protocol Overview:

  • Study metadata (title, phase, indication, sponsor)
  • Study identifiers (NCT, EudraCT, IND numbers)
  • Amendment history

Schedule of Activities (SoA):

  • Interactive AG Grid table with epoch/encounter groupings
  • Color-coded provenance indicators (green=confirmed, blue=text-only, orange=vision-only)
  • Footnote references with hover tooltips
  • Activity filtering and search

Study Design:

  • Study arms, epochs, and study cells
  • Activity groups with child activity linking
  • Transition rules between epochs

Eligibility & Objectives:

  • Inclusion/exclusion criteria display
  • Primary and secondary objectives
  • Endpoints and estimands

Interventions & Procedures:

  • Drug products with administration details
  • Substances and ingredients
  • Medical devices and procedures

Timeline Visualization:

  • Interactive Cytoscape graph of epochs and encounters
  • Node details panel with encounter information
  • Execution model overlay with time anchors

Advanced Views:

  • Execution model details (state machine, visit windows, dosing)
  • Provenance explorer with source tracking
  • Quality metrics dashboard
  • Validation results
  • Raw USDM JSON viewer

Future Roadmap

🚧 In Development: We intend to streamline the UI further and enable digital protocol (USDM JSON) editing directly via the UI. Some of these editing features are present but not fully functional yet. The goal is to allow users to:

  • Edit USDM entities directly in the browser
  • Save draft overlays without modifying source data
  • Publish finalized edits back to the USDM JSON
  • Track edit history and provenance

Model Benchmark

SoA extraction tested on Alexion Wilson's Disease protocol (Jan 2026):

Model Activities Timepoints Ticks Expansion Phases Recommendation
gemini-3-flash ⭐ 36 βœ“ 24 βœ“ 216 12/12 βœ“ Optimized for this release
gemini-2.5-pro 36 βœ“ 24 βœ“ 207 12/12 βœ“ Good fallback
claude-opus-4-5 36 βœ“ 24 βœ“ 212 12/12 βœ“ Good, higher cost
chatgpt-5.2 36 βœ“ 24 βœ“ 210 12/12 βœ“ Good alternative

Notes:

  • gemini-3-flash: This release is optimized for Gemini Flash 3. Best balance of speed, accuracy, and cost.
  • gemini-2.5-pro: Used as automatic fallback for SoA text extraction when Gemini 3 has JSON compliance issues.
  • claude-opus-4-5: High accuracy but significantly higher cost per extraction.
  • chatgpt-5.2: Latest OpenAI model with good accuracy.

Project Structure

Protocol2USDMv3/
β”œβ”€β”€ main_v3.py                # Entry point (phase registry architecture)
β”œβ”€β”€ llm_providers.py          # LLM provider abstraction layer
β”œβ”€β”€ pipeline/                 # ⭐ NEW: Phase registry architecture
β”‚   β”œβ”€β”€ __init__.py           # Package exports
β”‚   β”œβ”€β”€ base_phase.py         # BasePhase class with extract/combine/save
β”‚   β”œβ”€β”€ phase_registry.py     # Phase registration and discovery
β”‚   β”œβ”€β”€ orchestrator.py       # Pipeline orchestration with parallel support
β”‚   └── phases/               # Individual phase implementations
β”‚       β”œβ”€β”€ eligibility.py    # Eligibility criteria phase
β”‚       β”œβ”€β”€ metadata.py       # Study metadata phase
β”‚       β”œβ”€β”€ objectives.py     # Objectives & endpoints phase
β”‚       β”œβ”€β”€ studydesign.py    # Study design phase
β”‚       β”œβ”€β”€ interventions.py  # Interventions phase
β”‚       β”œβ”€β”€ narrative.py      # Narrative structure phase
β”‚       β”œβ”€β”€ advanced.py       # Advanced entities phase
β”‚       β”œβ”€β”€ procedures.py     # Procedures & devices phase
β”‚       β”œβ”€β”€ scheduling.py     # Scheduling logic phase
β”‚       β”œβ”€β”€ docstructure.py   # Document structure phase
β”‚       β”œβ”€β”€ amendmentdetails.py # Amendment details phase
β”‚       └── execution.py      # Execution model phase
β”œβ”€β”€ core/                     # Core modules
β”‚   β”œβ”€β”€ usdm_schema_loader.py # Official CDISC schema parser + USDMEntity base
β”‚   β”œβ”€β”€ usdm_types_generated.py # 86+ USDM types (hand-written, schema-aligned)
β”‚   β”œβ”€β”€ usdm_types.py         # Unified type interface
β”‚   β”œβ”€β”€ llm_client.py         # LLM client utilities
β”‚   β”œβ”€β”€ constants.py          # Centralized constants (DEFAULT_MODEL, etc.)
β”‚   β”œβ”€β”€ evs_client.py         # NCI EVS API client with caching
β”‚   β”œβ”€β”€ provenance.py         # ProvenanceTracker for source tracking
β”‚   └── reconciliation/       # Entity reconciliation framework
β”œβ”€β”€ extraction/               # Extraction modules
β”‚   β”œβ”€β”€ header_analyzer.py    # Vision-based structure
β”‚   β”œβ”€β”€ text_extractor.py     # Text-based extraction
β”‚   β”œβ”€β”€ pipeline.py           # SoA extraction pipeline
β”‚   β”œβ”€β”€ pipeline_context.py   # Context passing between extractors
β”‚   β”œβ”€β”€ execution/            # Execution model extractors (27 modules)
β”‚   └── */                    # Domain extractors (13 modules)
β”œβ”€β”€ enrichment/               # Terminology enrichment
β”‚   └── terminology.py        # NCI EVS enrichment
β”œβ”€β”€ validation/               # Validation package
β”‚   β”œβ”€β”€ usdm_validator.py     # Official USDM validation
β”‚   └── cdisc_conformance.py  # CDISC CORE conformance
β”œβ”€β”€ scripts/                  # Utility scripts
β”‚   β”œβ”€β”€ extractors/           # Standalone CLI extractors
β”‚   └── debug/                # Debug utilities
β”œβ”€β”€ testing/                  # Benchmarking & integration tests
β”œβ”€β”€ tests/                    # Unit tests
β”œβ”€β”€ docs/                     # Architecture documentation
β”œβ”€β”€ web-ui/                   # React/Next.js protocol viewer
β”œβ”€β”€ tools/                    # External tools (CDISC CORE engine)
β”œβ”€β”€ archive/                  # Archived legacy files
└── output/                   # Pipeline outputs

For detailed architecture, see docs/ARCHITECTURE.md.


Testing

# Run unit tests
pytest tests/

# Run integration tests
python testing/test_pipeline_steps.py

# Run golden standard comparison
python testing/compare_golden_vs_extracted.py

# Benchmark models
python testing/benchmark_models.py

Configuration

Environment Variables

# RECOMMENDED: Google Cloud Vertex AI (for Gemini models)
GOOGLE_CLOUD_PROJECT=your-project-id     # Your GCP project
GOOGLE_CLOUD_LOCATION=us-central1        # Region (us-central1, europe-west1, etc.)

# Alternative: Google AI Studio (may have safety restrictions)
GOOGLE_API_KEY=...          # For Gemini via AI Studio (not recommended for clinical)

# Other providers
OPENAI_API_KEY=...          # For GPT models
CLAUDE_API_KEY=...          # For Claude models (Anthropic)

# Required for CDISC conformance validation
CDISC_API_KEY=...           # For CORE rules cache (get from library.cdisc.org)

⚠️ Important: For clinical protocol extraction, use Vertex AI for Gemini models. AI Studio may block medical content even with safety settings disabled.

Supported Models

Google (Optimized for this release):

  • gemini-3-flash ⭐ Recommended - Pipeline optimized for this model
  • gemini-2.5-pro - Good fallback, used automatically for SoA extraction

Anthropic:

  • claude-opus-4-5 - High accuracy, higher cost
  • claude-sonnet-4 - Good balance of speed and accuracy

OpenAI:

  • chatgpt-5.2 - Latest OpenAI model, good accuracy

Note: Other models are also supported via the unified provider interface. See llm_providers.py for the full list.


Troubleshooting

Issue Solution
API key error Check .env file, restart terminal
Gemini blocks content Use Vertex AI instead of AI Studio (see Configuration)
Missing visits Verify correct SoA pages found (check 4_header_structure.json)
Parse errors Try gemini-3-flash model, check verbose logs
Schema errors Post-processing auto-fixes most issues
Safety filter errors Ensure using Vertex AI with BLOCK_NONE settings

Roadmap / TODO

The following items are planned for upcoming releases:

  • Web UI Protocol Editing: Enable direct USDM JSON editing via browser with draft/publish workflow
  • Biomedical Concepts: Add extraction via a separate comprehensive canonical model for standardized concept mapping
  • Multi-Protocol Comparison: Compare USDM outputs across protocol versions
  • Gemini Flash 3 Optimization: Pipeline optimized for gemini-3-flash with Vertex AI (completed v7.0)
  • Execution Model Extraction: Time anchors, visit windows, state machine, dosing regimens (completed v7.0)
  • Execution Model Promotion: Native USDM entities instead of extensions (completed v7.2)
  • Pipeline Context Architecture: Context-aware extraction with accumulated results (completed v7.0)
  • Entity Reconciliation Framework: LLM-based semantic mapping and ID preservation (completed v7.0)
  • Modern Web UI: Complete React/Next.js revamp from Streamlit (completed v7.0)
  • USDM 4.0 Alignment: All entities at correct locations per dataStructure.yml (completed v7.0)
  • NCI Code Mappings: Dose forms, timing types, identifier types with NCI codes (completed v7.0)
  • Repository Cleanup: Organized scripts, archived legacy files (completed v7.0)

License

Contact author for permission to use.


Acknowledgments

This project is, in many ways, a workflow wrapper around the incredible work done by CDISC and its volunteers.

CDISC & Data4Knowledge

A special thank you to the Data4Knowledge (D4K) team and the CDISC DDF/USDM community:

Most importantly, heartfelt thanks to Dave Iberson-Hurst, Kirsten Walther Langendorf, and Johannes Ulander β€” who have been extraordinarily kind and supportive despite my repeated questions and pestering. Their openness in sharing their work and time enables projects like this to exist.

Other Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors