# LangExtract Integration Demo

This notebook demonstrates the integration of Google's LangExtract library into the biomedical text agent for structured information extraction from medical literature.

## Features Demonstrated:
- Schema-aligned extraction classes
- Multi-pass extraction for improved recall
- Source grounding and visualization
- Integration with HPO/HGNC ontologies
- OpenRouter API support for free models
- End-to-end pipeline from text to structured data

## Setup and Installation

In [52]:
# Ensure latest code edits are loaded without restarting the kernel
import importlib
import sys

if '../src' not in sys.path:
    sys.path.append('../src')

import langextract_integration
import langextract_integration.extractor as _lx_extractor
import langextract_integration.normalizer as _lx_normalizer
import langextract_integration.schema_classes as _lx_schema

importlib.reload(langextract_integration)
importlib.reload(_lx_extractor)
importlib.reload(_lx_normalizer)
importlib.reload(_lx_schema)

print("🔄 LangExtract integration reloaded")


🔄 LangExtract integration reloaded


In [53]:
# Install required packages
# !pip install langextract openai pandas matplotlib seaborn plotly jupyter-widgets

# Import required libraries
import os
import sys
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import HTML, display, Markdown
import warnings
warnings.filterwarnings('ignore')

# Set up paths
sys.path.append('../src')

print("✅ Setup completed!")

✅ Setup completed!


## Configuration

In [54]:
# Configuration
# Prefer OpenAI-compatible id to force OpenRouter over Ollama inside LangExtract
MODEL_ID = os.getenv("LANGEXTRACT_MODEL_ID", "gpt-4o-mini")

# Optional local (Ollama) configuration
USE_LOCAL_MODEL = os.getenv("USE_LOCAL_MODEL", "false").lower() in ("1", "true", "yes")
LOCAL_MODEL_ID = os.getenv("LOCAL_MODEL_ID", "llama3")
LOCAL_MODEL_URL = os.getenv("LOCAL_MODEL_URL", "http://localhost:11434")

# Load API key from environment for safety (cloud route)
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
if not USE_LOCAL_MODEL:
    if not OPENROUTER_API_KEY:
        print("⚠️ OPENROUTER_API_KEY is not set. Set it in your environment to run extraction.")
    else:
        print("🔑 API Key detected in environment")
        # Ensure LangExtract and OpenAI SDKs route via OpenRouter
        os.environ["OPENAI_BASE_URL"] = "https://openrouter.ai/api/v1"
        os.environ["OPENAI_API_KEY"] = OPENROUTER_API_KEY
        os.environ["OPENROUTER_API_KEY"] = OPENROUTER_API_KEY

# Echo effective route
if USE_LOCAL_MODEL:
    print(f"🖥️ Using local model via Ollama: {LOCAL_MODEL_ID} @ {LOCAL_MODEL_URL}")
    EFFECTIVE_MODEL_ID = LOCAL_MODEL_ID
else:
    print(f"☁️ Using OpenRouter (OpenAI-compatible): {MODEL_ID}")
    EFFECTIVE_MODEL_ID = MODEL_ID

print(f"🤖 Effective model: {EFFECTIVE_MODEL_ID}")

🔑 API Key detected in environment
☁️ Using OpenRouter (OpenAI-compatible): gpt-4o-mini
🤖 Effective model: gpt-4o-mini


## Import LangExtract Integration

In [55]:
# Import our LangExtract integration
from langextract_integration import (
    LangExtractEngine,
    BiomedicExtractionClasses,
    BiomedicNormalizer , 
    extract_from_text
)

extraction_classes = BiomedicExtractionClasses()
normalizer = BiomedicNormalizer()

print("✅ LangExtract integration loaded")

print(f"📋 Available extraction classes: {list(extraction_classes.classes.keys())}")

[32m2025-08-26 14:31:04.347[0m | [1mINFO    [0m | [36montologies.gene_manager[0m:[36m_load_or_create_gene_data[0m:[36m42[0m - [1mLoaded 10 genes from data/ontologies/genes/hgnc_genes.json[0m


✅ LangExtract integration loaded
📋 Available extraction classes: ['Mutation', 'PhenotypeMention', 'TreatmentEvent', 'PatientRecord']


## Examine Extraction Schema

In [56]:
# Display the PatientRecord schema
patient_schema = extraction_classes.patient_record

print("🏥 PatientRecord Extraction Schema:")
print("=" * 50)
print(f"Class: {patient_schema.extraction_class}")
print(f"Description: {patient_schema.description}")
print("\nAttributes:")

for attr in patient_schema.attributes:
    print(f"  • {attr['name']} ({attr['type']}): {attr['description']}")
    if 'enum' in attr:
        print(f"    Allowed values: {attr['enum']}")
    print()

🏥 PatientRecord Extraction Schema:
Class: PatientRecord
Description: All fields needed to build one structured row for a single patient.

Attributes:
  • patient_label (string): Patient identifier from text (e.g., 'Patient 2', 'Case 1', 'P1')

  • sex (string): Map male→'m', female→'f' if stated. Use exact text indicators: girl/woman/female→'f', boy/man/male→'m'
    Allowed values: ['m', 'f']

  • age_of_onset_years (number): Age at symptom onset in years. Convert months to decimal years (e.g., 5 months → 0.42). Use earliest age if multiple onsets described.

  • age_at_diagnosis_years (number): Age at diagnosis in years if different from onset

  • last_seen_age_years (number): Age at last follow-up in years

  • alive_flag (integer): 0=alive, 1=deceased. Only set to 1 if death is explicitly mentioned.
    Allowed values: [0, 1]

  • consanguinity (boolean): Whether consanguineous parents are mentioned

  • family_history (string): Brief family history if mentioned

  • mutations (arr

## Sample Biomedical Text

In [57]:
# Sample biomedical case report text
sample_text = """
Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). 
He presented with developmental delay and lactic acidosis at 6 months of age. Treatment included 
riboflavin 100 mg/day and coenzyme Q10 with clinical improvement. He is alive at last follow-up.

Patient 2 is a female who developed generalized weakness on the second day of fever when she was 
2 years and 5 months old, approximately half a month after measles vaccination. She later had 
recurrent episodes at 4 years 2 months. Molecular testing identified two variants in SLC19A3: 
c.26T>C (p.Leu9Pro) and c.980-7_980-4del. She received high-dose thiamine and biotin with 
clinical improvement. She is alive at last follow-up.

Patient 3 was a 15-month-old boy who presented with hypotonia, seizures, and failure to thrive. 
Genetic testing revealed a homozygous SURF1 mutation c.750G>A. Despite treatment with coenzyme Q10 
and thiamine, he died at 18 months of age due to respiratory failure.
"""

print("📄 Sample Text:")
print("=" * 50)
print(sample_text)
print(f"\n📊 Text length: {len(sample_text)} characters")

📄 Sample Text:

Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). 
He presented with developmental delay and lactic acidosis at 6 months of age. Treatment included 
riboflavin 100 mg/day and coenzyme Q10 with clinical improvement. He is alive at last follow-up.

Patient 2 is a female who developed generalized weakness on the second day of fever when she was 
2 years and 5 months old, approximately half a month after measles vaccination. She later had 
recurrent episodes at 4 years 2 months. Molecular testing identified two variants in SLC19A3: 
c.26T>C (p.Leu9Pro) and c.980-7_980-4del. She received high-dose thiamine and biotin with 
clinical improvement. She is alive at last follow-up.

Patient 3 was a 15-month-old boy who presented with hypotonia, seizures, and failure to thrive. 
Genetic testing revealed a homozygous SURF1 mutation c.750G>A. Despite treatment with coenzyme Q10 
and thiamine, he died at 18 months of age due to respiratory fai

## Run LangExtract Extraction

In [58]:
# Initialize LangExtract engine
engine = LangExtractEngine(
    model_id=MODEL_ID,
    openrouter_api_key=OPENROUTER_API_KEY
)

print("🚀 Starting extraction...")

# Run extraction
try:
    results = engine.extract_from_text(
        text=sample_text,
        extraction_passes=2,  # Multiple passes for better recall
        max_workers=4,        # Parallel processing
        segment_patients=True, # Segment by patients
        include_visualization=True  # Generate HTML visualization
    )
    
    print("✅ Extraction completed successfully!")
    total_extractions = len(results.get('extractions') or results.get('original_extractions') or [])
    print(f"📊 Found {total_extractions} total extractions")
    print(f"🏥 Normalized {len(results.get('normalized_data', []))} patient records")
    
except Exception as e:
    print(f"❌ Extraction failed: {e}")
    results = None

[32m2025-08-26 14:31:13.820[0m | [1mINFO    [0m | [36montologies.gene_manager[0m:[36m_load_or_create_gene_data[0m:[36m42[0m - [1mLoaded 10 genes from data/ontologies/genes/hgnc_genes.json[0m
2025-08-26 14:31:13,877 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.__init__(self=<OpenAILanguageModel>, constraint=Constraint(co...NONE: 'none'>), kwargs={})
2025-08-26 14:31:13,878 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.__init__ -> None (0.0 ms)
2025-08-26 14:31:13,878 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.apply_schema(self=<OpenAILanguageModel>, schema_instance=None)
2025-08-26 14:31:13,878 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.apply_schema -> None (0.0 ms)
DEBUG:absl:Initialized Annotator with prompt:

You are a biomedical information extraction agent. Extract only facts that are explicitly stated in the provided context.

CORE 

🚀 Starting extraction...


[94m[1mLangExtract[0m: model=[92mgpt-4o-mini[0m [00:00]2025-08-26 14:31:13,881 - langextract.debug - DEBUG - [langextract.tokenizer] CALL: tokenize(text='Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). \nHe presented with developmental delay and lactic acidosis at 6 months of age. Treatment included \nriboflavin 100 mg/day and coenzyme Q10 with clinical improvement. He is alive at last follow-up.')
2025-08-26 14:31:13,881 - langextract.debug - DEBUG - [langextract.tokenizer] RETURN: tokenize -> TokenizedText...wline=False)]) (0.2 ms)
INFO:absl:Processing batch 0 with length 1
DEBUG:absl:Token util returns string: Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). 
He presented with developmental delay and lactic acidosis at 6 months of age. Treatment included 
riboflavin 100 mg/day and coenzyme Q10 with clinical improvement. He is alive at last follow-up. for tokenized_text: TokenizedText(text='Pa

[92m✓[0m Extraction processing complete



INFO:absl:Finalizing annotation for document ID doc_2c486ee5.
INFO:absl:Document annotation completed.
INFO:absl:Starting extraction pass 2 of 2
INFO:absl:Starting document annotation.
INFO:absl:Processing batch 0 with length 1
DEBUG:absl:Token util returns string: Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). 
He presented with developmental delay and lactic acidosis at 6 months of age. Treatment included 
riboflavin 100 mg/day and coenzyme Q10 with clinical improvement. He is alive at last follow-up. for tokenized_text: TokenizedText(text='Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). \nHe presented with developmental delay and lactic acidosis at 6 months of age. Treatment included \nriboflavin 100 mg/day and coenzyme Q10 with clinical improvement. He is alive at last follow-up.', tokens=[Token(index=0, token_type=<TokenType.WORD: 0>, char_interval=CharInterval(start_pos=0, end_pos=7), first_

[92m✓[0m Extracted [1m1[0m entities ([1m1[0m unique types)
  [96m•[0m Time: [1m23.20s[0m
  [96m•[0m Speed: [1m12[0m chars/sec
  [96m•[0m Chunks: [1m1[0m


2025-08-26 14:31:37,106 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.__init__(self=<OpenAILanguageModel>, constraint=Constraint(co...NONE: 'none'>), kwargs={})
2025-08-26 14:31:37,107 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.__init__ -> None (0.0 ms)
2025-08-26 14:31:37,107 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.apply_schema(self=<OpenAILanguageModel>, schema_instance=None)
2025-08-26 14:31:37,107 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.apply_schema -> None (0.0 ms)
DEBUG:absl:Initialized Annotator with prompt:

You are a biomedical information extraction agent. Extract only facts that are explicitly stated in the provided context.

CORE RULES:
1. If a value is missing in the context, return null or an empty list (do not infer or hallucinate).
2. Return strictly valid JSON that conforms to the attribute schemas and enumerations.
3. Use e

[92m✓[0m Extraction processing complete



INFO:absl:Finalizing annotation for document ID doc_452b4a79.
INFO:absl:Document annotation completed.
INFO:absl:Starting extraction pass 2 of 2
INFO:absl:Starting document annotation.
INFO:absl:Processing batch 0 with length 1
DEBUG:absl:Token util returns string: Patient 2 is a female who developed generalized weakness on the second day of fever when she was 
2 years and 5 months old, approximately half a month after measles vaccination. She later had 
recurrent episodes at 4 years 2 months. Molecular testing identified two variants in SLC19A3: 
c.26T>C (p.Leu9Pro) and c.980-7_980-4del. She received high-dose thiamine and biotin with 
clinical improvement. She is alive at last follow-up. for tokenized_text: TokenizedText(text='Patient 2 is a female who developed generalized weakness on the second day of fever when she was \n2 years and 5 months old, approximately half a month after measles vaccination. She later had \nrecurrent episodes at 4 years 2 months. Molecular testing identif

[92m✓[0m Extracted [1m1[0m entities ([1m1[0m unique types)
  [96m•[0m Time: [1m22.24s[0m
  [96m•[0m Speed: [1m19[0m chars/sec
  [96m•[0m Chunks: [1m1[0m


2025-08-26 14:31:59,380 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.__init__(self=<OpenAILanguageModel>, constraint=Constraint(co...NONE: 'none'>), kwargs={})
2025-08-26 14:31:59,380 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.__init__ -> None (0.0 ms)
2025-08-26 14:31:59,381 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.apply_schema(self=<OpenAILanguageModel>, schema_instance=None)
2025-08-26 14:31:59,381 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.apply_schema -> None (0.0 ms)
DEBUG:absl:Initialized Annotator with prompt:

You are a biomedical information extraction agent. Extract only facts that are explicitly stated in the provided context.

CORE RULES:
1. If a value is missing in the context, return null or an empty list (do not infer or hallucinate).
2. Return strictly valid JSON that conforms to the attribute schemas and enumerations.
3. Use e

[92m✓[0m Extraction processing complete



INFO:absl:Finalizing annotation for document ID doc_9203e667.
INFO:absl:Document annotation completed.
INFO:absl:Starting extraction pass 2 of 2
INFO:absl:Starting document annotation.
INFO:absl:Processing batch 0 with length 1
DEBUG:absl:Token util returns string: Patient 3 was a 15-month-old boy who presented with hypotonia, seizures, and failure to thrive. 
Genetic testing revealed a homozygous SURF1 mutation c.750G>A. Despite treatment with coenzyme Q10 
and thiamine, he died at 18 months of age due to respiratory failure. for tokenized_text: TokenizedText(text='Patient 3 was a 15-month-old boy who presented with hypotonia, seizures, and failure to thrive. \nGenetic testing revealed a homozygous SURF1 mutation c.750G>A. Despite treatment with coenzyme Q10 \nand thiamine, he died at 18 months of age due to respiratory failure.', tokens=[Token(index=0, token_type=<TokenType.WORD: 0>, char_interval=CharInterval(start_pos=0, end_pos=7), first_token_after_newline=False), Token(index=1,

[92m✓[0m Extracted [1m1[0m entities ([1m1[0m unique types)
  [96m•[0m Time: [1m23.27s[0m
  [96m•[0m Speed: [1m11[0m chars/sec
  [96m•[0m Chunks: [1m1[0m


ERROR:langextract_integration.extractor:Error generating visualization: Object of type Extraction is not JSON serializable


✅ Extraction completed successfully!
📊 Found 3 total extractions
🏥 Normalized 0 patient records


## Examine Raw Extractions

In [59]:
if results:
    print("🔍 Raw Extraction Results:")
    print("=" * 50)
    
    # Display first few extractions (prefer original_extractions if available)
    extractions = results.get('extractions') or results.get('original_extractions') or []
    
    for i, extraction in enumerate(extractions[:3]):
        print(f"\nExtraction {i+1}:")
        print(json.dumps(extraction, indent=2, default=str))
        print("-" * 30)
    
    if len(extractions) > 3:
        print(f"\n... and {len(extractions) - 3} more extractions")
else:
    print("❌ No results to display")

🔍 Raw Extraction Results:

Extraction 1:
"Extraction(extraction_class='PatientRecord', extraction_text='Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). He presented with developmental delay and lactic acidosis at 6 months of age. Treatment included riboflavin 100 mg/day and coenzyme Q10 with clinical improvement. He is alive at last follow-up.', char_interval=CharInterval(start_pos=0, end_pos=287), alignment_status=<AlignmentStatus.MATCH_EXACT: 'match_exact'>, extraction_index=1, group_index=0, description=None, attributes={'patient_label': 'Patient 1', 'sex': 'm', 'age_of_onset_years': 0.5, 'age_at_diagnosis_years': 3, 'last_seen_age_years': 3, 'alive_flag': 1, 'consanguinity': None, 'family_history': None, 'mutations': [{'Mutation': {'gene': 'MT-ATP6', 'cdna': 'c.8993T>G', 'protein': 'p.Leu156Arg', 'zygosity': 'unknown', 'inheritance': None}}], 'phenotypes': [{'PhenotypeMention': {'surface_form': 'developmental delay', 'negated': False, 'on

## Examine Normalized Data

In [60]:
if results and 'normalized_data' in results:
    # Convert to DataFrame for better visualization
    df = pd.DataFrame(results['normalized_data'])
    
    print("📋 Normalized Patient Records:")
    print("=" * 50)
    print(f"Shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    
    # Display the DataFrame
    display(df)
    
    # Show data types
    print("\n📊 Data Types:")
    print(df.dtypes)
    
else:
    print("❌ No normalized data available")
    df = None

📋 Normalized Patient Records:
Shape: (0, 0)
Columns: []



📊 Data Types:
Series([], dtype: object)


## Visualization of Extracted Data

In [61]:
if df is not None and len(df) > 0:
    # Create visualizations
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=[
            'Patient Sex Distribution',
            'Age of Onset Distribution', 
            'Survival Status',
            'Gene Distribution'
        ],
        specs=[[{"type": "pie"}, {"type": "histogram"}],
               [{"type": "pie"}, {"type": "bar"}]]
    )
    
    # Sex distribution
    sex_counts = df['sex'].value_counts()
    fig.add_trace(
        go.Pie(labels=sex_counts.index, values=sex_counts.values, name="Sex"),
        row=1, col=1
    )
    
    # Age of onset histogram
    ages = df['age_of_onset'].dropna()
    if len(ages) > 0:
        fig.add_trace(
            go.Histogram(x=ages, name="Age of Onset", nbinsx=10),
            row=1, col=2
        )
    
    # Survival status
    survival_labels = ['Alive', 'Deceased']
    survival_counts = df['_0_alive_1_dead'].value_counts()
    fig.add_trace(
        go.Pie(labels=[survival_labels[i] for i in survival_counts.index], 
               values=survival_counts.values, name="Survival"),
        row=2, col=1
    )
    
    # Gene distribution
    gene_counts = df['gene'].value_counts().head(5)
    if len(gene_counts) > 0:
        fig.add_trace(
            go.Bar(x=gene_counts.index, y=gene_counts.values, name="Genes"),
            row=2, col=2
        )
    
    fig.update_layout(
        height=800,
        title_text="Biomedical Extraction Results - Overview",
        showlegend=False
    )
    
    fig.show()
    
else:
    print("❌ No data available for visualization")

❌ No data available for visualization


## Detailed Analysis

In [40]:
if df is not None and len(df) > 0:
    print("🔬 Detailed Analysis:")
    print("=" * 50)
    
    # Basic statistics
    print("📊 Basic Statistics:")
    print(f"  • Total patients: {len(df)}")
    print(f"  • Male patients: {(df['sex'] == 'm').sum()}")
    print(f"  • Female patients: {(df['sex'] == 'f').sum()}")
    print(f"  • Patients with age of onset: {df['age_of_onset'].notna().sum()}")
    print(f"  • Patients with genetic data: {df['gene'].notna().sum()}")
    print(f"  • Patients with phenotype data: {df['phenotypes_text'].notna().sum()}")
    print(f"  • Patients with treatment data: {df['treatments'].notna().sum()}")
    
    # Age statistics
    if df['age_of_onset'].notna().any():
        ages = df['age_of_onset'].dropna()
        print(f"\n📈 Age of Onset Statistics:")
        print(f"  • Mean: {ages.mean():.2f} years")
        print(f"  • Median: {ages.median():.2f} years")
        print(f"  • Range: {ages.min():.2f} - {ages.max():.2f} years")
    
    # Gene analysis
    genes = df['gene'].dropna().unique()
    print(f"\n🧬 Genetic Analysis:")
    print(f"  • Unique genes identified: {len(genes)}")
    print(f"  • Genes: {', '.join(genes)}")
    
    # Phenotype analysis
    phenotypes = df['phenotypes_text'].dropna()
    if len(phenotypes) > 0:
        all_phenotypes = []
        for pheno_text in phenotypes:
            all_phenotypes.extend([p.strip() for p in str(pheno_text).split(';')])
        
        unique_phenotypes = list(set(all_phenotypes))
        print(f"\n🏥 Phenotype Analysis:")
        print(f"  • Total phenotype mentions: {len(all_phenotypes)}")
        print(f"  • Unique phenotypes: {len(unique_phenotypes)}")
        print(f"  • Common phenotypes: {', '.join(unique_phenotypes[:5])}")
    
    # Quality assessment
    if 'normalization_quality' in df.columns:
        quality_scores = df['normalization_quality'].dropna()
        print(f"\n⭐ Quality Assessment:")
        print(f"  • Average quality score: {quality_scores.mean():.2f}")
        print(f"  • Quality range: {quality_scores.min():.2f} - {quality_scores.max():.2f}")
        
        # Quality distribution
        fig = px.histogram(
            x=quality_scores, 
            title="Quality Score Distribution",
            labels={'x': 'Quality Score', 'y': 'Count'},
            nbins=10
        )
        fig.show()

else:
    print("❌ No data available for analysis")

❌ No data available for analysis


## HPO and Gene Normalization

In [41]:
if results:
    print("🔗 Ontology Normalization Results:")
    print("=" * 50)
    
    # Check normalization metadata
    norm_meta = results.get('normalization_metadata', {})
    
    print(f"📊 Normalization Statistics:")
    print(f"  • Total patients processed: {norm_meta.get('total_patients', 0)}")
    print(f"  • HPO mappings found: {norm_meta.get('hpo_mappings', 0)}")
    print(f"  • Gene mappings found: {norm_meta.get('gene_mappings', 0)}")
    
    quality_metrics = norm_meta.get('quality_metrics', {})
    if quality_metrics:
        print(f"\n⭐ Quality Metrics:")
        print(f"  • Average quality: {quality_metrics.get('average_quality', 0):.2f}")
        print(f"  • Completeness: {quality_metrics.get('completeness', 0):.2f}")
        print(f"  • Complete records: {quality_metrics.get('complete_records', 0)}/{quality_metrics.get('total_records', 0)}")
    
    # Show HPO terms if available
    if df is not None:
        hpo_columns = [col for col in df.columns if col.startswith('hpo_')]
        if hpo_columns:
            print(f"\n🏥 HPO Terms Identified:")
            for col in hpo_columns:
                count = df[col].sum() if df[col].dtype in ['int64', 'float64'] else 0
                print(f"  • {col.replace('hpo_', '').replace('_', ' ').title()}: {count} patients")

else:
    print("❌ No normalization results available")

🔗 Ontology Normalization Results:
📊 Normalization Statistics:
  • Total patients processed: 0
  • HPO mappings found: 0
  • Gene mappings found: 0

⭐ Quality Metrics:
  • Average quality: 0.00
  • Completeness: 0.00
  • Complete records: 0/0


## Interactive Visualization

In [42]:
# Display the LangExtract interactive visualization if available
if results and 'visualization_html' in results:
    print("🎨 Interactive LangExtract Visualization:")
    print("=" * 50)
    
    # Display the HTML visualization
    html_content = results['visualization_html']
    
    # Save to file for viewing
    with open('langextract_visualization.html', 'w') as f:
        f.write(html_content)
    
    print("✅ Visualization saved to 'langextract_visualization.html'")
    print("📖 Open the file in a web browser to see the interactive visualization")
    
    # Display a preview (first 500 characters)
    print("\n📋 HTML Preview:")
    print(html_content[:500] + "...")
    
    # Try to display inline (may not work in all environments)
    try:
        display(HTML(html_content))
    except:
        print("⚠️ Inline display not supported in this environment")

else:
    print("❌ No visualization available")

🎨 Interactive LangExtract Visualization:
✅ Visualization saved to 'langextract_visualization.html'
📖 Open the file in a web browser to see the interactive visualization

📋 HTML Preview:
<html><body><h1>Visualization Error</h1><p>Object of type Extraction is not JSON serializable</p></body></html>...


## Save Results

In [43]:
if results:
    # Save results to files
    saved_files = engine.save_results(
        results=results,
        output_dir="./output",
        filename_prefix="langextract_demo"
    )
    
    print("💾 Results Saved:")
    print("=" * 50)
    
    for file_type, file_path in saved_files.items():
        print(f"  • {file_type.upper()}: {file_path}")
    
    print("\n✅ All results saved successfully!")

else:
    print("❌ No results to save")

💾 Results Saved:
  • JSON: output/langextract_demo_20250826_142546.json
  • JSONL: output/langextract_demo_20250826_142546.jsonl
  • HTML: output/langextract_demo_20250826_142546.html
  • CSV: output/langextract_demo_20250826_142546.csv

✅ All results saved successfully!


## Performance Analysis

In [44]:
if results:
    print("⚡ Performance Analysis:")
    print("=" * 50)
    
    # Extract metadata
    extraction_meta = results.get('extraction_metadata', {})
    norm_meta = results.get('normalization_metadata', {})
    
    # Text statistics
    text_length = len(sample_text)
    total_extractions = len(results.get('extractions', []))
    total_patients = len(results.get('normalized_data', []))
    
    print(f"📊 Processing Statistics:")
    print(f"  • Input text length: {text_length:,} characters")
    print(f"  • Total extractions: {total_extractions}")
    print(f"  • Patients identified: {total_patients}")
    print(f"  • Extractions per patient: {total_extractions/total_patients:.1f}" if total_patients > 0 else "")
    print(f"  • Characters per extraction: {text_length/total_extractions:.0f}" if total_extractions > 0 else "")
    
    # Model information
    print(f"\n🤖 Model Information:")
    print(f"  • Model used: {MODEL_ID}")
    print(f"  • Extraction passes: 2")
    print(f"  • Parallel workers: 4")
    
    # Quality metrics
    quality_metrics = norm_meta.get('quality_metrics', {})
    if quality_metrics:
        print(f"\n⭐ Quality Metrics:")
        print(f"  • Average quality score: {quality_metrics.get('average_quality', 0):.2%}")
        print(f"  • Data completeness: {quality_metrics.get('completeness', 0):.2%}")
        print(f"  • Records with complete data: {quality_metrics.get('complete_records', 0)}/{quality_metrics.get('total_records', 0)}")
    
    # Create performance visualization
    metrics_data = {
        'Metric': ['Extractions', 'Patients', 'Quality Score', 'Completeness'],
        'Value': [
            total_extractions,
            total_patients,
            quality_metrics.get('average_quality', 0) * 100,
            quality_metrics.get('completeness', 0) * 100
        ],
        'Type': ['Count', 'Count', 'Percentage', 'Percentage']
    }
    
    fig = px.bar(
        metrics_data,
        x='Metric',
        y='Value',
        color='Type',
        title='LangExtract Performance Metrics',
        labels={'Value': 'Score/Count'}
    )
    
    fig.show()

else:
    print("❌ No performance data available")

⚡ Performance Analysis:
📊 Processing Statistics:
  • Input text length: 991 characters
  • Total extractions: 0
  • Patients identified: 0



🤖 Model Information:
  • Model used: gpt-4o-mini
  • Extraction passes: 2
  • Parallel workers: 4

⭐ Quality Metrics:
  • Average quality score: 0.00%
  • Data completeness: 0.00%
  • Records with complete data: 0/0


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

## Test with Different Models

In [None]:
# Test with different free models from OpenRouter
test_models = [
    "google/gemma-2-27b-it:free",
    "microsoft/phi-3-mini-128k-instruct:free",
    "huggingfaceh4/zephyr-7b-beta:free"
]

# Shorter test text for model comparison
short_test_text = """
Patient 1 was a 3-year-old male with Leigh syndrome due to MT-ATP6 c.8993T>G (p.Leu156Arg). 
He presented with developmental delay and lactic acidosis. Treatment included riboflavin 100 mg/day 
with clinical improvement. He is alive at last follow-up.
"""

model_results = {}

print("🔬 Model Comparison Test:")
print("=" * 50)

for model_id in test_models:
    print(f"\n🤖 Testing model: {model_id}")
    
    try:
        # Create engine for this model
        test_engine = LangExtractEngine(
            model_id=model_id,
            openrouter_api_key=OPENROUTER_API_KEY
        )
        
        # Run extraction
        test_result = test_engine.extract_from_text(
            text=short_test_text,
            extraction_passes=1,  # Single pass for speed
            segment_patients=False,
            include_visualization=False
        )
        
        # Store results
        model_results[model_id] = {
            'extractions': len(test_result.get('extractions', [])),
            'patients': len(test_result.get('normalized_data', [])),
            'quality': test_result.get('normalization_metadata', {}).get('quality_metrics', {}).get('average_quality', 0)
        }
        
        print(f"  ✅ Success: {model_results[model_id]['extractions']} extractions, {model_results[model_id]['patients']} patients")
        
    except Exception as e:
        print(f"  ❌ Failed: {str(e)[:100]}...")
        model_results[model_id] = {'extractions': 0, 'patients': 0, 'quality': 0}

# Create comparison visualization
if model_results:
    comparison_df = pd.DataFrame(model_results).T
    comparison_df.index.name = 'Model'
    comparison_df = comparison_df.reset_index()
    
    print("\n📊 Model Comparison Results:")
    display(comparison_df)
    
    # Visualization
    fig = px.bar(
        comparison_df,
        x='Model',
        y=['extractions', 'patients'],
        title='Model Performance Comparison',
        barmode='group'
    )
    
    fig.update_xaxes(tickangle=45)
    fig.show()

## Ground Truth Evaluation (Optional)

In [None]:
# If you have ground truth data, you can evaluate against it
ground_truth_file = "../data/manually_processed.csv"

if os.path.exists(ground_truth_file) and results:
    print("📊 Ground Truth Evaluation:")
    print("=" * 50)
    
    try:
        evaluation = engine.evaluate_against_ground_truth(
            extraction_results=results,
            ground_truth_file=ground_truth_file
        )
        
        print(f"✅ Evaluation completed")
        print(f"📊 Ground truth records: {evaluation.get('ground_truth_records', 0)}")
        print(f"📊 Extracted records: {evaluation.get('extracted_records', 0)}")
        
        # Overall metrics
        overall_metrics = evaluation.get('overall_metrics', {})
        if overall_metrics:
            print(f"\n⭐ Overall Performance:")
            print(f"  • Overall accuracy: {overall_metrics.get('overall_accuracy', 0):.2%}")
            print(f"  • Overall F1 score: {overall_metrics.get('overall_f1', 0):.2%}")
            print(f"  • Fields evaluated: {overall_metrics.get('fields_evaluated', 0)}")
        
        # Field-by-field comparison
        field_comparisons = evaluation.get('field_comparisons', {})
        if field_comparisons:
            print(f"\n📋 Field-by-Field Performance:")
            
            field_df = pd.DataFrame(field_comparisons).T
            field_df = field_df.round(3)
            display(field_df)
            
            # Visualization
            fig = px.bar(
                field_df.reset_index(),
                x='index',
                y=['precision', 'recall', 'f1', 'accuracy'],
                title='Field-by-Field Performance Metrics',
                barmode='group'
            )
            
            fig.update_xaxes(title='Field', tickangle=45)
            fig.update_yaxes(title='Score')
            fig.show()
        
    except Exception as e:
        print(f"❌ Evaluation failed: {e}")

else:
    print("⚠️ Ground truth file not found or no results available")
    print(f"Looking for: {ground_truth_file}")

## Summary and Conclusions

In [None]:
print("📋 LangExtract Integration Summary:")
print("=" * 50)

if results:
    total_extractions = len(results.get('extractions', []))
    total_patients = len(results.get('normalized_data', []))
    
    print(f"✅ Successfully processed biomedical text")
    print(f"📊 Extracted {total_extractions} structured elements")
    print(f"🏥 Identified {total_patients} patient records")
    print(f"🔗 Integrated with HPO/HGNC ontologies")
    print(f"📈 Generated interactive visualizations")
    print(f"💾 Saved results in multiple formats")
    
    print(f"\n🎯 Key Benefits Demonstrated:")
    print(f"  • Schema-faithful extraction with precise source grounding")
    print(f"  • Multi-pass processing for improved recall")
    print(f"  • Automatic patient segmentation")
    print(f"  • Ontology normalization (HPO, HGNC)")
    print(f"  • Quality assessment and validation")
    print(f"  • Interactive visualization for review")
    print(f"  • Support for free OpenRouter models")
    
    print(f"\n🚀 Next Steps:")
    print(f"  • Scale to larger document collections")
    print(f"  • Integrate with existing RAG system")
    print(f"  • Add feedback loop for continuous improvement")
    print(f"  • Deploy in production pipeline")
    print(f"  • Expand to clinical trials and patents")

else:
    print(f"❌ Extraction failed - check API key and model availability")
    print(f"🔧 Troubleshooting:")
    print(f"  • Verify OpenRouter API key is valid")
    print(f"  • Check model availability and rate limits")
    print(f"  • Ensure LangExtract is properly installed")
    print(f"  • Review error messages above")

print(f"\n🎉 Demo completed!")