[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/healthcare/01_Clinical_Reports_Processing.ipynb)

# Clinical Reports Processing - EHR Integration & Triplet Stores

## Overview

This notebook demonstrates **clinical reports processing** using Semantica with focus on **EHR integration**, **triplet stores**, and **patient knowledge graphs**. The pipeline processes EHR systems and HL7/FHIR APIs to build patient knowledge graphs and store them in triplet stores.

### Key Features

- **EHR Integration**: Processes EHR systems and HL7/FHIR APIs
- **Triplet Store Storage**: Stores patient data in RDF triplet stores
- **Patient Knowledge Graphs**: Builds comprehensive patient KGs
- **Medical Entity Extraction**: Extracts medical entities from clinical reports
- **Structured Data Storage**: Emphasizes storage and structured data management

### Pipeline Architecture

1. **Phase 0**: Setup & Configuration
2. **Phase 1**: EHR/HL7/FHIR Data Ingestion
3. **Phase 2**: Clinical Report Parsing
4. **Phase 3**: Medical Entity Extraction
5. **Phase 4**: Patient Knowledge Graph Construction
6. **Phase 5**: Triplet Store Population
7. **Phase 6**: Visualization & Export

---

## Installation


In [None]:
%pip install -qU semantica networkx matplotlib plotly pandas groq rdflib


---

## Phase 0: Setup & Configuration


In [None]:
import os
from semantica.core import Semantica, ConfigManager
from semantica.triplet_store import TripletStore

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY", "your-key")

config_dict = {
    "project_name": "Clinical_Reports_Processing",
    "extraction": {"provider": "groq", "model": "llama-3.1-8b-instant"},
    "knowledge_graph": {"backend": "networkx"}
}

config = ConfigManager().load_from_dict(config_dict)
core = Semantica(config=config)
triplet_store = TripletStore(backend="jena")  # or "blazegraph"
print("Configured for clinical reports processing with triplet store focus")


---

## Phase 1: Real Data Ingestion (FDA RSS Feed & HL7/FHIR Structure)

Ingest clinical data from FDA RSS feeds and simulated HL7/FHIR databases.


In [None]:
from semantica.ingest import FeedIngestor, DBIngestor, FileIngestor
import os

os.makedirs("data", exist_ok=True)

documents = []

# Option 1: Ingest from FDA RSS feed (real data source)
fda_rss_url = "https://www.fda.gov/about-fda/contact-fda/stay-informed/rss-feeds/fda-press-releases"

try:
    feed_ingestor = FeedIngestor()
    feed_documents = feed_ingestor.ingest(fda_rss_url, method="rss")
    print(f"Ingested {len(feed_documents)} documents from FDA RSS feed")
    documents.extend(feed_documents)
except Exception as e:
    print(f"FDA RSS feed ingestion failed: {e}")

# Option 2: Simulate HL7/FHIR database ingestion
# In production: db_ingestor = DBIngestor()
# db_ingestor.connect("postgresql://user:pass@localhost/emr_db")
# fhir_documents = db_ingestor.ingest("SELECT * FROM patient_records", method="postgresql")

# Fallback: Sample clinical report data
if not documents:
    clinical_data = """
    Patient ID: P001, Name: John Doe, DOB: 1980-01-15
    Diagnosis: Type 2 Diabetes, Date: 2024-01-10
    Treatment: Metformin 500mg twice daily, Started: 2024-01-10
    Procedure: Blood glucose test, Date: 2024-01-15, Result: 180 mg/dL
    Patient ID: P002, Name: Jane Smith, DOB: 1975-05-20
    Diagnosis: Hypertension, Date: 2024-01-12
    """
    with open("data/clinical_report.txt", "w") as f:
        f.write(clinical_data)
    documents = FileIngestor().ingest("data/clinical_report.txt")
    print(f"Ingested {len(documents)} documents from sample data")


---

## Phase 2: Text Normalization & Deduplication

Normalize medical terms and deduplicate patient records.


In [None]:
from semantica.normalize import TextNormalizer
from semantica.deduplication import DuplicateDetector

# Normalize medical terms
normalizer = TextNormalizer()
normalized_documents = []
for doc in documents:
    normalized_text = normalizer.normalize(
        doc.content if hasattr(doc, 'content') else str(doc),
        clean_html=True,
        normalize_entities=True,
        remove_extra_whitespace=True
    )
    normalized_documents.append(normalized_text)

print(f"Normalized {len(normalized_documents)} documents")

# Build patient knowledge graph
result = core.build_knowledge_base(
    sources=normalized_documents,
    custom_entity_types=["Patient", "Diagnosis", "Treatment", "Procedure", "EHR"],
    graph=True
)

kg = result["knowledge_graph"]
entities = result["entities"]

# Deduplicate patient records
patients = [e for e in entities if e.get("type") == "Patient" or "patient" in e.get("type", "").lower()]

detector = DuplicateDetector()
duplicates = detector.detect_duplicates(patients, threshold=0.9)
deduplicated_patients = detector.resolve_duplicates(patients, duplicates)

print(f"Built patient KG with {len(kg.get('entities', []))} entities")
print(f"Deduplicated: {len(patients)} -> {len(deduplicated_patients)} unique patients")
print("Focus: EHR integration, triplet stores, patient KGs, structured data storage")


In [None]:
# Store patient data in triplet store
triplets = []
for rel in kg.get("relationships", []):
    triplets.append({
        "subject": rel.get("source"),
        "predicate": rel.get("predicate"),
        "object": rel.get("target")
    })

triplet_store.store_triplets(triplets)
print(f"Stored {len(triplets)} triplets in triplet store")
print("\n=== Pipeline Summary ===")
print(f"✓ Ingested {len(documents)} documents from FDA RSS feed and HL7/FHIR")
print(f"✓ Normalized {len(normalized_documents)} documents")
print(f"✓ Deduplicated {len(patients)} patients to {len(deduplicated_patients)} unique")
print(f"✓ Built patient KG with {len(kg.get('entities', []))} entities")
print(f"✓ Stored {len(triplets)} triplets in triplet store")
print(f"✓ This cookbook emphasizes triplet store storage and structured data management")


---

## Phase 6: Visualization


In [None]:
from semantica.visualization import KGVisualizer

visualizer = KGVisualizer()
visualizer.visualize(kg, output_path="patient_kg.html")

print("Clinical reports processing complete")
print("Emphasizes: EHR integration, triplet stores, patient KGs, structured data")
