# Clinical Text De-identification & Insight Extraction Pipeline (Prototype)

This notebook demonstrates a prototype workflow for de-identifying clinical text and extracting medical entities using both Azure Cognitive Services and open-source NLP models. It includes documentation and resume bullet points for portfolio submission.

## 1. Import Required Libraries

Import all necessary Python libraries required for the workflow prototype, such as pandas, numpy, and any workflow-related packages.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import re
import random
from typing import List, Dict

# Mock imports for Azure and open-source NLP (for prototype)
# from azure.ai.textanalytics import TextAnalyticsClient
# import spacy
# from transformers import pipeline

## 2. Define Workflow Prototype Functions

Create Python functions that represent each step of the workflow. Use mock data and placeholder logic to simulate the workflow steps.

In [None]:
# Step 1: Preprocess clinical text (de-identification)
def clean_clinical_text(text: str) -> str:
    """Remove PHI using regex and mask with placeholders."""
    text = re.sub(r"\\b\\d{1,2}/\\d{1,2}/\\d{2,4}\\b", "[DATE]", text)
    text = re.sub(r"MRN:\\d+", "[ID]", text)
    text = re.sub(r"\\b\\d{3}-\\d{3}-\\d{4}\\b", "[PHONE]", text)
    text = re.sub(r"\\s+", " ", text).strip()
    return text

# Step 2: Mock Azure Cognitive Services entity extraction
def azure_extract_entities(texts: List[str]) -> List[Dict]:
    """Simulate Azure Health entity extraction."""
    # Mock output
    entities = []
    for text in texts:
        entities.append({
            'text': 'aspirin',
            'category': 'MedicationName',
            'confidence_score': round(random.uniform(0.8, 1.0), 2),
            'offset': text.find('aspirin')
        })
    return entities

# Step 3: Mock open-source NER entity extraction
def open_source_extract_entities(texts: List[str]) -> List[Dict]:
    """Simulate open-source NER extraction."""
    entities = []
    for text in texts:
        entities.append({
            'text': 'Type 2 Diabetes',
            'category': 'Diagnosis',
            'confidence_score': round(random.uniform(0.7, 0.95), 2),
            'offset': text.find('Diabetes')
        })
    return entities

# Step 4: Simulate saving to Azure SQL DB
def save_to_sql_db(entities: List[Dict], source: str):
    print(f"Saving {len(entities)} entities from {source} to Azure SQL DB (simulated)")

## 3. Simulate Workflow Execution

Run the workflow prototype using the defined functions. Show sample input and output for each step to demonstrate the workflow logic.

In [None]:
# Sample clinical notes
data = [
    "Patient MRN:12345 was prescribed aspirin on 12/01/2023. Diagnosed with Type 2 Diabetes. Call 555-123-4567.",
    "MRN:67890, admitted 01/15/2024, hypertension noted."
]

# Step 1: Preprocess
cleaned = [clean_clinical_text(t) for t in data]
print("Cleaned Text:", cleaned)

# Step 2: Azure extraction (mock)
az_entities = azure_extract_entities(cleaned)
print("Azure Entities:", az_entities)

# Step 3: Open-source extraction (mock)
os_entities = open_source_extract_entities(cleaned)
print("Open-Source Entities:", os_entities)

# Step 4: Save to SQL DB (simulated)
save_to_sql_db(az_entities, source="Azure Cognitive Services")
save_to_sql_db(os_entities, source="Open-Source NLP")

## 4. Workflow Documentation

### Step 1: Preprocessing
- **Purpose:** Remove/mask PHI (dates, IDs, phone numbers) from clinical text.
- **Input:** Raw clinical notes (string)
- **Output:** Cleaned, de-identified text

### Step 2: Entity Extraction (Azure)
- **Purpose:** Extract medical entities using Azure Cognitive Services for Health (simulated here)
- **Input:** Cleaned text
- **Output:** List of entities with category, confidence, offset

### Step 3: Entity Extraction (Open-Source)
- **Purpose:** Extract medical entities using open-source models (spaCy/Hugging Face, simulated here)
- **Input:** Cleaned text
- **Output:** List of entities with category, confidence, offset

### Step 4: Save Results
- **Purpose:** Store structured results in Azure SQL Database (simulated)
- **Input:** List of entities
- **Output:** Confirmation of save

---

#### High-Level Workflow Diagram (Pseudocode)

```
Raw Clinical Notes
   |\
   |  [Preprocess: clean_clinical_text]
   |/
Cleaned Text
   |\
   |  [Azure Entity Extraction]   [Open-Source Entity Extraction]
   |/                             |\
Entities (Azure)           Entities (Open-Source)
   |\                         |\
   |  [Save to SQL DB]        |  [Save to SQL DB]
   |/                         |/
Structured Results in Azure SQL DB
```

## 5. Resume Bullet Points for Portfolio Project

- Designed and prototyped a clinical NLP pipeline for de-identification and medical entity extraction using both Azure Cognitive Services and open-source models.
- Implemented secure, modular workflow steps for PHI masking, entity extraction, and structured data storage.
- Demonstrated parallel data flows to compare cloud-native and open-source NLP approaches for healthcare text.
- Automated PHI masking and simulated SNOMED CT mapping, showcasing advanced healthcare data engineering skills.
- Documented the workflow and results for clear communication and portfolio presentation.