# OpenMed PII Detection & De-identification - Complete Guide

This notebook demonstrates **everything** about PII (Personally Identifiable Information) detection and de-identification in OpenMed, including:

1. **Basic PII Extraction** - Detect PII entities in clinical text
2. **Smart Entity Merging** - Fix fragmentation issues (NEW in v0.5.0)
3. **De-identification Methods** - Mask, remove, replace, hash, shift dates
4. **Re-identification** - Reverse de-identification with mappings
5. **Batch Processing** - Process multiple texts efficiently
6. **Confidence Thresholding** - Control precision vs recall
7. **Custom Patterns** - Add domain-specific PII patterns
8. **Clinical Use Cases** - Real-world examples
9. **Visualization** - Display results with highlighting
10. **CLI Usage** - Command-line interface examples

---

**Requirements:**
```bash
pip install openmed
```

**Model Used:**
- `openmed/OpenMed-PII-SuperClinical-Large-434M-v1` (default)
- Trained on clinical notes, EHR data, and HIPAA-relevant PII

---

## Setup and Installation

In [1]:
# Import required libraries
import os
from pprint import pprint
import json

# Set HuggingFace token (if needed)
# os.environ['HF_TOKEN'] = 'your_token_here'

# Import OpenMed PII functions
from openmed import (
    extract_pii,
    deidentify,
    reidentify,
    PIIEntity,
    DeidentificationResult,
)

# Import smart merging utilities
from openmed import (
    merge_entities_with_semantic_units,
    find_semantic_units,
    calculate_dominant_label,
    PII_PATTERNS,
    PIIPattern,
)

# Import batch processing
from openmed import BatchProcessor, BatchItem, process_batch

print("‚úÖ All imports successful!")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All imports successful!


---
## 1. Basic PII Extraction

Extract PII entities from clinical text.

In [2]:
# Simple clinical text with various PII types
clinical_text = """
Patient Name: Dr. Sarah Johnson
Date of Birth: 03/15/1975
Social Security: 123-45-6789
Phone: (555) 123-4567
Email: sarah.johnson@email.com
Address: 456 Oak Avenue, Boston, MA 02115
"""

print("=" * 80)
print("BASIC PII EXTRACTION")
print("=" * 80)
print(f"Input text:\n{clinical_text}\n")

# Extract PII with default settings
result = extract_pii(
    clinical_text,
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.5,
    use_smart_merging=True  # DEFAULT in v0.5.0
)

print(f"Found {len(result.entities)} PII entities:\n")
for i, entity in enumerate(result.entities, 1):
    print(f"{i:2d}. [{entity.label:25s}] '{entity.text:30s}' (confidence: {entity.confidence:.3f})")

print("\n" + "=" * 80)

BASIC PII EXTRACTION
Input text:

Patient Name: Dr. Sarah Johnson
Date of Birth: 03/15/1975
Social Security: 123-45-6789
Phone: (555) 123-4567
Email: sarah.johnson@email.com
Address: 456 Oak Avenue, Boston, MA 02115




Device set to use cpu


Found 11 PII entities:

 1. [occupation               ] 'Dr.                           ' (confidence: 0.597)
 2. [first_name               ] 'Sarah                         ' (confidence: 1.000)
 3. [last_name                ] 'Johnson                       ' (confidence: 0.998)
 4. [date_of_birth            ] '03/15/1975                    ' (confidence: 0.693)
 5. [ssn                      ] '123-45-6789                   ' (confidence: 0.981)
 6. [phone_number             ] '555) 123-4567                 ' (confidence: 0.868)
 7. [email                    ] 'sarah.johnson@email.com       ' (confidence: 1.000)
 8. [street_address           ] '456 Oak Avenue                ' (confidence: 1.000)
 9. [city                     ] 'Boston                        ' (confidence: 0.900)
10. [state                    ] 'MA                            ' (confidence: 0.927)
11. [postcode                 ] '02115                         ' (confidence: 0.967)



### Inspecting Entity Details

In [3]:
# Access individual entity properties
if result.entities:
    entity = result.entities[0]
    print("First Entity Details:")
    print(f"  Text: {entity.text}")
    print(f"  Label: {entity.label}")
    print(f"  Confidence: {entity.confidence:.4f}")
    print(f"  Start position: {entity.start}")
    print(f"  End position: {entity.end}")
    print(f"  Extracted from result.text: '{result.text[entity.start:entity.end]}'")

First Entity Details:
  Text: Dr.
  Label: occupation
  Confidence: 0.5971
  Start position: 14
  End position: 17
  Extracted from result.text: 'Dr.'


---
## 2. Smart Entity Merging (NEW in v0.5.0)

Smart merging fixes the fragmentation problem where dates, SSN, phone numbers, and other PII entities are split into unusable fragments by the tokenizer.

### The Problem: Fragmentation

In [4]:
test_text = "Patient DOB: 01/15/1970, Admission: 2024-03-20, SSN: 987-65-4321"

print("=" * 80)
print("COMPARING: WITHOUT vs WITH Smart Merging")
print("=" * 80)
print(f"Input: {test_text}\n")

# WITHOUT smart merging (raw model output)
print("‚ùå WITHOUT Smart Merging (use_smart_merging=False)")
print("-" * 80)
result_raw = extract_pii(
    test_text,
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.5,
    use_smart_merging=False  # Disable smart merging
)

print(f"Found {len(result_raw.entities)} entities:")
for entity in result_raw.entities:
    print(f"  [{entity.label:20s}] '{entity.text}' (confidence: {entity.confidence:.3f})")

# Check for fragmentation
date_fragments = [e for e in result_raw.entities if 'date' in e.label.lower() and len(e.text) < 8]
if date_fragments:
    print(f"\n‚ö†Ô∏è  PROBLEM: {len(date_fragments)} date fragments detected!")
    print("   These fragments are unusable for production de-identification.")

print("\n" + "=" * 80)

# WITH smart merging (default)
print("‚úÖ WITH Smart Merging (use_smart_merging=True) - DEFAULT")
print("-" * 80)
result_merged = extract_pii(
    test_text,
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.5,
    use_smart_merging=True  # Enable smart merging (DEFAULT)
)

print(f"Found {len(result_merged.entities)} entities:")
for entity in result_merged.entities:
    print(f"  [{entity.label:20s}] '{entity.text}' (confidence: {entity.confidence:.3f})")

# Check for complete dates
complete_dates = [e for e in result_merged.entities if 'date' in e.label.lower() and len(e.text) >= 8]
if complete_dates:
    print(f"\n‚úÖ SUCCESS: {len(complete_dates)} complete date entities!")
    print("   Production-ready for de-identification.")

print("\n" + "=" * 80)

COMPARING: WITHOUT vs WITH Smart Merging
Input: Patient DOB: 01/15/1970, Admission: 2024-03-20, SSN: 987-65-4321

‚ùå WITHOUT Smart Merging (use_smart_merging=False)
--------------------------------------------------------------------------------


Device set to use cpu


Found 5 entities:
  [date                ] '01' (confidence: 0.886)
  [date_of_birth       ] '/15' (confidence: 0.704)
  [date                ] '/1970' (confidence: 0.565)
  [date                ] '2024-03-20' (confidence: 0.999)
  [ssn                 ] '987-65-4321' (confidence: 0.997)

‚ö†Ô∏è  PROBLEM: 3 date fragments detected!
   These fragments are unusable for production de-identification.

‚úÖ WITH Smart Merging (use_smart_merging=True) - DEFAULT
--------------------------------------------------------------------------------


Device set to use cpu


Found 3 entities:
  [date                ] '01/15/1970' (confidence: 0.718)
  [date                ] '2024-03-20' (confidence: 0.999)
  [ssn                 ] '987-65-4321' (confidence: 0.997)

‚úÖ SUCCESS: 2 complete date entities!
   Production-ready for de-identification.



### How Smart Merging Works

In [5]:
# Demonstrate semantic unit detection
demo_text = "Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789, Phone: (555) 123-4567"

print("=" * 80)
print("SMART MERGING: Semantic Unit Detection")
print("=" * 80)
print(f"Input: {demo_text}\n")

# Find semantic units using regex patterns
semantic_units = find_semantic_units(demo_text)

print(f"Detected {len(semantic_units)} semantic units using regex patterns:\n")
for start, end, entity_type in semantic_units:
    text_span = demo_text[start:end]
    print(f"  [{entity_type:20s}] '{text_span}' at position {start}-{end}")

print("\n" + "=" * 80)
print(f"Total PII patterns defined: {len(PII_PATTERNS)}")
print("\nPattern categories:")
categories = set(p.entity_type for p in PII_PATTERNS)
for cat in sorted(categories):
    print(f"  - {cat}")

SMART MERGING: Semantic Unit Detection
Input: Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789, Phone: (555) 123-4567

Detected 2 semantic units using regex patterns:

  [date                ] '01/15/1970' at position 24-34
  [ssn                 ] '123-45-6789' at position 41-52

Total PII patterns defined: 20

Pattern categories:
  - credit_debit_card
  - date
  - email
  - ipv4
  - ipv6
  - mac_address
  - medical_record_number
  - phone_number
  - postcode
  - ssn
  - street_address
  - url


### Supported PII Patterns

In [9]:
# Display all supported patterns
print("=" * 80)
print("SUPPORTED PII PATTERNS")
print("=" * 80)

# Group patterns by type
from collections import defaultdict
patterns_by_type = defaultdict(list)
for pattern in PII_PATTERNS:
    patterns_by_type[pattern.entity_type].append(pattern)

for entity_type in sorted(patterns_by_type.keys()):
    patterns = patterns_by_type[entity_type]
    print(f"\n{entity_type.upper()}:")
    for p in patterns:
        print(f"  Priority {p.priority}: {p.pattern[:80]}{'...' if len(p.pattern) > 80 else ''}")

SUPPORTED PII PATTERNS

CREDIT_DEBIT_CARD:
  Priority 8: \b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b

DATE:
  Priority 10: \b\d{4}-\d{2}-\d{2}\b
  Priority 9: \b\d{1,2}/\d{1,2}/\d{2,4}\b
  Priority 9: \b\d{1,2}-\d{1,2}-\d{2,4}\b
  Priority 8: \b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{1,2},? \d{4}\b
  Priority 8: \b\d{1,2} (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* \d{4}\b

EMAIL:
  Priority 10: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b

IPV4:
  Priority 7: \b(?:\d{1,3}\.){3}\d{1,3}\b

IPV6:
  Priority 8: \b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b

MAC_ADDRESS:
  Priority 8: \b(?:[0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}\b

MEDICAL_RECORD_NUMBER:
  Priority 9: \b(?:MRN|mrn)[:\s#]*\d{6,10}\b
  Priority 5: \b[A-Z]{2,3}\d{6,9}\b

PHONE_NUMBER:
  Priority 9: \b\(\d{3}\)\s*\d{3}[-.\s]?\d{4}\b
  Priority 8: \b\d{3}[-.\s]\d{3}[-.\s]\d{4}\b
  Priority 5: \b\d{10}\b

POSTCODE:
  Priority 7: \b\d{5}(?:-\d{4})?\b

SSN:
  Priority 10: \b\d{3}-\d{2}-\d{4}\b


---
## 3. De-identification Methods

OpenMed supports multiple de-identification methods to protect patient privacy.

In [10]:
# Clinical note for de-identification
patient_note = """
CLINICAL NOTE
=============
Patient Name: Dr. Sarah Johnson
Date of Birth: 03/15/1975
MRN: 87654321
Social Security: 123-45-6789
Contact: (555) 987-6543
Email: sarah.j@hospital.org
Address: 456 Oak Avenue, Boston, MA 02115

Admission Date: 12/20/2024
Discharge Date: 12/25/2024

DIAGNOSIS: Type 2 Diabetes Mellitus
"""

print("Original Clinical Note:")
print(patient_note)
print("\n" + "=" * 80)

Original Clinical Note:

CLINICAL NOTE
Patient Name: Dr. Sarah Johnson
Date of Birth: 03/15/1975
MRN: 87654321
Social Security: 123-45-6789
Contact: (555) 987-6543
Email: sarah.j@hospital.org
Address: 456 Oak Avenue, Boston, MA 02115

Admission Date: 12/20/2024
Discharge Date: 12/25/2024

DIAGNOSIS: Type 2 Diabetes Mellitus




### Method 1: Mask (Placeholder replacement)

In [15]:
print("=" * 80)
print("METHOD 1: MASK (Placeholder replacement)")
print("=" * 80)

result_mask = deidentify(
    patient_note,
    method="mask",
    model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
    confidence_threshold=0.6,
    use_smart_merging=True,
)

print("De-identified text:")
print(result_mask.deidentified_text)

# ‚úÖ The library returns `pii_entities` (not `entities`)
entities = getattr(result_mask, "pii_entities", None) or getattr(result_mask, "entities", [])

print(f"\nEntities masked: {len(entities)}")

# Optional: sort by position for nicer display
entities = sorted(entities, key=lambda e: getattr(e, "start", 0))

for entity in entities[:5]:  # Show first 5
    label = getattr(entity, "label", getattr(entity, "entity_type", "UNKNOWN"))
    text = getattr(entity, "text", "")
    redacted = getattr(entity, "redacted_text", "")
    conf = getattr(entity, "confidence", None)
    span = (getattr(entity, "start", None), getattr(entity, "end", None))

    conf_str = f"{conf:.3f}" if isinstance(conf, (int, float)) else "n/a"
    print(f"  [{label}] '{text}' -> '{redacted}' conf={conf_str} span={span}")

METHOD 1: MASK (Placeholder replacement)


Device set to use cpu


De-identified text:

CLINICAL NOTE
Patient Name: Dr.[first_name]h[last_name]n
Date of Birth:[date_of_birth]5
MRN: 87654321
Social Security:[ssn]9
Contact: [phone_number]3
Email:[email]g
Address:[street_address]e,[city]n,[state]A[postcode]5

Admission Date:[date]4
Discharge Date:[date]4

DIAGNOSIS: Type 2 Diabetes Mellitus


Entities masked: 12
  [first_name] 'Sarah' -> '[first_name]' conf=1.000 span=(46, 51)
  [last_name] 'Johnson' -> '[last_name]' conf=0.999 span=(52, 59)
  [date_of_birth] '03/15/1975' -> '[date_of_birth]' conf=0.815 span=(75, 85)
  [ssn] '123-45-6789' -> '[ssn]' conf=0.977 span=(117, 128)
  [phone_number] '555) 987-6543' -> '[phone_number]' conf=0.659 span=(139, 152)


### Method 2: Remove (Complete removal)

In [17]:
print("=" * 80)
print("METHOD 2: REMOVE (Complete removal)")
print("=" * 80)

result_remove = deidentify(
    patient_note,
    method="remove",
    model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
    confidence_threshold=0.6,
    use_smart_merging=True,
)

print("De-identified text:")
print(result_remove.deidentified_text)

# ‚úÖ Use `pii_entities` (fallback included for robustness)
entities = getattr(result_remove, "pii_entities", None) or getattr(result_remove, "entities", [])

print(f"\nEntities removed: {len(entities)}")

METHOD 2: REMOVE (Complete removal)


Device set to use cpu


De-identified text:

CLINICAL NOTE
Patient Name: Dr.hn
Date of Birth:5
MRN: 87654321
Social Security:9
Contact: 3
Email:g
Address:e,n,A5

Admission Date:4
Discharge Date:4

DIAGNOSIS: Type 2 Diabetes Mellitus


Entities removed: 12


### Method 3: Replace (Synthetic data)

In [19]:
print("=" * 80)
print("METHOD 3: REPLACE (Synthetic data)")
print("=" * 80)

result_replace = deidentify(
    patient_note,
    method="replace",
    model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
    confidence_threshold=0.6,
    use_smart_merging=True,
)

print("De-identified text:")
print(result_replace.deidentified_text)

# ‚úÖ `DeidentificationResult` uses `pii_entities`
entities = getattr(result_replace, "pii_entities", None) or getattr(result_replace, "entities", [])

print(f"\nEntities replaced: {len(entities)}")

for entity in sorted(entities, key=lambda e: getattr(e, "start", 0))[:5]:
    label = getattr(entity, "label", getattr(entity, "entity_type", "UNKNOWN"))
    text = getattr(entity, "text", "")
    # for replace/mask, this often holds the replacement value or placeholder
    repl = getattr(entity, "redacted_text", "")
    print(f"  [{label}] '{text}' -> '{repl}'")

METHOD 3: REPLACE (Synthetic data)


Device set to use cpu


De-identified text:

CLINICAL NOTE
Patient Name: Dr.[first_name]h[last_name]n
Date of Birth:[date_of_birth]5
MRN: 87654321
Social Security:[ssn]9
Contact: [phone_number]3
Email:[email]g
Address:[street_address]e,[city]n,[state]A[postcode]5

Admission Date:[date]4
Discharge Date:[date]4

DIAGNOSIS: Type 2 Diabetes Mellitus


Entities replaced: 12
  [first_name] 'Sarah' -> '[first_name]'
  [last_name] 'Johnson' -> '[last_name]'
  [date_of_birth] '03/15/1975' -> '[date_of_birth]'
  [ssn] '123-45-6789' -> '[ssn]'
  [phone_number] '555) 987-6543' -> '[phone_number]'


### Method 4: Hash (Cryptographic hashing)

In [21]:
print("=" * 80)
print("METHOD 4: HASH (Cryptographic hashing)")
print("=" * 80)

result_hash = deidentify(
    patient_note,
    method="hash",
    model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
    confidence_threshold=0.6,
    use_smart_merging=True,
)

print("De-identified text (first 500 chars):")
text = result_hash.deidentified_text or ""
print((text[:500] + "...") if len(text) > 500 else text)

# ‚úÖ Use `pii_entities` (fallback included)
entities = getattr(result_hash, "pii_entities", None) or getattr(result_hash, "entities", [])

print(f"\nEntities hashed: {len(entities)}")
print("\nExample hashed values:")

for entity in sorted(entities, key=lambda e: getattr(e, "start", 0))[:3]:
    label = getattr(entity, "label", getattr(entity, "entity_type", "UNKNOWN"))
    original = getattr(entity, "text", "")
    hashed = getattr(entity, "hash_value", None) or getattr(entity, "redacted_text", "")
    print(f"  [{label}] Original: '{original}'  Hashed: '{hashed}'")

METHOD 4: HASH (Cryptographic hashing)


Device set to use cpu


De-identified text (first 500 chars):

CLINICAL NOTE
Patient Name: Dr.first_name_7e8c729ehlast_name_3013b18fn
Date of Birth:date_of_birth_ad87a4065
MRN: 87654321
Social Security:ssn_01a546299
Contact: phone_number_d8f6c45f3
Email:email_c67e1ae7g
Address:street_address_c25c1d69e,city_a06522bcn,state_f0055891Apostcode_20ec61f35

Admission Date:date_9b3129044
Discharge Date:date_a98356c94

DIAGNOSIS: Type 2 Diabetes Mellitus


Entities hashed: 12

Example hashed values:
  [first_name] Original: 'Sarah'  Hashed: '7e8c729e'
  [last_name] Original: 'Johnson'  Hashed: '3013b18f'
  [date_of_birth] Original: '03/15/1975'  Hashed: 'ad87a406'


### Method 5: Shift Dates (Date shifting)

In [23]:
print("=" * 80)
print("METHOD 5: SHIFT_DATES (Preserves temporal relationships)")
print("=" * 80)

result_shift = deidentify(
    patient_note,
    method="shift_dates",
    model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
    confidence_threshold=0.6,
    use_smart_merging=True,
    date_shift_days=365,  # Shift by 1 year
)

print("De-identified text:")
print(result_shift.deidentified_text)

# ‚úÖ Use `pii_entities` (fallback included)
entities = getattr(result_shift, "pii_entities", None) or getattr(result_shift, "entities", [])

date_entities = [
    e for e in entities
    if "date" in getattr(e, "label", getattr(e, "entity_type", "")).lower()
]

print("\nDate entities shifted:")
for e in sorted(date_entities, key=lambda x: getattr(x, "start", 0)):
    label = getattr(e, "label", getattr(e, "entity_type", "UNKNOWN"))
    original = getattr(e, "text", "")
    shifted = getattr(e, "redacted_text", "")  # often holds the shifted date or replacement
    print(f"  [{label}] '{original}' -> '{shifted}'")

print("Note: Temporal relationships between dates are preserved!")

METHOD 5: SHIFT_DATES (Preserves temporal relationships)


Device set to use cpu


De-identified text:

CLINICAL NOTE
Patient Name: Dr.[first_name]h[last_name]n
Date of Birth:[date_of_birth]5
MRN: 87654321
Social Security:[ssn]9
Contact: [phone_number]3
Email:[email]g
Address:[street_address]e,[city]n,[state]A[postcode]5

Admission Date:[date]4
Discharge Date:[date]4

DIAGNOSIS: Type 2 Diabetes Mellitus


Date entities shifted:
  [date_of_birth] '03/15/1975' -> '[date_of_birth]'
  [date] '12/20/2024' -> '[date]'
  [date] '12/25/2024' -> '[date]'
Note: Temporal relationships between dates are preserved!


---
## 4. Re-identification

Reverse de-identification using stored mappings.

In [24]:
print("=" * 80)
print("RE-IDENTIFICATION")
print("=" * 80)

# De-identify with mapping
print("Step 1: De-identify with keep_mapping=True")
result_with_mapping = deidentify(
    patient_note,
    method="mask",
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.6,
    keep_mapping=True,  # Keep mapping for re-identification
    use_smart_merging=True
)

print(f"\nDe-identified text (first 200 chars):")
print(result_with_mapping.deidentified_text[:200] + "...")

print(f"\nMapping created: {len(result_with_mapping.mapping)} entries")
print("\nFirst 5 mapping entries:")
for i, (redacted, original) in enumerate(list(result_with_mapping.mapping.items())[:5], 1):
    print(f"  {i}. '{redacted}' ‚Üí '{original}'")

# Re-identify
print("\n" + "-" * 80)
print("Step 2: Re-identify using the mapping")
original_text = reidentify(
    result_with_mapping.deidentified_text,
    result_with_mapping.mapping
)

print(f"\nRe-identified text (first 300 chars):")
print(original_text[:300] + "...")

# Verify
print("\n" + "-" * 80)
print("Verification:")
original_clean = patient_note.strip()
reidentified_clean = original_text.strip()
if original_clean == reidentified_clean:
    print("‚úÖ SUCCESS: Re-identification is perfect!")
else:
    print(f"‚ö†Ô∏è  Difference detected (usually whitespace/formatting)")
    print(f"   Original length: {len(original_clean)}")
    print(f"   Re-identified length: {len(reidentified_clean)}")

RE-IDENTIFICATION
Step 1: De-identify with keep_mapping=True


Device set to use cpu



De-identified text (first 200 chars):

CLINICAL NOTE
Patient Name: Dr.[first_name]h[last_name]n
Date of Birth:[date_of_birth]5
MRN: 87654321
Social Security:[ssn]9
Contact: [phone_number]3
Email:[email]g
Address:[street_addr...

Mapping created: 11 entries

First 5 mapping entries:
  1. '[date]' ‚Üí '12/20/2024'
  2. '[postcode]' ‚Üí '02115'
  3. '[state]' ‚Üí 'MA'
  4. '[city]' ‚Üí 'Boston'
  5. '[street_address]' ‚Üí '456 Oak Avenue'

--------------------------------------------------------------------------------
Step 2: Re-identify using the mapping

Re-identified text (first 300 chars):

CLINICAL NOTE
Patient Name: Dr.SarahhJohnsonn
Date of Birth:03/15/19755
MRN: 87654321
Social Security:123-45-67899
Contact: 555) 987-65433
Email:sarah.j@hospital.orgg
Address:456 Oak Avenuee,Bostonn,MAA021155

Admission Date:12/20/20244
Discharge Date:12/20/20244

DIAGNOSIS: Type 2 Di...

--------------------------------------------------------------------------------
Verification:
‚ö†Ô∏è  Differ

---
## 5. Batch Processing

Efficiently process multiple clinical notes.

In [27]:
from openmed import BatchProcessor

print("=" * 80)
print("BATCH PROCESSING")
print("=" * 80)

batch_texts = [
    "Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789",
    "Dr. Sarah Johnson, Phone: (555) 123-4567, Email: sarah@email.com",
    "MRN: 87654321, Admission: 2024-03-20, Discharge: 2024-03-25",
    "Address: 123 Main Street, Boston, MA 02101, ZIP: 02101",
    "Contact: patient.name@hospital.org, Emergency: (555) 987-6543",
]

processor = BatchProcessor(
    model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
    confidence_threshold=0.5,
    group_entities=True,
    continue_on_error=True,
    # IMPORTANT: do NOT pass use_smart_merging here if your installed version triggers the HF pipeline error
)

batch_result = processor.process_texts(batch_texts)

print("Batch processing completed!")
print(f"  Total items: {batch_result.total_items}")
print(f"  Successful: {batch_result.successful_items}")
print(f"  Failed: {batch_result.failed_items}")
print(f"  Total processing time: {batch_result.total_processing_time:.2f}s")

print("\n" + "-" * 80)
print("Results per note:\n")

for item_result in batch_result.items:
    if not item_result.success:
        print(f"‚ùå {item_result.id}: {item_result.error}")
        continue

    # In BatchProcessor results, entities usually live under item_result.result.entities
    ents = item_result.result.entities
    print(f"üìÑ {item_result.id}:")
    print(f"   Entities found: {len(ents)}")
    for entity in ents[:3]:
        print(f"     - [{entity.label}] '{entity.text}'")
    if len(ents) > 3:
        print(f"     ... and {len(ents) - 3} more")
    print()

BATCH PROCESSING


Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu


Batch processing completed!
  Total items: 5
  Successful: 5
  Failed: 0
  Total processing time: 8.92s

--------------------------------------------------------------------------------
Results per note:

üìÑ item_0:
   Entities found: 5
     - [first_name] 'John'
     - [last_name] 'Doe'
     - [date] '01'
     ... and 2 more

üìÑ item_1:
   Entities found: 4
     - [occupation] 'Dr.'
     - [first_name] 'Sarah'
     - [last_name] 'Johnson'
     ... and 1 more

üìÑ item_2:
   Entities found: 2
     - [date] '2024-03-20'
     - [date] '2024-03-25'

üìÑ item_3:
   Entities found: 5
     - [street_address] '123 Main Street'
     - [city] 'Boston'
     - [state] 'MA'
     ... and 2 more

üìÑ item_4:
   Entities found: 1
     - [email] 'patient.name@hospital.org'



### Batch De-identification

In [29]:
print("=" * 80)
print("BATCH DE-IDENTIFICATION (extract in batch, then deidentify)")
print("=" * 80)

ids = [f"note_{i+1}" for i in range(len(batch_texts))]

# 1) Batch extraction (no use_smart_merging here in YOUR install; it breaks HF pipeline creation)
batch_result = process_batch(
    batch_texts,
    model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
    ids=ids,
    confidence_threshold=0.6,
    batch_size=2,
)

print(f"Batch extraction completed!")
print(f"  Successful: {batch_result.successful_items}/{batch_result.total_items}\n")

# 2) Deidentify each text (this API supports use_smart_merging in your environment)
print("De-identified texts:\n")

for item in batch_result.items:
    item_id = getattr(item, "id", None) or getattr(item, "item_id", "unknown")

    if not getattr(item, "success", False):
        err = getattr(item, "error", None) or getattr(item, "exception", None)
        print(f"‚ùå {item_id} failed during extraction: {err}")
        continue

    # run deidentify for the same text
    original_text = item.text if hasattr(item, "text") else batch_texts[ids.index(item_id)]
    deid = deidentify(
        original_text,
        method="mask",
        model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
        confidence_threshold=0.6,
        use_smart_merging=True,
    )

    print(f"üìÑ {item_id}:")
    print(deid.deidentified_text)
    print()

BATCH DE-IDENTIFICATION (extract in batch, then deidentify)


Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu
Device set to use cpu


Batch extraction completed!
  Successful: 5/5

De-identified texts:



Device set to use cpu


üìÑ note_1:
Patient: [first_name] [last_name], DOB: [date_of_birth], SSN: [ssn]



Device set to use cpu


üìÑ note_2:
[occupation] [first_name] [last_name], Phone: (555) 123-4567, Email: [email]



Device set to use cpu


üìÑ note_3:
MRN: 87654321, Admission: [date], Discharge: [date]



Device set to use cpu


üìÑ note_4:
Address: [street_address], [city], [state] [postcode], ZIP: [postcode]



Device set to use cpu


üìÑ note_5:
Contact: [email], Emergency: (555) 987-6543



---
## 6. Confidence Thresholding

Control precision vs recall trade-off.

In [30]:
print("=" * 80)
print("CONFIDENCE THRESHOLDING")
print("=" * 80)

test_text = "Patient: Jane Doe, DOB: 05/20/1985, Phone: 555-1234, Email: jane@email.com"
print(f"Input: {test_text}\n")

thresholds = [0.3, 0.5, 0.7, 0.9]

for threshold in thresholds:
    result = extract_pii(
        test_text,
        model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
        confidence_threshold=threshold,
        use_smart_merging=True
    )

    print(f"Threshold: {threshold:.1f} ‚Üí {len(result.entities)} entities")
    for entity in result.entities:
        print(f"  [{entity.label:20s}] '{entity.text:25s}' (conf: {entity.confidence:.3f})")
    print()

print("-" * 80)
print("Guidelines:")
print("  ‚Ä¢ threshold=0.3-0.5: High recall (catch more PII, more false positives)")
print("  ‚Ä¢ threshold=0.5-0.7: Balanced (RECOMMENDED for most use cases)")
print("  ‚Ä¢ threshold=0.7-0.9: High precision (fewer false positives, may miss some PII)")

CONFIDENCE THRESHOLDING
Input: Patient: Jane Doe, DOB: 05/20/1985, Phone: 555-1234, Email: jane@email.com



Device set to use cpu


Threshold: 0.3 ‚Üí 4 entities
  [first_name          ] 'Jane                     ' (conf: 1.000)
  [last_name           ] 'Doe                      ' (conf: 0.998)
  [date                ] '05/20/1985               ' (conf: 0.672)
  [email               ] 'jane@email.com           ' (conf: 0.999)



Device set to use cpu


Threshold: 0.5 ‚Üí 4 entities
  [first_name          ] 'Jane                     ' (conf: 1.000)
  [last_name           ] 'Doe                      ' (conf: 0.998)
  [date                ] '05/20/1985               ' (conf: 0.672)
  [email               ] 'jane@email.com           ' (conf: 0.999)



Device set to use cpu


Threshold: 0.7 ‚Üí 4 entities
  [first_name          ] 'Jane                     ' (conf: 1.000)
  [last_name           ] 'Doe                      ' (conf: 0.998)
  [date                ] '05/20/1985               ' (conf: 0.751)
  [email               ] 'jane@email.com           ' (conf: 0.999)



Device set to use cpu


Threshold: 0.9 ‚Üí 3 entities
  [first_name          ] 'Jane                     ' (conf: 1.000)
  [last_name           ] 'Doe                      ' (conf: 0.998)
  [email               ] 'jane@email.com           ' (conf: 0.999)

--------------------------------------------------------------------------------
Guidelines:
  ‚Ä¢ threshold=0.3-0.5: High recall (catch more PII, more false positives)
  ‚Ä¢ threshold=0.5-0.7: Balanced (RECOMMENDED for most use cases)
  ‚Ä¢ threshold=0.7-0.9: High precision (fewer false positives, may miss some PII)


---
## 7. Custom PII Patterns

Add domain-specific patterns for your organization.

In [31]:
print("=" * 80)
print("CUSTOM PII PATTERNS")
print("=" * 80)

# Define custom patterns
custom_patterns = [
    PIIPattern(
        pattern=r'\bEMP-\d{6}\b',  # Employee ID format: EMP-123456
        entity_type='employee_id',
        priority=10
    ),
    PIIPattern(
        pattern=r'\bPID-\d{8}\b',  # Patient ID format: PID-12345678
        entity_type='patient_id',
        priority=9
    ),
    PIIPattern(
        pattern=r'\b[A-Z]{2}-\d{4}-[A-Z]\b',  # Custom format: AB-1234-X
        entity_type='internal_code',
        priority=8
    ),
]

print(f"Defined {len(custom_patterns)} custom patterns:\n")
for p in custom_patterns:
    print(f"  [{p.entity_type:20s}] Priority: {p.priority}, Pattern: {p.pattern}")

# Test text with custom identifiers
custom_text = """
Employee: EMP-123456
Patient ID: PID-87654321
Department Code: HR-2024-A
Regular SSN: 123-45-6789
"""

print("\n" + "-" * 80)
print("Test text:")
print(custom_text)

# Find custom semantic units
print("-" * 80)
print("Detected units with custom patterns:\n")

# First get model predictions
result = extract_pii(
    custom_text,
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.5,
    use_smart_merging=False  # Get raw predictions first
)

# Convert to dict format
entity_dicts = [
    {
        'entity_type': e.label,
        'score': e.confidence,
        'start': e.start,
        'end': e.end,
        'word': e.text
    }
    for e in result.entities
]

# Merge with custom patterns
merged = merge_entities_with_semantic_units(
    entity_dicts,
    result.text,
    patterns=custom_patterns,  # Add custom patterns
    use_semantic_patterns=True,
    prefer_model_labels=False  # Prefer pattern labels for custom types
)

print(f"Found {len(merged)} entities (including custom types):\n")
for entity in merged:
    label = entity['entity_type']
    text = entity['word']
    conf = entity['score']
    is_custom = label in ['employee_id', 'patient_id', 'internal_code']
    marker = "üÜï" if is_custom else "  "
    print(f"{marker} [{label:20s}] '{text}' (confidence: {conf:.3f})")

print("\nüÜï = Custom pattern detected")

CUSTOM PII PATTERNS
Defined 3 custom patterns:

  [employee_id         ] Priority: 10, Pattern: \bEMP-\d{6}\b
  [patient_id          ] Priority: 9, Pattern: \bPID-\d{8}\b
  [internal_code       ] Priority: 8, Pattern: \b[A-Z]{2}-\d{4}-[A-Z]\b

--------------------------------------------------------------------------------
Test text:

Employee: EMP-123456
Patient ID: PID-87654321
Department Code: HR-2024-A
Regular SSN: 123-45-6789

--------------------------------------------------------------------------------
Detected units with custom patterns:



Device set to use cpu


Found 2 entities (including custom types):

üÜï [employee_id         ] 'EMP-123456' (confidence: 0.988)
   [ssn                 ] '123-45-6789' (confidence: 0.924)

üÜï = Custom pattern detected


---
## 8. Clinical Use Cases

Real-world clinical scenarios.

### Use Case 1: Discharge Summary

In [46]:
discharge_summary = """
DISCHARGE SUMMARY
=====================================
Patient Name: Michael Anderson
MRN: 98765432
Date of Birth: 08/12/1968
Admission Date: 01/05/2025
Discharge Date: 01/10/2025
Attending Physician: Dr. Emily Carter

PRIMARY DIAGNOSIS:
Acute myocardial infarction

HOSPITAL COURSE:
Mr. Anderson is a 56-year-old male who presented to the emergency
department on 01/05/2025 with chest pain. He was admitted for
cardiac catheterization and intervention.

CONTACT INFORMATION:
Phone: (555) 234-5678
Email: m.anderson@email.com
Emergency Contact: Jane Anderson (Wife) - (555) 234-5679

FOLLOW-UP:
Patient scheduled for follow-up on 01/24/2025 at the cardiology clinic.
"""

print("=" * 80)
print("USE CASE 1: Discharge Summary De-identification")
print("=" * 80)

# De-identify for research database
deid_discharge = deidentify(
    discharge_summary,
    method="mask",
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.6,
    use_smart_merging=True
)

print("De-identified Discharge Summary:")
print(deid_discharge.deidentified_text)

print("\n" + "-" * 80)
print(f"PII entities protected: {len(deid_discharge.pii_entities)}")
# Check for adjacent placeholders (quality check)
if '][' in deid_discharge.deidentified_text:
    print("‚ùå Adjacent placeholders detected - fragmentation issue!")
else:
    print("‚úÖ Clean de-identification - no adjacent placeholders!")

USE CASE 1: Discharge Summary De-identification


Device set to use cpu


De-identified Discharge Summary:

DISCHARGE SUMMARY
Patient Name:[first_name]l[last_name]n[medical_record_number]2
Date of Birth:[date_of_birth]8
Admission Date:[date]5
Discharge Date:[date]5[occupation]n: Dr.[first_name]y[last_name]r

PRIMARY DIAGNOSIS:
Acute myocardial infarction

HOSPITAL COURSE:
Mr.[last_name]n is a[age]6-year-old male who presented to the emergency
department on[date]5 with chest pain. He was admitted for
cardiac catheterization and intervention.

CONTACT INFORMATION:
Phone:[phone_number]8
Email:[email]m
Emergency Contact:[first_name]e[last_name]n (Wife) - (555) 234-5679

FOLLOW-UP:
Patient scheduled for follow-up on[date]5 at the cardiology clinic.


--------------------------------------------------------------------------------
PII entities protected: 17
‚úÖ Clean de-identification - no adjacent placeholders!


### Use Case 2: Research Dataset Preparation

In [50]:
research_notes = [
    "Patient 001: John Smith, DOB 03/15/1975, diagnosed with T2DM on 12/20/2024",
    "Patient 002: Sarah Johnson, DOB 08/22/1982, A1C 8.5%, started metformin 01/05/2025",
    "Patient 003: Robert Williams, DOB 11/30/1965, BMI 32.1, blood pressure 145/90",
]

print("=" * 80)
print("USE CASE 2: Research Dataset Preparation")
print("=" * 80)
print(f"Processing {len(research_notes)} patient notes for research...\n")

# For batch de-identification, use deidentify() on each text
# BatchProcessor.process_items() is for extraction only

print("De-identified research dataset (date shifting by 180 days):\n")

for i, note in enumerate(research_notes, 1):
    patient_id = f"patient_{i:03d}"

    # De-identify each note with date shifting
    deid_result = deidentify(
        note,
        method="shift_dates",
        model_name="openmed/OpenMed-PII-SuperClinical-Large-434M-v1",
        confidence_threshold=0.6,
        use_smart_merging=True,
        date_shift_days=180,
        keep_mapping=True,
    )

    print(f"{patient_id}: {deid_result.deidentified_text}")

print("\n" + "-" * 80)
print("‚úÖ Research dataset ready!")
print("   - All dates shifted by 180 days")
print("   - Temporal relationships preserved")
print("   - Audit mapping available for IRB review")

USE CASE 2: Research Dataset Preparation
Processing 3 patient notes for research...

De-identified research dataset (date shifting by 180 days):



Device set to use cpu


patient_001: Patient 001: [first_name] [last_name], DOB [date], diagnosed with T2DM on [date]


Device set to use cpu


patient_002: Patient 002: [first_name] [last_name], DOB [date_of_birth], A1C 8.5%, started metformin [date]


Device set to use cpu


patient_003: Patient 003: [first_name] [last_name], DOB [date_of_birth], BMI 32.1, blood pressure 145/90

--------------------------------------------------------------------------------
‚úÖ Research dataset ready!
   - All dates shifted by 180 days
   - Temporal relationships preserved
   - Audit mapping available for IRB review


### Use Case 3: HIPAA Compliance Audit

In [51]:
print("=" * 80)
print("USE CASE 3: HIPAA Compliance Audit")
print("=" * 80)

# HIPAA Safe Harbor requires removal of 18 identifiers
hipaa_text = """
Patient: Jane Doe
DOB: 05/15/1980
SSN: 987-65-4321
Address: 789 Pine Street, Unit 4B, Seattle, WA 98101
Phone: (206) 555-1234
Fax: (206) 555-1235
Email: jane.doe@email.com
Medical Record: MRN-12345678
Account Number: ACCT-987654
Device ID: DEVICE-ABC123
IP Address: 192.168.1.100
Vehicle: License plate ABC-1234
Biometric: Fingerprint on file
Photo: Patient photo available
URL: https://patient-portal.hospital.com/jane-doe
"""

# Extract all PII for audit
hipaa_result = extract_pii(
    hipaa_text,
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.5,
    use_smart_merging=True
)

print(f"HIPAA Compliance Check:\n")
print(f"Total PII entities detected: {len(hipaa_result.entities)}\n")

# Group by type
from collections import Counter
pii_types = Counter(e.label for e in hipaa_result.entities)

print("PII Categories Found:")
for pii_type, count in sorted(pii_types.items()):
    print(f"  ‚Ä¢ {pii_type}: {count} instance(s)")

# HIPAA 18 identifiers checklist
hipaa_18_identifiers = [
    'names', 'geographic', 'dates', 'phone', 'fax', 'email', 'ssn',
    'medical_record', 'account_number', 'certificate', 'vehicle',
    'device', 'url', 'ip_address', 'biometric', 'photo', 'unique_id'
]

print("\n" + "-" * 80)
print("HIPAA Safe Harbor Compliance:")
detected_types = set(e.label.lower() for e in hipaa_result.entities)
covered = sum(1 for identifier in hipaa_18_identifiers
              if any(identifier in dt for dt in detected_types))

print(f"  Coverage: {covered}/{len(hipaa_18_identifiers)} HIPAA identifier categories")
print("  ‚úÖ Ready for HIPAA-compliant de-identification")

USE CASE 3: HIPAA Compliance Audit


Device set to use cpu


HIPAA Compliance Check:

Total PII entities detected: 15

PII Categories Found:
  ‚Ä¢ account_number: 1 instance(s)
  ‚Ä¢ city: 1 instance(s)
  ‚Ä¢ date: 1 instance(s)
  ‚Ä¢ email: 1 instance(s)
  ‚Ä¢ fax_number: 1 instance(s)
  ‚Ä¢ first_name: 1 instance(s)
  ‚Ä¢ ipv4: 1 instance(s)
  ‚Ä¢ last_name: 1 instance(s)
  ‚Ä¢ license_plate: 1 instance(s)
  ‚Ä¢ medical_record_number: 1 instance(s)
  ‚Ä¢ postcode: 1 instance(s)
  ‚Ä¢ ssn: 1 instance(s)
  ‚Ä¢ state: 1 instance(s)
  ‚Ä¢ street_address: 1 instance(s)
  ‚Ä¢ url: 1 instance(s)

--------------------------------------------------------------------------------
HIPAA Safe Harbor Compliance:
  Coverage: 6/17 HIPAA identifier categories
  ‚úÖ Ready for HIPAA-compliant de-identification


---
## 9. Visualization

Display results with highlighting.

In [55]:
from IPython.display import HTML, display

def highlight_entities(text, entities):
    """Create HTML with highlighted PII entities."""
    # Define colors for different entity types
    colors = {
        'name': '#FFB3BA',
        'first_name': '#FFB3BA',
        'last_name': '#FFB3BA',
        'date': '#BAFFC9',
        'date_of_birth': '#BAFFC9',
        'ssn': '#BAE1FF',
        'phone': '#FFFFBA',
        'email': '#FFDFBA',
        'address': '#E0BBE4',
        'medical_record_number': '#D4F1F4',
    }

    # Sort entities by start position (reverse for replacement)
    sorted_entities = sorted(entities, key=lambda e: e.start, reverse=True)

    highlighted = text
    for entity in sorted_entities:
        color = colors.get(entity.label.lower(), '#E8E8E8')
        replacement = (
            f'<span style="background-color: {color}; padding: 1px 2px; '
            f'border-radius: 3px; font-weight: bold;" '
            f'title="{entity.label} (confidence: {entity.confidence:.2f})">{entity.text}</span>'
        )
        highlighted = (
            highlighted[:entity.start] + replacement + highlighted[entity.end:]
        )

    return f'<div style="font-family: monospace; white-space: pre-wrap; padding: 10px; background: #f8f8f8; border-radius: 5px;">{highlighted}</div>'

# Example visualization
viz_text = """Patient: Dr. Sarah Johnson
DOB: 03/15/1975
SSN: 123-45-6789
Phone: (555) 123-4567
Email: sarah.j@email.com
Address: 456 Oak Ave, Boston, MA 02115"""

viz_result = extract_pii(
    viz_text,
    model_name='openmed/OpenMed-PII-SuperClinical-Large-434M-v1',
    confidence_threshold=0.5,
    use_smart_merging=True
)

print("=" * 80)
print("VISUALIZATION: Highlighted PII Entities")
print("=" * 80)
print("\nHover over highlighted text to see entity type and confidence.\n")

html = highlight_entities(viz_result.text, viz_result.entities)
display(HTML(html))

print("\nLegend:")
print("  üü• Pink: Names")
print("  üü© Green: Dates")
print("  üü¶ Blue: SSN")
print("  üü® Yellow: Phone")
print("  üüß Orange: Email")
print("  üü™ Purple: Address")

Device set to use cpu


VISUALIZATION: Highlighted PII Entities

Hover over highlighted text to see entity type and confidence.




Legend:
  üü• Pink: Names
  üü© Green: Dates
  üü¶ Blue: SSN
  üü® Yellow: Phone
  üüß Orange: Email
  üü™ Purple: Address


---
## 10. CLI Usage Examples

Command-line interface for PII operations.

In [None]:
print("=" * 80)
print("CLI USAGE EXAMPLES")
print("=" * 80)
print("""
OpenMed provides a powerful CLI for PII detection and de-identification.

1. Extract PII from text:
   ```bash
   openmed pii extract \
     --text "Patient: John Doe, DOB: 01/15/1970" \
     --model openmed/OpenMed-PII-SuperClinical-Large-434M-v1 \
     --confidence-threshold 0.5
   ```

2. Extract PII from file:
   ```bash
   openmed pii extract \
     --input-file patient_note.txt \
     --output results.json
   ```

3. De-identify with mask method:
   ```bash
   openmed pii deidentify \
     --input-file patient_note.txt \
     --method mask \
     --output deidentified.txt
   ```

4. De-identify with date shifting:
   ```bash
   openmed pii deidentify \
     --text "Admission: 01/15/2025" \
     --method shift_dates \
     --date-shift-days 180
   ```

5. Batch processing:
   ```bash
   openmed pii batch-extract \
     --input-dir ./patient_notes/ \
     --output-dir ./results/ \
     --confidence-threshold 0.6
   ```

6. Interactive mode:
   ```bash
   openmed pii interactive
   ```

7. Get help:
   ```bash
   openmed pii --help
   openmed pii extract --help
   openmed pii deidentify --help
   ```
""")

print("\nTo see available CLI commands, run:")
print("  !openmed pii --help")

---
## Summary and Best Practices

### Key Takeaways

1. **Smart Merging (v0.5.0)** - Always enabled by default
   - Fixes fragmentation issues
   - Production-ready complete entities
   - Minimal performance overhead (~5-10%)

2. **De-identification Methods**
   - `mask`: Best for clinical review (maintains structure)
   - `remove`: Maximum privacy (minimal data)
   - `replace`: Research datasets (synthetic data)
   - `hash`: Linking records (deterministic)
   - `shift_dates`: Temporal analysis (preserves relationships)

3. **Confidence Thresholds**
   - 0.5-0.7: Recommended for most use cases
   - Lower: High recall (catch more PII)
   - Higher: High precision (fewer false positives)

4. **Batch Processing**
   - Use for multiple documents
   - Efficient resource usage
   - Progress tracking included

5. **HIPAA Compliance**
   - Covers all 18 identifier categories
   - Audit trails with mappings
   - Safe Harbor method supported

### Best Practices

‚úÖ **DO:**
- Use `use_smart_merging=True` (default) for production
- Test with representative data
- Monitor entity quality (check for fragments)
- Keep mappings for audit trails
- Validate de-identified output

‚ùå **DON'T:**
- Disable smart merging without good reason
- Use very low thresholds without review
- Skip validation on production data
- Share mappings without encryption
- Assume 100% recall (always review edge cases)

### Performance Tips

- Use batch processing for multiple documents
- Adjust `batch_size` based on memory
- Cache model loading for repeated calls
- Monitor processing time with profiling

### Security Considerations

- Store mappings securely (encrypted)
- Limit access to original data
- Audit de-identification logs
- Follow organizational HIPAA policies
- Regular compliance reviews

---

## Resources

- **Documentation:** https://github.com/maziyarpanahi/openmed
- **Smart Merging Guide:** `docs/pii-smart-merging.md`
- **API Reference:** `docs/api-reference.md`
- **HIPAA Compliance:** `docs/hipaa-compliance.md`
- **Model Hub:** https://huggingface.co/openmed

---

**Version:** OpenMed v0.5.0+

**Last Updated:** 2026-01-13
