<a href="https://colab.research.google.com/github/madhupawar/2026_Consumer_AI_Systems_Healthcare/blob/main/Simple_Deidentification_Workbench_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Setup & Installation
First, we need to install spaCy, an industry-standard Natural Language Processing library. We will also download a small English language model (en_core_web_sm) that has been pre-trained to recognize people, places, and dates

In [None]:
# In Google Colab, run this cell first to install the necessary data.
!pip install -U spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m101.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Step 2: The 'Hybrid' Logic
This is the core of our pipeline. We use spaCy to handle the ambiguous stuff (Names, Locations) and Regex to handle the structured formats (MRNs, Zip Codes). This covers the gaps that single methods usually miss.

In [None]:
import spacy
import re

# 1. Load the AI Model
nlp = spacy.load("en_core_web_sm")

def deidentify_note(text):
    doc = nlp(text)
    redactions = []

    # --- LAYER 1: AI (Named Entity Recognition) ---
    # We ask the model for specific categories relevant to HIPAA
    target_labels = ["PERSON", "GPE", "DATE", "LOC"]

    for ent in doc.ents:
        if ent.label_ in target_labels:
            # Store the start/end character positions
            redactions.append((ent.start_char, ent.end_char, ent.label_))

    # --- LAYER 2: RULES (Regex for IDs) ---
    # AI often misses numbers, so we use strict patterns for them.

    # Pattern for Medical Record Number (6 digits)
    mrn_pattern = r"\b\d{6}\b"
    for match in re.finditer(mrn_pattern, text):
        redactions.append((match.start(), match.end(), "MRN"))

    # Pattern for Zip Codes (5 digits)
    zip_pattern = r"\b\d{5}\b"
    for match in re.finditer(zip_pattern, text):
        redactions.append((match.start(), match.end(), "ZIP"))

    # --- EXECUTION: Apply Redactions ---
    # We sort strictly by position (reverse) to avoid messing up indices when we edit the string.
    redactions.sort(key=lambda x: x[0], reverse=True)

    sanitized_text = text
    for start, end, label in redactions:
        # Replace the PHI with a safe tag
        replacement = f"[{label}]"
        sanitized_text = sanitized_text[:start] + replacement + sanitized_text[end:]

    return sanitized_text

Step 3: Run the Pipeline
Now, let's test it on a clinical note. Notice how it handles the ambiguity of "Los Angeles" (Location) vs. "Anxiety" (Condition).

In [None]:
# The raw clinical note containing PHI
clinical_note = """
Patient: Sarah Connor (DOB: 05/12/1984). MRN: 884291.
Presenting Complaint: Patient arrived at localized ER in Los Angeles
complaining of severe anxiety and delusions regarding 'machines'.
History: Patient suffered a fracture in 2019.
Lives at 1984 Skynet Blvd, Zip 90210.
Assessment: Acute stress reaction.
Plan: Follow up with Dr. Silberman on Monday.
"""

# Run our function
result = deidentify_note(clinical_note)

print("--- ORIGINAL ---")
print(clinical_note)
print("\n--- DE-IDENTIFIED ---")
print(result)

--- ORIGINAL ---

Patient: Sarah Connor (DOB: 05/12/1984). MRN: 884291.
Presenting Complaint: Patient arrived at localized ER in Los Angeles
complaining of severe anxiety and delusions regarding 'machines'.
History: Patient suffered a fracture in 2019.
Lives at 1984 Skynet Blvd, Zip 90210.
Assessment: Acute stress reaction.
Plan: Follow up with Dr. Silberman on Monday.


--- DE-IDENTIFIED ---

Patient: [PERSON] (DOB: 05/12/1984). MRN: [MRN].
[PERSON] arrived at localized [GPE] in [GPE]
complaining of severe anxiety and delusions regarding 'machines'.
History: Patient suffered a fracture in [DATE].
Lives at [DATE] Skynet Blvd, Zip [ZIP].
Assessment: Acute stress reaction.
Plan: Follow up with Dr. [PERSON] on [DATE].



**Expected Output & Analysis**

When you run the code above, you should see output similar to this.

Patient: XXX (DOB: XXX/XXX/1984). MRN: XXX.
Presenting Complaint: Patient arrived at localized ER in XXX
complaining of severe anxiety and delusions regarding 'machines'.
History: Patient suffered a fracture in 2019.
Lives at XXX, Zip XXX.
Assessment: Acute stress reaction.
Plan: Follow up with Dr. XXX on XXX.



(1) Discuss the models precision and recall.

(2) "Why did the model redact '1984' in the address as [DATE] instead of part of the address?"