The goal of this notebook is to apply privacy-preserving de-identification to preprocessed medical text by masking or generalizing sensitive personal information (PHI), while preserving medical meaning for downstream LLM tasks.

In [1]:
import pandas as pd
import re
from pathlib import Path


In [2]:
DATA_DIR = Path("data")

train_df = pd.read_csv("train_preprocessed.csv")
test_df = pd.read_csv("test_preprocessed.csv")

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

train_df.head()


Train shape: (9240, 2)
Test shape: (2310, 2)


Unnamed: 0,raw,processed
0,Simultaneous bilateral hernia repair. A case a...,simultaneous bilateral hernia repair. a case a...
1,Exposure histories in acute nonlymphocytic leu...,exposure histories in acute nonlymphocytic leu...
2,Misdiagnosis of the Zollinger-Ellison syndrome...,misdiagnosis of the zollingerellison syndrome ...
3,The effect of sleep on the dyskinetic movement...,the effect of sleep on the dyskinetic movement...
4,Use of ektacytometry to determine red cell sus...,use of ektacytometry to determine red cell sus...


In [19]:
PATTERNS = {
    "EMAIL": r"\b[\w\.-]+@[\w\.-]+\.\w+\b",
    "PHONE": r"\b(\+?\d{1,3}[\s\-]?)?\d{2,4}[\s\-]?\d{3}[\s\-]?\d{3,4}\b",
    "ID": r"\bID[:\s]?\d+\b"
}


In [20]:
def bucket_age(age):
    age = int(age)

    if age < 18:
        return "[AGE_CHILD]"
    elif age < 35:
        return "[AGE_YOUNG_ADULT]"
    elif age < 60:
        return "[AGE_MIDDLE_ADULT]"
    elif age < 90:
        return "[AGE_SENIOR]"
    else:
        return "[AGE_90_PLUS]"


In [21]:
AGE_PATTERN = r"""
\b(
    \d{1,3}\s*-\s*year\s*-\s*old |
    \d{1,3}\s*year\s*-\s*old |
    \d{1,3}\s*years?\s*old |
    age\s*\d{1,3} |
    aged\s*\d{1,3}
)\b
"""


In [22]:
GENDER_MAP = {
    r"\bgirl\b": "[GENDER_FEMALE]",
    r"\bboy\b": "[GENDER_MALE]",
    r"\bwoman\b": "[GENDER_FEMALE]",
    r"\bman\b": "[GENDER_MALE]"
}


In [23]:
def extract_age(text):
    match = re.search(r"\d{1,3}", text)
    return match.group() if match else None


In [24]:
def deidentify_text(
    text,
    mask_age=True,
    mask_gender=True
):
    if not isinstance(text, str):
        return text

    # 1. Mask non-age PHI
    for label, pattern in PATTERNS.items():
        text = re.sub(pattern, f"[{label}]", text, flags=re.IGNORECASE)

    # 2. Mask individual age ONLY
    if mask_age:
        def replace_age(match):
            age = extract_age(match.group())
            return bucket_age(age) if age else match.group()

        text = re.sub(
            AGE_PATTERN,
            replace_age,
            text,
            flags=re.IGNORECASE | re.VERBOSE
        )

    # 3. Optional gender masking (for case reports)
    if mask_gender:
        for pattern, token in GENDER_MAP.items():
            text = re.sub(pattern, token, text, flags=re.IGNORECASE)

    return text


In [25]:
train_df["text_deidentified"] = train_df["raw"].apply(deidentify_text)
test_df["text_deidentified"] = test_df["raw"].apply(deidentify_text)

train_df[["raw", "text_deidentified"]].head()

Unnamed: 0,raw,text_deidentified
0,Simultaneous bilateral hernia repair. A case a...,Simultaneous bilateral hernia repair. A case a...
1,Exposure histories in acute nonlymphocytic leu...,Exposure histories in acute nonlymphocytic leu...
2,Misdiagnosis of the Zollinger-Ellison syndrome...,Misdiagnosis of the Zollinger-Ellison syndrome...
3,The effect of sleep on the dyskinetic movement...,The effect of sleep on the dyskinetic movement...
4,Use of ektacytometry to determine red cell sus...,Use of ektacytometry to determine red cell sus...


In [26]:
example = train_df.iloc[363]

print("ORIGINAL:\n", example["raw"])
print("\nDE-IDENTIFIED:\n", example["text_deidentified"])

ORIGINAL:
 Autonomic dysfunction and Guillain-Barre syndrome. The use of esmolol in its management. A 17-year-old girl with Guillain-Barre syndrome and autonomic dysfunction was treated successfully with esmolol. Esmolol may be an appropriate drug for the rapid assessment and control of tachyarrhythmias in critically ill patients. 

DE-IDENTIFIED:
 Autonomic dysfunction and Guillain-Barre syndrome. The use of esmolol in its management. A [AGE_CHILD] [GENDER_FEMALE] with Guillain-Barre syndrome and autonomic dysfunction was treated successfully with esmolol. Esmolol may be an appropriate drug for the rapid assessment and control of tachyarrhythmias in critically ill patients. 
