# üè• Medical Data Preprocessing and Curation: Hands-on Practice

## Table of Contents
1. [PHI Detection and Removal](#practice-1-phi-detection-and-removal)
2. [Clinical Text Normalization](#practice-2-clinical-text-normalization)
3. [Medical Abbreviation Expansion](#practice-3-medical-abbreviation-expansion)
4. [Negation Detection](#practice-4-negation-detection)
5. [Medical Coding with UMLS](#practice-5-medical-coding-with-umls)
6. [Data Quality Assessment](#practice-6-data-quality-assessment)

---
**Learning Objectives:**
- Implement PHI detection and anonymization techniques
- Normalize clinical text for analysis
- Apply medical coding systems (ICD, SNOMED CT)
- Assess and improve data quality

## Installing and Importing Essential Libraries

In [None]:
# Install required packages (uncomment if needed)
# !pip install pandas numpy matplotlib seaborn
# !pip install presidio-analyzer presidio-anonymizer
# !pip install spacy
# !python -m spacy download en_core_web_sm

# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
sns.set_style('whitegrid')

print("‚úÖ All libraries loaded successfully!")

---
## Practice 1: PHI Detection and Removal

### üéØ Learning Objectives
- Understand the 18 HIPAA identifiers
- Implement rule-based PHI detection
- Apply anonymization techniques

### üìñ Key Concepts
**Protected Health Information (PHI):** Any information that can identify a patient
- Names, addresses, dates, phone numbers, email, SSN, medical record numbers, etc.
- **Safe Harbor Method:** Remove all 18 identifiers specified by HIPAA

In [None]:
# 1.1 Sample clinical note with PHI
sample_note = """
Patient Name: John Doe
DOB: 03/15/1975
MRN: 123456
Phone: (555) 123-4567
Email: john.doe@email.com

Chief Complaint: Chest pain
HPI: 48-year-old male presents with chest pain since 2023-11-10.
Address: 123 Main Street, Boston, MA 02101

Assessment: Possible angina
Plan: ECG, troponin levels
"""

print("Original Clinical Note:")
print("=" * 60)
print(sample_note)

In [None]:
# 1.2 Rule-based PHI detection using regex
def detect_phi_patterns(text):
    """Detect PHI using regular expressions"""
    
    patterns = {
        'Date (MM/DD/YYYY)': r'\b\d{2}/\d{2}/\d{4}\b',
        'Date (YYYY-MM-DD)': r'\b\d{4}-\d{2}-\d{2}\b',
        'Phone': r'\(\d{3}\)\s*\d{3}-\d{4}',
        'Email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'SSN': r'\b\d{3}-\d{2}-\d{4}\b',
        'MRN': r'MRN:\s*(\d+)',
        'Address (ZIP)': r'\b\d{5}(?:-\d{4})?\b'
    }
    
    detected = {}
    
    for phi_type, pattern in patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            detected[phi_type] = matches
    
    return detected

# Detect PHI
phi_found = detect_phi_patterns(sample_note)

print("\nDetected PHI:")
print("=" * 60)
for phi_type, values in phi_found.items():
    print(f"  {phi_type}: {values}")

In [None]:
# 1.3 PHI anonymization
def anonymize_phi(text):
    """Remove or mask PHI from clinical text"""
    
    # Define replacement patterns
    anonymization_rules = [
        (r'Patient Name:\s*[A-Z][a-z]+\s+[A-Z][a-z]+', 'Patient Name: [REDACTED]'),
        (r'\b\d{2}/\d{2}/\d{4}\b', '[DATE]'),
        (r'\b\d{4}-\d{2}-\d{2}\b', '[DATE]'),
        (r'\(\d{3}\)\s*\d{3}-\d{4}', '[PHONE]'),
        (r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]'),
        (r'MRN:\s*\d+', 'MRN: [REDACTED]'),
        (r'\d+\s+[A-Z][a-z]+\s+Street,\s*[A-Z][a-z]+,\s*[A-Z]{2}\s+\d{5}', '[ADDRESS]')
    ]
    
    anonymized = text
    for pattern, replacement in anonymization_rules:
        anonymized = re.sub(pattern, replacement, anonymized)
    
    return anonymized

# Apply anonymization
anonymized_note = anonymize_phi(sample_note)

print("\nAnonymized Clinical Note:")
print("=" * 60)
print(anonymized_note)

---
## Practice 2: Clinical Text Normalization

### üéØ Learning Objectives
- Standardize clinical text formats
- Convert units and dates
- Handle special characters and punctuation

### üìñ Key Concepts
**Text Normalization:** Transforming text into a standard format
- Lowercase conversion
- Date/time standardization
- Unit conversion
- Special character handling

In [None]:
# 2.1 Clinical text normalization
def normalize_clinical_text(text):
    """Normalize clinical text for analysis"""
    
    normalized = text
    
    # 1. Lowercase conversion (except abbreviations)
    # For simplicity, we'll keep abbreviations as-is
    
    # 2. Date standardization (MM/DD/YYYY -> YYYY-MM-DD)
    def standardize_date(match):
        date_str = match.group(0)
        try:
            date_obj = datetime.strptime(date_str, '%m/%d/%Y')
            return date_obj.strftime('%Y-%m-%d')
        except:
            return date_str
    
    normalized = re.sub(r'\b\d{2}/\d{2}/\d{4}\b', standardize_date, normalized)
    
    # 3. Unit conversion (Fahrenheit to Celsius)
    def fahrenheit_to_celsius(match):
        temp_f = float(match.group(1))
        temp_c = (temp_f - 32) * 5/9
        return f"{temp_c:.1f}¬∞C"
    
    normalized = re.sub(r'(\d+\.?\d*)\s*¬∞?F\b', fahrenheit_to_celsius, normalized)
    
    # 4. Remove extra whitespace
    normalized = re.sub(r'\s+', ' ', normalized).strip()
    
    return normalized

# Test normalization
test_text = """
Patient presented on 03/15/2023 with fever of 101.5¬∞F.
Temperature was  98.6F  at  discharge on 03/18/2023.
"""

print("Original Text:")
print(test_text)
print("\nNormalized Text:")
print(normalize_clinical_text(test_text))

---
## Practice 3: Medical Abbreviation Expansion

### üéØ Learning Objectives
- Create and use medical abbreviation dictionaries
- Handle context-dependent abbreviations
- Implement abbreviation expansion

### üìñ Key Concepts
**Medical Abbreviations:** Shortened forms commonly used in clinical documentation
- **BP** ‚Üí blood pressure
- **HR** ‚Üí heart rate
- **MI** ‚Üí myocardial infarction
- **COPD** ‚Üí chronic obstructive pulmonary disease

In [None]:
# 3.1 Medical abbreviation dictionary
medical_abbreviations = {
    'BP': 'blood pressure',
    'HR': 'heart rate',
    'RR': 'respiratory rate',
    'Temp': 'temperature',
    'MI': 'myocardial infarction',
    'CHF': 'congestive heart failure',
    'COPD': 'chronic obstructive pulmonary disease',
    'HTN': 'hypertension',
    'DM': 'diabetes mellitus',
    'Dx': 'diagnosis',
    'Tx': 'treatment',
    'Hx': 'history',
    'Sx': 'symptoms',
    'Pt': 'patient',
    'CBC': 'complete blood count',
    'ECG': 'electrocardiogram',
    'EKG': 'electrocardiogram'
}

def expand_abbreviations(text, abbrev_dict):
    """Expand medical abbreviations in clinical text"""
    
    expanded = text
    
    # Sort by length (longest first) to handle overlapping abbreviations
    sorted_abbrevs = sorted(abbrev_dict.items(), key=lambda x: len(x[0]), reverse=True)
    
    for abbrev, expansion in sorted_abbrevs:
        # Use word boundaries to avoid partial matches
        pattern = r'\b' + re.escape(abbrev) + r'\b'
        expanded = re.sub(pattern, expansion, expanded)
    
    return expanded

# Test abbreviation expansion
test_note = """
Pt presents with elevated BP and HR.
Hx of MI and CHF.
Current Sx: chest pain, shortness of breath.
Dx: Possible MI
Tx: ECG ordered, monitor vitals
"""

print("Original Note:")
print(test_note)
print("\nExpanded Note:")
print(expand_abbreviations(test_note, medical_abbreviations))

---
## Practice 4: Negation Detection

### üéØ Learning Objectives
- Implement the NegEx algorithm
- Identify negation triggers
- Determine negation scope

### üìñ Key Concepts
**NegEx Algorithm:** Rule-based algorithm to detect negated medical concepts
- **Negation triggers:** no, not, denies, without, absent, negative
- **Scope:** Words following the trigger within a certain window

In [None]:
# 4.1 Simple NegEx implementation
def detect_negation(text, concept):
    """Detect if a medical concept is negated"""
    
    # Negation triggers
    negation_triggers = [
        'no', 'not', 'denies', 'denied', 'without', 'absent',
        'negative', 'negative for', 'no evidence of', 'rule out',
        'free of', 'never'
    ]
    
    # Convert to lowercase for matching
    text_lower = text.lower()
    concept_lower = concept.lower()
    
    # Find concept position
    concept_pos = text_lower.find(concept_lower)
    
    if concept_pos == -1:
        return None  # Concept not found
    
    # Look for negation triggers before the concept (window of 5 words)
    preceding_text = text_lower[:concept_pos]
    words_before = preceding_text.split()[-5:]  # Last 5 words
    
    for trigger in negation_triggers:
        if trigger in ' '.join(words_before):
            return True  # Negated
    
    return False  # Not negated

# Test negation detection
test_cases = [
    ("Patient has diabetes", "diabetes"),
    ("Patient denies chest pain", "chest pain"),
    ("No evidence of pneumonia", "pneumonia"),
    ("History of myocardial infarction", "myocardial infarction"),
    ("Patient is negative for COVID-19", "COVID-19")
]

print("Negation Detection Results:")
print("=" * 70)
for text, concept in test_cases:
    is_negated = detect_negation(text, concept)
    status = "‚ùå NEGATED" if is_negated else "‚úÖ AFFIRMED" if is_negated is False else "‚ùì NOT FOUND"
    print(f"{status:15} | Concept: {concept:25} | Text: {text}")

---
## Practice 5: Medical Coding with UMLS

### üéØ Learning Objectives
- Understand medical coding systems (ICD, SNOMED CT)
- Map clinical terms to standard codes
- Implement entity linking

### üìñ Key Concepts
**Medical Coding Systems:**
- **ICD-10:** International Classification of Diseases (70,000+ codes)
- **SNOMED CT:** Systematized Nomenclature of Medicine (350,000+ concepts)
- **RxNorm:** Drug normalization
- **LOINC:** Laboratory test codes (96,000+ codes)

In [None]:
# 5.1 Simple medical coding dictionary
medical_codes = {
    # ICD-10 codes
    'diabetes mellitus': {'ICD-10': 'E11.9', 'SNOMED': '73211009'},
    'hypertension': {'ICD-10': 'I10', 'SNOMED': '38341003'},
    'myocardial infarction': {'ICD-10': 'I21.9', 'SNOMED': '22298006'},
    'pneumonia': {'ICD-10': 'J18.9', 'SNOMED': '233604007'},
    'congestive heart failure': {'ICD-10': 'I50.9', 'SNOMED': '42343007'},
    'chronic obstructive pulmonary disease': {'ICD-10': 'J44.9', 'SNOMED': '13645005'},
    'chest pain': {'ICD-10': 'R07.9', 'SNOMED': '29857009'},
    'fever': {'ICD-10': 'R50.9', 'SNOMED': '386661006'},
}

# LOINC codes for common lab tests
loinc_codes = {
    'glucose': '2339-0',  # Glucose [Mass/volume] in Blood
    'hemoglobin': '718-7',  # Hemoglobin [Mass/volume] in Blood
    'creatinine': '2160-0',  # Creatinine [Mass/volume] in Serum or Plasma
    'white blood cell count': '6690-2',  # Leukocytes [#/volume] in Blood
}

def map_to_codes(clinical_term):
    """Map clinical terms to standard codes"""
    
    term_lower = clinical_term.lower()
    
    # Check disease codes
    if term_lower in medical_codes:
        return medical_codes[term_lower]
    
    # Check lab test codes
    if term_lower in loinc_codes:
        return {'LOINC': loinc_codes[term_lower]}
    
    return None

# Test medical coding
test_terms = [
    'diabetes mellitus',
    'hypertension',
    'chest pain',
    'glucose',
    'hemoglobin'
]

print("Medical Coding Results:")
print("=" * 70)
for term in test_terms:
    codes = map_to_codes(term)
    if codes:
        print(f"\nTerm: {term}")
        for system, code in codes.items():
            print(f"  {system}: {code}")
    else:
        print(f"\nTerm: {term} - No codes found")

---
## Practice 6: Data Quality Assessment

### üéØ Learning Objectives
- Calculate data quality metrics
- Identify missing values and outliers
- Visualize data quality issues

### üìñ Key Concepts
**Data Quality Dimensions:**
- **Completeness:** Percentage of non-missing values
- **Accuracy:** Correctness of data values
- **Consistency:** Data follows expected patterns
- **Timeliness:** Data is up-to-date

In [None]:
# 6.1 Create sample medical dataset
np.random.seed(42)

# Generate synthetic patient data
n_patients = 100

data = {
    'patient_id': range(1, n_patients + 1),
    'age': np.random.randint(18, 90, n_patients),
    'gender': np.random.choice(['M', 'F', None], n_patients, p=[0.48, 0.48, 0.04]),
    'systolic_bp': np.random.randint(90, 180, n_patients),
    'diastolic_bp': np.random.randint(60, 110, n_patients),
    'heart_rate': np.random.randint(60, 120, n_patients),
    'temperature': np.random.uniform(36.0, 39.0, n_patients),
    'glucose': np.random.randint(70, 200, n_patients),
    'diagnosis': np.random.choice(['HTN', 'DM', 'CHF', 'COPD', None], n_patients, p=[0.25, 0.25, 0.20, 0.20, 0.10])
}

# Introduce some missing values
for col in ['systolic_bp', 'glucose', 'temperature']:
    missing_idx = np.random.choice(n_patients, size=int(n_patients * 0.05), replace=False)
    for idx in missing_idx:
        data[col][idx] = np.nan

# Create DataFrame
df = pd.DataFrame(data)

print("Sample Medical Dataset:")
print(df.head(10))
print(f"\nDataset shape: {df.shape}")

In [None]:
# 6.2 Data quality assessment
def assess_data_quality(df):
    """Calculate data quality metrics"""
    
    quality_metrics = {}
    
    for col in df.columns:
        total = len(df)
        missing = df[col].isna().sum()
        present = total - missing
        completeness = (present / total) * 100
        
        quality_metrics[col] = {
            'Total': total,
            'Missing': missing,
            'Present': present,
            'Completeness (%)': completeness
        }
    
    return pd.DataFrame(quality_metrics).T

# Calculate quality metrics
quality_report = assess_data_quality(df)

print("\nData Quality Assessment:")
print("=" * 70)
print(quality_report)

# Overall quality score
overall_completeness = quality_report['Completeness (%)'].mean()
print(f"\nOverall Completeness: {overall_completeness:.2f}%")

In [None]:
# 6.3 Visualize data quality
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Completeness by column
axes[0, 0].barh(quality_report.index, quality_report['Completeness (%)'], color='#1E64C8')
axes[0, 0].set_xlabel('Completeness (%)')
axes[0, 0].set_title('Data Completeness by Column')
axes[0, 0].axvline(x=95, color='red', linestyle='--', label='95% threshold')
axes[0, 0].legend()

# 2. Missing values heatmap
missing_matrix = df.isnull().astype(int)
sns.heatmap(missing_matrix.T, cmap='RdYlGn_r', cbar=False, ax=axes[0, 1])
axes[0, 1].set_title('Missing Values Heatmap')
axes[0, 1].set_xlabel('Patient Index')

# 3. Distribution of vital signs
vital_signs = df[['systolic_bp', 'heart_rate', 'temperature']].dropna()
axes[1, 0].boxplot([vital_signs['systolic_bp'], vital_signs['heart_rate'], vital_signs['temperature']])
axes[1, 0].set_xticklabels(['Systolic BP', 'Heart Rate', 'Temperature'])
axes[1, 0].set_title('Distribution of Vital Signs')
axes[1, 0].set_ylabel('Value')

# 4. Diagnosis distribution
diagnosis_counts = df['diagnosis'].value_counts()
axes[1, 1].pie(diagnosis_counts.values, labels=diagnosis_counts.index, autopct='%1.1f%%', startangle=90)
axes[1, 1].set_title('Diagnosis Distribution')

plt.tight_layout()
plt.show()

print("\n‚úÖ Data quality visualization complete!")

---
## üéØ Practice Summary

### What We Learned:

1. **PHI Detection & Removal**
   - Implemented rule-based pattern matching for 18 HIPAA identifiers
   - Applied anonymization techniques using regex

2. **Clinical Text Normalization**
   - Standardized dates, units, and formats
   - Handled special characters and whitespace

3. **Medical Abbreviation Expansion**
   - Created abbreviation dictionaries
   - Implemented context-aware expansion

4. **Negation Detection**
   - Applied NegEx algorithm principles
   - Identified negation triggers and scope

5. **Medical Coding**
   - Mapped terms to ICD-10, SNOMED CT, and LOINC codes
   - Implemented entity linking

6. **Data Quality Assessment**
   - Calculated completeness, accuracy metrics
   - Visualized quality issues

### Key Takeaways:
- üîí **PHI protection** is critical for HIPAA compliance
- üìù **Text normalization** improves analysis quality
- üè• **Medical coding** enables interoperability
- üìä **Data quality** directly impacts model performance

### Next Steps:
- Implement ML-based NER for PHI detection
- Use UMLS API for comprehensive medical coding
- Build complete preprocessing pipelines with Apache Airflow
- Apply these techniques to MIMIC-III dataset

---
## üìö Additional Resources

### Libraries & Tools:
- **Presidio**: Microsoft's PHI detection library
- **MedCAT**: Medical Concept Annotation Tool
- **spaCy**: Industrial-strength NLP
- **pydicom**: DICOM file handling

### Datasets:
- **MIMIC-III/IV**: Critical care database
- **i2b2**: Clinical NLP challenges

### Documentation:
- UMLS: https://www.nlm.nih.gov/research/umls/
- SNOMED CT: https://www.snomed.org/
- LOINC: https://loinc.org/