# 01 - Census Region Data Cleaning
## Trigeminal Neuralgia Treatment Patterns Analysis

**Data Source:** Epic Cosmos  
**Study Period:** November 28, 2022 - November 27, 2025 (3 years)  
**ICD-10 Code:** G50.0 (Trigeminal Neuralgia)  
**Target Journal:** Journal of Neurosurgery (JNS)

---

### Purpose
This notebook processes **census region-level** data from Epic Cosmos.
Aggregation by census region helps minimize the "10 or fewer" privacy masking issue.

### Data Files
1. `Meds and Census Jan 4 2026.xlsx` - Medication counts by census region
2. `Procedures and Census.xlsx` - Procedure counts by census region
3. `TN meds then procedures by census Jan 4 2026.xlsx` - Cross-tabulation

---


## 1. Setup and Imports


In [1]:
# Standard library imports
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data manipulation
import pandas as pd
import numpy as np

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 80)

# Add project root to path for imports
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Import project configuration
from src.config import (
    RAW_DATA_DIR, PROCESSED_DATA_DIR,
    SMALL_CELL_VALUE, SMALL_CELL_IMPUTATION,
    TN_CONFIG, ensure_directories
)
from src.utils.data_cleaning import impute_small_cells

ensure_directories()

print(f"Project Root: {project_root}")
print(f"Raw Data Dir: {RAW_DATA_DIR}")
print(f"\nSmall cell value: '{SMALL_CELL_VALUE}' → Imputed as: {SMALL_CELL_IMPUTATION}")


Project Root: /Users/dhirajpangal/Library/Mobile Documents/com~apple~CloudDocs/Desktop/RESEARCH/STANFORD/Cosmos/trigeminalneuralgia-cosmos
Raw Data Dir: /Users/dhirajpangal/Library/Mobile Documents/com~apple~CloudDocs/Desktop/RESEARCH/STANFORD/Cosmos/trigeminalneuralgia-cosmos/TN_Data

Small cell value: '10 or fewer' → Imputed as: 5


In [2]:
# Census region mapping from Epic's long descriptions to standard names
CENSUS_REGION_MAP = {
    'Ohio, Michigan, Illinois, Wisconsin, or Indiana': 'East North Central',
    'Minnesota, Iowa, Missouri, Kansas, Nebraska, North Dakota, or South Dakota': 'West North Central',
    'Pennsylvania, New York, or New Jersey': 'Middle Atlantic',
    'Massachusetts, Connecticut, Maine, New Hampshire, Rhode Island, or Vermont': 'New England',
    'Florida, North Carolina, Virginia, South Carolina, Georgia, Maryland, West Virginia, Delaware, or District of Columbia': 'South Atlantic',
    'Kentucky, Mississippi, Tennessee, or Alabama': 'East South Central',
    'Texas, Louisiana, Arkansas, or Oklahoma': 'West South Central',
    'California, Oregon, Washington, Hawaii, or Alaska': 'Pacific',
    'Colorado, Arizona, Utah, Idaho, Nevada, Montana, New Mexico, or Wyoming': 'Mountain'
}

# Standard census region order (for consistent display)
CENSUS_REGION_ORDER = [
    'New England', 'Middle Atlantic', 'East North Central', 'West North Central',
    'South Atlantic', 'East South Central', 'West South Central', 'Mountain', 'Pacific'
]

print(f"Census regions to process: {len(CENSUS_REGION_MAP)}")


Census regions to process: 9


In [3]:
# Load medications by census region
meds_file = RAW_DATA_DIR / 'Meds and Census Jan 4 2026.xlsx'
df_meds_raw = pd.read_excel(meds_file, header=None)

print("Raw medications file structure (rows 11-20):")
print(df_meds_raw.iloc[11:21].to_string())


Raw medications file structure (rows 11-20):
                                                                                                                         0                               1         2           3            4           5                  6                                                                                     7
11                                                                                                         All Medications  Carbmazapine or Oxcarbmazapine  baclofen  gabapentin  lamotrigine  pregabalin  None of the above  Total: Total includes all data, including data from columns not currently displayed.
12                                                                                                           Census Region                             NaN       NaN         NaN          NaN         NaN                NaN                                                                                   NaN
13                                

In [4]:
# Build column names
med_columns = ['census_region_raw', 'carbamazepine_oxcarbazepine', 'baclofen', 
               'gabapentin', 'lamotrigine', 'pregabalin', 'none_of_above', 'total']

# Extract data (row 13 onwards)
df_meds = df_meds_raw.iloc[13:].copy()
df_meds = df_meds.iloc[:, :len(med_columns)]
df_meds.columns = med_columns
df_meds = df_meds.reset_index(drop=True)

# Remove Total row and non-US regions
df_meds = df_meds[~df_meds['census_region_raw'].astype(str).str.contains('Total', na=False)]
df_meds = df_meds[~df_meds['census_region_raw'].astype(str).str.contains('Puerto Rico|Virgin Islands|Ontario|Armed Forces', na=False)]
df_meds = df_meds[~df_meds['census_region_raw'].astype(str).str.contains('None of the above', na=False)]
df_meds = df_meds[df_meds['census_region_raw'].notna()]

# Map to standard census region names
df_meds['census_region'] = df_meds['census_region_raw'].map(CENSUS_REGION_MAP)

# Verify mapping worked
unmapped = df_meds[df_meds['census_region'].isna()]['census_region_raw'].tolist()
if unmapped:
    print(f"⚠ Unmapped regions: {unmapped}")
else:
    print("✓ All regions mapped successfully")

# Impute small cells
numeric_cols = [c for c in med_columns if c not in ['census_region_raw']]
df_meds = impute_small_cells(df_meds, columns=numeric_cols)

# Convert to numeric
for col in numeric_cols:
    df_meds[col] = pd.to_numeric(df_meds[col], errors='coerce')

# Reorder columns
df_meds = df_meds[['census_region', 'carbamazepine_oxcarbazepine', 'baclofen', 
                   'gabapentin', 'lamotrigine', 'pregabalin', 'none_of_above', 'total']]

# Sort by region order
df_meds['_sort'] = df_meds['census_region'].map({r: i for i, r in enumerate(CENSUS_REGION_ORDER)})
df_meds = df_meds.sort_values('_sort').drop('_sort', axis=1).reset_index(drop=True)

print(f"\nCleaned medications data: {df_meds.shape}")
df_meds


✓ All regions mapped successfully

Cleaned medications data: (9, 8)


Unnamed: 0,census_region,carbamazepine_oxcarbazepine,baclofen,gabapentin,lamotrigine,pregabalin,none_of_above,total
0,New England,7391,2691,9297,1225,2465,5663,19802
1,Middle Atlantic,16121,6210,19565,2709,5497,9607,39021
2,East North Central,25508,10219,29935,4083,9391,13900,60044
3,West North Central,9035,3080,10163,1436,3164,4408,20236
4,South Atlantic,30193,12154,37744,4695,12627,15805,71862
5,East South Central,7314,2754,7943,1009,2841,3090,15447
6,West South Central,14969,5453,17969,1944,6050,6042,32130
7,Mountain,7172,2950,8518,1385,3044,4380,17959
8,Pacific,10038,4052,13196,1578,4095,6946,26470


In [5]:
# Load procedures by census region
procs_file = RAW_DATA_DIR / 'Procedures and Census.xlsx'
df_procs_raw = pd.read_excel(procs_file, header=None)

# Build column names for procedures
proc_columns = ['census_region_raw', 'mvd', 'srs', 'rhizotomy', 'botox', 'none_of_above', 'total']

# Extract data (row 13 onwards)
df_procs = df_procs_raw.iloc[13:].copy()
df_procs = df_procs.iloc[:, :len(proc_columns)]
df_procs.columns = proc_columns
df_procs = df_procs.reset_index(drop=True)

# Remove Total row and non-US regions
df_procs = df_procs[~df_procs['census_region_raw'].astype(str).str.contains('Total', na=False)]
df_procs = df_procs[~df_procs['census_region_raw'].astype(str).str.contains('Puerto Rico|Virgin Islands|Ontario|Armed Forces', na=False)]
df_procs = df_procs[~df_procs['census_region_raw'].astype(str).str.contains('None of the above', na=False)]
df_procs = df_procs[df_procs['census_region_raw'].notna()]

# Map to standard census region names
df_procs['census_region'] = df_procs['census_region_raw'].map(CENSUS_REGION_MAP)

# Impute small cells
proc_numeric_cols = [c for c in proc_columns if c not in ['census_region_raw']]
df_procs = impute_small_cells(df_procs, columns=proc_numeric_cols)

# Convert to numeric
for col in proc_numeric_cols:
    df_procs[col] = pd.to_numeric(df_procs[col], errors='coerce')

# Reorder columns
df_procs = df_procs[['census_region', 'mvd', 'srs', 'rhizotomy', 'botox', 'none_of_above', 'total']]

# Sort by region order
df_procs['_sort'] = df_procs['census_region'].map({r: i for i, r in enumerate(CENSUS_REGION_ORDER)})
df_procs = df_procs.sort_values('_sort').drop('_sort', axis=1).reset_index(drop=True)

print(f"Cleaned procedures data: {df_procs.shape}")
df_procs


Cleaned procedures data: (9, 7)


Unnamed: 0,census_region,mvd,srs,rhizotomy,botox,none_of_above,total
0,New England,174,291,120,136,17874,18548
1,Middle Atlantic,453,206,470,345,36309,37678
2,East North Central,604,102,637,704,55361,57279
3,West North Central,238,104,172,213,18730,19402
4,South Atlantic,1122,571,641,578,66684,69452
5,East South Central,154,96,72,73,14739,15114
6,West South Central,525,170,442,235,29739,31031
7,Mountain,234,151,141,275,16479,17247
8,Pacific,568,61,242,226,23716,24753


In [6]:
# Load cross-tabulation file
cross_file = RAW_DATA_DIR / 'TN meds then procedures by census Jan 4 2026.xlsx'
df_cross_raw = pd.read_excel(cross_file, header=None)

# The cross-tab has a complex structure:
# Each medication has 6 sub-columns: MVD, SRS, Rhizotomy, Botox, None, Total
medications = ['carbamazepine_oxcarbazepine', 'baclofen', 'gabapentin', 
               'lamotrigine', 'pregabalin', 'none_of_above']
procedures = ['mvd', 'srs', 'rhizotomy', 'botox', 'none_of_above', 'total']

# Extract data starting from row 14
df_cross = df_cross_raw.iloc[14:].copy()

# Process into long format
cross_data = []

for idx, row in df_cross.iterrows():
    region_raw = row[0]
    
    # Skip if region is Total, None of above, or territories
    if pd.isna(region_raw):
        continue
    if 'Total' in str(region_raw) or 'None of the above' in str(region_raw):
        continue
    if 'Puerto Rico' in str(region_raw) or 'Ontario' in str(region_raw):
        continue
    
    region = CENSUS_REGION_MAP.get(region_raw, None)
    if region is None:
        continue
    
    # Extract data for each medication (6 columns each, starting at col 1)
    for med_idx, med_name in enumerate(medications):
        start_col = 1 + med_idx * 6
        
        entry = {
            'census_region': region,
            'medication': med_name
        }
        
        for proc_idx, proc_name in enumerate(procedures):
            val = row[start_col + proc_idx]
            # Handle "10 or fewer"
            if str(val).strip() == '10 or fewer':
                val = SMALL_CELL_IMPUTATION
            entry[proc_name] = val
        
        cross_data.append(entry)

df_cross_clean = pd.DataFrame(cross_data)

# Convert numeric columns
for col in procedures:
    df_cross_clean[col] = pd.to_numeric(df_cross_clean[col], errors='coerce')

print(f"Cleaned cross-tab: {df_cross_clean.shape}")
print(f"Regions: {df_cross_clean['census_region'].nunique()}")
print(f"Medications: {df_cross_clean['medication'].nunique()}")
df_cross_clean.head(12)


Cleaned cross-tab: (54, 8)
Regions: 9
Medications: 6


Unnamed: 0,census_region,medication,mvd,srs,rhizotomy,botox,none_of_above,total
0,East North Central,carbamazepine_oxcarbazepine,525,78,412,360,23757,25047
1,East North Central,baclofen,286,40,283,337,9066,9929
2,East North Central,gabapentin,410,78,437,433,27664,28926
3,East North Central,lamotrigine,101,28,109,121,3642,3964
4,East North Central,pregabalin,148,33,194,201,8557,9085
5,East North Central,none_of_above,11,5,42,71,12480,12603
6,West North Central,carbamazepine_oxcarbazepine,200,77,134,106,8413,8890
7,West North Central,baclofen,96,41,80,70,2744,3007
8,West North Central,gabapentin,151,72,118,137,9421,9858
9,West North Central,lamotrigine,41,22,35,41,1268,1395


In [7]:
print("=" * 70)
print("DATA VALIDATION - Census Region Level")
print("=" * 70)

# Total patients
meds_total = df_meds['total'].sum()
procs_total = df_procs['total'].sum()

print(f"\nTotal patients:")
print(f"  From medications: {meds_total:,}")
print(f"  From procedures:  {procs_total:,}")

# Check for NaN
print(f"\nNaN values:")
print(f"  Medications: {df_meds.isna().sum().sum()}")
print(f"  Procedures:  {df_procs.isna().sum().sum()}")
print(f"  Cross-tab:   {df_cross_clean.isna().sum().sum()}")

print(f"\n✓ Census region data validated")

# Summary by region
print("\n" + "=" * 70)
print("PATIENTS BY CENSUS REGION")
print("=" * 70)

region_summary = df_meds[['census_region', 'total']].copy()
region_summary['pct'] = (region_summary['total'] / region_summary['total'].sum() * 100).round(1)
region_summary = region_summary.sort_values('total', ascending=False)

print(region_summary.to_string(index=False))
print(f"\nTotal: {region_summary['total'].sum():,} patients")


DATA VALIDATION - Census Region Level

Total patients:
  From medications: 302,971
  From procedures:  290,504

NaN values:
  Medications: 0
  Procedures:  0
  Cross-tab:   0

✓ Census region data validated

PATIENTS BY CENSUS REGION
     census_region  total  pct
    South Atlantic  71862 23.7
East North Central  60044 19.8
   Middle Atlantic  39021 12.9
West South Central  32130 10.6
           Pacific  26470  8.7
West North Central  20236  6.7
       New England  19802  6.5
          Mountain  17959  5.9
East South Central  15447  5.1

Total: 302,971 patients


In [8]:
# Save cleaned datasets
print("Saving cleaned census region datasets...")

# Medications by census region
meds_output = PROCESSED_DATA_DIR / 'census_medications_clean.csv'
df_meds.to_csv(meds_output, index=False)
print(f"  ✓ Saved: {meds_output.name}")

# Procedures by census region
procs_output = PROCESSED_DATA_DIR / 'census_procedures_clean.csv'
df_procs.to_csv(procs_output, index=False)
print(f"  ✓ Saved: {procs_output.name}")

# Cross-tabulation
cross_output = PROCESSED_DATA_DIR / 'census_meds_procedures_clean.csv'
df_cross_clean.to_csv(cross_output, index=False)
print(f"  ✓ Saved: {cross_output.name}")

print(f"\nAll files saved to: {PROCESSED_DATA_DIR}")


Saving cleaned census region datasets...
  ✓ Saved: census_medications_clean.csv
  ✓ Saved: census_procedures_clean.csv
  ✓ Saved: census_meds_procedures_clean.csv

All files saved to: /Users/dhirajpangal/Library/Mobile Documents/com~apple~CloudDocs/Desktop/RESEARCH/STANFORD/Cosmos/trigeminalneuralgia-cosmos/analysis/outputs/data


In [9]:
print("=" * 70)
print("DATA CLEANING COMPLETE - Census Region Level")
print("=" * 70)
print(f"""
Study: Trigeminal Neuralgia Treatment Patterns
Data Source: Epic Cosmos
Study Period: {TN_CONFIG.study_start} to {TN_CONFIG.study_end}
ICD-10: {TN_CONFIG.icd10_code}

Datasets Created:
  1. census_medications_clean.csv ({df_meds.shape[0]} regions)
  2. census_procedures_clean.csv ({df_procs.shape[0]} regions)
  3. census_meds_procedures_clean.csv ({df_cross_clean.shape[0]} rows)

Total Patients: ~{meds_total:,}

Next Steps:
  → Run 02_statistical_analysis_census.ipynb
""")


DATA CLEANING COMPLETE - Census Region Level

Study: Trigeminal Neuralgia Treatment Patterns
Data Source: Epic Cosmos
Study Period: 2022-11-28 to 2025-11-27
ICD-10: G50.0

Datasets Created:
  1. census_medications_clean.csv (9 regions)
  2. census_procedures_clean.csv (9 regions)
  3. census_meds_procedures_clean.csv (54 rows)

Total Patients: ~302,971

Next Steps:
  → Run 02_statistical_analysis_census.ipynb

