<a href="https://colab.research.google.com/github/jcl347/MiniJuypters/blob/main/demo_all_components.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Medical Code Intelligence — Full Pipeline Demo

This notebook demonstrates every component of the Medical Code Intelligence pipeline:

| # | Component | Module | GPU? |
|---|-----------|--------|------|
| 1 | Configuration | `configs.ner_config` | No |
| 2 | Shorthand Expansion | `src.clinical.shorthand` | No |
| 3 | Negation Detection (rule-based) | `src.clinical.negation` | No |
| 4 | ICD-10-CM Code Lookup | `src.clinical.icd_codes` | No |
| 5 | MS-DRG Cost Estimation | `src.clinical.drg_costs` | No |
| 6 | Entity Post-Processing | `src.inference.entity_utils` | No |
| 7 | Evaluation Metrics | `src.evaluation.metrics` | No |
| 8 | Curated ICD Dataset Generation | `src.data.icd_dataset` | No |
| 9 | MedMentions & MACCROBAT (optional sources) | `src.data.icd_dataset` | No* |
| 10 | End-to-End Pipeline | `src.clinical.pipeline` | No** |
| 11 | Adversarial Training (overview) | `src.training.adversarial` | Yes |
| 12 | Assertion Classifier (transformer) | `src.clinical.assertion` | Optional |

\* MedMentions and MACCROBAT download from HuggingFace on first use; cells show the loader API and structure with graceful fallback if unavailable.

\** The pipeline demo uses `process_with_entities()` (pre-extracted entities), which requires no trained NER model.

**All cells run on CPU** without downloading external models (except Section 9, which attempts optional HuggingFace downloads).

In [9]:
!git clone https://github.com/jcl347/Medical_Code_Intelligence

Cloning into 'Medical_Code_Intelligence'...
remote: Enumerating objects: 232, done.[K
remote: Counting objects: 100% (232/232), done.[K
remote: Compressing objects: 100% (164/164), done.[K
remote: Total 232 (delta 117), reused 180 (delta 65), pack-reused 0 (from 0)[K
Receiving objects: 100% (232/232), 202.09 KiB | 1.58 MiB/s, done.
Resolving deltas: 100% (117/117), done.


In [10]:
import os, sys, pathlib

# Find the repo root by looking for known markers (configs/, src/)
# Works whether the notebook runs from notebooks/, the repo root, a cloud
# environment (Colab, Kaggle), or elsewhere.
def _find_repo_root():
    markers = ("src", "configs")

    def _has_markers(p):
        return all((p / m).is_dir() for m in markers)

    cwd = pathlib.Path.cwd()

    # 1. cwd IS the repo root (running from repo root)
    if _has_markers(cwd):
        return str(cwd)

    # 2. cwd is notebooks/ inside the repo
    if _has_markers(cwd.parent):
        return str(cwd.parent)

    # 3. Repo is a subdirectory of cwd (common in Colab: /content/RepoName/)
    for child in sorted(cwd.iterdir()):
        if child.is_dir() and _has_markers(child):
            return str(child)

    # 4. Walk up from cwd (handles deeply nested launch dirs)
    p = cwd
    for _ in range(5):
        p = p.parent
        if _has_markers(p):
            return str(p)

    # 5. Last resort: try relative to this file if available
    try:
        nb_dir = pathlib.Path(__file__).resolve().parent
        if _has_markers(nb_dir.parent):
            return str(nb_dir.parent)
    except NameError:
        pass

    raise RuntimeError(
        f"Cannot find repo root (looked for src/ + configs/ dirs).\n"
        f"  cwd = {cwd}\n"
        f"Hint: clone the repo and run from inside it, or set REPO_ROOT manually:\n"
        f"  REPO_ROOT = '/path/to/Medical_Code_Intelligence'"
    )

REPO_ROOT = _find_repo_root()
if REPO_ROOT not in sys.path:
    sys.path.insert(0, REPO_ROOT)

print(f"Repo root: {REPO_ROOT}")
print(f"  configs/ exists: {os.path.isdir(os.path.join(REPO_ROOT, 'configs'))}")
print(f"  src/ exists:     {os.path.isdir(os.path.join(REPO_ROOT, 'src'))}")

Repo root: /content/Medical_Code_Intelligence
  configs/ exists: True
  src/ exists:     True


---
## 1. Configuration — `NERConfig`, `MODEL_CONFIGS`, `DATASET_CONFIGS`

All training hyperparameters, model definitions, and dataset metadata live in a single dataclass.

In [11]:
from configs.ner_config import NERConfig, MODEL_CONFIGS, DATASET_CONFIGS

# --- Inspect available models ---
print("=== Supported Pre-trained Models ===")
for key, cfg in MODEL_CONFIGS.items():
    print(f"  {key:20s}  {cfg['model_name']}")

print()

# --- Inspect available datasets ---
print("=== Supported Datasets ===")
for key, cfg in DATASET_CONFIGS.items():
    print(f"  {key:20s}  {cfg['description'][:60]}")

=== Supported Pre-trained Models ===
  pubmedbert            microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  biobert               dmis-lab/biobert-v1.1
  bio_clinicalbert      emilyalsentzer/Bio_ClinicalBERT
  scibert               allenai/scibert_scivocab_uncased
  gatortron-base        UFNLP/gatortron-base

=== Supported Datasets ===
  ncbi_disease          NCBI Disease Corpus - disease name recognition
  bc5cdr                BC5CDR - chemical and disease NER from PubMed articles
  bc2gm                 BC2GM - gene/protein mention recognition
  jnlpba                JNLPBA - biomedical entity recognition (proteins, DNA, RNA, 
  linnaeus              LINNAEUS - species name recognition
  biomedical_ner_all    Combined biomedical NER (10+ entity types across multiple da
  icd_ner               ICD-focused composite NER dataset. Combines seven sources wi
  biomed_ner            Biomedical NER with 24 entity types including DISORDER, MEDI
  icd10_terminology     72,750

In [12]:
# --- Default hyperparameters ---
config = NERConfig()
print("=== Default NERConfig ===")
for k, v in vars(config).items():
    print(f"  {k:35s} = {v}")

=== Default NERConfig ===
  model_key                           = pubmedbert
  model_name_or_path                  = microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  dataset_key                         = ncbi_disease
  max_seq_length                      = 512
  num_train_epochs                    = 20
  per_device_train_batch_size         = 16
  per_device_eval_batch_size          = 32
  learning_rate                       = 5e-05
  weight_decay                        = 0.01
  warmup_ratio                        = 0.1
  max_grad_norm                       = 1.0
  lr_scheduler_type                   = linear
  fp16                                = True
  gradient_accumulation_steps         = 1
  early_stopping_patience             = 5
  early_stopping_metric               = eval_f1
  use_crf                             = False
  use_adversarial_training            = False
  adv_method                          = fgm
  adv_epsilon                         = None
  pgd_alpha

In [13]:
# --- Override for a specific experiment ---
custom = NERConfig(
    model_key="bio_clinicalbert",
    dataset_key="icd_ner",
    learning_rate=3e-5,
    num_train_epochs=15,
    use_adversarial_training=True,
    adv_method="fgm",
    resolve_drg=True,
)
print(f"Model:       {custom.model_key}")
print(f"Dataset:     {custom.dataset_key}")
print(f"LR:          {custom.learning_rate}")
print(f"Adversarial: {custom.adv_method} (epsilon={custom.adv_epsilon})")
print(f"DRG enabled: {custom.resolve_drg}")

Model:       bio_clinicalbert
Dataset:     icd_ner
LR:          3e-05
Adversarial: fgm (epsilon=None)
DRG enabled: True


---
## 2. Shorthand Expansion — `ShorthandExpander`

Expands physician abbreviations ("cp" → "chest pain") with character offset tracking for NER alignment.

Uses the built-in fallback (~280 abbreviations) so no network download is required.

In [14]:
from src.clinical.shorthand import ShorthandExpander

expander = ShorthandExpander(source="builtin")
print(f"Loaded {expander.num_abbreviations} abbreviations")
print(f"Ambiguous: {expander.num_ambiguous}")

Loaded 246 abbreviations
Ambiguous: 0


In [15]:
# --- Simple expansion ---
samples = [
    "pt c/o sob and cp",
    "hx of dm2, htn, and cad",
    "dx: afib r/o mi",
    "nkda, aox3, wnl",
]

print("=== Shorthand Expansion ===")
for text in samples:
    expanded = expander.expand(text)
    print(f"  {text:35s} → {expanded}")

=== Shorthand Expansion ===
  pt c/o sob and cp                   → prothrombin time complaining of shortness of breath and chest pain
  hx of dm2, htn, and cad             → history of type 2 diabetes mellitus, hypertension, and coronary artery disease
  dx: afib r/o mi                     → diagnosis: atrial fibrillation rule out myocardial infarction
  nkda, aox3, wnl                     → no known drug allergies, aox3, wnl


In [16]:
# --- Expansion with offset tracking ---
text = "pt denies cp or sob"
expanded, offsets = expander.expand_with_offsets(text)
print(f"Original:  {text!r}")
print(f"Expanded:  {expanded!r}")
print(f"\nOffset map ({len(offsets)} expansions):")
for om in offsets:
    print(f"  '{om['abbreviation']}' @ [{om['original_start']}:{om['original_end']}] "
          f"→ '{om['expansion']}' @ [{om['expanded_start']}:{om['expanded_end']}]")

Original:  'pt denies cp or sob'
Expanded:  'prothrombin time denies chest pain or shortness of breath'

Offset map (3 expansions):
  'pt' @ [0:2] → 'prothrombin time' @ [0:16]
  'cp' @ [10:12] → 'chest pain' @ [24:34]
  'sob' @ [16:19] → 'shortness of breath' @ [38:57]


In [17]:
# --- Identify abbreviations without expanding ---
abbrevs = expander.identify_abbreviations("pt c/o sob and cp on exertion")
print("=== Identified Abbreviations ===")
for a in abbrevs:
    print(f"  '{a['abbreviation']}' @ [{a['start']}:{a['end']}] → '{a['expansion']}'")

=== Identified Abbreviations ===
  'pt' @ [0:2] → 'prothrombin time'
  'c/o' @ [3:6] → 'complaining of'
  'sob' @ [7:10] → 'shortness of breath'
  'cp' @ [15:17] → 'chest pain'


---
## 3. Negation Detection — `NegationDetector`

Rule-based ConText/NegEx algorithm with 100+ trigger patterns. Detects six assertion statuses:
**AFFIRMED**, **NEGATED**, **POSSIBLE**, **HYPOTHETICAL**, **HISTORICAL**, **FAMILY**.

In [18]:
from src.clinical.negation import NegationDetector, NegationStatus

detector = NegationDetector(scope_window=6)

# --- Show all assertion statuses ---
print("Assertion statuses:", [s.value for s in NegationStatus])

Assertion statuses: ['affirmed', 'negated', 'possible', 'hypothetical', 'historical', 'family']


In [19]:
# --- Detect negation scopes in raw text ---
text = "Patient denies chest pain but has persistent cough. No fever. History of diabetes."
scopes = detector.detect(text)
print(f"Text: {text!r}\n")
print(f"Detected {len(scopes)} negation/context scopes:")
for s in scopes:
    print(f"  [{s.status.value:12s}] trigger='{s.trigger_text}' "
          f"scope=[{s.scope_start}:{s.scope_end}] → '{text[s.scope_start:s.scope_end]}' "
          f"({s.direction})")

Text: 'Patient denies chest pain but has persistent cough. No fever. History of diabetes.'

Detected 3 negation/context scopes:
  [negated     ] trigger='No' scope=[52:60] → 'No fever' (forward)
  [negated     ] trigger='denies' scope=[8:26] → 'denies chest pain ' (forward)
  [historical  ] trigger='History of' scope=[62:81] → 'History of diabetes' (forward)


In [20]:
# --- Annotate pre-extracted entities ---
text = "Patient denies fever but reports persistent cough. No evidence of pneumonia. Family history of diabetes."
entities = [
    {"text": "fever",     "label": "DIAGNOSIS", "start": 15, "end": 20},
    {"text": "cough",     "label": "DIAGNOSIS", "start": 43, "end": 48},
    {"text": "pneumonia", "label": "DIAGNOSIS", "start": 67, "end": 76},
    {"text": "diabetes",  "label": "DIAGNOSIS", "start": 96, "end": 104},
]

annotated = detector.annotate_entities(text, entities)
print(f"Text: {text!r}\n")
print("Entity Annotations:")
for ent in annotated:
    trigger = ent.get('negation_trigger', '-')
    print(f"  {ent['text']:15s} → {ent['negation']:12s} (trigger: {trigger})")

Text: 'Patient denies fever but reports persistent cough. No evidence of pneumonia. Family history of diabetes.'

Entity Annotations:
  fever           → negated      (trigger: denies)
  cough           → affirmed     (trigger: -)
  pneumonia       → affirmed     (trigger: -)
  diabetes        → affirmed     (trigger: -)


In [21]:
# --- Quick negation check ---
text = "No evidence of pulmonary embolism."
print(f"'{text}' — is 'pulmonary embolism' negated? "
      f"{detector.is_negated(text, 15, 33)}")

text2 = "Diagnosed with pulmonary embolism."
print(f"'{text2}' — is 'pulmonary embolism' negated? "
      f"{detector.is_negated(text2, 16, 34)}")

'No evidence of pulmonary embolism.' — is 'pulmonary embolism' negated? True
'Diagnosed with pulmonary embolism.' — is 'pulmonary embolism' negated? False


In [22]:
# --- Test all six assertion statuses ---
test_cases = [
    ("Patient has pneumonia.", "pneumonia", 12, 21, "affirmed"),
    ("Patient denies chest pain.", "chest pain", 15, 25, "negated"),
    ("Possible diagnosis of lupus.", "lupus", 23, 28, "possible"),
    ("If symptoms worsen, consider asthma.", "asthma", 30, 36, "hypothetical"),
    ("History of myocardial infarction.", "myocardial infarction", 11, 32, "historical"),
    ("Family history of breast cancer.", "breast cancer", 18, 31, "family"),
]

print("=== All Six Assertion Statuses ===")
for text, entity, start, end, expected in test_cases:
    ents = [{"text": entity, "label": "DIAGNOSIS", "start": start, "end": end}]
    result = detector.annotate_entities(text, ents)
    status = result[0]["negation"]
    match = "✓" if status == expected else "✗"
    print(f"  {match} {status:12s} (expected {expected:12s}) — {text}")

=== All Six Assertion Statuses ===
  ✓ affirmed     (expected affirmed    ) — Patient has pneumonia.
  ✓ negated      (expected negated     ) — Patient denies chest pain.
  ✗ affirmed     (expected possible    ) — Possible diagnosis of lupus.
  ✗ affirmed     (expected hypothetical) — If symptoms worsen, consider asthma.
  ✓ historical   (expected historical  ) — History of myocardial infarction.
  ✓ family       (expected family      ) — Family history of breast cancer.


---
## 4. ICD-10-CM Code Lookup — `ICDCodeLookup`

TF-IDF character n-gram matching against 51K ICD-10-CM codes (falls back to 45 built-in codes offline).

In [23]:
from src.clinical.icd_codes import ICDCodeLookup

lookup = ICDCodeLookup()
print(f"Loaded {len(lookup._codes)} ICD-10-CM codes")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'atta00/icd10-codes' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'atta00/icd10-codes' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


README.md:   0%|          | 0.00/613 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/577k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/622k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25719 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25719 [00:00<?, ? examples/s]

Loaded 25719 ICD-10-CM codes


In [24]:
# --- Match entity text to ICD codes ---
queries = [
    "chest pain",
    "type 2 diabetes mellitus",
    "hypertension",
    "congestive heart failure",
    "pneumonia",
    "atrial fibrillation",
    "chronic kidney disease",
]

print("=== ICD-10-CM Entity Linking ===")
for query in queries:
    matches = lookup.match_entity(query, top_k=3)
    top = matches[0] if matches else None
    if top:
        print(f"  {query:30s} → {top.code}: {top.description} (score={top.score:.3f})")
    else:
        print(f"  {query:30s} → no match")

=== ICD-10-CM Entity Linking ===
  chest pain                     → R07.8: Other chest pain (score=0.949)
  type 2 diabetes mellitus       → E11.621: Type 2 diabetes mellitus with foot ulcer (score=0.848)
  hypertension                   → H40.05: Ocular hypertension (score=0.821)
  congestive heart failure       → I50.89: Other heart failure (score=0.746)
  pneumonia                      → J12.89: Other viral pneumonia (score=0.809)
  atrial fibrillation            → I48.91: Unspecified atrial fibrillation (score=0.954)
  chronic kidney disease         → N18.9: Chronic kidney disease, unspecified (score=0.874)


In [25]:
# --- Direct code lookup ---
codes_to_look_up = ["E11.9", "I10", "J18.9", "R07.9", "I50.9"]

print("=== Direct Code Lookup ===")
for code_str in codes_to_look_up:
    code_obj = lookup.lookup_code(code_str)
    if code_obj:
        print(f"  {code_obj.code}: {code_obj.description}")
    else:
        print(f"  {code_str}: not found")

=== Direct Code Lookup ===
  E11.9: Type 2 diabetes mellitus without complications
  I10: not found
  J18.9: Pneumonia, unspecified organism
  R07.9: Chest pain, unspecified
  I50.9: Heart failure, unspecified


In [26]:
# --- Batch entity matching ---
batch_entities = [
    {"text": "hypertension", "label": "DIAGNOSIS"},
    {"text": "pneumonia", "label": "DIAGNOSIS"},
    {"text": "chest pain", "label": "DIAGNOSIS"},
    {"text": "diabetes", "label": "DIAGNOSIS"},
]

results = lookup.match_entities_batch(batch_entities, top_k=3)
print("=== Batch Entity → ICD Mapping ===")
for r in results:
    codes = [c["code"] for c in r.get("icd_codes", [])]
    print(f"  {r['text']:20s} → {codes}")

=== Batch Entity → ICD Mapping ===
  hypertension         → ['H40.05', 'P29.2', 'K76.6']
  pneumonia            → ['J12.89', 'J12.8', 'J15.7']
  chest pain           → ['R07.8', 'R07.89', 'R07.9']
  diabetes             → ['R73.03', 'E13.61', 'E13.620']


---
## 5. MS-DRG Cost Estimation — `DRGCostEstimator`

Maps ICD-10-CM codes to MS-DRGs and estimates financial impact. Uses built-in fallback weights for 32 common DRGs.

> **Note:** Full ICD→DRG grouper logic requires `drgpy` (`pip install drgpy`). Without it, only direct DRG code lookups work.

In [27]:
from src.clinical.drg_costs import DRGCostEstimator, DRGResult, CostImpactAnalysis

estimator = DRGCostEstimator()
print(f"Base rate: ${estimator.base_rate:,.2f} (FY 2026)")
print(f"DRG weight table: {len(estimator._weights)} DRGs loaded")
print(f"Grouper available: {estimator._grouper is not None}")



Base rate: $6,752.61 (FY 2026)
DRG weight table: 32 DRGs loaded
Grouper available: False


In [28]:
# --- Direct cost estimate by DRG code ---
drg_codes = ["291", "292", "293", "065", "066", "067", "189", "190", "191"]

print("=== DRG Cost Estimates ===")
for code in drg_codes:
    result = estimator._build_result(code)
    if result:
        print(f"  DRG {result.drg_code}: {result.drg_title:50s} "
              f"wt={result.relative_weight:.4f}  ${result.estimated_payment:>10,.2f}  [{result.severity_level}]")

=== DRG Cost Estimates ===
  DRG 291: Heart Failure & Shock W MCC                        wt=1.3968  $  9,432.05  [mcc]
  DRG 292: Heart Failure & Shock W CC                         wt=0.9259  $  6,252.24  [cc]
  DRG 293: Heart Failure & Shock W/O CC/MCC                   wt=0.6521  $  4,403.38  [base]
  DRG 065: Intracranial Hemorrhage Or Cerebral Infarction W CC wt=1.0619  $  7,170.60  [cc]
  DRG 066: Intracranial Hemorrhage Or Cerebral Infarction W/O CC/MCC wt=0.7175  $  4,845.00  [base]


In [29]:
# --- DRG grouping from ICD codes (requires drgpy) ---
icd_sets = [
    ["J18.9"],                         # Pneumonia alone
    ["J18.9", "E11.9"],               # Pneumonia + diabetes
    ["J18.9", "E11.9", "N17.9"],     # Pneumonia + diabetes + AKI
    ["I50.9"],                         # Heart failure
]

print("=== ICD → DRG Grouping ===")
for codes in icd_sets:
    result = estimator.get_drg(codes)
    if result:
        print(f"  {str(codes):45s} → DRG {result.drg_code}: {result.drg_title} "
              f"(${result.estimated_payment:,.2f})")
    else:
        print(f"  {str(codes):45s} → (grouper unavailable — install drgpy)")

=== ICD → DRG Grouping ===
  ['J18.9']                                     → (grouper unavailable — install drgpy)
  ['J18.9', 'E11.9']                            → (grouper unavailable — install drgpy)
  ['J18.9', 'E11.9', 'N17.9']                   → (grouper unavailable — install drgpy)
  ['I50.9']                                     → (grouper unavailable — install drgpy)


In [30]:
# --- Cost impact analysis (CC/MCC comparison) ---
# Even without drgpy, we can demonstrate the analysis structure
# by using a known DRG family from the fallback weights

# Heart Failure family: DRG 291 (MCC) / 292 (CC) / 293 (base)
print("=== Heart Failure DRG Family (291/292/293) ===")
for code in ["291", "292", "293"]:
    r = estimator._build_result(code)
    if r:
        print(f"  DRG {r.drg_code} [{r.severity_level:4s}]: wt={r.relative_weight:.4f}  "
              f"${r.estimated_payment:>10,.2f}  {r.drg_title}")

# Calculate revenue at risk
base = estimator._build_result("293")
mcc = estimator._build_result("291")
if base and mcc:
    gap = mcc.estimated_payment - base.estimated_payment
    print(f"\n  Revenue at risk (base→MCC): ${gap:,.2f}")

=== Heart Failure DRG Family (291/292/293) ===
  DRG 291 [mcc ]: wt=1.3968  $  9,432.05  Heart Failure & Shock W MCC
  DRG 292 [cc  ]: wt=0.9259  $  6,252.24  Heart Failure & Shock W CC
  DRG 293 [base]: wt=0.6521  $  4,403.38  Heart Failure & Shock W/O CC/MCC

  Revenue at risk (base→MCC): $5,028.67


In [31]:
# --- Full cost impact analysis via API ---
analysis = estimator.analyze_cost_impact(["J18.9", "E11.9"])
if analysis:
    print("=== Cost Impact Analysis ===")
    d = analysis.to_dict()
    print(f"  Current DRG: {d['current']['drg_code']} — {d['current']['drg_title']}")
    print(f"  Estimated payment: ${d['current']['estimated_payment']:,.2f}")
    print(f"  Revenue at risk: ${d['revenue_at_risk']:,.2f}")
    print(f"  Undercoding risk: {d['undercoding_risk']}")
    if 'mcc_variant' in d:
        print(f"  MCC variant: DRG {d['mcc_variant']['drg_code']} — "
              f"${d['mcc_variant']['estimated_payment']:,.2f}")
else:
    print("Cost impact analysis requires drgpy. Install with: pip install drgpy")

Cost impact analysis requires drgpy. Install with: pip install drgpy


---
## 6. Entity Post-Processing — `post_process_entities()`

Filters garbage entities (stopwords, punctuation) and merges adjacent fragments from subword tokenization.

In [32]:
from src.inference.entity_utils import NEREntity, post_process_entities

text = "Patient has congestive heart failure and type 2 diabetes mellitus."

# Simulate raw NER output with garbage and fragments
raw_entities = [
    NEREntity(text="congestive",    label="DIAGNOSIS", start_char=12, end_char=22, score=0.95),
    NEREntity(text="heart failure", label="DIAGNOSIS", start_char=23, end_char=36, score=0.93),
    NEREntity(text="and",           label="DIAGNOSIS", start_char=37, end_char=40, score=0.30),
    NEREntity(text="type",          label="DIAGNOSIS", start_char=41, end_char=45, score=0.25),
    NEREntity(text="2 diabetes mellitus", label="DIAGNOSIS", start_char=46, end_char=65, score=0.91),
]

print(f"Before post-processing ({len(raw_entities)} entities):")
for e in raw_entities:
    print(f"  '{e.text}' [{e.label}] score={e.score:.2f}")

cleaned = post_process_entities(raw_entities, text)
print(f"\nAfter post-processing ({len(cleaned)} entities):")
for e in cleaned:
    print(f"  '{e.text}' [{e.label}] score={e.score:.2f}")

Before post-processing (5 entities):
  'congestive' [DIAGNOSIS] score=0.95
  'heart failure' [DIAGNOSIS] score=0.93
  'and' [DIAGNOSIS] score=0.30
  'type' [DIAGNOSIS] score=0.25
  '2 diabetes mellitus' [DIAGNOSIS] score=0.91

After post-processing (2 entities):
  'congestive heart failure' [DIAGNOSIS] score=0.95
  '2 diabetes mellitus' [DIAGNOSIS] score=0.91


---
## 7. Evaluation Metrics — `compute_ner_metrics()`

Entity-level precision, recall, and F1 using seqeval (or built-in fallback).

In [33]:
from src.evaluation.metrics import compute_ner_metrics, _extract_entities_from_bio
import numpy as np

# --- Simulated model predictions ---
# Label mapping: 0=O, 1=B-DIAGNOSIS, 2=I-DIAGNOSIS
label_list = ["O", "B-DIAGNOSIS", "I-DIAGNOSIS"]

# Gold:  "The patient has [congestive heart failure] and [diabetes]."
# Pred:  "The patient has [congestive heart] failure and [diabetes]."
#  (boundary error on first entity, correct on second)

gold_labels = [0, 0, 0, 1, 2, 2, 0, 1, 0]  # O O O B I I O B O
pred_labels = [0, 0, 0, 1, 2, 0, 0, 1, 0]  # O O O B I O O B O (missed I on "failure")

# compute_ner_metrics expects a namedtuple-like object with predictions and label_ids
class FakePreds:
    def __init__(self, preds, labels):
        self.predictions = np.array([preds])
        self.label_ids = np.array([labels])

metrics = compute_ner_metrics(
    FakePreds(pred_labels, gold_labels),
    label_list=label_list,
)

print("=== Entity-Level Metrics ===")
for k, v in metrics.items():
    print(f"  {k}: {v:.4f}")

print("\n(Note: boundary error on 'congestive heart failure' causes lower recall)")



TypeError: compute_ner_metrics() missing 1 required positional argument: 'labels'

In [None]:
# --- BIO entity extraction utility ---
labels = ["O", "B-DIAGNOSIS", "I-DIAGNOSIS", "I-DIAGNOSIS", "O", "B-DIAGNOSIS", "O"]
entities = _extract_entities_from_bio(labels)
print("Extracted entities from BIO sequence:")
for etype, start, end in sorted(entities):
    print(f"  {etype} @ tokens [{start}:{end}]")

---
## 8. Curated ICD Dataset — Template-Generated Examples

The `icd_ner` dataset includes ~100 template-generated sentences targeting common NER failure patterns.

In [None]:
from src.data.icd_dataset import _generate_template_examples, ICD_NER_LABELS

print(f"Label scheme: {ICD_NER_LABELS}\n")

examples = _generate_template_examples()
print(f"Generated {len(examples)} template examples\n")

# Show a few examples
print("=== Sample Template Examples ===")
for ex in examples[:8]:
    tokens = ex["tokens"]
    labels = ex["ner_labels"]
    # Reconstruct text with labels
    labeled = []
    for tok, lab in zip(tokens, labels):
        if lab.startswith("B-"):
            labeled.append(f"[{tok}")
        elif lab.startswith("I-"):
            labeled.append(tok)
        else:
            if labeled and labeled[-1] and not labeled[-1].endswith("]"):
                # Close the previous entity bracket
                labeled[-1] = labeled[-1] + "]"
            labeled.append(tok)
    # Close any trailing entity
    text = " ".join(labeled)
    if text.count("[") > text.count("]"):
        text += "]"
    print(f"  {text}")

---
## 9. MedMentions & MACCROBAT — Optional Dataset Sources

The `icd_ner` composite dataset includes two optional HuggingFace sources that download on first use:

1. **MedMentions** (`bigbio/medmentions`) — up to 5K examples from 4,392 PubMed abstracts with 350K+ UMLS entity mentions, filtered for disease/disorder semantic types (T047, T048, T019, T046, T191)
2. **MACCROBAT** (`singh-aditya/MACCROBAT_biomedical_ner`) — up to 3K examples from 200 clinical case reports with DISEASE_DISORDER entities, providing clinical-note-style text that PubMed abstracts lack

Both are loaded by `load_icd_ner_dataset()` as Sources 6 and 7. If the download fails, they are skipped gracefully.

In [None]:
# --- Source 6: MedMentions ---
# Loads disease/disorder entities from 4,392 PubMed abstracts (bigbio/medmentions).
# Filters for UMLS semantic types: T047 (Disease), T048 (Mental Disorder),
# T019 (Congenital Abnormality), T046 (Pathologic Function), T191 (Neoplastic Process).

from src.data.icd_dataset import _load_medmentions_diseases, _MEDMENTIONS_DISEASE_TYPES

print("=== MedMentions Disease Loader ===")
print(f"Target UMLS semantic types: {sorted(_MEDMENTIONS_DISEASE_TYPES)}")
print()

try:
    mm_dataset = _load_medmentions_diseases(max_examples=50)  # small sample for demo
    for split, ds in mm_dataset.items():
        n_entities = sum(1 for ex in ds for lab in ex["ner_labels"] if lab.startswith("B-"))
        print(f"  {split:12s}: {len(ds):4d} examples, {n_entities} DIAGNOSIS entities")

    # Show a few examples
    print("\nSample MedMentions examples:")
    for ex in list(mm_dataset["train"])[:3]:
        tokens = ex["tokens"]
        labels = ex["ner_labels"]
        # Show only the diagnosis spans
        spans = []
        current = []
        for tok, lab in zip(tokens, labels):
            if lab.startswith("B-"):
                if current:
                    spans.append(" ".join(current))
                current = [tok]
            elif lab.startswith("I-") and current:
                current.append(tok)
            else:
                if current:
                    spans.append(" ".join(current))
                    current = []
        if current:
            spans.append(" ".join(current))
        text_preview = " ".join(tokens[:15])
        if len(tokens) > 15:
            text_preview += " ..."
        print(f"  Text: {text_preview}")
        print(f"  Entities: {spans}")
        print()
except Exception as e:
    print(f"  MedMentions not available (expected in offline mode): {type(e).__name__}: {e}")
    print("  This source is optional — load_icd_ner_dataset() skips it gracefully.")

In [None]:
# --- Source 7: MACCROBAT ---
# Loads DISEASE_DISORDER entities from 200 clinical case reports
# (singh-aditya/MACCROBAT_biomedical_ner). Provides clinical-note-style text
# that PubMed abstracts lack, closing the domain gap.

from src.data.icd_dataset import _load_maccrobat_diseases, _MACCROBAT_DISEASE_LABELS

print("=== MACCROBAT Disease Loader ===")
print(f"Target entity labels: {sorted(_MACCROBAT_DISEASE_LABELS)}")
print()

try:
    mac_dataset = _load_maccrobat_diseases(max_examples=50)  # small sample for demo
    for split, ds in mac_dataset.items():
        n_entities = sum(1 for ex in ds for lab in ex["ner_labels"] if lab.startswith("B-"))
        print(f"  {split:12s}: {len(ds):4d} examples, {n_entities} DIAGNOSIS entities")

    # Show a few examples
    print("\nSample MACCROBAT examples:")
    for ex in list(mac_dataset["train"])[:3]:
        tokens = ex["tokens"]
        labels = ex["ner_labels"]
        # Show only the diagnosis spans
        spans = []
        current = []
        for tok, lab in zip(tokens, labels):
            if lab.startswith("B-"):
                if current:
                    spans.append(" ".join(current))
                current = [tok]
            elif lab.startswith("I-") and current:
                current.append(tok)
            else:
                if current:
                    spans.append(" ".join(current))
                    current = []
        if current:
            spans.append(" ".join(current))
        text_preview = " ".join(tokens[:15])
        if len(tokens) > 15:
            text_preview += " ..."
        print(f"  Text: {text_preview}")
        print(f"  Entities: {spans}")
        print()
except Exception as e:
    print(f"  MACCROBAT not available (expected in offline mode): {type(e).__name__}: {e}")
    print("  This source is optional — load_icd_ner_dataset() skips it gracefully.")

---
## 10. End-to-End Pipeline — `MedicalCodingPipeline`

Chains shorthand expansion → negation detection → ICD resolution → DRG cost estimation.

Using `process_with_entities()` to supply pre-extracted entities (no NER model required).

In [34]:
from src.clinical.pipeline import MedicalCodingPipeline, MedicalEntity

# Initialize pipeline without a trained NER model
pipeline = MedicalCodingPipeline(
    model_path=None,         # No NER model — we'll supply entities manually
    expand_shorthand=True,
    detect_negation=True,
    negation_strategy="rules",
    resolve_icd_codes=True,
    icd_top_k=3,
    resolve_drg=False,       # Set True if drgpy is installed
)

print("Pipeline initialized:")
print(f"  Shorthand expander: {pipeline.shorthand_expander is not None}")
print(f"  Negation detector:  {pipeline.negation_detector is not None}")
print(f"  ICD lookup:         {pipeline.icd_lookup is not None}")
print(f"  DRG estimator:      {pipeline.drg_estimator is not None}")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'atta00/icd10-codes' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'atta00/icd10-codes' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


Pipeline initialized:
  Shorthand expander: True
  Negation detector:  True
  ICD lookup:         True
  DRG estimator:      False


In [35]:
# --- Process pre-extracted entities ---
clinical_text = "Patient denies chest pain. Diagnosed with congestive heart failure and hypertension."

pre_extracted = [
    {"text": "chest pain",              "label": "DIAGNOSIS", "start": 15, "end": 25, "score": 0.95},
    {"text": "congestive heart failure", "label": "DIAGNOSIS", "start": 43, "end": 66, "score": 0.97},
    {"text": "hypertension",            "label": "DIAGNOSIS", "start": 71, "end": 83, "score": 0.96},
]

results = pipeline.process_with_entities(clinical_text, pre_extracted)

print(f"Input:  {clinical_text}\n")
print("=== Pipeline Results ===")
for ent in results:
    print(f"  Entity: {ent.text}")
    print(f"    Label:    {ent.label}")
    print(f"    Negation: {ent.negation} (trigger: {ent.negation_trigger or 'none'})")
    print(f"    Score:    {ent.score:.3f}")
    if ent.icd_codes:
        print(f"    ICD codes:")
        for icd in ent.icd_codes[:2]:
            print(f"      {icd['code']}: {icd['description']} (score={icd['score']:.3f})")
    print()

Input:  Patient denies chest pain. Diagnosed with congestive heart failure and hypertension.

=== Pipeline Results ===
  Entity: chest pain
    Label:    DIAGNOSIS
    Negation: negated (trigger: denies)
    Score:    0.950

  Entity: congestive heart failure
    Label:    DIAGNOSIS
    Negation: affirmed (trigger: none)
    Score:    0.970

  Entity: hypertension
    Label:    DIAGNOSIS
    Negation: affirmed (trigger: none)
    Score:    0.960



In [36]:
# --- Human-readable formatted output ---
formatted = pipeline.format_output(clinical_text, results)
print(formatted)

Patient denies chest pain. Diagnosed with congestive heart failure and hypertension.

  [chest pain](DIAGNOSIS, NEGATED, trigger="denies", score=0.950)
  [congestive heart failure](DIAGNOSIS, AFFIRMED, score=0.970)
  [hypertension](DIAGNOSIS, AFFIRMED, score=0.960)


In [37]:
# --- MedicalEntity properties and serialization ---
for ent in results:
    print(f"  {ent.text:30s} is_affirmed={ent.is_affirmed}  is_negated={ent.is_negated}")

print("\n=== JSON serialization ===")
import json
print(json.dumps(results[0].to_dict(), indent=2))

  chest pain                     is_affirmed=False  is_negated=True
  congestive heart failure       is_affirmed=True  is_negated=False
  hypertension                   is_affirmed=True  is_negated=False

=== JSON serialization ===
{
  "text": "chest pain",
  "label": "DIAGNOSIS",
  "start": 15,
  "end": 25,
  "score": 0.95,
  "negation": "negated",
  "negation_trigger": "denies"
}


In [38]:
# --- Multiple clinical scenarios ---
scenarios = [
    (
        "No evidence of pneumonia on chest X-ray. Patient has COPD exacerbation.",
        [
            {"text": "pneumonia",         "label": "DIAGNOSIS", "start": 15, "end": 24, "score": 0.92},
            {"text": "COPD exacerbation", "label": "DIAGNOSIS", "start": 43, "end": 60, "score": 0.94},
        ]
    ),
    (
        "History of stroke. Currently presents with acute kidney injury.",
        [
            {"text": "stroke",             "label": "DIAGNOSIS", "start": 11, "end": 17, "score": 0.90},
            {"text": "acute kidney injury", "label": "DIAGNOSIS", "start": 43, "end": 61, "score": 0.96},
        ]
    ),
    (
        "Mother had breast cancer. Patient denies any malignancy.",
        [
            {"text": "breast cancer", "label": "DIAGNOSIS", "start": 11, "end": 24, "score": 0.93},
            {"text": "malignancy",   "label": "DIAGNOSIS", "start": 45, "end": 55, "score": 0.88},
        ]
    ),
]

print("=== Multiple Clinical Scenarios ===")
for text, ents in scenarios:
    results = pipeline.process_with_entities(text, ents)
    print(f"\n{text}")
    for r in results:
        icd = r.icd_codes[0]['code'] if r.icd_codes else 'N/A'
        print(f"  [{r.text}] {r.negation.upper():12s} ICD={icd}")

=== Multiple Clinical Scenarios ===

No evidence of pneumonia on chest X-ray. Patient has COPD exacerbation.
  [pneumonia] NEGATED      ICD=N/A
  [COPD exacerbation] AFFIRMED     ICD=N/A

History of stroke. Currently presents with acute kidney injury.
  [stroke] HISTORICAL   ICD=N/A
  [acute kidney injury] AFFIRMED     ICD=N/A

Mother had breast cancer. Patient denies any malignancy.
  [breast cancer] FAMILY       ICD=N/A
  [malignancy] NEGATED      ICD=N/A


---
## 11. Adversarial Training — `FGM`, `PGD`, `AdversarialTrainer`

FGM and PGD perturb word embeddings during training to improve robustness (+0.5-1.5% F1).

This section demonstrates the API structure. Actual training requires a GPU and dataset.

In [39]:
import torch
from src.training.adversarial import FGM, PGD, AdversarialTrainer

# --- Demonstrate FGM on a toy model ---
class ToyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.word_embeddings = torch.nn.Embedding(100, 16)
        self.classifier = torch.nn.Linear(16, 3)

    def forward(self, input_ids):
        emb = self.word_embeddings(input_ids)
        return self.classifier(emb.mean(dim=1))

model = ToyModel()

# Simulate a forward + backward pass
input_ids = torch.randint(0, 100, (2, 5))
output = model(input_ids)
loss = output.sum()
loss.backward()

# --- FGM attack ---
fgm = FGM(model, epsilon=1.0)
original_emb = model.word_embeddings.weight.data.clone()

fgm.attack()
perturbed_emb = model.word_embeddings.weight.data.clone()
perturbation_norm = torch.norm(perturbed_emb - original_emb).item()
print(f"FGM perturbation L2 norm: {perturbation_norm:.4f}")

fgm.restore()
restored_emb = model.word_embeddings.weight.data.clone()
print(f"Embeddings restored: {torch.allclose(original_emb, restored_emb)}")

FGM perturbation L2 norm: 1.0000
Embeddings restored: True


In [40]:
# --- PGD multi-step attack ---
model.zero_grad()
output = model(input_ids)
loss = output.sum()
loss.backward()

pgd = PGD(model, epsilon=0.3, alpha=0.1, num_steps=3)
original_emb = model.word_embeddings.weight.data.clone()

pgd.save()
for step in range(pgd.num_steps):
    pgd.attack_step()
    step_emb = model.word_embeddings.weight.data.clone()
    step_norm = torch.norm(step_emb - original_emb).item()
    print(f"  PGD step {step+1}: perturbation L2 norm = {step_norm:.4f}")

pgd.restore()
print(f"Embeddings restored: {torch.allclose(original_emb, model.word_embeddings.weight.data)}")

  PGD step 1: perturbation L2 norm = 0.1000
  PGD step 2: perturbation L2 norm = 0.2000
  PGD step 3: perturbation L2 norm = 0.3000
Embeddings restored: True


In [41]:
# --- AdversarialTrainer overview ---
print("AdversarialTrainer extends HuggingFace Trainer with:")
print("  - Automatic FGM or PGD perturbation during training_step()")
print("  - Clean loss + adversarial loss combined")
print("  - No architecture changes required")
print()
print("Usage:")
print("  trainer = AdversarialTrainer(")
print("      model=model, args=training_args,")
print("      train_dataset=train_ds, eval_dataset=eval_ds,")
print("      adv_method='fgm', adv_epsilon=1.0,")
print("  )")
print("  trainer.train()")
print()
print("CLI:")
print("  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial")
print("  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial --adv-method pgd")

AdversarialTrainer extends HuggingFace Trainer with:
  - Automatic FGM or PGD perturbation during training_step()
  - Clean loss + adversarial loss combined
  - No architecture changes required

Usage:
  trainer = AdversarialTrainer(
      model=model, args=training_args,
      train_dataset=train_ds, eval_dataset=eval_ds,
      adv_method='fgm', adv_epsilon=1.0,
  )
  trainer.train()

CLI:
  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial
  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial --adv-method pgd


---
## 12. Assertion Classifier (Transformer-based) — Overview

The `AssertionClassifier` uses `bvanaken/clinical-assertion-negation-bert` for learned assertion detection.
It downloads the model on first use (~440MB), so we show the API without executing.

In [42]:
# To actually run: uncomment the lines below (requires model download)
#
# from src.clinical.assertion import AssertionClassifier
#
# classifier = AssertionClassifier(device="cpu")
#
# result = classifier.predict(
#     text="Patient denies any chest pain or shortness of breath.",
#     entity_text="chest pain",
#     entity_start=19,
#     entity_end=29,
# )
# print(result)  # {'label': 'ABSENT', 'negation': 'negated', 'score': 0.97}
#
# # Batch annotation:
# entities = [
#     {"text": "chest pain", "label": "DIAGNOSIS", "start": 19, "end": 29},
# ]
# annotated = classifier.annotate_entities(
#     "Patient denies any chest pain.", entities
# )

print("AssertionClassifier API:")
print("  .predict(text, entity_text, entity_start, entity_end) → Dict")
print("    Returns: {'label': 'PRESENT'|'ABSENT'|'POSSIBLE', 'negation': ..., 'score': float}")
print()
print("  .annotate_entities(text, entities) → List[Dict]")
print("    Adds: 'negation', 'assertion_label', 'assertion_score' to each entity")
print()
print("Pipeline integration:")
print("  pipeline = MedicalCodingPipeline(negation_strategy='transformer')")

AssertionClassifier API:
  .predict(text, entity_text, entity_start, entity_end) → Dict
    Returns: {'label': 'PRESENT'|'ABSENT'|'POSSIBLE', 'negation': ..., 'score': float}

  .annotate_entities(text, entities) → List[Dict]
    Adds: 'negation', 'assertion_label', 'assertion_score' to each entity

Pipeline integration:
  pipeline = MedicalCodingPipeline(negation_strategy='transformer')


---
## 13. CLI Scripts Reference

Quick reference for the training, prediction, evaluation, and benchmark scripts.

In [43]:
cli_reference = """
=== Training ===
  python scripts/train.py --model pubmedbert --dataset icd_ner
  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial
  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial --adv-method pgd
  python scripts/train.py --model bio_clinicalbert --dataset icd_ner --use-crf --lr 3e-5

=== Prediction ===
  python scripts/predict.py --model-path outputs/pubmedbert_icd_ner/best_model --text "Pt denies cp."
  python scripts/predict.py --model-path outputs/pubmedbert_icd_ner/best_model --icd-codes
  python scripts/predict.py --model-path outputs/pubmedbert_icd_ner/best_model --input-file data.txt --output-file out.json

=== Evaluation ===
  python scripts/evaluate.py --model-path outputs/pubmedbert_icd_ner/best_model --dataset icd_ner
  python scripts/evaluate.py --model-path outputs/pubmedbert_icd_ner/best_model --dataset icd_ner --error-analysis

=== Benchmark ===
  python scripts/benchmark.py --models pubmedbert biobert bio_clinicalbert --datasets icd_ner ncbi_disease bc5cdr

=== Tests ===
  python -m pytest tests/ -v
  python -m pytest tests/ --cov=src --cov-report=term-missing
"""
print(cli_reference)


=== Training ===
  python scripts/train.py --model pubmedbert --dataset icd_ner
  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial
  python scripts/train.py --model pubmedbert --dataset icd_ner --adversarial --adv-method pgd
  python scripts/train.py --model bio_clinicalbert --dataset icd_ner --use-crf --lr 3e-5

=== Prediction ===
  python scripts/predict.py --model-path outputs/pubmedbert_icd_ner/best_model --text "Pt denies cp."
  python scripts/predict.py --model-path outputs/pubmedbert_icd_ner/best_model --icd-codes
  python scripts/predict.py --model-path outputs/pubmedbert_icd_ner/best_model --input-file data.txt --output-file out.json

=== Evaluation ===
  python scripts/evaluate.py --model-path outputs/pubmedbert_icd_ner/best_model --dataset icd_ner
  python scripts/evaluate.py --model-path outputs/pubmedbert_icd_ner/best_model --dataset icd_ner --error-analysis

=== Benchmark ===
  python scripts/benchmark.py --models pubmedbert biobert bio_clinical

---
## 14. Running the Test Suite

All tests use mocked models and fallback data — no GPU or network access required.

In [44]:
# Uncomment to run the full test suite from this notebook:
# !cd {REPO_ROOT} && python -m pytest tests/ -v --tb=short 2>&1 | tail -30

print("To run tests from the command line:")
print(f"  cd {REPO_ROOT}")
print("  python -m pytest tests/ -v")
print()
print("Key test files:")
test_files = [
    ("test_negation.py",            "Rule-based negation detection (200+ assertions)"),
    ("test_assertion.py",           "Transformer assertion classifier"),
    ("test_icd_ner_dataset.py",     "Composite dataset loading + garbage label cleaning"),
    ("test_pipeline.py",            "End-to-end pipeline integration"),
    ("test_icd_pipeline.py",        "ICD code resolution pipeline"),
    ("test_shorthand.py",           "Abbreviation expansion"),
    ("test_disambiguation.py",      "Abbreviation disambiguation"),
    ("test_preprocessing.py",       "Tokenization and label alignment"),
    ("test_entity_postprocessing.py", "Entity filtering and merging"),
]
for filename, desc in test_files:
    print(f"  {filename:35s} — {desc}")

To run tests from the command line:
  cd /content/Medical_Code_Intelligence
  python -m pytest tests/ -v

Key test files:
  test_negation.py                    — Rule-based negation detection (200+ assertions)
  test_assertion.py                   — Transformer assertion classifier
  test_icd_ner_dataset.py             — Composite dataset loading + garbage label cleaning
  test_pipeline.py                    — End-to-end pipeline integration
  test_icd_pipeline.py                — ICD code resolution pipeline
  test_shorthand.py                   — Abbreviation expansion
  test_disambiguation.py              — Abbreviation disambiguation
  test_preprocessing.py               — Tokenization and label alignment
  test_entity_postprocessing.py       — Entity filtering and merging


---
## Summary

This notebook demonstrated the full Medical Code Intelligence pipeline:

```
Clinical Text
  → ShorthandExpander (abbreviation expansion with offset tracking)
  → NER Model (transformer token classification, BIO scheme)
  → post_process_entities() (stopword filter, fragment merging)
  → NegationDetector or AssertionClassifier (6 assertion statuses)
  → ICDCodeLookup (TF-IDF matching against 51K codes)
  → DRGCostEstimator (ICD-10 → MS-DRG → cost estimate)
  → MedicalEntity list (text, label, negation, ICD codes, DRG info)
```

Every component has offline fallback data so this notebook runs on CPU without network access.