# 1) Physician Notetaker — End-to-End Pipeline Explained

This notebook will explain my full medical NLP pipeline which i built for this assignment :
- Medical entity extraction (Symptoms / Diagnosis / Treatment / Prognosis)
- Structured medical JSON summary
- Keyword extraction
- Sentiment + intent analysis
- SOAP note generation

Goal: It is to Convert a raw doctor patient transcript into structured clinical outputs.

In [None]:
# Sample input which was given to us for this assignment
from pathlib import Path

transcript = Path("../data/sample_transcript.txt").read_text(encoding="utf-8")
print(transcript[:800])

Physician: Good morning, Ms. Jones. How are you feeling today?
Patient: Good morning, doctor. I’m doing better, but I still have some discomfort now and then.
Physician: I understand you were in a car accident last September. Can you walk me through what happened?
Patient: Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.
Physician: That sounds like a strong impact. Were you wearing your seatbelt?
Patient: Yes, I always do.
Physician: What did you feel immediately after the accident?
Patient: At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my nec


## 2) Experiment: Split Transcript into Speaker Turns

Before doing any NLP modeling, we need a clean structure:
- each line should become a Turn(speaker, text)
- we should group Patient vs Physician text

This makes later extraction much easier.

#### 2.1) Prototype: Speaker Turn Parsing

Before doing any medical NLP, we first convert the raw transcript into structured turns:

- Turn(speaker, text)

Then we group all Patient text and Physician text separately.
This makes downstream extraction easier and more accurate.

In [None]:
from dataclasses import dataclass
from typing import List, Dict
import re


@dataclass
class Turn:
    speaker: str
    text: str


def split_turns(transcript: str) -> List[Turn]:
    turns = []
    lines = transcript.splitlines()

    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Ignore bracket notes like [Physical Examination Conducted]
        if line.startswith("[") and line.endswith("]"):
            continue

        m = re.match(r"^(Physician|Doctor|Patient)\s*:\s*(.+)$", line, flags=re.IGNORECASE)
        if not m:
            continue

        speaker = m.group(1).strip().title()
        text = m.group(2).strip()

        turns.append(Turn(speaker=speaker, text=text))

    return turns


def group_by_speaker(turns: List[Turn]) -> Dict[str, str]:
    grouped = {}
    for t in turns:
        grouped.setdefault(t.speaker, [])
        grouped[t.speaker].append(t.text)

    return {speaker: " ".join(texts) for speaker, texts in grouped.items()}

In [5]:
turns = split_turns(transcript)

print("Total turns:", len(turns))
for t in turns[:5]:
    print(t.speaker, "=>", t.text[:80])

Total turns: 26
Physician => Good morning, Ms. Jones. How are you feeling today?
Patient => Good morning, doctor. I’m doing better, but I still have some discomfort now and
Physician => I understand you were in a car accident last September. Can you walk me through 
Patient => Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from 
Physician => That sounds like a strong impact. Were you wearing your seatbelt?


In [6]:
grouped = group_by_speaker(turns)

print("\n--- Patient Preview ---\n")
print(grouped.get("Patient", "")[:500])

print("\n--- Physician Preview ---\n")
print(grouped.get("Physician", "")[:500])


--- Patient Preview ---

Good morning, doctor. I’m doing better, but I still have some discomfort now and then. Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front. Yes, I always do. At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away. Yes, I went t

--- Physician Preview ---

Good morning, Ms. Jones. How are you feeling today? I understand you were in a car accident last September. Can you walk me through what happened? That sounds like a strong impact. Were you wearing your seatbelt? What did you feel immediately after the accident? Did you seek medical attention at that time? How did things progress after that? That makes sense. Are you still experiencing pain now? That’s good to hear. Have you noticed any othe

#### 2.2) Prototype: Extract Dates, Times, Durations, and Counts

Medical transcripts often contain important numeric facts such as:
- Accident date and time
- Duration of symptoms (e.g., 4 weeks)
- Treatment counts (e.g., 10 physiotherapy sessions)
- Time off work (e.g., 1 week)

NER models are often inconsistent with numeric facts, so we implement deterministic extraction rules.

In [7]:
import re
from typing import Dict, Any


def extract_dates_and_times(text: str) -> Dict[str, Any]:
    t = text.lower()

    # Date like "September 1st" / "September 1"
    date_match = re.search(r"\b(september)\s+(\d{1,2})(st|nd|rd|th)?\b", t)
    accident_date = None
    if date_match:
        accident_date = f"{date_match.group(1).title()} {date_match.group(2)}"

    # Time like "12:30"
    time_match = re.search(r"\b(\d{1,2}:\d{2})\b", t)
    accident_time = time_match.group(1) if time_match else None

    # Month reference like "last September"
    month_ref = None
    if "last september" in t:
        month_ref = "last September"

    return {
        "Accident_Date": accident_date,
        "Accident_Time": accident_time,
        "Accident_Month_Reference": month_ref
    }


def extract_counts_and_durations(text: str) -> Dict[str, Any]:
    t = text.lower()

    # physio sessions (e.g., "ten sessions", "10 sessions")
    physio_sessions = None
    if "ten sessions" in t:
        physio_sessions = 10
    else:
        m = re.search(r"\b(\d+)\s+sessions?\b", t)
        if m:
            physio_sessions = int(m.group(1))

    # acute pain duration (e.g., "first four weeks")
    acute_weeks = None
    if "four weeks" in t:
        acute_weeks = 4
    else:
        m = re.search(r"\b(\d+)\s+weeks?\b", t)
        if m:
            acute_weeks = int(m.group(1))

    # time off work
    time_off_days = None
    if "week off work" in t or "a week off work" in t:
        time_off_days = 7

    return {
        "Physio_Sessions": physio_sessions,
        "Acute_Pain_Duration_Weeks": acute_weeks,
        "Time_Off_Work_Days": time_off_days
    }

In [8]:
combined_text = grouped.get("Patient", "") + " " + grouped.get("Physician", "")

dates = extract_dates_and_times(combined_text)
counts = extract_counts_and_durations(combined_text)

print("DATES:")
print(dates)

print("\nCOUNTS:")
print(counts)

DATES:
{'Accident_Date': 'September 1', 'Accident_Time': '12:30', 'Accident_Month_Reference': 'last September'}

COUNTS:
{'Physio_Sessions': 10, 'Acute_Pain_Duration_Weeks': 4, 'Time_Off_Work_Days': 7}


### Result

The rules successfully extract key numeric facts:
- accident date/time
- symptom duration (4 weeks)
- physiotherapy sessions (10)
- time off work (1 week)

These fields are high-value and are more reliable with rules than with NER.

## 3) Prototype: Medical NER using a Transformer Model

Next, we test a pretrained biomedical NER model to extract clinical entities from the transcript.

The model outputs entities with labels such as:
- Sign_symptom
- Medication
- Therapeutic_procedure
- Diagnostic_procedure
- Detailed_description

These raw outputs are then mapped into:
- Symptoms
- Diagnosis candidates
- Treatments

In [9]:
from transformers import pipeline

ner_pipe = pipeline(
    "token-classification",
    model="d4data/biomedical-ner-all",
    aggregation_strategy="simple"
)

In [10]:
ner_results = ner_pipe(transcript[:3000])  # cap for speed
len(ner_results)

22

In [11]:
for r in ner_results[:20]:
    print(r)


{'entity_group': 'Sign_symptom', 'score': 0.99994004, 'word': 'discomfort', 'start': 134, 'end': 144}
{'entity_group': 'Activity', 'score': 0.5682387, 'word': 'car accident', 'start': 197, 'end': 209}
{'entity_group': 'Time', 'score': 0.9518853, 'word': '12 : 30 in', 'start': 311, 'end': 319}
{'entity_group': 'Detailed_description', 'score': 0.25372186, 'word': '##ad', 'start': 357, 'end': 359}
{'entity_group': 'Nonbiological_location', 'score': 0.718798, 'word': 'hulme', 'start': 362, 'end': 367}
{'entity_group': 'Sign_symptom', 'score': 0.99995553, 'word': 'pain', 'start': 786, 'end': 790}
{'entity_group': 'Biological_structure', 'score': 0.99979633, 'word': 'neck', 'start': 797, 'end': 801}
{'entity_group': 'Biological_structure', 'score': 0.94998693, 'word': 'back', 'start': 806, 'end': 810}
{'entity_group': 'Duration', 'score': 0.96820205, 'word': 'weeks', 'start': 1150, 'end': 1155}
{'entity_group': 'Biological_structure', 'score': 0.99973065, 'word': 'neck', 'start': 1171, 'end'

In [12]:
print(set([x["entity_group"] for x in ner_results]))

{'Biological_structure', 'Activity', 'Sign_symptom', 'Time', 'Medication', 'Therapeutic_procedure', 'Detailed_description', 'Nonbiological_location', 'Duration', 'Lab_value'}


### Observation

The model produces many entity labels.
However, raw outputs include:
- generic words ("pain")
- fragments
- negated symptoms ("no anxiety" still extracted)

So we apply:
- filtering (confidence threshold, stopwords)
- light negation handling
- schema mapping (Symptoms / Diagnosis / Treatments)


In [13]:
def map_ner_to_schema(ner_results, min_score=0.75):
    symptoms = []
    diagnosis_candidates = []
    treatments = []

    GENERIC_BAD = {
        "issues", "damage", "recovery", "full range", "range", "movement",
        "mobility", "tenderness", "not constant", "constant"
    }

    for ent in ner_results:
        label = ent.get("entity_group")
        text = ent.get("word", "").strip()
        score = float(ent.get("score", 0.0))

        # remove subword junk
        if text.startswith("##"):
            continue

        # clean subword markers
        text_clean = text.replace("##", "").strip()

        # low confidence
        if score < min_score:
            continue

        # too short
        if len(text_clean) < 3:
            continue

        # generic stopwords
        if text_clean.lower() in GENERIC_BAD:
            continue

        # map labels
        if label == "Sign_symptom":
            symptoms.append(text_clean)

        elif label in ["Medication", "Therapeutic_procedure"]:
            treatments.append(text_clean)

        elif label in ["Detailed_description", "History"]:
            diagnosis_candidates.append(text_clean)

    # dedup
    symptoms = sorted(list(set(symptoms)))
    treatments = sorted(list(set(treatments)))
    diagnosis_candidates = sorted(list(set(diagnosis_candidates)))

    return {
        "Symptoms": symptoms,
        "Treatments": treatments,
        "Diagnosis_Candidates": diagnosis_candidates
    }


In [15]:
import json

mapped = map_ner_to_schema(ner_results)
print(json.dumps(mapped, indent=2))

{
  "Symptoms": [
    "anxiety",
    "discomfort",
    "emotional issues",
    "nervous",
    "pain",
    "stiff"
  ],
  "Treatments": [
    "physiotherapy"
  ],
  "Diagnosis_Candidates": [
    "ten sessions"
  ]
}


### Negation Handling

NER models often extract symptoms even when they are negated:

Eg:
- "No anxiety while driving" → model still extracts "anxiety"

A lightweight fix is to remove extracted symptoms if they appear near negation words like:
- no
- not
- haven't


In [16]:
NEGATION_WORDS = {"no", "not", "never", "haven't", "hasn't", "didn't"}


def remove_negated_entities(text: str, entities: list[str], window: int = 3):
    tokens = text.lower().split()
    cleaned = []

    for ent in entities:
        ent_tokens = ent.lower().split()
        ent_first = ent_tokens[0]

        keep = True
        for i, tok in enumerate(tokens):
            if tok == ent_first:
                start = max(0, i - window)
                context = tokens[start:i]
                if any(w in context for w in NEGATION_WORDS):
                    keep = False
                    break

        if keep:
            cleaned.append(ent)

    return sorted(list(set(cleaned)))

In [17]:
mapped["Symptoms"] = remove_negated_entities(transcript, mapped["Symptoms"])
print(mapped["Symptoms"])

['anxiety', 'discomfort', 'emotional issues', 'nervous', 'pain', 'stiff']


## 4) Final Step: Build the Structured Medical JSON Summary

Now we combine everything:

Inputs:
- grouped speaker text (Patient vs Physician)
- NER extracted entities (Symptoms / Treatments / Diagnosis candidates)
- rule-based extraction (dates, durations, counts)

Output:
- a structured clinical JSON report

In [18]:
def build_structured_summary(grouped, mapped, dates, counts):
    patient_text = grouped.get("Patient", "")
    doctor_text = grouped.get("Physician", "") + " " + grouped.get("Doctor", "")
    combined = (patient_text + " " + doctor_text).lower()

    # -----------------------
    # Diagnosis selection
    # -----------------------
    diagnosis = None
    for d in mapped.get("Diagnosis_Candidates", []):
        if "whiplash" in d.lower():
            diagnosis = "Whiplash injury"
            break

    if diagnosis is None and "whiplash" in combined:
        diagnosis = "Whiplash injury"

    # -----------------------
    # Symptoms patch
    # -----------------------
    symptoms = mapped.get("Symptoms", [])
    if "neck" in combined and "pain" in combined:
        symptoms.append("Neck pain")
    if "back" in combined and "pain" in combined:
        symptoms.append("Back pain")

    # remove anxiety if transcript denies it
    if "nothing like that" in combined:
        symptoms = [s for s in symptoms if s.lower() not in ["anxiety", "nervous"]]

    symptoms = sorted(list(set(symptoms)))

    # -----------------------
    # Treatments patch
    # -----------------------
    treatments = mapped.get("Treatments", [])

    # add physio sessions from rule extraction
    if counts.get("Physio_Sessions"):
        treatments.append(f"{counts['Physio_Sessions']} physiotherapy sessions")

    # add painkillers if mentioned
    if "painkillers" in combined:
        treatments.append("Painkillers")

    # remove junk
    treatments = [t for t in treatments if t.lower() not in ["pain"]]
    treatments = sorted(list(set(treatments)))

    # -----------------------
    # Current status
    # -----------------------
    if "occasional" in patient_text.lower() and ("backache" in patient_text.lower() or "back pain" in patient_text.lower()):
        current_status = "Occasional backache"
    else:
        current_status = "Improving, intermittent discomfort"

    # -----------------------
    # Prognosis
    # -----------------------
    prognosis = None
    if "full recovery" in doctor_text.lower() and "six months" in doctor_text.lower():
        prognosis = "Full recovery expected within six months of the accident"

    # -----------------------
    # Physical exam
    # -----------------------
    physical_exam = None
    if "full range of movement" in doctor_text.lower() or "full range of motion" in doctor_text.lower():
        physical_exam = "Full range of movement in neck and back; no tenderness; no signs of lasting damage."

    # -----------------------
    # HPI narrative
    # -----------------------
    hpi = (
        "Patient involved in a motor vehicle accident. "
        "Reported head impact and acute neck/back pain. "
        f"Severe symptoms lasted approximately {counts.get('Acute_Pain_Duration_Weeks')} weeks, "
        "followed by improvement with physiotherapy."
    )

    return {
        "Patient_Name": "Ms. Jones",
        "Accident_Details": {
            "Accident_Date": dates.get("Accident_Date"),
            "Accident_Time": dates.get("Accident_Time"),
            "Accident_Month_Reference": dates.get("Accident_Month_Reference"),
            "Mechanism": "Rear-end collision"
        },
        "Symptoms": symptoms,
        "Diagnosis": diagnosis,
        "Treatment": treatments,
        "Current_Status": current_status,
        "Prognosis": prognosis,
        "Functional_Impact": {
            "Time_Off_Work_Days": counts.get("Time_Off_Work_Days"),
            "Daily_Life_Impact": "Minimal; returned to usual routine after one week"
        },
        "HPI": hpi,
        "Physical_Exam": physical_exam
    }


In [19]:
structured_summary = build_structured_summary(grouped, mapped, dates, counts)
print(json.dumps(structured_summary, indent=2))

{
  "Patient_Name": "Ms. Jones",
  "Accident_Details": {
    "Accident_Date": "September 1",
    "Accident_Time": "12:30",
    "Accident_Month_Reference": "last September",
    "Mechanism": "Rear-end collision"
  },
  "Symptoms": [
    "Back pain",
    "Neck pain",
    "discomfort",
    "emotional issues",
    "pain",
    "stiff"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "10 physiotherapy sessions",
    "Painkillers",
    "physiotherapy"
  ],
  "Current_Status": "Occasional backache",
  "Prognosis": "Full recovery expected within six months of the accident",
  "Functional_Impact": {
    "Time_Off_Work_Days": 7,
    "Daily_Life_Impact": "Minimal; returned to usual routine after one week"
  },
  "HPI": "Patient involved in a motor vehicle accident. Reported head impact and acute neck/back pain. Severe symptoms lasted approximately 4 weeks, followed by improvement with physiotherapy.",
  "Physical_Exam": "Full range of movement in neck and back; no tenderness; no signs o

### Result

We now have a clean structured medical report in JSON.

At this point, the logic is stable and can be modularized into the final project structure.

## 5) Modularization into the Final Project

After validating each component in this notebook, the code was moved into modular files:

- `src/preprocess.py` (speaker splitting)
- `src/ner.py` (medical NER + mapping)
- `src/pipeline.py` (final schema builder)
- `src/keywords.py`
- `src/sentiment_intent.py`
- `src/soap.py`
- `src/summarizer.py`