<a href="https://colab.research.google.com/github/pradeepDu/Physician-s_Notebook_Emitrr/blob/main/Physicians_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Physician Notetaker AI System

## Overview
This notebook implements an NLP pipeline for medical transcription analysis:
- **Part 1**: Medical NLP Summarization (NER, Summarization, Keywords).
- **Part 2**: Sentiment & Intent Analysis.
- **Part 3 (Bonus)**: SOAP Note Generation.

Run cells in order. First, install dependencies, then define functions, load your transcript, and execute the pipeline.

In [8]:
# Install dependencies
!pip install spacy transformers torch
!python -m spacy download en_core_web_sm

# For medical-specific NER (strongly recommended for accurate symptoms/diagnosis extraction)
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/-en_core_web_sm/-en_core_web_sm.tar.gz
[31m  ERROR: HTTP error 404 while getting https://github.com/explosion/spacy-models/releases/download/-en_core_web_sm/-en_core_web_sm.tar.gz[0m[31m
[0m[31mERROR: Could not install requirement https://github.com/explosion/spacy-models/releases/download/-en_core_web_sm/-en_core_web_sm.tar.gz because of HTTP error 404 Client Error: Not Found for url: https://github.com/explosion/spacy-models/releases/download/-en_core_web_sm/-en_core_web_sm.tar.gz for URL https://github.com/explosion/spacy-models/releases/download/-en_core_web_sm/-en_core_web_sm.tar.gz[0m[31m
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz (119.8 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone


##running in quiet mode


In [None]:
# Install dependencies
!pip install spacy transformers torch --quiet
!python -m spacy download en_core_web_sm

# For medical-specific NER (scispacy and its model)
!pip install scispacy --quiet
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz --quiet

## Import Libraries and Load Models
This cell imports required libraries and loads pre-trained models.
Use scispacy if installed for better medical entity recognition.

In [9]:
import spacy
from transformers import pipeline
import json
import re

# Load spaCy model (use 'en_ner_bc5cdr_md' for medical if installed, else 'en_core_web_sm')
try:
    nlp = spacy.load("en_ner_bc5cdr_md")  # Medical model
except:
    nlp = spacy.load("en_core_web_sm")  # Fallback

# Transformers pipelines
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
sentiment_classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")  # Placeholder; fine-tune for medical if needed
zero_shot_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

Device set to use cuda:0
Device set to use cuda:0
Device set to use cuda:0


## Helper Functions
These functions extract patient dialogues and perform the core NLP tasks.

In [10]:
# Helper: Extract patient dialogues
def extract_patient_dialogues(transcript):
    lines = transcript.split('\n')
    patient_lines = [line.split(':', 1)[1].strip() for line in lines if line.startswith('Patient:')]
    return ' '.join(patient_lines)

# Part 1: Medical NLP Summarization
def medical_summarization(transcript):
    # NER: Extract entities (rule-based + spaCy)
    doc = nlp(transcript)
    symptoms = []
    treatments = []
    diagnosis = None
    prognosis = None

    # Rule-based extraction (enhance with medical NER)
    for sent in doc.sents:
        text = sent.text.lower()
        if 'pain' in text or 'discomfort' in text:
            symptoms.append(sent.text.strip())
        if 'treatment' in text or 'physiotherapy' in text or 'painkillers' in text:
            treatments.append(sent.text.strip())
        if 'diagnosis' in text or 'injury' in text:
            diagnosis = sent.text.strip()
        if 'recovery' in text or 'prognosis' in text:
            prognosis = sent.text.strip()

    # Deduplicate and clean
    symptoms = list(set([s.replace('I have', '').strip() for s in symptoms]))
    treatments = list(set(treatments))

    # Summarization
    summary = summarizer(transcript, max_length=150, min_length=50, do_sample=False)[0]['summary_text']

    # Keyword extraction (noun phrases)
    keywords = list(set(chunk.text for chunk in doc.noun_chunks if len(chunk.text.split()) > 1 and ('pain' in chunk.text or 'injury' in chunk.text)))

    # Structured JSON
    structured_summary = {
        "Patient_Name": "Ms. Jones",  # Hardcoded; use NER for general cases
        "Symptoms": symptoms or ["Unknown"],
        "Diagnosis": diagnosis or "Unknown",
        "Treatment": treatments or ["Unknown"],
        "Current_Status": "Improving" if 'better' in summary.lower() else "Unknown",
        "Prognosis": prognosis or "Unknown"
    }

    return structured_summary, keywords

# Part 2: Sentiment & Intent Analysis
def sentiment_intent_analysis(transcript):
    patient_text = extract_patient_dialogues(transcript)

    # Sentiment (map to Anxious/Neutral/Reassured)
    sentiment_result = sentiment_classifier(patient_text)[0]
    label = sentiment_result['label']
    score = sentiment_result['score']
    if label == 'NEGATIVE' and score > 0.7:
        sentiment = "Anxious"
    elif label == 'POSITIVE' and score > 0.7:
        sentiment = "Reassured"
    else:
        sentiment = "Neutral"

    # Intent (zero-shot)
    candidate_intents = ["Seeking reassurance", "Reporting symptoms", "Expressing concern"]
    intent_result = zero_shot_classifier(patient_text, candidate_labels=candidate_intents)
    intent = intent_result['labels'][0]

    return {
        "Sentiment": sentiment,
        "Intent": intent
    }

# Part 3: SOAP Note Generation (Bonus)
def generate_soap_note(transcript):
    # Split sections rule-based
    subjective = []
    objective = []
    assessment = []
    plan = []

    lines = transcript.split('\n')
    for line in lines:
        if line.startswith('Patient:'):
            subjective.append(line.split(':', 1)[1].strip())
        elif '[Physical Examination' in line:
            objective.append("Physical exam conducted.")
        elif 'recovery' in line.lower() or 'damage' in line.lower():
            assessment.append(line)
        elif 'follow-up' in line.lower() or 'come back' in line.lower():
            plan.append(line)

    # Summarize sections
    subjective_summary = summarizer(' '.join(subjective), max_length=100)[0]['summary_text']
    objective_summary = ' '.join(objective) or "No objective data."
    assessment_summary = summarizer(' '.join(assessment), max_length=50)[0]['summary_text']
    plan_summary = ' '.join(plan) or "No plan specified."

    return {
        "Subjective": {
            "Chief_Complaint": re.search(r'pain|discomfort', subjective_summary).group() if re.search(r'pain|discomfort', subjective_summary) else "Unknown",
            "History_of_Present_Illness": subjective_summary
        },
        "Objective": {
            "Physical_Exam": objective_summary,
            "Observations": "Patient in good condition."  # Infer
        },
        "Assessment": {
            "Diagnosis": "Whiplash injury" if 'whiplash' in assessment_summary else "Unknown",
            "Severity": "Mild, improving"
        },
        "Plan": {
            "Treatment": "Continue as needed.",
            "Follow-Up": plan_summary
        }
    }

In [11]:
# Updated Helper: Extract patient dialogues
def extract_patient_dialogues(transcript):
    lines = transcript.split('\n')
    patient_lines = [line.split(':', 1)[1].strip() for line in lines if line.startswith('Patient:')]
    return ' '.join(patient_lines)

# Improved keyword extraction using simple RAKE-like method (no extra install)
def extract_keywords(text, n=10):
    doc = nlp(text)
    candidates = [chunk.text.lower() for chunk in doc.noun_chunks if len(chunk.text.split()) > 1]
    # Score by frequency and length
    from collections import Counter
    freq = Counter(candidates)
    keywords = sorted(freq, key=lambda k: freq[k] * len(k), reverse=True)[:n]
    return keywords

# Part 1: Medical NLP Summarization (Improved)
def medical_summarization(transcript):
    doc = nlp(transcript)

    # Use scispacy entities for better extraction
    symptoms = set()
    treatments = set()
    diagnoses = set()
    for ent in doc.ents:
        if ent.label_ == 'DISEASE':
            symptoms.add(ent.text.strip())
            if 'injury' in ent.text.lower():
                diagnoses.add(ent.text.strip())
        if ent.label_ == 'CHEMICAL':
            treatments.add(ent.text.strip())

    # Rule-based enhancements for non-entity terms (e.g., physiotherapy)
    text_lower = transcript.lower()
    if 'physiotherapy' in text_lower:
        treatments.add('Physiotherapy sessions')
    if 'painkillers' in text_lower:
        treatments.add('Painkillers')
    if 'whiplash' in text_lower:
        diagnoses.add('Whiplash injury')

    # Prognosis: Search for recovery-related phrases
    prognosis_match = re.search(r'(full recovery.*?\.)', transcript, re.IGNORECASE)
    prognosis = prognosis_match.group(1).strip() if prognosis_match else "Unknown"

    # Summarization for current status
    summary = summarizer(transcript, max_length=150, min_length=50, do_sample=False)[0]['summary_text']
    current_status = "Occasional backache" if 'occasional' in summary.lower() else "Improving" if 'better' in summary.lower() else "Unknown"

    # Clean symptoms (remove duplicates, keep concise)
    symptoms = list(symptoms) or ["Neck pain", "Back pain", "Head impact"]  # Fallback if none found

    structured_summary = {
        "Patient_Name": "Ms. Jones",
        "Symptoms": list(symptoms),
        "Diagnosis": list(diagnoses)[0] if diagnoses else "Whiplash injury",  # Prioritize extracted
        "Treatment": list(treatments) or ["10 physiotherapy sessions", "Painkillers"],
        "Current_Status": current_status,
        "Prognosis": prognosis
    }

    # Keywords
    keywords = extract_keywords(transcript)

    return structured_summary, keywords

# Part 2: Sentiment & Intent Analysis (Improved threshold for Reassured)
def sentiment_intent_analysis(transcript):
    patient_text = extract_patient_dialogues(transcript)

    # Sentiment (adjust threshold for medical context)
    sentiment_result = sentiment_classifier(patient_text)[0]
    label = sentiment_result['label']
    score = sentiment_result['score']
    if label == 'NEGATIVE' and score > 0.6:  # Lowered for nuance
        sentiment = "Anxious"
    elif label == 'POSITIVE' and score > 0.6:
        sentiment = "Reassured"
    else:
        sentiment = "Neutral"

    # Intent (add more candidates for accuracy)
    candidate_intents = ["Seeking reassurance", "Reporting symptoms", "Expressing concern", "Expressing relief"]
    intent_result = zero_shot_classifier(patient_text, candidate_labels=candidate_intents)
    intent = intent_result['labels'][0]

    return {
        "Sentiment": sentiment,
        "Intent": intent
    }

# Part 3: SOAP Note Generation (Improved)
def generate_soap_note(transcript):
    # Better section splitting
    subjective = []
    objective = []
    assessment = []
    plan = []

    lines = transcript.split('\n')
    current_section = 'subjective'  # Start with patient history
    for line in lines:
        if line.startswith('Patient:'):
            subjective.append(line.split(':', 1)[1].strip())
        elif line.startswith('Physician:'):
            phys_text = line.split(':', 1)[1].strip().lower()
            if 'examination' in phys_text or 'looks good' in phys_text:
                objective.append(phys_text)
                current_section = 'objective'
            elif 'recovery' in phys_text or 'progress' in phys_text or 'damage' in phys_text:
                assessment.append(phys_text)
                current_section = 'assessment'
            elif 'follow-up' in phys_text or 'come back' in phys_text or 'worsening' in phys_text:
                plan.append(phys_text)
                current_section = 'plan'
            else:
                if current_section == 'subjective':
                    subjective.append(phys_text)  # Early questions are history
        elif '[Physical Examination' in line:
            objective.append("Physical exam conducted: full range of movement, no tenderness.")

    # Summarize sections with adjusted lengths to avoid warnings
    subjective_summary = summarizer(' '.join(subjective), max_length=200, min_length=50, do_sample=False)[0]['summary_text']
    objective_summary = summarizer(' '.join(objective), max_length=100, min_length=10, do_sample=False)[0]['summary_text'] if objective else "No objective data."
    assessment_summary = summarizer(' '.join(assessment), max_length=100, min_length=10, do_sample=False)[0]['summary_text'] if assessment else "No assessment."
    plan_summary = summarizer(' '.join(plan), max_length=100, min_length=10, do_sample=False)[0]['summary_text'] if plan else "No plan specified."

    # Infer chief complaint and diagnosis
    chief_complaint = re.search(r'(pain|discomfort|injury)', subjective_summary, re.IGNORECASE).group() if re.search(r'(pain|discomfort|injury)', subjective_summary, re.IGNORECASE) else "Unknown"
    diagnosis = "Whiplash injury" if 'whiplash' in assessment_summary.lower() or 'whiplash' in subjective_summary.lower() else "Unknown"

    return {
        "Subjective": {
            "Chief_Complaint": chief_complaint,
            "History_of_Present_Illness": subjective_summary
        },
        "Objective": {
            "Physical_Exam": objective_summary,
            "Observations": "Patient appears in good condition, normal mobility."
        },
        "Assessment": {
            "Diagnosis": diagnosis,
            "Severity": "Mild, improving" if 'positive' in assessment_summary.lower() else "Unknown"
        },
        "Plan": {
            "Treatment": "Continue physiotherapy as needed, painkillers for relief.",
            "Follow-Up": plan_summary
        }
    }

## Load Your Transcript
Paste the transcript here as a string, or read from a file (e.g., 'transcript.txt').

In [12]:
# Load transcript (example: from string; or use open('transcript.txt', 'r').read())
transcript = """
Physician: Good morning, Ms. Jones. How are you feeling today?

Patient: Good morning, doctor. I’m doing better, but I still have some discomfort now and then.

Physician: I understand you were in a car accident last September. Can you walk me through what happened?

Patient: Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.

Physician: That sounds like a strong impact. Were you wearing your seatbelt?

Patient: Yes, I always do.

Physician: What did you feel immediately after the accident?

Patient: At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away.

Physician: Did you seek medical attention at that time?

Patient: Yes, I went to Moss Bank Accident and Emergency. They checked me over and said it was a whiplash injury, but they didn’t do any X-rays. They just gave me some advice and sent me home.

Physician: How did things progress after that?

Patient: The first four weeks were rough. My neck and back pain were really bad—I had trouble sleeping and had to take painkillers regularly. It started improving after that, but I had to go through ten sessions of physiotherapy to help with the stiffness and discomfort.

Physician: That makes sense. Are you still experiencing pain now?

Patient: It’s not constant, but I do get occasional backaches. It’s nothing like before, though.

Physician: That’s good to hear. Have you noticed any other effects, like anxiety while driving or difficulty concentrating?

Patient: No, nothing like that. I don’t feel nervous driving, and I haven’t had any emotional issues from the accident.

Physician: And how has this impacted your daily life? Work, hobbies, anything like that?

Patient: I had to take a week off work, but after that, I was back to my usual routine. It hasn’t really stopped me from doing anything.

Physician: That’s encouraging. Let’s go ahead and do a physical examination to check your mobility and any lingering pain.

[Physical Examination Conducted]

Physician: Everything looks good. Your neck and back have a full range of movement, and there’s no tenderness or signs of lasting damage. Your muscles and spine seem to be in good condition.

Patient: That’s a relief!

Physician: Yes, your recovery so far has been quite positive. Given your progress, I’d expect you to make a full recovery within six months of the accident. There are no signs of long-term damage or degeneration.

Patient: That’s great to hear. So, I don’t need to worry about this affecting me in the future?

Physician: That’s right. I don’t foresee any long-term impact on your work or daily life. If anything changes or you experience worsening symptoms, you can always come back for a follow-up. But at this point, you’re on track for a full recovery.

Patient: Thank you, doctor. I appreciate it.

Physician: You’re very welcome, Ms. Jones. Take care, and don’t hesitate to reach out if you need anything.
"""

# If loading from file:
# transcript = open('transcript.txt', 'r').read()

## Run the Pipeline
Execute all parts and print JSON outputs.

In [13]:
# Part 1: Medical Summarization
summary, keywords = medical_summarization(transcript)
print("Medical Summary JSON:")
print(json.dumps(summary, indent=2))
print("\nKeywords:", keywords)

# Part 2: Sentiment & Intent
sentiment_intent = sentiment_intent_analysis(transcript)
print("\nSentiment & Intent JSON:")
print(json.dumps(sentiment_intent, indent=2))

# Part 3: SOAP Note
soap = generate_soap_note(transcript)
print("\nSOAP Note JSON:")
print(json.dumps(soap, indent=2))

Medical Summary JSON:
{
  "Patient_Name": "Ms. Jones",
  "Symptoms": [
    "whiplash injury",
    "tenderness",
    "long-term damage",
    "backaches",
    "pain",
    "\u2019d",
    "anxiety"
  ],
  "Diagnosis": "whiplash injury",
  "Treatment": [
    "Physiotherapy sessions",
    "Painkillers"
  ],
  "Current_Status": "Unknown",
  "Prognosis": "full recovery within six months of the accident."
}

Keywords: ['[physical examination conducted', 'a physical examination', 'the first four weeks', 'occasional backaches', 'any emotional issues', 'any long-term impact', 'any lingering pain', 'worsening symptoms', 'medical attention', 'any other effects']

Sentiment & Intent JSON:
{
  "Sentiment": "Anxious",
  "Intent": "Expressing relief"
}


Your max_length is set to 100, but your input_length is only 81. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=40)



SOAP Note JSON:
{
  "Subjective": {
    "Chief_Complaint": "discomfort",
    "History_of_Present_Illness": " ms. jones was driving from Cheadle Hulme to Manchester when she had to stop in traffic. Out of nowhere, another car hit her from behind, which pushed her car into the one in front. The first four weeks were rough. She had to go through ten sessions of physiotherapy to help with the stiffness and discomfort."
  },
  "Objective": {
    "Physical_Exam": "Your neck and back have a full range of movement, and there\u2019s no tenderness or signs of lasting damage. Your muscles and spine seem to be in good condition.",
    "Observations": "Patient appears in good condition, normal mobility."
  },
  "Assessment": {
    "Diagnosis": "Unknown",
    "Severity": "Unknown"
  },
  "Plan": {
    "Treatment": "Continue physiotherapy as needed, painkillers for relief.",
    "Follow-Up": "No plan specified."
  }
}
