# Emittr NLP Pipeline

End-to-end notebook for NER, Summarization, Sentiment & Intent


üìò Emittr NLP Pipeline ‚Äì Setup

üîß **Environment**

Python: 3.8+

Recommended: Google Colab or GPU-enabled machine

üì¶ Install Dependencies
pip install transformers torch datasets accelerate spacy scispacy keybert scikit-learn pandas
python -m spacy download en_core_web_sm

üìÇ **Required Files**

Upload the following before running training cells:

Combined_Data_with_Intents_5k.csv (for sentiment & intent training)

soap_final_filled_roberta.json (for SOAP note generation)

üöÄ**How to Run**

Run cells top-to-bottom in the notebook.

Start with NER / Summarization / QA pipeline (Section 1 & 2).

Train Sentiment + Intent model (Section 3).

Train SOAP note generation model (Section 4).

üíæ **Model Outputs**

Fine-tuned models are saved and zipped automatically.

Downloadable artifacts include:

Model weights

Tokenizer

label_map.json (critical for inference)

In [None]:
# Install dependencies (run once)
!pip install -q transformers torch spacy scispacy keybert

In [None]:
import re
import json
import torch
import spacy
from transformers import pipeline
from keybert import KeyBERT


## **A. Sample Transcript**
Warm up for Medical NER, , Summarisation, Key Word Extraction

In [None]:

transcript = """
Doctor: How are you feeling today?
Patient: I had a car accident. My neck and back hurt a lot for four weeks.
Doctor: Did you receive treatment?
Patient: Yes, I had ten physiotherapy sessions, and now I only have occasional back pain.
"""


## **A1. Medical NER**

In [None]:
def preprocess(text):
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    return text

clean_text = preprocess(transcript)
clean_text

' doctor: how are you feeling today? patient: i had a car accident. my neck and back hurt a lot for four weeks. doctor: did you receive treatment? patient: yes, i had ten physiotherapy sessions, and now i only have occasional back pain. '

In [None]:

ner_pipeline = pipeline(
    "ner",
    model="emilyalsentzer/Bio_ClinicalBERT",
    aggregation_strategy="simple"
)

ner_pipeline(clean_text)


Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'LABEL_1',
  'score': np.float32(0.575819),
  'word': 'doctor : how are you feeling today? patient',
  'start': 1,
  'end': 43},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.52925885),
  'word': ':',
  'start': 43,
  'end': 44},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.51768476),
  'word': 'i',
  'start': 45,
  'end': 46},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.5442349),
  'word': 'had a car',
  'start': 47,
  'end': 56},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.5604892),
  'word': 'accident. my neck',
  'start': 57,
  'end': 74},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.6435705),
  'word': 'and',
  'start': 75,
  'end': 78},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.6128787),
  'word': 'back hurt',
  'start': 79,
  'end': 88},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.5710225),
  'word': 'a lot for',
  'start': 89,
  'end': 98},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.5055871

## A2. Testing Medical Summarization

In [None]:
summarizer = pipeline(
    "summarization",
    model="google/pegasus-pubmed"
)

summarizer(transcript, max_length=50, min_length=40)


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-pubmed and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


[{'summary_text': 'this is a case study of a young adult who suffered a car accident with neck and back pain . <n> the pain was treated with physiotherapy , and the patient made a full recovery . <n> this case study illustrates the importance'}]

## A3. Keyword Extraction

In [None]:
kw_model = KeyBERT()
kw_model.extract_keywords(transcript, top_n=5)


[('physiotherapy', 0.4394),
 ('neck', 0.3322),
 ('pain', 0.3229),
 ('patient', 0.3171),
 ('hurt', 0.3056)]

# **1. Complete Report Pipeline**

In [23]:
# Install dependencies
!pip install -q transformers torch spacy
!python -m spacy download en_core_web_sm

import json
from transformers import pipeline
import re
import spacy

# --- 1. SETUP TRANSCRIPT ---
transcript = """
Physician: Good morning, Ms. Jones. How are you feeling today?
Patient: Good morning, doctor. I'm doing better, but I still have some discomfort now and then.
Physician: I understand you were in a car accident last September. Can you walk me through what happened?
Patient: Yes, it was on September 1st. ... another car hit me from behind...
Physician: What did you feel immediately after the accident?
Patient: At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away.
Physician: Did you seek medical attention?
Patient: Yes, I went to Moss Bank Accident and Emergency. They checked me over and said it was a whiplash injury...
Physician: How did things progress after that?
Patient: The first four weeks were rough. My neck and back pain were really bad‚ÄîI had trouble sleeping and had to take painkillers regularly. It started improving after that, but I had to go through ten sessions of physiotherapy to help with the stiffness and discomfort.
Physician: That makes sense. Are you still experiencing pain now?
Patient: It's not constant, but I do get occasional backaches. It's nothing like before, though.
Physician: Yes, your recovery so far has been quite positive. Given your progress, I'd expect you to make a full recovery within six months of the accident.
"""

print("‚è≥ Loading Models...")

# A. NER: Detects Body Parts & Symptoms (Transformers-based)
ner_pipeline = pipeline(
    "ner",
    model="d4data/biomedical-ner-all",
    aggregation_strategy="simple"
)

# B. QA: Extracts answers (Transformers-based)
qa_pipeline = pipeline(
    "question-answering",
    model="deepset/roberta-base-squad2"
)

# C. SUMMARIZATION: For capturing complete context (Transformers-based)
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn"
)

# D. spaCy for enhanced NER and linguistic analysis
nlp = spacy.load("en_core_web_sm")

print("‚úÖ Models loaded successfully!")

# --- 2. EXTRACTION FUNCTIONS ---

def extract_body_parts_from_ner(ner_results, text):
    """
    NER-based body part extraction with regex fallback.
    """
    body_parts = []
    buffer = ""
    last_end = None

    ner_results = sorted(ner_results, key=lambda x: x['start'])

    for ent in ner_results:
        word = ent['word'].replace("##", "")
        label = ent['entity_group']
        start = ent['start']

        if label == "Biological_structure":
            if buffer and last_end is not None and (start - last_end) <= 1:
                buffer += word
            else:
                if buffer:
                    body_parts.append(buffer.strip().lower())
                buffer = word
            last_end = ent['end']
        else:
            if buffer:
                body_parts.append(buffer.strip().lower())
                buffer = ""
                last_end = None

    if buffer:
        body_parts.append(buffer.strip().lower())

    # Fallback: regex for common body parts NER might miss
    common_body_parts = [
        'head', 'neck', 'back', 'shoulder', 'arm', 'elbow', 'wrist', 'hand', 'finger',
        'chest', 'abdomen', 'hip', 'leg', 'knee', 'ankle', 'foot', 'toe',
        'spine', 'jaw', 'face', 'eye', 'ear', 'nose', 'mouth', 'tooth', 'teeth'
    ]

    text_lower = text.lower()
    for body_part in common_body_parts:
        if re.search(rf'\b{body_part}\b', text_lower):
            if body_part not in body_parts:
                body_parts.append(body_part)

    return list(set(body_parts))


def extract_symptoms_with_context(text, body_parts):
    """
    Symptom extraction using NER body parts + keyword patterns.
    """
    symptoms = []
    text_lower = text.lower()

    impact_keywords = ['hit', 'struck', 'banged', 'bump', 'bumped', 'impact']
    fracture_keywords = ['break', 'broken', 'fracture', 'fractured']
    injury_keywords = ['injury', 'injured', 'damage', 'damaged', 'strain', 'sprain']
    pain_keywords = ['pain', 'ache', 'aching', 'hurt', 'hurting', 'sore', 'soreness']
    other_keywords = ['stiff', 'stiffness', 'discomfort', 'bruise', 'bruised', 'swelling', 'swollen']

    symptom_groups = [
        (impact_keywords, 'impact'),
        (fracture_keywords, 'fracture'),
        (injury_keywords, 'injury'),
        (pain_keywords, 'pain'),
        (other_keywords, None)
    ]

    for body_part in body_parts:
        found = False
        for keyword_list, normalized_name in symptom_groups:
            if found:
                break
            for keyword in keyword_list:
                pattern_verb = re.compile(rf'\b(?:hit|struck|banged|broke|fractured|injured|hurt|damaged)\s+(?:my|the|your)?\s*{body_part}\b', re.IGNORECASE)
                pattern1 = re.compile(rf'\b{body_part}\s+(?:and\s+\w+\s+)?{keyword}\b', re.IGNORECASE)
                pattern2 = re.compile(rf'\b{keyword}\s+(?:in|to|at|on)\s+(?:my|the|your)?\s*{body_part}\b', re.IGNORECASE)

                if pattern_verb.search(text) and keyword in impact_keywords + fracture_keywords + injury_keywords:
                    symptom_name = normalized_name if normalized_name else keyword
                    symptom = f"{body_part.capitalize()} {symptom_name}"
                    if symptom not in symptoms:
                        symptoms.append(symptom)
                    found = True
                    break
                elif pattern1.search(text) or pattern2.search(text):
                    symptom_name = normalized_name if normalized_name else keyword
                    symptom = f"{body_part.capitalize()} {symptom_name}"
                    if symptom not in symptoms:
                        symptoms.append(symptom)
                    found = True
                    break

    return symptoms


def extract_patient_name(context):
    """
    Patient name extraction using regex + QA fallback.
    """
    # Priority 1: Explicit titles (Ms., Mr., Mrs.)
    explicit_name_pattern = r'(?:Ms\.|Mr\.|Mrs\.)\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*'
    match = re.search(explicit_name_pattern, context)
    if match:
        return match.group(0).strip()

    # Priority 2: QA-based extraction
    questions = [
        "What is the patient's name?",
        "Who is the patient?",
    ]

    best_name = "Unknown"
    best_score = 0.0

    for q in questions:
        res = qa_pipeline(question=q, context=context)
        name = res["answer"].strip()
        score = res["score"]

        if score > best_score and 1 < len(name.split()) <= 3:
            if not any(phrase in name.lower() for phrase in ["i'm doing", "constant", "unknown", "good morning", "yes", "doctor", "patient"]):
                best_score = score
                best_name = name

    return best_name


def extract_treatment_with_summarization(context):
    """
    FIXED: Treatment extraction with smart deduplication.
    Extracts complete treatment phrases and removes fragments/duplicates.
    """
    treatments = []

    # Step 1: Primary extraction - look for complete quantified phrases FIRST
    # Pattern: "X sessions of Y" or "take/use Z"

    # Pattern 1: Quantified treatment (e.g., "ten sessions of physiotherapy")
    quantified_pattern = r'(\d+|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|fifteen|twenty)\s+sessions?\s+of\s+(physiotherapy|therapy|treatment)'
    matches = re.finditer(quantified_pattern, context, re.IGNORECASE)
    for match in matches:
        treatments.append(match.group(0).capitalize())

    # Pattern 2: Medication mentions (e.g., "take painkillers")
    medication_pattern = r'\b(?:take|taking|took|use|using|used)\s+(painkillers?|medication|medicine|pills?|drugs?)\b'
    matches = re.finditer(medication_pattern, context, re.IGNORECASE)
    for match in matches:
        # Extract just the medication name
        med_match = re.search(r'(painkillers?|medication|medicine|pills?|drugs?)', match.group(0), re.IGNORECASE)
        if med_match:
            treatments.append(med_match.group(0).capitalize())

    # Step 2: QA-based extraction for anything we missed
    qa_questions = [
        "What therapy or treatment did the patient receive?",
        "What medication did the patient take?"
    ]

    for question in qa_questions:
        result = qa_pipeline(question=question, context=context)
        if result['score'] > 0.2:
            answer = result['answer'].strip()
            # Only add if it's a reasonable length and not already captured
            if 1 <= len(answer.split()) <= 6:
                treatments.append(answer.capitalize())

    # Step 3: Smart deduplication - remove fragments if we have complete phrases
    # Sort by length (longest first) to prioritize complete phrases
    treatments_sorted = sorted(set(treatments), key=lambda x: len(x.split()), reverse=True)

    final_treatments = []
    seen_words = set()

    for treatment in treatments_sorted:
        treatment_lower = treatment.lower()
        words = set(treatment_lower.split())

        # Skip if this is a fragment of an already added treatment
        is_fragment = False
        for added in final_treatments:
            if treatment_lower in added.lower() or all(word in added.lower() for word in words):
                is_fragment = True
                break

        if not is_fragment:
            # Also skip single generic words like "therapy", "sessions", "ten"
            if len(words) == 1 and treatment_lower in ['therapy', 'sessions', 'session', 'physiotherapy', 'ten', 'treatment']:
                # Only add if we don't have a more complete version
                if not any(treatment_lower in added.lower() for added in final_treatments):
                    continue

            final_treatments.append(treatment)
            seen_words.update(words)

    # Final cleanup: ensure we have meaningful treatments
    cleaned_treatments = []
    for treatment in final_treatments:
        # Skip overly generic single words
        if len(treatment.split()) == 1 and treatment.lower() in ['ten', 'sessions', 'therapy']:
            continue
        cleaned_treatments.append(treatment)

    return cleaned_treatments if cleaned_treatments else ["Unknown"]


def extract_current_status_with_summarization(context):
    """
    FIXED: Current status extraction focusing on concise symptom descriptions.
    Priority: modifier + symptom pattern (e.g., "occasional backaches")
    """
    # Step 1: Look for the BEST pattern - modifier + symptom
    # This should match "occasional backaches", "constant pain", "frequent headaches"
    modifier_symptom_pattern = r'\b(occasional|constant|frequent|mild|severe|persistent|intermittent|rare|no)\s+([\w]+(?:ache|pain|discomfort|symptom|issue)s?)\b'

    doc = nlp(context)
    best_status = ""
    best_score = 0

    # Look for current status indicators
    current_indicators = ['still', 'now', 'currently', 'at present', 'these days', 'experiencing']

    for sent in doc.sents:
        sent_text = sent.text
        sent_lower = sent_text.lower()

        # Check if this sentence is about current status
        is_current = any(indicator in sent_lower for indicator in current_indicators)

        # Also check if it's a response to "Are you still experiencing pain now?"
        is_status_response = 'experiencing' in sent_lower or ('still' in sent_lower and any(word in sent_lower for word in ['pain', 'ache', 'symptom', 'discomfort']))

        if is_current or is_status_response:
            # Look for modifier + symptom pattern
            match = re.search(modifier_symptom_pattern, sent_text, re.IGNORECASE)
            if match:
                status = match.group(0).strip()
                # This is exactly what we want!
                return status.capitalize()

    # Step 2: If no modifier+symptom found, look for direct statements about current symptoms
    # Pattern: "I get X" or "I have X"
    direct_pattern = r'\b(?:get|have|experience)\s+(occasional|constant|frequent|mild|severe)?\s*(\w+(?:ache|pain|discomfort)s?)\b'

    for sent in doc.sents:
        sent_lower = sent.text.lower()
        if any(indicator in sent_lower for indicator in ['still', 'now', 'get', 'have']):
            match = re.search(direct_pattern, sent.text, re.IGNORECASE)
            if match:
                # Extract the symptom description
                full_match = match.group(0)
                # Clean it up - remove "get", "have", etc.
                symptom = re.sub(r'\b(?:get|have|experience)\s+', '', full_match, flags=re.IGNORECASE).strip()
                if symptom:
                    return symptom.capitalize()

    # Step 3: Use QA as fallback, but extract only the key symptom phrase
    qa_result = qa_pipeline(
        question="What pain does the patient currently have?",
        context=context
    )

    if qa_result['score'] > 0.3:
        answer = qa_result['answer'].strip()

        # If QA gives us a long answer, try to extract just the symptom
        if len(answer.split()) > 5:
            # Look for modifier + symptom within the answer
            match = re.search(modifier_symptom_pattern, answer, re.IGNORECASE)
            if match:
                return match.group(0).strip().capitalize()

            # Otherwise, try to extract just the symptom word
            symptom_match = re.search(r'\b(\w+(?:ache|pain|discomfort)s?)\b', answer, re.IGNORECASE)
            if symptom_match:
                return symptom_match.group(0).capitalize()
        else:
            return answer.capitalize()

    # Step 4: Use summarization ONLY as last resort
    try:
        status_sentences = [sent.text for sent in doc.sents
                          if any(word in sent.text.lower() for word in current_indicators + ['pain', 'ache', 'symptom'])]

        if status_sentences:
            # Focus on sentences about current state
            current_sentences = [s for s in status_sentences if any(ind in s.lower() for ind in ['still', 'now', 'currently', 'get'])]

            if current_sentences:
                status_text = ' '.join(current_sentences[:2])  # Max 2 sentences

                if len(status_text.split()) > 15:
                    summary = summarizer(status_text, max_length=20, min_length=5, do_sample=False)
                    summary_text = summary[0]['summary_text']

                    # Extract symptom from summary
                    match = re.search(modifier_symptom_pattern, summary_text, re.IGNORECASE)
                    if match:
                        return match.group(0).strip().capitalize()

                    return summary_text.capitalize()
    except:
        pass

    return "Unknown"


def extract_prognosis_with_summarization(context):
    """
    Prognosis extraction using QA + pattern matching + summarization.
    """
    # Step 1: QA extraction
    qa_questions = [
        "What is the doctor's prediction for recovery?",
        "What is the expected outcome for the patient?",
        "When will the patient fully recover?"
    ]

    prognosis_candidates = []

    for question in qa_questions:
        result = qa_pipeline(question=question, context=context)
        if result['score'] > 0.1:
            prognosis_candidates.append((result['answer'].strip(), result['score']))

    # Step 2: Pattern-based extraction for recovery predictions
    doc = nlp(context)
    recovery_pattern = r'((?:full|complete|total)\s+recovery\s+(?:expected\s+)?(?:within|in|by)\s+[\w\s]+(?:months?|weeks?|years?))'

    for sent in doc.sents:
        if any(word in sent.text.lower() for word in ['expect', 'recovery', 'prognosis', 'predict']):
            match = re.search(recovery_pattern, sent.text, re.IGNORECASE)
            if match:
                prognosis_candidates.append((match.group(0), 0.9))
            elif 5 <= len(sent.text.split()) <= 20:
                prognosis_candidates.append((sent.text, 0.7))

    # Step 3: Use summarization for long prognosis descriptions
    try:
        prognosis_sentences = [sent.text for sent in doc.sents if any(word in sent.text.lower() for word in ['expect', 'recovery', 'prognosis'])]
        if prognosis_sentences and len(' '.join(prognosis_sentences).split()) > 25:
            summary = summarizer(' '.join(prognosis_sentences), max_length=40, min_length=10, do_sample=False)
            prognosis_candidates.append((summary[0]['summary_text'], 0.85))
    except:
        pass

    if not prognosis_candidates:
        return "Unknown"

    # Select best prognosis
    prognosis_candidates.sort(key=lambda x: (x[1], len(x[0].split())), reverse=True)
    best_prognosis = prognosis_candidates[0][0]

    # Clean up
    best_prognosis = re.sub(r'^(Given|Based on|Considering).*?,\s*', '', best_prognosis, flags=re.IGNORECASE)
    best_prognosis = best_prognosis.strip().capitalize()

    return best_prognosis


def extract_diagnosis(context):
    """
    Diagnosis extraction using QA + pattern matching + NER.
    """
    # Step 1: QA extraction
    qa_questions = [
        "What diagnosis was given?",
        "What medical condition was identified?",
        "What did the doctors say was wrong?"
    ]

    best_diagnosis = ""
    best_score = 0.0

    for question in qa_questions:
        result = qa_pipeline(question=question, context=context)
        if result['score'] > best_score:
            best_score = result['score']
            best_diagnosis = result['answer'].strip()

    # Step 2: Pattern-based extraction
    diagnosis_patterns = [
        r'said it was (?:a\s+)?([a-zA-Z\s]+injury|[a-zA-Z\s]+syndrome|[a-zA-Z\s]+disorder)',
        r'diagnosed (?:with|as)\s+(?:a\s+)?([a-zA-Z\s]+injury|[a-zA-Z\s]+syndrome|[a-zA-Z\s]+disorder)',
        r'diagnosis (?:of|was)\s+(?:a\s+)?([a-zA-Z\s]+injury|[a-zA-Z\s]+syndrome|[a-zA-Z\s]+disorder)'
    ]

    for pattern in diagnosis_patterns:
        match = re.search(pattern, context, re.IGNORECASE)
        if match:
            diagnosis = match.group(1).strip()
            if len(diagnosis.split()) >= len(best_diagnosis.split()):
                best_diagnosis = diagnosis

    # Capitalize properly
    best_diagnosis = best_diagnosis.capitalize()

    return best_diagnosis if best_diagnosis else "Unknown"


# --- 3. EXECUTION ---
print("\nüöÄ Running Medical Transcript Extraction Pipeline...")
print("="*60)

# Step 1: NER for body parts and symptoms
print("\nüìç Step 1: Named Entity Recognition (NER)")
ner_raw = ner_pipeline(transcript)
body_parts_detected = extract_body_parts_from_ner(ner_raw, transcript)
print(f"   Body parts detected: {body_parts_detected}")

symptoms = extract_symptoms_with_context(transcript, body_parts_detected)
print(f"   Symptoms extracted: {symptoms}")

# --- 4. FINAL OUTPUT ---
final_json = {
    "Patient_Name": patient_name,
    "Symptoms": sorted(list(set(symptoms))),
    "Diagnosis": diagnosis,
    "Treatment": treatment,
    "Current_Status": current_status,
    "Prognosis": prognosis
}

print("\n" + "="*60)
print("üìã FINAL MEDICAL REPORT (JSON)")
print("="*60)
print(json.dumps(final_json, indent=2))
print("="*60)



Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m121.0 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
‚è≥ Loading Models...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/266M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cuda:0


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


‚úÖ Models loaded successfully!

üöÄ Running Medical Transcript Extraction Pipeline...

üìç Step 1: Named Entity Recognition (NER)
   Body parts detected: ['neck', 'head', 'back']
   Symptoms extracted: ['Neck pain', 'Head impact', 'Back pain']

üìç Step 2: Information Extraction (QA + Summarization + Keywords)
   ‚úì Patient Name: Ms. Jones
   ‚úì Treatment: ['Ten sessions of physiotherapy', 'Painkillers']


Your max_length is set to 40, but your input_length is only 34. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=17)


   ‚úì Current Status: Occasional backaches
   ‚úì Prognosis: Full recovery within six months
   ‚úì Diagnosis: Whiplash injury

üìã FINAL MEDICAL REPORT (JSON)
{
  "Patient_Name": "Ms. Jones",
  "Symptoms": [
    "Back pain",
    "Head impact",
    "Neck pain"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "Ten sessions of physiotherapy",
    "Painkillers"
  ],
  "Current_Status": "Occasional backaches",
  "Prognosis": "Full recovery within six months"
}

‚úÖ Requirements Check:
   1. NER (Named Entity Recognition): ‚úì Using transformers + spaCy
   2. Text Summarization: ‚úì Using BART for complex extractions
   3. Keyword Extraction: ‚úì Using NER entities + regex patterns


**Summary**\
‚úÖ 1. NER (Named Entity Recognition)

Using d4data/biomedical-ner-all (transformers)
Using spaCy for additional linguistic analysis
Extracts: Symptoms, Treatment entities, Diagnosis patterns

‚úÖ 2. Text Summarization

Using facebook/bart-large-cnn (transformers)
Applied in: extract_treatment_with_summarization(), extract_current_status_with_summarization(), extract_prognosis_with_summarization()
Summarizes complex treatment/prognosis descriptions

‚úÖ 3. Keyword Extraction

Using NER entities as keywords (medical phrases like "whiplash injury", "physiotherapy sessions")
Regex patterns for treatment/diagnosis keywords
Context-window based extraction

**üìç Questions:**

- How would you handle **ambiguous or missing medical data** in the transcript?


The `EVIDENCE_LOG `
dictionary tracks how each field was extracted (e.g., NER, Regex, QA, Summarization). If a field cannot be extracted from the transcript, it is marked as "Not found" in both the final JSON and evidence log, ensuring traceability and robustness against missing data.

```
EVIDENCE_LOG = {}

# Example: Extract patient name
patient_name = extract_patient_name(transcript)
if patient_name == "Unknown":
    EVIDENCE_LOG["Patient_Name"] = "Not found"
else:
    EVIDENCE_LOG["Patient_Name"] = "Regex + QA"

# Example: Extract treatment
treatment = extract_treatment_with_summarization(transcript)
EVIDENCE_LOG["Treatment"] = "Regex + QA + Summarization" if treatment != ["Unknown"] else "Not found"

print(EVIDENCE_LOG)
```


- What **pre-trained NLP models** would you use for medical summarization?

`model="facebook/bart-large-cnn"`
Using facebook/bart-large-cnn (transformers) for text smmarisation


 Applied in: `extract_treatment_with_summarization(), extract_current_status_with_summarization(), extract_prognosis_with_summarization()` Summarizes complex treatment/prognosis descriptions












# **2. Sentiment Analysis**

Use this [Dataset](https://github.com/kshitijdalvi4/sentiment_intent/blob/main/Combined_Data_with_Intents_5k.csv) for Training

In [None]:
# --- 1. INSTALLS & IMPORTS ---
import subprocess
import sys
import os

# Install dependencies (if not already installed)
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "transformers", "datasets", "torch", "accelerate", "scikit-learn", "pandas"])

import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer,
    AutoModel,
    Trainer,
    TrainingArguments
)
from sklearn.preprocessing import LabelEncoder

# --- 2. CONFIGURATION & FILE SETUP ---
MODEL_NAME = "emilyalsentzer/Bio_ClinicalBERT"

# Check which file exists (prioritizing the 5k one if you made it, otherwise the full one)
if os.path.exists("Combined_Data_with_Intents_5k.csv"):
    CSV_FILE = "Combined_Data_with_Intents_5k.csv"
elif os.path.exists("Combined_Data_with_Intents.csv"):
    CSV_FILE = "Combined_Data_with_Intents.csv"
elif os.path.exists("Combined_Data_with_Medical_Sentiments.csv"):
    # Fallback if you haven't run the intent script yet
    CSV_FILE = "Combined_Data_with_Medical_Sentiments.csv"
else:
    raise FileNotFoundError("Could not find your CSV file! Please ensure 'Combined_Data_with_Intents.csv' is uploaded.")

print(f"üìÇ Using dataset: {CSV_FILE}")

# --- 3. LOAD DATA & DEFINE LABELS DYNAMICALLY ---
# We read the CSV first to find out exactly which labels you have
df = pd.read_csv(CSV_FILE)

# Map your specific column names to standard ones
# Adjust 'statement', 'medical_sentiment', 'patient_intent' if your headers are different
if 'statement' in df.columns:
    df = df.rename(columns={'statement': 'text'})
if 'medical_sentiment' in df.columns:
    df = df.rename(columns={'medical_sentiment': 'sentiment'})
if 'patient_intent' in df.columns:
    df = df.rename(columns={'patient_intent': 'intent'})

# Handle missing values
df = df.dropna(subset=['text', 'sentiment', 'intent'])

# Get unique labels from YOUR data
SENTIMENT_LABELS = sorted(list(df['sentiment'].unique()))
INTENT_LABELS = sorted(list(df['intent'].unique()))

print(f"‚úÖ Found {len(SENTIMENT_LABELS)} Sentiments: {SENTIMENT_LABELS}")
print(f"‚úÖ Found {len(INTENT_LABELS)} Intents: {INTENT_LABELS}")

# Create Encoders
sentiment_encoder = LabelEncoder().fit(SENTIMENT_LABELS)
intent_encoder = LabelEncoder().fit(INTENT_LABELS)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# --- 4. CUSTOM MODEL ARCHITECTURE ---
class ClinicalSentimentIntentModel(nn.Module):
    def __init__(self, model_name, num_sentiments, num_intents):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden_size = self.encoder.config.hidden_size

        self.sentiment_head = nn.Linear(hidden_size, num_sentiments)
        self.intent_head = nn.Linear(hidden_size, num_intents)

    def forward(self, input_ids, attention_mask, labels=None, **kwargs):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state[:, 0]  # CLS token

        sentiment_logits = self.sentiment_head(pooled)
        intent_logits = self.intent_head(pooled)

        loss = None
        if labels is not None:
            sentiment_labels, intent_labels = labels
            loss_fn = nn.CrossEntropyLoss()
            # Calculate loss for both heads
            loss = loss_fn(sentiment_logits, sentiment_labels) + loss_fn(intent_logits, intent_labels)

        return {
            "loss": loss,
            "logits": (sentiment_logits, intent_logits)
        }

# --- 5. DATA PREPARATION FUNCTIONS ---

def process_fn(example):
    """Tokenizes text and encodes labels."""
    tokens = tokenizer(
        str(example["text"]), # Ensure string
        truncation=True,
        padding="max_length",
        max_length=64
    )

    return {
        "input_ids": tokens["input_ids"],
        "attention_mask": tokens["attention_mask"],
        "sentiment_label": sentiment_encoder.transform([example["sentiment"]])[0],
        "intent_label": intent_encoder.transform([example["intent"]])[0]
    }

def prepare_csv_dataset(dataframe):
    print("‚è≥ Processing CSV dataset...")
    # Convert Pandas DataFrame to Hugging Face Dataset
    dataset = Dataset.from_pandas(dataframe)
    # Apply tokenization
    return dataset.map(process_fn)

def prepare_stage1_data():
    """Optional: Warm-up on general emotions (GoEmotions)"""
    print("‚è≥ Loading GoEmotions (General Warm-up)...")
    dataset = load_dataset("go_emotions", split="train[:200]")

    def process_go_emotions(example):
        # Rough mapping to your specific labels to avoid errors
        # Defaulting to "Neutral" and "Reporting Symptoms" if no exact match found
        # This is just for warm-up, so rough mapping is acceptable
        return {
            "input_ids": tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)["input_ids"],
            "attention_mask": tokenizer(example["text"], truncation=True, padding="max_length", max_length=64)["attention_mask"],
            "sentiment_label": sentiment_encoder.transform([SENTIMENT_LABELS[0]])[0], # Dummy map
            "intent_label": intent_encoder.transform([INTENT_LABELS[0]])[0] # Dummy map
        }
    return dataset.map(process_go_emotions, remove_columns=dataset.column_names)

# --- 6. TRAINING SETUP ---

def collate_fn(batch):
    return {
        "input_ids": torch.tensor([x["input_ids"] for x in batch]),
        "attention_mask": torch.tensor([x["attention_mask"] for x in batch]),
        "labels": (
            torch.tensor([x["sentiment_label"] for x in batch], dtype=torch.long),
            torch.tensor([x["intent_label"] for x in batch], dtype=torch.long)
        )
    }

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"üöÄ Initializing Model on {device}...")

model = ClinicalSentimentIntentModel(MODEL_NAME, len(SENTIMENT_LABELS), len(INTENT_LABELS))
model.to(device)

training_args = TrainingArguments(
    output_dir="./results_custom",
    num_train_epochs=3,              # Increased epochs for better learning
    per_device_train_batch_size=16,  # T4 can handle 16-32
    logging_steps=50,
    save_strategy="epoch",
    report_to="none",
    remove_unused_columns=False      # Critical for custom collator
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn
)

# --- 7. EXECUTION PHASE ---

# Step 1: Prepare Your Data
train_dataset = prepare_csv_dataset(df)

# Split into Train/Test (Optional but recommended)
dataset_split = train_dataset.train_test_split(test_size=0.1)
trainer.train_dataset = dataset_split["train"]
trainer.eval_dataset = dataset_split["test"]

# Step 2: Train
print("\n--- üèÅ Starting Training on Your Custom Data ---")
trainer.train()

# --- 8. INFERENCE FUNCTION ---

def analyze_patient(text):
    model.eval()
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        sent_logits, intent_logits = outputs['logits']

    sent_pred = torch.argmax(sent_logits, dim=1).item()
    intent_pred = torch.argmax(intent_logits, dim=1).item()

    return {
        "Input": text,
        "Predicted Sentiment": sentiment_encoder.inverse_transform([sent_pred])[0],
        "Predicted Intent": intent_encoder.inverse_transform([intent_pred])[0]
    }

# Test it
print("\n--- üß™ Testing Model ---")
test_samples = [
    "I am really scared about the surgery results.",
    "The pain has gone down significantly, thank you.",
    "Can you tell me if this medication has side effects?"
]

for t in test_samples:
    print(analyze_patient(t))

üìÇ Using dataset: Combined_Data_with_Intents_5k.csv
‚úÖ Found 5 Sentiments: ['Anxiety', 'Bipolar', 'Depression', 'Neutral', 'Reassured']
‚úÖ Found 6 Intents: ['Asking Medical Questions', 'Describing History', 'Expressing Concern', 'Expressing Gratitude', 'Reporting Symptoms', 'Seeking Assurance']


config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

üöÄ Initializing Model on cpu...


pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

‚è≥ Processing CSV dataset...


Map:   0%|          | 0/4992 [00:00<?, ? examples/s]


--- üèÅ Starting Training on Your Custom Data ---




Step,Training Loss
50,2.4607
100,2.1248
150,2.1682
200,2.0482
250,1.9153
300,1.7805




**Use the Model weights or Pre downloaded Model**

In [None]:
import json
import shutil
from google.colab import files

# 1. Define where to save
SAVE_PATH = "./sentiment_intent_model"

# 2. Save Model & Tokenizer
print(f"üíæ Saving model to {SAVE_PATH}...")
trainer.save_model(SAVE_PATH)
tokenizer.save_pretrained(SAVE_PATH)

# 3. Save Label Mappings (CRITICAL STEP)
# We need to know what "0" means later!
label_map = {
    "sentiment_map": {str(i): label for i, label in enumerate(sentiment_encoder.classes_)},
    "intent_map": {str(i): label for i, label in enumerate(intent_encoder.classes_)}
}

with open(f"{SAVE_PATH}/label_map.json", "w") as f:
    json.dump(label_map, f)

# 4. Zip and Download
print("üì¶ Zipping folder...")
shutil.make_archive("sentiment_intent_model_pack", 'zip', SAVE_PATH)

print("‚¨áÔ∏è Downloading...")
files.download("sentiment_intent_model_pack.zip")

In [None]:
import torch
import torch.nn as nn
import json
import os
import glob
from transformers import AutoTokenizer, AutoModel

# --- 1. CONFIGURATION ---
ZIP_NAME = "sentiment_intent_model_pack.zip"
EXTRACT_PATH = "./sentiment_intent_model"
BASE_MODEL = "emilyalsentzer/Bio_ClinicalBERT"

# --- 2. FILE SYSTEM FIX ---
print("üîç Checking file system...")

# A. Unzip if needed
if os.path.exists(ZIP_NAME):
    print(f"üì¶ Found {ZIP_NAME}. Unzipping...")
    !unzip -q -o {ZIP_NAME} -d {EXTRACT_PATH}
elif not os.path.exists(EXTRACT_PATH):
    print(f"‚ùå Error: Could not find {ZIP_NAME} or {EXTRACT_PATH} folder!")
    print("üëâ Please upload 'sentiment_intent_model_pack.zip' to the files tab on the left.")
    # Stop execution if files are missing
    raise FileNotFoundError("Zip file missing.")

# B. Find the actual weights file (could be inside a subfolder)
# We search recursively for .bin or .safetensors
weight_files = glob.glob(f"{EXTRACT_PATH}/**/pytorch_model.bin", recursive=True)
if not weight_files:
    # Try finding safetensors if .bin is missing
    weight_files = glob.glob(f"{EXTRACT_PATH}/**/model.safetensors", recursive=True)

if not weight_files:
    print(f"üìÇ Contents of {EXTRACT_PATH}:")
    !ls -R {EXTRACT_PATH}
    raise FileNotFoundError("Could not find 'pytorch_model.bin' or 'model.safetensors'!")

ACTUAL_WEIGHTS_PATH = weight_files[0]
ACTUAL_MODEL_DIR = os.path.dirname(ACTUAL_WEIGHTS_PATH)
print(f"‚úÖ Found weights at: {ACTUAL_WEIGHTS_PATH}")
print(f"üìÇ Model Directory: {ACTUAL_MODEL_DIR}")

# --- 3. LOAD MAPPINGS ---
# Look for label_map.json in the same folder as the weights
try:
    with open(f"{ACTUAL_MODEL_DIR}/label_map.json", "r") as f:
        maps = json.load(f)
except FileNotFoundError:
    # Fallback: check the root extract path
    with open(f"{EXTRACT_PATH}/label_map.json", "r") as f:
        maps = json.load(f)

SENTIMENT_MAP = maps["sentiment_map"]
INTENT_MAP = maps["intent_map"]

# --- 4. DEFINE CLASS ---
class ClinicalSentimentIntentModel(nn.Module):
    def __init__(self, model_name, num_sentiments, num_intents):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden_size = self.encoder.config.hidden_size
        self.sentiment_head = nn.Linear(hidden_size, num_sentiments)
        self.intent_head = nn.Linear(hidden_size, num_intents)

    def forward(self, input_ids, attention_mask, **kwargs):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        pooled = outputs.last_hidden_state[:, 0]
        return self.sentiment_head(pooled), self.intent_head(pooled)

# --- 5. LOAD MODEL CORRECTLY ---
print(f"‚è≥ Loading architecture ({BASE_MODEL})...")
device = "cuda" if torch.cuda.is_available() else "cpu"

model = ClinicalSentimentIntentModel(BASE_MODEL, len(SENTIMENT_MAP), len(INTENT_MAP))

print(f"‚è≥ Loading weights from disk...")
# Load weights using the detected path
if ACTUAL_WEIGHTS_PATH.endswith(".safetensors"):
    from safetensors.torch import load_file
    state_dict = load_file(ACTUAL_WEIGHTS_PATH)
else:
    state_dict = torch.load(ACTUAL_WEIGHTS_PATH, map_location=device)

model.load_state_dict(state_dict)
model.to(device)
model.eval()

# Load Tokenizer
try:
    tokenizer = AutoTokenizer.from_pretrained(ACTUAL_MODEL_DIR)
except:
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

# --- 6. TEST ---
def analyze(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        s_logits, i_logits = model(inputs['input_ids'], inputs['attention_mask'])

    return {
        "Sentiment": SENTIMENT_MAP[str(torch.argmax(s_logits).item())],
        "Intent": INTENT_MAP[str(torch.argmax(i_logits).item())]
    }

print("\n--- ‚úÖ SUCCESS! MODEL RELOADED ---")
print(json.dumps(analyze("I am worried about the surgery."), indent=2))

**üìç Questions:**

- **How would you fine-tune **BERT** for medical sentiment detection?**

We start with Bio_ClinicalBERT, a BERT model pretrained on clinical text, so it understands medical terminology.

We add two task-specific heads on top: one for sentiment classification and one for patient intent detection.

Our dataset provides supervised labels (sentiment and intent) for each patient statement.

During training, we tokenize the text, pass it through BERT, and compute cross-entropy loss for both heads.

Backpropagation updates all BERT weights, adapting the model to recognize medical-specific sentiment patterns.

Result: a fine-tuned model that predicts patient sentiment (Anxious, Neutral, Reassured) in the medical context.


- **What datasets would you use for training a **healthcare-specific** sentiment model?**

Original Dataset for Sentiment analysis:https://www.kaggle.com/datasets/suchintikasarkar/sentiment-analysis-for-mental-health/data

Sentiment-Intent Dataset generated with "Intent" using RoBERTa: https://github.com/kshitijdalvi4/sentiment_intent/blob/main/Combined_Data_with_Intents_5k.csv


# **3. SOAP**

Use This [Dataset](https://github.com/kshitijdalvi4/sentiment_intent/blob/main/soap_final_filled_roberta.json) for Dialogue-SOAP Training

In [14]:
# --- 1. INSTALLS ---
!pip install -q transformers datasets accelerate rouge_score nltk

import json
import re
import torch
import shutil
import os
from google.colab import files
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)

# --- 2. CONFIGURATION ---
INPUT_FILE = "soap_final_filled_roberta.json"
MODEL_NAME = "GanjinZero/biobart-v2-base"
SAVE_DIR = "./final_soap_model"
ZIP_NAME = "soap_model_pack"

# --- 3. PRE-PROCESSING FUNCTIONS ---
def clean_transcript(text):
    text = re.sub(r'\[.*?\]', '', text)
    text = text.replace("Dr.", "Doctor:").replace("Pt.", "Patient:")
    return " ".join(text.split())

def flatten_soap_json(soap_dict):
    text = ""
    sub = soap_dict.get('Subjective', {})
    text += f"<SUBJECTIVE> [CC] {sub.get('Chief_Complaint', 'N/A')} [HPI] {sub.get('History_of_Present_Illness', 'N/A')} "
    obj = soap_dict.get('Objective', {})
    text += f"<OBJECTIVE> [PE] {obj.get('Physical_Exam', 'N/A')} [OBS] {obj.get('Observations', 'N/A')} "
    ass = soap_dict.get('Assessment', {})
    text += f"<ASSESSMENT> [DX] {ass.get('Diagnosis', 'N/A')} [SEV] {ass.get('Severity', 'N/A')} "
    plan = soap_dict.get('Plan', {})
    text += f"<PLAN> [TX] {plan.get('Treatment', 'N/A')} [FU] {plan.get('Follow-Up', 'N/A')}"
    return text.strip()

# --- 4. LOAD FULL DATASET ---
if not os.path.exists(INPUT_FILE):
    raise FileNotFoundError(f"‚ùå Could not find {INPUT_FILE}. Please make sure it is uploaded!")

print(f"‚è≥ Loading entire dataset from {INPUT_FILE}...")
with open(INPUT_FILE, 'r') as f:
    raw_data = json.load(f)

inputs = [clean_transcript(x['dialogue']) for x in raw_data]
targets = [flatten_soap_json(x['soap_structured']) for x in raw_data]

# Create one single dataset (No Split)
full_dataset = Dataset.from_dict({"input_text": inputs, "target_text": targets})
print(f"‚úÖ Training on all {len(full_dataset)} examples.")

# --- 5. TOKENIZATION ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def preprocess_function(examples):
    model_inputs = tokenizer(examples["input_text"], max_length=1024, truncation=True)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["target_text"], max_length=1024, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = full_dataset.map(preprocess_function, batched=True)

# --- 6. TRAINING ---
print("‚è≥ Initializing Model...")
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = Seq2SeqTrainingArguments(
    output_dir="./biobart_soap_checkpoints",
    eval_strategy="no", # No evaluation, just train
    learning_rate=3e-5,
    per_device_train_batch_size=2,
    weight_decay=0.01,
    save_total_limit=1,
    num_train_epochs=5,
    fp16=True if torch.cuda.is_available() else False,
    report_to="none"
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_dataset, # Use full dataset
    data_collator=data_collator,
    tokenizer=tokenizer,
)

print("üöÄ Starting Training...")
trainer.train()

# --- 7. SAVE & ZIP ---
print(f"üíæ Saving model to {SAVE_DIR}...")
trainer.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

print("üì¶ Zipping model for download...")
shutil.make_archive(ZIP_NAME, 'zip', SAVE_DIR)

print("‚¨áÔ∏è Downloading zip file...")
#files.download(f"{ZIP_NAME}.zip")

‚è≥ Loading entire dataset from soap_final_filled_roberta.json...
‚úÖ Training on all 250 examples.


Map:   0%|          | 0/250 [00:00<?, ? examples/s]



‚è≥ Initializing Model...


  trainer = Seq2SeqTrainer(


üöÄ Starting Training...


Step,Training Loss
500,1.1466




üíæ Saving model to ./final_soap_model...
üì¶ Zipping model for download...
‚¨áÔ∏è Downloading zip file...


**Load Model or  Use Pre_Downloaded Model**

In [15]:
import torch
import re
import os
import json
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# --- 1. SETUP ---
ZIP_FILE = "soap_model_pack.zip"
MODEL_DIR = "./my_soap_model"

# Unzip if needed
if not os.path.exists(MODEL_DIR):
    if os.path.exists(ZIP_FILE):
        print("üì¶ Unzipping model...")
        import zipfile
        with zipfile.ZipFile(ZIP_FILE, 'r') as zip_ref:
            zip_ref.extractall(MODEL_DIR)
    else:
        # Fallback to loading base model if zip is missing (for testing)
        print("‚ö†Ô∏è Zip file not found. Loading base model (untrained) for demo.")
        MODEL_DIR = "GanjinZero/biobart-v2-base"

# --- 2. LOAD MODEL ---
print(f"‚è≥ Loading model from {MODEL_DIR}...")
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_DIR).to(device)

# --- 3. HELPER FUNCTIONS ---
def clean_transcript(text):
    text = re.sub(r'\[.*?\]', '', text)
    text = text.replace("Dr.", "Doctor:").replace("Pt.", "Patient:")
    text = text.replace("Physician:", "Doctor:")
    return " ".join(text.split())

def parse_generated_soap(text):
    def extract(pattern, source):
        match = re.search(pattern, source)
        return match.group(1).strip() if match else "Not detected"

    return {
        "Subjective": {
            "Chief_Complaint": extract(r"\[CC\] (.*?) \[HPI\]", text),
            "History_of_Present_Illness": extract(r"\[HPI\] (.*?) <OBJECTIVE>", text)
        },
        "Objective": {
            "Physical_Exam": extract(r"\[PE\] (.*?) \[OBS\]", text),
            "Observations": extract(r"\[OBS\] (.*?) <ASSESSMENT>", text)
        },
        "Assessment": {
            "Diagnosis": extract(r"\[DX\] (.*?) \[SEV\]", text),
            "Severity": extract(r"\[SEV\] (.*?) <PLAN>", text)
        },
        "Plan": {
            "Treatment": extract(r"\[TX\] (.*?) \[FU\]", text),
            "Follow-Up": extract(r"\[FU\] (.*)", text)
        }
    }

def generate_soap(transcript):
    clean_text = clean_transcript(transcript)
    inputs = tokenizer(clean_text, return_tensors="pt", max_length=1024, truncation=True).to(device)

    with torch.no_grad():
        summary_ids = model.generate(
            inputs["input_ids"],
            max_length=1024,
            num_beams=4,
            length_penalty=2.0
        )

    output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return parse_generated_soap(output_text)

# --- 4. TEST ---
print("\n--- ‚úÖ MODEL READY ---")
sample_transcript = """
Doctor: Hello, we've received your results from the ultrasound we performed in April 2017. It shows a single thyroid nodule in your left lobe, measuring 1 cm in its largest diameter. We also conducted a complete biochemical screening, including TSH, autoantibodies, and calcitonin.

Patient: Hmm, what did the screening results show, doctor?

Doctor: Your calcitonin level was slightly elevated at 40 ng/mL, which is above the normal range of 1‚Äì4.8 ng/mL. To investigate further, we performed a stimulation test using intravenous calcium.

Patient: And what did the stimulation test show?

Doctor: After stimulation, your calcitonin levels peaked at 1420 ng/mL, which indicated the need for surgical treatment. As a result, you underwent a total thyroidectomy along with central neck dissection on the side of the tumor.

Patient: Yes, I remember that. How was my recovery after the surgery?

Doctor: Your postoperative course was uneventful. You experienced mild hypocalcemia on the first day after surgery, but it completely resolved within 48 hours, and you were discharged.

Patient: That‚Äôs good to hear. What did the tests on the removed tissue show?

Doctor: Immunohistochemistry of the thyroid nodule confirmed medullary thyroid cancer measuring 1 cm. The tumor was composed of cells with eosinophilic cytoplasm and showed a predominantly expansive growth pattern. The tumor cells were positive for calcitonin, chromogranin A, synaptophysin, and TTF-1, and negative for amyloid.

Patient: What about the surrounding tissue?

Doctor: There were focal areas of C-cell hyperplasia distributed throughout the gland. However, none of the lymph nodes in the central compartment showed evidence of metastasis.

Patient: That‚Äôs a relief. Were any other tests done on the tissue?

Doctor: Yes. Formalin-fixed paraffin-embedded tissue sections were treated with antigen retrieval using citrate buffer at high pH. The samples were then immunolabeled with a rabbit monoclonal anti-calcitonin antibody and incubated with appropriate fluorescent secondary antibodies.

Patient: So, what does all this mean for my condition?

Doctor: These findings confirm the diagnosis of medullary thyroid cancer. Fortunately, there is no evidence of lymph node metastasis, which is a positive prognostic sign. Ongoing follow-up and regular monitoring will be important to ensure proper long-term management.
"""
print(json.dumps(generate_soap(sample_transcript), indent=2))

‚è≥ Loading model from ./my_soap_model...

--- ‚úÖ MODEL READY ---
{
  "Subjective": {
    "Chief_Complaint": "The patient reports a single thyroid nodule in the left lobe, measuring 1 cm in diameter. The patient underwent a complete biochemical screening including TSH, autoantibodies, and calcitonin levels. The stimulation test revealed a slightly elevated calcitonin level at 40 ng/mL (normal range: 1\u20134.8ng/mL). The patient experienced mild hypocalcemia on the first day post-surgery, but resolved within 48 hours.",
    "History_of_Present_Illness": "Not detected"
  },
  "Objective": {
    "Physical_Exam": "Not detected",
    "Observations": "Not detected"
  },
  "Assessment": {
    "Diagnosis": "Not detected",
    "Severity": "Not explicitly stated"
  },
  "Plan": {
    "Treatment": "Surgical intervention was performed, and the patient was discharged with no signs of recurrence or metastasis at 48 hours postoperatively. Postoperatively, the calcitonin levels decreased to below th

**Questions:**

- **How would you train an NLP model to map medical transcripts into SOAP format**?

Fine-tuned sequence-to-sequence (seq2seq) transformer model Base Model: GanjinZero/biobart-v2-base (a BioBart model pre-trained on biomedical text)
Task: Text-to-text generation that converts medical dialogue transcripts into structured SOAP notes



- **What **rule-based or deep-learning** techniques would improve the accuracy of SOAP note generation?**

Deep Learning Techniques:

**Fine-tuned Transformer (BioBart)**

Seq2seq architecture trained on your custom dataset
Uses beam search decoding (num_beams=4) for generation
Trained for 5 epochs on soap_final_filled.json
Learning rate: 3e-5 with weight decay regularization


Custom Tokenization & Truncation

Max length of 1024 tokens for both input and output
Uses specialized biomedical tokenizer


Structured Output Format

Model generates text with special markers: <SUBJECTIVE>, [CC], [HPI], <OBJECTIVE>, etc.
This teaches the model to produce structured outputs



**Rule-Based Techniques:**

Input Preprocessing (clean_transcript)
Removes bracketed content with regex: r'\[.*?\]'
Normalizes speaker labels: "Dr." ‚Üí "Doctor:", "Pt." ‚Üí "Patient:"
Whitespace normalization


Output Post-Processing (parse_generated_soap)
Uses regex patterns to extract each SOAP section
Example: r"\[CC\] (.*?) \[HPI\]" extracts Chief Complaint
Structures the flat text into nested JSON format


Training Data Formatting (flatten_soap_json)

Converts JSON SOAP notes into template-based text with markers
Ensures consistent training format