# üè• Medical NER Model Training

**Train a Custom spaCy NER Model for Medical Entity Extraction**

This notebook trains a Named Entity Recognition model to identify:
- **SYMPTOM**: headache, chest pain, fatigue, dizziness
- **TREATMENT**: ibuprofen, surgery, physical therapy
- **DIAGNOSIS**: migraine, hypertension, diabetes
- **BODY_PART**: head, chest, leg, arm
- **DURATION**: 3 days, 2 weeks, chronic

**Author:** Himanshu Sharma

---

## 1Ô∏è‚É£ Setup & Installation

In [None]:
# Install spaCy
!pip install spacy==3.7.4 --quiet
!python -m spacy download en_core_web_sm

print("‚úÖ spaCy installed!")

In [None]:
import spacy
from spacy.tokens import DocBin
from spacy.training import Example
from spacy.util import minibatch, compounding
import random
import json
from pathlib import Path

print(f"üì¶ spaCy version: {spacy.__version__}")

## 2Ô∏è‚É£ Training Data (100+ Labeled Examples)

In [None]:
# NER Training Data: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
    # SYMPTOM examples
    ("I have a severe headache", {"entities": [(16, 24, "SYMPTOM")]}),
    ("The chest pain started yesterday", {"entities": [(4, 14, "SYMPTOM")]}),
    ("I'm experiencing dizziness and nausea", {"entities": [(17, 26, "SYMPTOM"), (31, 37, "SYMPTOM")]}),
    ("My fatigue has been overwhelming", {"entities": [(3, 10, "SYMPTOM")]}),
    ("I have shortness of breath", {"entities": [(7, 26, "SYMPTOM")]}),
    ("The numbness in my hands is concerning", {"entities": [(4, 12, "SYMPTOM")]}),
    ("I've been having muscle pain", {"entities": [(17, 28, "SYMPTOM")]}),
    ("There's swelling in my ankle", {"entities": [(8, 16, "SYMPTOM")]}),
    ("I'm feeling very anxious", {"entities": [(18, 24, "SYMPTOM")]}),
    ("The fever started last night", {"entities": [(4, 9, "SYMPTOM")]}),
    ("I have back pain that radiates down my leg", {"entities": [(7, 16, "SYMPTOM")]}),
    ("My vision is getting blurry", {"entities": [(21, 27, "SYMPTOM")]}),
    ("I've been having insomnia", {"entities": [(17, 25, "SYMPTOM")]}),
    ("The joint pain is debilitating", {"entities": [(4, 14, "SYMPTOM")]}),
    ("I have a persistent cough", {"entities": [(20, 25, "SYMPTOM")]}),
    ("My stomach hurts after eating", {"entities": [(3, 10, "BODY_PART")]}),
    ("I feel nauseous in the morning", {"entities": [(7, 15, "SYMPTOM")]}),
    ("The migraine is getting worse", {"entities": [(4, 12, "SYMPTOM")]}),
    ("I have tremors in my hands", {"entities": [(7, 14, "SYMPTOM")]}),
    ("There's a rash on my arm", {"entities": [(10, 14, "SYMPTOM")]}),
    
    # TREATMENT examples
    ("I'm taking ibuprofen for the pain", {"entities": [(11, 20, "TREATMENT")]}),
    ("The doctor prescribed antibiotics", {"entities": [(22, 33, "TREATMENT")]}),
    ("I need physical therapy for my back", {"entities": [(7, 23, "TREATMENT")]}),
    ("We might need to consider surgery", {"entities": [(26, 33, "TREATMENT")]}),
    ("I started taking aspirin daily", {"entities": [(17, 24, "TREATMENT")]}),
    ("The medication is helping", {"entities": [(4, 14, "TREATMENT")]}),
    ("I've been prescribed lisinopril", {"entities": [(21, 31, "TREATMENT")]}),
    ("You should try acetaminophen", {"entities": [(15, 28, "TREATMENT")]}),
    ("I'm using a heating pad", {"entities": [(13, 24, "TREATMENT")]}),
    ("The steroid injection helped", {"entities": [(4, 21, "TREATMENT")]}),
    ("I take metformin for my diabetes", {"entities": [(7, 16, "TREATMENT")]}),
    ("Physical therapy sessions twice weekly", {"entities": [(0, 16, "TREATMENT")]}),
    ("I'm on blood pressure medication", {"entities": [(8, 32, "TREATMENT")]}),
    ("The doctor recommended massage therapy", {"entities": [(23, 38, "TREATMENT")]}),
    ("I use an inhaler for asthma", {"entities": [(10, 17, "TREATMENT")]}),
    ("We'll start with conservative treatment", {"entities": [(17, 39, "TREATMENT")]}),
    ("I take atorvastatin for cholesterol", {"entities": [(7, 19, "TREATMENT")]}),
    ("Chemotherapy starts next week", {"entities": [(0, 12, "TREATMENT")]}),
    ("I'm doing breathing exercises", {"entities": [(9, 29, "TREATMENT")]}),
    ("The cortisone shot was effective", {"entities": [(4, 18, "TREATMENT")]}),
    
    # DIAGNOSIS examples
    ("I was diagnosed with diabetes", {"entities": [(21, 29, "DIAGNOSIS")]}),
    ("The doctor said it's hypertension", {"entities": [(21, 33, "DIAGNOSIS")]}),
    ("I have chronic migraine", {"entities": [(15, 23, "DIAGNOSIS")]}),
    ("It looks like arthritis", {"entities": [(14, 23, "DIAGNOSIS")]}),
    ("I was told I have anemia", {"entities": [(18, 24, "DIAGNOSIS")]}),
    ("The tests showed asthma", {"entities": [(17, 23, "DIAGNOSIS")]}),
    ("I've been diagnosed with anxiety disorder", {"entities": [(25, 41, "DIAGNOSIS")]}),
    ("It appears to be bronchitis", {"entities": [(17, 27, "DIAGNOSIS")]}),
    ("The MRI revealed a herniated disc", {"entities": [(19, 33, "DIAGNOSIS")]}),
    ("I have Type 2 diabetes", {"entities": [(7, 22, "DIAGNOSIS")]}),
    ("The diagnosis is osteoporosis", {"entities": [(17, 29, "DIAGNOSIS")]}),
    ("I was told it's carpal tunnel", {"entities": [(16, 29, "DIAGNOSIS")]}),
    ("Tests confirm hypothyroidism", {"entities": [(14, 28, "DIAGNOSIS")]}),
    ("I have seasonal allergies", {"entities": [(7, 25, "DIAGNOSIS")]}),
    ("The x-ray showed a fracture", {"entities": [(19, 27, "DIAGNOSIS")]}),
    ("I've been diagnosed with depression", {"entities": [(25, 35, "DIAGNOSIS")]}),
    ("It's a case of gastritis", {"entities": [(15, 24, "DIAGNOSIS")]}),
    ("The doctor suspects fibromyalgia", {"entities": [(20, 32, "DIAGNOSIS")]}),
    ("I have high cholesterol", {"entities": [(7, 23, "DIAGNOSIS")]}),
    ("The scan showed kidney stones", {"entities": [(16, 29, "DIAGNOSIS")]}),
    
    # BODY_PART examples
    ("My head hurts really bad", {"entities": [(3, 7, "BODY_PART")]}),
    ("There's pain in my chest", {"entities": [(19, 24, "BODY_PART")]}),
    ("I can't move my leg", {"entities": [(16, 19, "BODY_PART")]}),
    ("My arm feels numb", {"entities": [(3, 6, "BODY_PART")]}),
    ("The back pain is severe", {"entities": [(4, 8, "BODY_PART")]}),
    ("I injured my knee playing sports", {"entities": [(13, 17, "BODY_PART")]}),
    ("My shoulder has been hurting", {"entities": [(3, 11, "BODY_PART")]}),
    ("There's swelling in my ankle", {"entities": [(23, 28, "BODY_PART")]}),
    ("I have pain in my wrist", {"entities": [(18, 23, "BODY_PART")]}),
    ("My neck is very stiff", {"entities": [(3, 7, "BODY_PART")]}),
    ("The hip replacement was successful", {"entities": [(4, 7, "BODY_PART")]}),
    ("I hurt my elbow", {"entities": [(10, 15, "BODY_PART")]}),
    ("My stomach has been upset", {"entities": [(3, 10, "BODY_PART")]}),
    ("There's pressure in my eyes", {"entities": [(23, 27, "BODY_PART")]}),
    ("My throat feels scratchy", {"entities": [(3, 9, "BODY_PART")]}),
    ("I have ringing in my ears", {"entities": [(21, 25, "BODY_PART")]}),
    ("My foot is swollen", {"entities": [(3, 7, "BODY_PART")]}),
    ("The spine x-ray was clear", {"entities": [(4, 9, "BODY_PART")]}),
    ("I have pain in my abdomen", {"entities": [(18, 25, "BODY_PART")]}),
    ("My fingers are tingling", {"entities": [(3, 10, "BODY_PART")]}),
    
    # DURATION examples
    ("This has been going on for 3 days", {"entities": [(27, 33, "DURATION")]}),
    ("I've had this for 2 weeks", {"entities": [(18, 25, "DURATION")]}),
    ("The pain started yesterday", {"entities": [(17, 26, "DURATION")]}),
    ("It's been chronic for years", {"entities": [(10, 17, "DURATION")]}),
    ("This happened 5 hours ago", {"entities": [(14, 25, "DURATION")]}),
    ("I've been dealing with this for a month", {"entities": [(34, 39, "DURATION")]}),
    ("The symptoms appeared last week", {"entities": [(22, 31, "DURATION")]}),
    ("It's been 6 months now", {"entities": [(10, 18, "DURATION")]}),
    ("This started 2 years ago", {"entities": [(13, 24, "DURATION")]}),
    ("I noticed it 3 weeks ago", {"entities": [(13, 24, "DURATION")]}),
    ("It's an ongoing issue", {"entities": [(7, 14, "DURATION")]}),
    ("This has been constant for 4 days", {"entities": [(27, 33, "DURATION")]}),
    ("The problem started overnight", {"entities": [(20, 29, "DURATION")]}),
    ("I've had chronic pain for decades", {"entities": [(9, 16, "DURATION"), (26, 33, "DURATION")]}),
    ("It comes and goes for several months", {"entities": [(22, 36, "DURATION")]}),
    
    # Mixed examples (multiple entity types)
    ("I've had headaches for 2 weeks", {"entities": [(8, 17, "SYMPTOM"), (22, 29, "DURATION")]}),
    ("Taking ibuprofen for my back pain", {"entities": [(7, 16, "TREATMENT"), (24, 33, "SYMPTOM")]}),
    ("My doctor diagnosed diabetes and prescribed metformin", {"entities": [(20, 28, "DIAGNOSIS"), (44, 53, "TREATMENT")]}),
    ("Chest pain started 3 days ago", {"entities": [(0, 10, "SYMPTOM"), (19, 29, "DURATION")]}),
    ("I have arthritis in my knee", {"entities": [(7, 16, "DIAGNOSIS"), (23, 27, "BODY_PART")]}),
    ("The headache and dizziness are concerning", {"entities": [(4, 12, "SYMPTOM"), (17, 26, "SYMPTOM")]}),
    ("Physical therapy helped my shoulder pain", {"entities": [(0, 16, "TREATMENT"), (28, 40, "SYMPTOM")]}),
    ("I was diagnosed with hypertension last year", {"entities": [(21, 33, "DIAGNOSIS"), (34, 43, "DURATION")]}),
    ("My leg pain has been chronic", {"entities": [(3, 11, "SYMPTOM"), (21, 28, "DURATION")]}),
    ("Taking aspirin for chest tightness", {"entities": [(7, 14, "TREATMENT"), (19, 34, "SYMPTOM")]}),
]

print(f"üìä Training Data: {len(TRAIN_DATA)} examples")

# Count entities by type
from collections import Counter
entity_counts = Counter()
for _, annotations in TRAIN_DATA:
    for _, _, label in annotations['entities']:
        entity_counts[label] += 1

print("\n   Entity distribution:")
for label, count in entity_counts.most_common():
    print(f"   - {label}: {count}")

## 3Ô∏è‚É£ Train spaCy NER Model

In [None]:
# Load blank English model
nlp = spacy.blank("en")

# Add NER pipeline
ner = nlp.add_pipe("ner")

# Add entity labels
for label in ["SYMPTOM", "TREATMENT", "DIAGNOSIS", "BODY_PART", "DURATION"]:
    ner.add_label(label)

print("‚úÖ Model setup complete")
print(f"   Labels: {ner.labels}")

In [None]:
# Prepare training examples
examples = []
for text, annotations in TRAIN_DATA:
    doc = nlp.make_doc(text)
    example = Example.from_dict(doc, annotations)
    examples.append(example)

# Training settings
EPOCHS = 30
DROP = 0.3

# Initialize optimizer
optimizer = nlp.begin_training()

print(f"üöÄ Training for {EPOCHS} epochs...\n")

# Training loop
losses_history = []
for epoch in range(EPOCHS):
    random.shuffle(examples)
    losses = {}
    
    # Batch training
    batches = minibatch(examples, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        nlp.update(batch, drop=DROP, losses=losses, sgd=optimizer)
    
    losses_history.append(losses['ner'])
    
    if (epoch + 1) % 5 == 0:
        print(f"   Epoch {epoch+1:2d}/{EPOCHS}: Loss = {losses['ner']:.4f}")

print(f"\n‚úÖ Training complete!")
print(f"   Final loss: {losses_history[-1]:.4f}")

## 4Ô∏è‚É£ Evaluate Model

In [None]:
# Test examples
test_texts = [
    "I've been having severe headaches for about 2 weeks now",
    "My doctor prescribed ibuprofen for the back pain",
    "I was diagnosed with Type 2 diabetes last year",
    "The chest pain and shortness of breath are concerning",
    "I'm taking metformin and doing physical therapy",
    "My knee has been swollen for 3 days",
]

print("üîÆ Model Predictions:\n")
for text in test_texts:
    doc = nlp(text)
    print(f'üìù "{text}"')
    if doc.ents:
        for ent in doc.ents:
            print(f"   ‚Üí {ent.text}: {ent.label_}")
    else:
        print("   (no entities found)")
    print()

In [None]:
# Visualize entities
from spacy import displacy

test_text = "I've been having severe headaches and dizziness for 2 weeks. My doctor prescribed ibuprofen and recommended physical therapy for the back pain."
doc = nlp(test_text)

# Custom colors for medical entities
colors = {
    "SYMPTOM": "#ff6b6b",
    "TREATMENT": "#4ecdc4",
    "DIAGNOSIS": "#ffe66d",
    "BODY_PART": "#95e1d3",
    "DURATION": "#a8d8ea"
}

# Display (will show in Colab)
displacy.render(doc, style="ent", jupyter=True, options={"colors": colors})

In [None]:
# Plot training loss
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(range(1, EPOCHS+1), losses_history, 'b-', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('NER Model Training Loss')
plt.grid(True, alpha=0.3)
plt.savefig('ner_training_loss.png', dpi=150)
plt.show()

## 5Ô∏è‚É£ Save & Download Model

In [None]:
# Save model
MODEL_PATH = Path("./medical_ner_model")
MODEL_PATH.mkdir(exist_ok=True)
nlp.to_disk(MODEL_PATH)

# Save metadata
metadata = {
    "model_name": "medical_ner",
    "labels": list(ner.labels),
    "training_examples": len(TRAIN_DATA),
    "epochs": EPOCHS,
    "final_loss": float(losses_history[-1]),
    "spacy_version": spacy.__version__
}

with open(MODEL_PATH / "metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print(f"‚úÖ Model saved to: {MODEL_PATH}")
print(f"\nüìä Model Info:")
for k, v in metadata.items():
    print(f"   {k}: {v}")

In [None]:
# Zip and download
!zip -r medical_ner_model.zip medical_ner_model/

from google.colab import files
files.download('medical_ner_model.zip')
print("üì• NER Model downloaded!")

## 6Ô∏è‚É£ How to Use in Your App

In [None]:
# Usage example
print("""
üìñ How to use in your Streamlit app:

1. Extract medical_ner_model.zip to your project:
   models/medical_ner_model/

2. Load in Python:

   import spacy
   nlp = spacy.load("models/medical_ner_model")
   
   # Extract entities
   doc = nlp("I have headaches for 3 days")
   for ent in doc.ents:
       print(f"{ent.text}: {ent.label_}")
   
   # Output:
   # headaches: SYMPTOM
   # 3 days: DURATION
""")