# **NER for Disease Prediction Using Dictionary and Pretrained Model**

This notebook demonstrates two different approaches to identifying diseases in text:

Using a dictionary with Spacy’s PhraseMatcher: This approach involves matching disease names from a predefined list.
Using a pretrained Named Entity Recognition (NER) model: In this case, we use the bc5cdr model, which is trained specifically for biomedical entities.

In [None]:
# ! pip install scispacy
# ! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
# ! pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_ner_bc5cdr_md-0.5.4.tar.gz

^C
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_sm-0.5.4.tar.gz (14.8 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


## Approach 1: Dictionary-Based NER with sci_sm Model
This approach uses the sci_sm model in Spacy to process biomedical text, but instead of relying on machine learning for disease recognition, we use a custom dictionary of diseases. The PhraseMatcher is used to detect diseases based on exact string matches.

In [4]:
# Install the necessary Libraries

import en_core_sci_sm
from spacy.matcher import PhraseMatcher

In [2]:
# Sample Description of a Medicine for disease detection

text = """ FRAGMIN (dalteparin sodium) is indicated for prophylaxis of ischemic complications in 
unstable angina and non-Q-wave myocardial infarction, prophylaxis of deep vein thrombosis (DVT), extended treatment of
symptomatic venous thromboembolism (VTE) in adults with cancer, and treatment of symptomatic VTE in pediatric patients aged 1 month and older."""

In [None]:
# Load the spaCy model
nlp_sci_sm = en_core_sci_sm.load()

In [6]:
# Define the disease dictionary (from database or predefined)
disease_list = [
    "unstable angina", 
    "myocardial infarction", 
    "deep vein thrombosis", 
    "venous thromboembolism", 
    "VTE"
]

In [7]:
# Create PhraseMatcher and add patterns for each disease
matcher = PhraseMatcher(nlp_sci_sm.vocab)
patterns = [nlp_sci_sm(text) for text in disease_list]
matcher.add("DISEASE", patterns)

In [8]:
# Process the text
doc = nlp_sci_sm(text)

In [9]:
# Find matches in the text
matches = matcher(doc)

In [10]:
# Extract matched disease names
matched_diseases = [doc[start:end].text for match_id, start, end in matches]
print("Diseases mentioned (Dictionary-based):", matched_diseases)

Diseases mentioned (Dictionary-based): ['unstable angina', 'myocardial infarction', 'deep vein thrombosis', 'venous thromboembolism', 'VTE', 'VTE']


## Approach 2: Pretrained NER Model (bc5cdr)
In this approach, we leverage the bc5cdr NER model, which is trained specifically to identify disease and chemical entities in biomedical text. The model doesn’t require any pre-defined dictionary and works by recognizing entities based on its training.

In [11]:
# Importing the required pretrained model
import en_ner_bc5cdr_md

In [12]:
# Load the pretrained model for NER
nlp_bc5cdr = en_ner_bc5cdr_md.load()

In [None]:
# Process the text using the NER model
doc = nlp_bc5cdr(text)

In [13]:
# Extract entities recognized as diseases
diseases = [ent.text for ent in doc.ents if ent.label_ == "DISEASE"]
print("Diseases mentioned (NER model):", diseases)

Diseases mentioned (NER model): []


### Conclusion

**Dictionary-based Approach** is fast, efficient, and works well when you have a small and well-defined list of diseases. However, it lacks flexibility and may miss diseases not in the list.

**Pretrained NER Models** like bc5cdr are highly flexible and can identify a wide range of diseases without needing a predefined list. However, this approach requires more computational resources and may introduce occasional inaccuracies.