# Named Entity Recognition (NER)

### What is it?

-   Natural Language Processing (NLP) task
-   Extract and categorize 'entities' from text
-   Can be words or sequences of words
-   Looking for 'who', 'what', 'where', 'when'
-   Common types of entities identified:
        -   Person
        -   Organization
        -   Time/Date
        -   Event
        -   Location
        -   Money
        -   Cardinal Number
        -   Phone number
        -   Email Address 
-   Simple examples:
    -   **Lisa** *(PERSON)* went to the **mall** *(LOCATION)* **yesterday** *(DATE)*. 
    -   The annual **party** *(EVENT)* for **Praxis Engineering** *(ORGANIZATION)* is **Saturday** *(DATE)*.
-   Example Named Entity Visualizer: https://demos.explosion.ai/displacy-ent

### Use Cases - extract important details from large volumes of text
-   Search Engines
-   Health Records
-   Customer Service
-   HR/Resumes
-   Translation

### Types
-   Dictionary
    -   Uses dictionary (list) of entities
    -   Used in combination with other methods in applications with very specific lexicon
    -   Not often used alone as requires large, constantly updated dictionary
-   Rules and Patterns
    -   Similiar to dictionary often used to complement other methods
    -   Can use patterns like regular expressions
    -   Can use rules based on text context, for example, word after title always a name
-   Statistical Methods
    -   Conditional Random Field (sklearn-crfsuite) - graphical method where probability of node having label is dependent on neighboring nodes and their labels
    -   Maximum-entropy Markov Model (MEMM) - form Markov model where probability of current label depends on previous label and current token
-   Neural Network Methods
    -   Most common method currently
    -   More flexible as can 'understand' the semantic context of the words
-   Methods can be used in combination

# Evaluation Measures
-   F-Score: function of precision and recall
    -   2 * Precision * Recall / (Precision + Recall)     
    -   Precision: # true positives / (# true positives + # false positives)
        -   How many predicted entities labeled correctly
    -   Recall: # true positives / (# true positives + # false negatives)
        -   How many of the true entities were labeled
-   Accuracy: (# true positives + # true negatives) / (# true positives + # false positives + # true negatives + # false negatives)
-   One criticism of these methods is inability to locate why issues are happening
-   Measures are calculated over all tokens and aggregated by type of entity and the whole document

# spaCy package in python
-    Open-source library uses mostly deep-learning neural network models
-    Can be fine-tuned by users
-    Offers different model sizes
-    Includes other NLP features like tokenization, parts-of-speech, and text categorization
-    Unlike NLTK, generally recommended for production speed and scale



In [1]:
# basic spaCy setup 
import spacy

spacy.cli.download("en_core_web_sm")
NER = spacy.load("en_core_web_sm")

doc='''The World Health Organization (WHO)[1] is a specialized agency of the United Nations 
responsible for international public health.[2] 
The WHO Constitution states its main objective as 'the attainment by all peoples of the highest 
possible level of health'.[3] Headquartered in Geneva, Switzerland, it has six regional offices 
and 150 field offices worldwide. The WHO was established on 7 April 1948.[4][5] 
The first meeting of the World Health Assembly (WHA), the agency's governing body, took place on 
24 July of that year. The WHO incorporated the assets, personnel, and duties of the League of Nations' 
Health Organization and the Office International d'Hygiène Publique, including the 
International Classification of Diseases (ICD).[6] Its work began in earnest in 1951 
after a significant infusion of financial and technical resources.[7]'''

text = NER(doc)

Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 13.1 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# spaCy NER output

print([(ent.text.strip(), ent.label_) for ent in text.ents])

[('The World Health Organization', 'ORG'), ('the United Nations', 'ORG'), ('The WHO Constitution', 'LAW'), ("health'.[3]", 'ORG'), ('Geneva', 'GPE'), ('Switzerland', 'GPE'), ('six', 'CARDINAL'), ('150', 'CARDINAL'), ('WHO', 'ORG'), ('7 April 1948.[4][5', 'DATE'), ('first', 'ORDINAL'), ('the World Health Assembly', 'ORG'), ('WHA', 'ORG'), ('24 July of that year', 'DATE'), ('WHO', 'ORG'), ("the League of Nations' \nHealth Organization", 'ORG'), ("the Office International d'Hygiène Publique", 'ORG'), ('the \nInternational Classification of Diseases', 'ORG'), ('1951', 'DATE')]


In [3]:
# display meaning of entity category
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [4]:
# display entities visually
from spacy import displacy
displacy.render(NER(doc),style="ent",jupyter=True)

In [5]:
# see current pipeline of individual tasks
print(NER.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [6]:
# can add new elements to pipeline
NER.enable_pipe("senter") # this element breaks text into sentences

-   NER models may perform poorly for specialized jargon as in example below
-   We can add exact words, patterns or rules to model using spaCy's EntityRuler
        -   See https://spacy.io/usage/rule-based-matching for documentation on formatting
-   Can also fine-tune deep learning model with annotated training data

In [7]:
# performance issues
doc_med = "Antiretroviral therapy (ART) is recommended for all HIV-infected individuals."
text_med = NER(doc_med)
print([(ent.text.strip(), ent.label_) for ent in text_med.ents])

[]


In [8]:
# performance issues
doc_phone = "My phone number is (555) 123-4567 and email is lisa@xyz.com"
text_phone = NER(doc_phone)
print([(ent.text.strip(), ent.label_) for ent in text_phone.ents])

[('555', 'CARDINAL'), ('123-4567', 'CARDINAL')]


In [9]:
# EntityRuler - add rules
patterns = [{"label": "MEDICINE", "pattern": "Antiretroviral therapy"}, # exact match
            {"label": "PHONENUM", "pattern": 
            [{"ORTH": "("},
            {"SHAPE": "ddd"},
            {"ORTH": ")"},
            {"IS_SPACE": True, "OP": "?"},
            {"SHAPE": "ddd"},
            {"ORTH": "-"},
            {"SHAPE": "dddd"},]}, # spaCy SHAPE
            {"label": "EMAILADDR", "pattern": 
             [{"TEXT": {"REGEX": "[a-zA-Z0-9._~&=*-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}"}}]} # regular expression
            ]
ruler = NER.add_pipe("entity_ruler", before = "ner")
ruler.add_patterns(patterns)


text_med = NER(doc_med)
text_phone = NER(doc_phone)
print([(ent.text.strip(), ent.label_) for ent in text_med.ents])
print([(ent.text.strip(), ent.label_) for ent in text_phone.ents])


[('Antiretroviral therapy', 'MEDICINE')]
[('(555) 123-4567', 'PHONENUM'), ('lisa@xyz.com', 'EMAILADDR')]


-   Can use annotated example to fine-tune model
-   Used medical entity dictionary from Kaggle https://www.kaggle.com/datasets/finalepoch/medical-ner/
-   Will show format needed for training, for full tutorial see https://newscatcherapi.com/blog/train-custom-named-entity-recognition-ner-model-with-spacy-v3

In [10]:
# fine-tuning training data format
import json
with open('Corona2.json', 'r') as f:
    data = json.load(f)
    
print(data['examples'][0])



{'id': '18c2f619-f102-452f-ab81-d26f7e283ffe', 'content': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]", 'metadata': {}, 'annotations': [{'id': '0825a1bf-

In [11]:
training_data = {'classes' : ['MEDICINE', "MEDICALCONDITION", "PATHOGEN"], 'annotations' : []}
for example in data['examples']:
  temp_dict = {}
  temp_dict['text'] = example['content']
  temp_dict['entities'] = []
  for annotation in example['annotations']:
    start = annotation['start']
    end = annotation['end']
    label = annotation['tag_name'].upper()
    temp_dict['entities'].append((start, end, label))
  training_data['annotations'].append(temp_dict)
  
print(training_data['annotations'][0])

{'text': "While bismuth compounds (Pepto-Bismol) decreased the number of bowel movements in those with travelers' diarrhea, they do not decrease the length of illness.[91] Anti-motility agents like loperamide are also effective at reducing the number of stools but not the duration of disease.[8] These agents should be used only if bloody diarrhea is not present.[92]\n\nDiosmectite, a natural aluminomagnesium silicate clay, is effective in alleviating symptoms of acute diarrhea in children,[93] and also has some effects in chronic functional diarrhea, radiation-induced diarrhea, and chemotherapy-induced diarrhea.[45] Another absorbent agent used for the treatment of mild diarrhea is kaopectate.\n\nRacecadotril an antisecretory medication may be used to treat diarrhea in children and adults.[86] It has better tolerability than loperamide, as it causes less constipation and flatulence.[94]", 'entities': [(360, 371, 'MEDICINE'), (383, 408, 'MEDICINE'), (104, 112, 'MEDICALCONDITION'), (679,