#### This file will concern itself with the *Named Entity Recognition (NER)* part of the project.

The pre-trained model is loaded only with the EntityRecognizer pipeline enabled to improve loading and inference speed. Other pipelines are disabled such as ones concerned with POS tagging, lemmatization, parsing, etc.

In [1]:
import spacy
from spacy import displacy

spacy.prefer_gpu()
model_name = "en_core_web_sm"
nlp = spacy.load(model_name, enable = ['ner'])
print("Spacy NLP model named '{}' successfully loaded".format(model_name))

Spacy NLP model named 'en_core_web_sm' successfully loaded


The *Named Entity Recognition* pipeline that the model is equipped with is able to detect the following tags by default.

In **our** case, we are only interested in the GPE, LAW, LOC, PERSON, and PRODUCT tags.

In [2]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

A pretrained model is loaded from the NLP library *Spacy* which takes sentences as input and performs several sentence tagging tasks including NER which we are interested in.

A great visualization of the entity recognition process is displayed by the *displacy* suite.

In [3]:
example_text_1 = "klevio is a singer from Albania who usually goes to Greece and works in UBS. He lives in Lake Geneva and owns a Mercedes car."
doc1 = nlp(example_text_1)
print('Example I: ')
displacy.render(doc1, style = 'ent')

doc2 = nlp('The government in Senegal just passed a law on the 2nd of February regarding universal healthcare, named Universal Care Act, passed in parliament also in French')
print('Example II: ')
displacy.render(doc2, style = 'ent')

doc3 = nlp('World Health Organization in Geneva')
print('Example III: ')
displacy.render(doc3, style = 'ent')

Example I: 


Example II: 


Example III: 


An example of how entities found in the text are saved.

In [4]:
for ent in doc1.ents:
    print("Entity: {}, Label: {}, Label ID: {} ".format(ent.text, ent.label_, ent.label))

Entity: klevio, Label: PERSON, Label ID: 380 
Entity: Albania, Label: GPE, Label ID: 384 
Entity: Greece, Label: GPE, Label ID: 384 
Entity: UBS, Label: ORG, Label ID: 383 
Entity: Lake Geneva, Label: LOC, Label ID: 385 
Entity: Mercedes, Label: PRODUCT, Label ID: 386 


Our custom dataset containing different sentences related to global digital health organizations, products, people, countries, and laws will be loaded from Prodigy in a format which is friendly to the Spacy library.

This dataset will be used for our supervised entity recognition learning task.

In [5]:
import random
from prodigy.components.db import connect
from spacy.scorer import Scorer

def evaluate_model(nlp, examples_list):
    scorer = nlp.evaluate(examples_list)
    return scorer.scores


#### Load and shuffle dataset
seed = 596
random.seed(seed)
db = connect()
ner_dataset = db.get_dataset('ner_500_health')
random.shuffle(ner_dataset)
print('Custom Health NER Dataset loaded and shuffled')

Custom Health NER Dataset loaded and shuffled


One sample from the loaded dataset. Sentences with annotated entities are saved in a *JSONL* format with the *'text'* field holding the text input and the *'spans'* field holding the annotated entity spans. 

In [6]:
test_sample = ner_dataset[0]
test_sample

{'text': 'The Web site operators, based outside the United States, forwarded his request and associated questionnaires to Colorado psychiatrist Dr Christian Hageseth.',
 '_input_hash': -1967958515,
 '_task_hash': 1622678098,
 'tokens': [{'text': 'The', 'start': 0, 'end': 3, 'id': 0, 'ws': True},
  {'text': 'Web', 'start': 4, 'end': 7, 'id': 1, 'ws': True},
  {'text': 'site', 'start': 8, 'end': 12, 'id': 2, 'ws': True},
  {'text': 'operators', 'start': 13, 'end': 22, 'id': 3, 'ws': False},
  {'text': ',', 'start': 22, 'end': 23, 'id': 4, 'ws': True},
  {'text': 'based', 'start': 24, 'end': 29, 'id': 5, 'ws': True},
  {'text': 'outside', 'start': 30, 'end': 37, 'id': 6, 'ws': True},
  {'text': 'the', 'start': 38, 'end': 41, 'id': 7, 'ws': True},
  {'text': 'United', 'start': 42, 'end': 48, 'id': 8, 'ws': True},
  {'text': 'States', 'start': 49, 'end': 55, 'id': 9, 'ws': False},
  {'text': ',', 'start': 55, 'end': 56, 'id': 10, 'ws': True},
  {'text': 'forwarded', 'start': 57, 'end': 66, 

##### NER Evaluation

In [7]:
for sample in ner_dataset[:5]:
    print(sample)

{'text': 'The Web site operators, based outside the United States, forwarded his request and associated questionnaires to Colorado psychiatrist Dr Christian Hageseth.', '_input_hash': -1967958515, '_task_hash': 1622678098, 'tokens': [{'text': 'The', 'start': 0, 'end': 3, 'id': 0, 'ws': True}, {'text': 'Web', 'start': 4, 'end': 7, 'id': 1, 'ws': True}, {'text': 'site', 'start': 8, 'end': 12, 'id': 2, 'ws': True}, {'text': 'operators', 'start': 13, 'end': 22, 'id': 3, 'ws': False}, {'text': ',', 'start': 22, 'end': 23, 'id': 4, 'ws': True}, {'text': 'based', 'start': 24, 'end': 29, 'id': 5, 'ws': True}, {'text': 'outside', 'start': 30, 'end': 37, 'id': 6, 'ws': True}, {'text': 'the', 'start': 38, 'end': 41, 'id': 7, 'ws': True}, {'text': 'United', 'start': 42, 'end': 48, 'id': 8, 'ws': True}, {'text': 'States', 'start': 49, 'end': 55, 'id': 9, 'ws': False}, {'text': ',', 'start': 55, 'end': 56, 'id': 10, 'ws': True}, {'text': 'forwarded', 'start': 57, 'end': 66, 'id': 11, 'ws': True}, {'

In [8]:
from spacy.training import Example, offsets_to_biluo_tags, biluo_to_iob

##### Modified version of the converter from an Iterable of Entities to IOB2 format
##### The results are first converted to the BILOU Schema and then the IOB2 Schema
##### Example: Original ['O', 'B-GPE', 'I-GPE', 'L-GPE', 'O'] => Converted ['O', 'B-GPE', 'I-GPE', 'I-GPE', 'O']
##### To be used to compute A more Lenient Classification Score for entities on a token level
def single_token_tags(doc, entities):
        bilou_tags = offsets_to_biluo_tags(doc, entities)
        iob_tags = biluo_to_iob(bilou_tags)
        return iob_tags


def get_entities_from_jsonl(jsonl_sample):
        spans = []
        single_entities = []
        for span in jsonl_sample['spans']:
                start, end, label = span['start'], span['end'], span['label']
                single_entities.append({"start": start, "end": end, "label": label})
                spans.append((start, end, label))
        return spans, single_entities

def make_prediction(nlp_model, text):
        prediction = nlp_model(text)
        pass 

Testing Loop

In [9]:
all_examples = []
all_tags = {"true" : [], "predicted": []}

for sample in ner_dataset[:5]:
        visualization = True
        true_entity_spans, true_entities = get_entities_from_jsonl(sample)
        sentence = sample['text']
        prediction = nlp(sentence)
        predicted_ent_spans = [(ent.start_char, ent.end_char, ent.label_) for ent in prediction.ents]
        
        ##### Converting each true and predicted entity span to the IOB2 tags schema
        predicted_ent_tags = single_token_tags(prediction, predicted_ent_spans)
        true_ent_tags = single_token_tags(prediction, true_entity_spans)

        ##### Uncomment for debug        
        # print('For Sentence: {}\n'.format(test_sample['text']))
        # print('True Entity Span: {}'.format(true_entity_spans))
        # print('Model Prediction: {}\n'.format(predicted_ent_spans))
        # print('Predicted Tags: {}'.format(predicted_ent_tags))
        # print('True Tags: {}\n'.format(true_ent_tags))

        all_examples.append([Example.from_dict(prediction, {'entities': true_entity_spans})])
        all_tags['true'].append(true_ent_tags)
        all_tags['predicted'].append(predicted_ent_tags)        

        if visualization:
        ##### Visualization Debug
                print('Visualization: ')
                print('Predicted Entity Spans: ')
                displacy.render(prediction, style = 'ent', jupyter = True)
                true_example = {"text": prediction.text, "ents": true_entities, "title": None}
                print('True Entity Spans: ')
                displacy.render(true_example, style = 'ent', jupyter = True, manual = True)

Visualization: 
Predicted Entity Spans: 


True Entity Spans: 


Visualization: 
Predicted Entity Spans: 


True Entity Spans: 


Visualization: 
Predicted Entity Spans: 


True Entity Spans: 


Visualization: 
Predicted Entity Spans: 


True Entity Spans: 


Visualization: 
Predicted Entity Spans: 


True Entity Spans: 


In [10]:
scorer = Scorer(nlp)

##### Exact Span Prediction Working Score
for example in all_examples:
    print('Score: {}'.format(scorer.score_spans(example, attr="ents")))
# print('Score: {}'.format(scorer.score(examples)))

Score: {'ents_p': 0.3333333333333333, 'ents_r': 0.3333333333333333, 'ents_f': 0.3333333333333333, 'ents_per_type': {'GPE': {'p': 0.5, 'r': 0.5, 'f': 0.5}, 'PERSON': {'p': 0.0, 'r': 0.0, 'f': 0.0}}}
Score: {'ents_p': 0.0, 'ents_r': 0.0, 'ents_f': 0.0, 'ents_per_type': {'ORG': {'p': 0.0, 'r': 0.0, 'f': 0.0}}}
Score: {'ents_p': 0.0, 'ents_r': 0.0, 'ents_f': 0.0, 'ents_per_type': {'GPE': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'NORP': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'ORG': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'DATE': {'p': 0.0, 'r': 0.0, 'f': 0.0}}}
Score: {'ents_p': 0.0, 'ents_r': 0.0, 'ents_f': 0.0, 'ents_per_type': {'NORP': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'ORG': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'PRODUCT': {'p': 0.0, 'r': 0.0, 'f': 0.0}}}
Score: {'ents_p': 0.5, 'ents_r': 1.0, 'ents_f': 0.6666666666666666, 'ents_per_type': {'GPE': {'p': 1.0, 'r': 1.0, 'f': 1.0}, 'WORK_OF_ART': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'ORDINAL': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'ORG': {'p': 1.0, 'r': 1.0, 'f': 1.0}}}


In [11]:
#### Remove unwanted labels (set them to Others ('O'), so errors still count)
def set_tags_to_fixed_labels(tags_lists):
    our_labels = ['GPE', 'LAW', 'LOC', 'ORG', 'PERSON', 'PRODUCT']
    for tags in tags_lists:
        # print('Before: {}'.format(tags))
        for i in range(len(tags)):
            if tags[i][2:] not in our_labels:
                tags[i] = 'O'
        # print('After: {} \n'.format(tags))
    return tags_lists

In [12]:
from seqeval.metrics import classification_report as seqclassify

# print('Predicted Tags: {}'.format(predicted_ent_tags))
# print('True Tags: {}\n'.format(true_ent_tags))

print('Performance with all tags: ')
report_1 = seqclassify(all_tags['true'], all_tags['predicted'], zero_division = 1)
report_1.splitlines()

Performance with all tags: 


['              precision    recall  f1-score   support',
 '',
 '        DATE       0.00      1.00      0.00         0',
 '         GPE       0.67      0.50      0.57         4',
 '        NORP       0.00      1.00      0.00         0',
 '     ORDINAL       0.00      1.00      0.00         0',
 '         ORG       0.14      0.20      0.17         5',
 '      PERSON       0.00      0.00      0.00         1',
 '     PRODUCT       1.00      0.00      0.00         1',
 ' WORK_OF_ART       0.00      1.00      0.00         0',
 '',
 '   micro avg       0.19      0.27      0.22        11',
 '   macro avg       0.23      0.59      0.09        11',
 'weighted avg       0.40      0.27      0.28        11']

In [13]:
print('\n Performance with only our needed tags: ')
fixed_predicted_labels = set_tags_to_fixed_labels(all_tags['predicted'])
report_2 = seqclassify(all_tags['true'], fixed_predicted_labels, zero_division = 1)
report_2.splitlines()


 Performance with only our needed tags: 


['              precision    recall  f1-score   support',
 '',
 '         GPE       0.67      0.50      0.57         4',
 '         ORG       0.14      0.20      0.17         5',
 '      PERSON       0.00      0.00      0.00         1',
 '     PRODUCT       1.00      0.00      0.00         1',
 '',
 '   micro avg       0.27      0.27      0.27        11',
 '   macro avg       0.45      0.17      0.18        11',
 'weighted avg       0.40      0.27      0.28        11']