# Assessing spaCy NER performances

This notebook aim to evaluate spaCy performance to find persones, organisation and Location entities. We used the pretrained transformer model `en_core_web_trf`.

To do so we use the dataset CoNLL003 (English-version) available on kaggle at the following [link](https://www.kaggle.com/alaakhaled/conll003-englishversion). We formated the data in the notebook `1_reformat_data.ipynb` in order to make it compatible with the spaCy model.

## Imports

In [1]:
import copy
import spacy
import pickle

import numpy as np
from tqdm import tqdm_notebook as tqdm 

## Model loadig

We load the transformer based NER model from spaCy `en_core_web_trf`.

In [2]:
import en_core_web_trf
nlp = en_core_web_trf.load()

## Data Loading

We load the data reformated in a dictionnairy from the pickle file created by the notebook `1_reformat_data.ipynb`.

In [3]:
filename = '../../data/ner_testing.pkl'
infile = open(filename,'rb')
ner_testing_data = pickle.load(infile)
infile.close()

### Quick view of the first 5 testing sentences

In [4]:
ner_testing_data[:5]

[{'string': 'EU rejects German call to boycott British lamb .',
  'ents': [{'ent': 'ORG', 'start': 0, 'end': 2},
   {'ent': 'MISC', 'start': 11, 'end': 17},
   {'ent': 'MISC', 'start': 34, 'end': 41}]},
 {'string': 'Peter Blackburn',
  'ents': [{'ent': 'PER', 'start': 0, 'end': 15}]},
 {'string': 'BRUSSELS 1996-08-22',
  'ents': [{'ent': 'LOC', 'start': 0, 'end': 8}]},
 {'string': 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .',
  'ents': [{'ent': 'ORG', 'start': 4, 'end': 23},
   {'ent': 'MISC', 'start': 59, 'end': 65},
   {'ent': 'MISC', 'start': 94, 'end': 101}]},
 {'string': "Germany's representative to the European Union's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .",
  'ents': [{'ent': 'LOC', 'start': 0, 'end': 6},
   {'ent

## Data preprocessing 

We need to change the names of the entities, so that they will be the same than the one created by the spaCy model. We also try to use entity names which make sense.

In [5]:
def change_test_data_entities_names(ent_and_string):
    for ent in ent_and_string['ents']:
        if ent['ent'][-3:] == "ORG":
             ent['ent'] = "ORG"
        elif ent['ent'][-3:] == "PER":
             ent['ent'] = "PER"
        elif ent['ent'][-3:] == "LOC":
             ent['ent'] = "LOC"
        elif ent['ent'][-4:] == "MISC":
             ent['ent'] = "NORP"
    return ent_and_string

ner_testing_data = [change_test_data_entities_names(ent_and_string) for ent_and_string in ner_testing_data]

### Quick look at the result of the pre-processing 

In [6]:
ner_testing_data[:5]

[{'string': 'EU rejects German call to boycott British lamb .',
  'ents': [{'ent': 'ORG', 'start': 0, 'end': 2},
   {'ent': 'NORP', 'start': 11, 'end': 17},
   {'ent': 'NORP', 'start': 34, 'end': 41}]},
 {'string': 'Peter Blackburn',
  'ents': [{'ent': 'PER', 'start': 0, 'end': 15}]},
 {'string': 'BRUSSELS 1996-08-22',
  'ents': [{'ent': 'LOC', 'start': 0, 'end': 8}]},
 {'string': 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .',
  'ents': [{'ent': 'ORG', 'start': 4, 'end': 23},
   {'ent': 'NORP', 'start': 59, 'end': 65},
   {'ent': 'NORP', 'start': 94, 'end': 101}]},
 {'string': "Germany's representative to the European Union's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .",
  'ents': [{'ent': 'LOC', 'start': 0, 'end': 6},
   {'ent

## Application of the model on the data
We will only use a test set of 4000 sentences.

In [7]:
# Create the list which will contain all the results
spacy_ner_predictions = []

# Loop over the sentences of the testing set
for sentence in tqdm(ner_testing_data[:4000]):
    # Build the dictionnairy which will contain the NER prediction for the current sentence
    spacy_ents_dict = {"string": sentence['string'], "ents": []}
    # Apply the spaCy NER model and loop over the entities
    for ent in nlp(sentence['string']).ents:
        # Append a dictionnairy containing each enety found
        spacy_ents_dict["ents"].append({
            "start": ent.start_char,
            "end": ent.end_char,
            "ent":ent.label_
        })
    # Append the predictions on the sentence
    spacy_ner_predictions.append(spacy_ents_dict)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """


  0%|          | 0/4000 [00:00<?, ?it/s]

## Quick look at the first predictions

In [8]:
len(spacy_ner_predictions)

4000

In [9]:
spacy_ner_predictions[:5]

[{'string': 'EU rejects German call to boycott British lamb .',
  'ents': [{'start': 0, 'end': 2, 'ent': 'ORG'},
   {'start': 11, 'end': 17, 'ent': 'NORP'},
   {'start': 34, 'end': 41, 'ent': 'NORP'}]},
 {'string': 'Peter Blackburn',
  'ents': [{'start': 0, 'end': 15, 'ent': 'PERSON'}]},
 {'string': 'BRUSSELS 1996-08-22',
  'ents': [{'start': 9, 'end': 19, 'ent': 'DATE'}]},
 {'string': 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .',
  'ents': [{'start': 0, 'end': 23, 'ent': 'ORG'},
   {'start': 32, 'end': 40, 'ent': 'DATE'},
   {'start': 59, 'end': 65, 'ent': 'NORP'},
   {'start': 94, 'end': 101, 'ent': 'NORP'}]},
 {'string': "Germany's representative to the European Union's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .",
  'ents'

## Post-processing
We change the names of some entities to make it compatible with what we use for the ground truth. And we filter all the entities we don't have grounnd truth for.

In [10]:
def change_spacy_entities_names(ent_and_string):
    new_ent_and_string = {
        "string": ent_and_string["string"],
        "ents": []
    }
    for ent in ent_and_string['ents']:
        if ent['ent'] == "ORG":
             new_ent_and_string['ents'].append(ent)
        elif ent['ent'] == "PERSON":
             ent['ent'] = "PER"
             new_ent_and_string['ents'].append(ent)
        elif ent['ent'] == "GPE":
             ent['ent'] = "LOC"
             new_ent_and_string['ents'].append(ent)
        elif ent['ent'] == "NORP":
             new_ent_and_string['ents'].append(ent)
    return new_ent_and_string

In [11]:
filtered_spacy_ner_predictions = [change_spacy_entities_names(ent_and_string) for ent_and_string in spacy_ner_predictions]

In [12]:
filtered_spacy_ner_predictions[:5]

[{'string': 'EU rejects German call to boycott British lamb .',
  'ents': [{'start': 0, 'end': 2, 'ent': 'ORG'},
   {'start': 11, 'end': 17, 'ent': 'NORP'},
   {'start': 34, 'end': 41, 'ent': 'NORP'}]},
 {'string': 'Peter Blackburn',
  'ents': [{'start': 0, 'end': 15, 'ent': 'PER'}]},
 {'string': 'BRUSSELS 1996-08-22', 'ents': []},
 {'string': 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .',
  'ents': [{'start': 0, 'end': 23, 'ent': 'ORG'},
   {'start': 59, 'end': 65, 'ent': 'NORP'},
   {'start': 94, 'end': 101, 'ent': 'NORP'}]},
 {'string': "Germany's representative to the European Union's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .",
  'ents': [{'start': 0, 'end': 7, 'ent': 'LOC'},
   {'start': 28, 'end': 48, 'ent': 'ORG'},
   

## Scoring function

In this section we build the function to see how close are the spaCy model prediction from the ground truth.

In [13]:
def get_size_intersection(ent_1, ent_2):
    """
    Look at the size of the intersection on the two entities.
    """
    return max(0, min(ent_1["end"], ent_2["end"]) - max(ent_1["start"], ent_2["start"]))

def find_one_best_match(ent_1, list_of_entities_2):
    """
    Find the best match of the entity `ent_1` in the list of entities 
    `list_of_entities_2`.
    """
    max_overlap = 0
    index_max_overlap = -1
    for i, ent_2 in enumerate(list_of_entities_2):
        # Make sure that the entities have the same type 
        if ent_2['ent'] == ent_1['ent']:
            intersection_ent_1_ent_2 = get_size_intersection(ent_1, ent_2)
            # If the new candidate has a better overlape with,
            # ent_1 we keep it as the current best match
            if intersection_ent_1_ent_2 > max_overlap:
                index_max_overlap = i
                max_overlap = intersection_ent_1_ent_2
    # Return the index of the entity with the best overlap in `list_of_entities_2`
    # as well as the associated overlap
    return index_max_overlap, max_overlap

def get_size_all_entities(list_of_entities):
    """
    Get the size of all the eneties in the list of entities
    `list_of_entities`.
    """
    size_all_enetities = 0
    for ent in list_of_entities:
        size_all_enetities += ent["end"] - ent["start"]
    return size_all_enetities

def jacquard_metrics(list_of_entities_1, list_of_entities_2):
    """
    Compute the jacquard metrics between the tow list of entities given in parameters.
    Also return the size of intersection, the size of the union, and the number of 
    entities with an non void intersection divided by either the number of entities in 
    list 1 or the number of entities in list two.
    """
    list_of_entities_1 = copy.deepcopy(list_of_entities_1)
    list_of_entities_2 = copy.deepcopy(list_of_entities_2)
    size_of_intersection = 0
    nb_intersec = 0
    len1, len2 = len(list_of_entities_1), len(list_of_entities_2)
    size_of_union = get_size_all_entities(list_of_entities_1) + get_size_all_entities(list_of_entities_2)
    for ent_1 in list_of_entities_1:
        index_max_overlap, max_overlap = find_one_best_match(ent_1, list_of_entities_2)
        if index_max_overlap >= 0:
            size_of_intersection += max_overlap
            list_of_entities_2.pop(index_max_overlap)
            nb_intersec += 1
    size_of_union -= size_of_intersection # #AUB = #A + #B - #AinterB
    
    return [size_of_intersection / size_of_union if (size_of_union > 0) else 1,
            size_of_intersection, size_of_union,
            nb_intersec / len1 if len1 > 0 else 1, nb_intersec / len2 if len2 > 0 else 1]

## Compute the scoring metrics for each sentence in the test set

In [15]:
list_of_jacquard = []

for gt_ent, spacy_ent in zip(ner_testing_data, filtered_spacy_ner_predictions):
    list_of_jacquard.append(jacquard_metrics(gt_ent['ents'], spacy_ent['ents']))

### Average Jacquard metrics

In [17]:
print(round(100 * np.mean(np.array(list_of_jacquard)[:, 0]), 2), "%")

69.37 %


### Size of the interesection of all entities over all sentence divided by the size of the union

In [18]:
print(round(100 * np.sum(np.array(list_of_jacquard)[:, 1]) / np.sum(np.array(list_of_jacquard)[:, 2]), 2), "%")

65.07 %


### Recall of the spaCy NER model

In [19]:
print(round(100 *np.mean(np.array(list_of_jacquard)[:, 3]), 2), "%")

71.41 %


### Precision of the spaCy NER model

In [20]:
print(round(100 *np.mean(np.array(list_of_jacquard)[:, 4]), 2), "%")

91.76 %
