<a href="https://colab.research.google.com/github/jmrf/NER-evaluation/blob/master/NER_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NER evaluation and comparison

We will compare 3 different libraries with their pre-trained models for Name Entity Extraction (NER)

* [Spacy](https://github.com/explosion/spaCy)
* [Deeppavlov](https://github.com/deepmipt/DeepPavlov)
* [Polyglot](https://github.com/aboSamoor/polyglot)

### Grant acces to Drive 

If we don't grant acces to drive, **we will lose the generated data** every time we restart the colab runtime

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


### Install all needed libraries and download the pre-trained models

There's quite a bit to download and to compile... so go and grab a coffee in the meantime ;)

**IMPORTANT**: After the installation process is complete you'll need to **_restart the collab runtime_** so the changes and installed models take effect!

In [0]:
# install spacy
!pip install spacy
!python -m spacy download en_core_web_lg

# install deepPavlov
!pip install deeppavlov
!python -m deeppavlov install squad_bert
!python -m deeppavlov install ner_ontonotes_bert

# install Polyglot
!pip install polyglot
!pip install pyicu
!pip install pycld2
!pip install morfessor
!polyglot download embeddings2.en ner2.en

### Define some helper classes to perform the NER extraction

We also define a mapping between the different tagging conventions

In [0]:
import spacy

from polyglot.text import Text
from deeppavlov import configs, build_model


polyglot2spacy = {
    "I-LOC":"GPE",
    "I-ORG":"ORG",
    "I-PER":"PERSON"
}

pavlov2spacy = {
    "LOCATION":"LOC",
    "ORGANIZATION":"ORG",
    "PERSON":"PERSON"
}


class PolyglotNER():

    @classmethod
    def process(cls, text):
        doc = Text(text)
        return dict([(t._collection[0], polyglot2spacy.get(t.tag, t.tag))
                     for t in doc.entities])


class PavlovNER():

    def __init__(self):
        self.ner_model = build_model(
            configs.ner.ner_ontonotes_bert_mult, 
            download=True
        )

    def process(self, text):
        sents, bios = self.ner_model([text])
        # return a dict mapping entity to type
        return dict([(w, tag.split("-")[-1])
                    for w, tag in zip(sents[0], bios[0]) 
                    if tag != 'O'])


class SpacyNER():

    def __init__(self):
        # load model
        self.nlp = spacy.load("en_core_web_lg")

    def process(self, text):
        doc = self.nlp(text)
        # return a dict mapping entity to type
        return dict([(e.text, e.ent_type_)
                     for e in doc if e.ent_type != 0])

### Load the different models

We do this only once. DeepPavlov should be called SlowPavlov... :/

In [0]:
spacy_ner = SpacyNER()
pavlov_ner = PavlovNER()

2019-09-26 13:56:27.749 INFO in 'deeppavlov.core.data.utils'['utils'] at line 64: Downloading from http://files.deeppavlov.ai/deeppavlov_data/ner_ontonotes_bert_mult_v1.tar.gz to /root/.deeppavlov/ner_ontonotes_bert_mult_v1.tar.gz
100%|██████████| 1.32G/1.32G [06:02<00:00, 3.64MB/s]
2019-09-26 14:02:29.779 INFO in 'deeppavlov.core.data.utils'['utils'] at line 216: Extracting /root/.deeppavlov/ner_ontonotes_bert_mult_v1.tar.gz archive into /root/.deeppavlov/models
2019-09-26 14:02:47.49 INFO in 'deeppavlov.core.data.utils'['utils'] at line 64: Downloading from http://files.deeppavlov.ai/deeppavlov_data/bert/multi_cased_L-12_H-768_A-12.zip to /root/.deeppavlov/downloads/multi_cased_L-12_H-768_A-12.zip
 80%|███████▉  | 527M/663M [01:24<00:25, 5.36MB/s]

### Run some prints to understand each library outputs

Once the models are loaded into memory is a just a matter of running the inputs through

In [5]:
# sample sentence:
text = "This is Marcos from the Golden Bridge of Barcelona saying hello to all of Spain!"

# Spacy
print("Spacy: {}\n".format(spacy_ner.process(text)))
# Polyglot
print("Polyglot: {}\n".format(PolyglotNER.process(text)))
# Pavlov
print("Pavlov: {}\n".format(pavlov_ner.process(text)))


Spacy: {'Marcos': 'PERSON', 'the': 'FAC', 'Golden': 'FAC', 'Bridge': 'FAC', 'Barcelona': 'GPE', 'Spain': 'GPE'}

Polyglot: {'Marcos': 'PERSON', 'Barcelona': 'GPE', 'Spain': 'GPE'}

Pavlov: {'Marcos': 'PERSON', 'the': 'FAC', 'Golden': 'FAC', 'Bridge': 'FAC', 'Barcelona': 'GPE', 'Spain': 'GPE'}



### Sets difference and intersection example


In [30]:
GT = {"Africa","Australia"}
ner = {"Africa","ASDs"}

print(GT-ner,len(list(GT-ner)))
print(ner-GT,len(list(ner-GT)))
print(GT.intersection(ner),len(list(GT.intersection(ner))))


{'Australia'} 1
{'ASDs'} 1
{'Africa'} 1


### Load the corpus of tagged sentences with entities


In [43]:
from collections import defaultdict
import ast
import numpy as np
import json


"""
This function provide the output following the next JSON/dict format:
{'Is Africa in your catalogue': 
    {'GT': 
          {'Africa': 'LOC'}
    }
}
"""
def load_data(path):
    #Read file
    with open(path) as f:
        lines = f.readlines()

    #Interpret file and convert to lists
    texts = []
    tokens = []
    ent_type = [] 
    for line in lines:
        line = ast.literal_eval(line)
        texts.append(line[0])
        tokens.append(line[1])
        ent_type.append(line[2])

    #Translate from lists to the desired format
    data = {}
    for token,ent,txt in zip(tokens,ent_type,texts):    
        data[txt]={"GT":{}}
        for i,en in enumerate(ent):
            if en!="O":
                data[txt]["GT"][token[i]]=en
    return data


# def get_entities(data: dict) -> Dict[Text[Dict[Text,Text]]]:
def get_entities(corpus):
    entity_results = {}
    for text in corpus:
        # entity_results[text]={}
        entity_results[text]=corpus[text]
        entity_results[text]["Spacy"]=spacy_ner.process(text)
        entity_results[text]["Polyglot"]=PolyglotNER.process(text)
        entity_results[text]["Pavlov"]=pavlov_ner.process(text)
    return entity_results

def evaluate_entities(entities):
  ner_extractors = ["Spacy","Polyglot","Pavlov"]
  results = {}
  for extractor in ner_extractors:
    results[extractor] = {"TP":0,"TN":0,"FP":0,"FN":0,"Acc":0}
    for sentence in entities:
      gt = set(entities[sentence]["GT"].keys())
      extractor_result = set(entities[sentence][extractor].keys())
      results[extractor]["TP"] += len(list(gt.intersection(extractor_result)))
      results[extractor]["FP"] += len(list(extractor_result-gt))
      results[extractor]["FN"] += len(list(gt-extractor_result))
    
    tp = results[extractor]["TP"]
    tn = results[extractor]["TN"]
    fp = results[extractor]["FP"]
    fn = results[extractor]["FN"]
    results[extractor]["Acc"]= tp / (tp+tn+fp+fn)

  return results


# load the entity corpus and
corpus = load_data('/content/drive/My Drive/expanded.txt')

# get the entities from all aviable extractors
entities = get_entities(corpus)
print(entities)

# evaluate each extractor
results = evaluate_entities(entities)
print(results)


with open("/content/drive/My Drive/entities.json", "w") as file:
    json.dump(entities, file, indent=4)

with open("/content/drive/My Drive/results.json", "w") as file:
    json.dump(results, file, indent=4)






ValueError: ignored