# Updating spaCy's Named Entity Recognition System

## A toy example (40%)

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. In this HW, we will try to utilize spaCy to do NER. Given a sentence: 'Theresa May is a British politician serving as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2016', could you try to apply spaCy to label what words are geo-political entity (GPE), the organization (ORG), date, etc. 

In [1]:
from IPython.display import HTML, display
import tabulate
import spacy

import en_core_web_sm
nlp = en_core_web_sm.load()

text = "Theresa May is a British politician serving as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2016. "

doc = nlp(text)
entities = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Theresa,B,PERSON
May,I,PERSON
is,O,
a,O,
British,B,NORP
politician,O,
serving,O,
as,O,
Prime,O,
Minister,O,


Your NER labels should not be good, you can try utilize some training texts to improve your NER labeling, try following training texts and redo this problem? 

In [2]:
training_texts = [
    (["Theresa", "May", "is", "determined", "to", "leave", "the", "EU", "in", "March", "."],
     ["B-PERSON", "L-PERSON", "O", "O", "O", "O", "O", "U-ORG", "O", "U-DATE", "O"]
    ),
    (["Theresa", "May", "says", "she", "will", "seek", "a", "pragmatic", "Brexit", "deal", "."],
     ["B-PERSON", "L-PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"]
    ),
    (["Theresa", "May", "vows", "to", "battle", "in", "Brussels", "."],
     ["B-PERSON", "L-PERSON", "O", "O", "O", "O", "U-GPE", "O"]
    )
]


In [3]:
from spacy.tokens import Doc
from spacy.gold import GoldParse

training_data = []
for tokens, annotation in training_texts:
    doc = Doc(nlp.vocab, words=tokens)
    gold = GoldParse(doc, entities=annotation)
    training_data.append((doc, gold))

In [4]:
import random
from tqdm import tqdm_notebook as tqdm

# Random shuffling of data
for _ in tqdm(range(10)):
    random.shuffle(training_data)
    for doc, gold in training_data:
        nlp.update([doc], [gold], drop=0.3)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




In [5]:
text = "Theresa May is a British politician serving as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2016. "

doc = nlp(text) # Converting string text to nlp object
entities = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Theresa,B,PERSON
May,I,PERSON
is,O,
a,O,
British,O,
politician,O,
serving,O,
as,O,
Prime,O,
Minister,O,


Install wget on terminal, if you don't have it. If using wget in notebook is problematic, then perform following commands using Terminal.

In [6]:
!wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.train -P ~/Downloads/data/ner/
!wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.testa -P ~/Downloads/data/ner/
!wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.testb -P ~/Downloads/data/ner/

--2020-04-27 20:06:11--  https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.train
Resolving raw.githubusercontent.com... 151.101.40.133
Connecting to raw.githubusercontent.com|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2377174 (2.3M) [text/plain]
Saving to: '/Users/mrunalikhandat/Downloads/data/ner/ned.train.2'


2020-04-27 20:06:12 (2.78 MB/s) - '/Users/mrunalikhandat/Downloads/data/ner/ned.train.2' saved [2377174/2377174]

--2020-04-27 20:06:12--  https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.testa
Resolving raw.githubusercontent.com... 151.101.40.133
Connecting to raw.githubusercontent.com|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 450785 (440K) [text/plain]
Saving to: '/Users/mrunalikhandat/Downloads/data/ner/ned.testa.2'


2020-04-27 20:06:13 (3.04 MB/s) - '/Users/mrunalikhandat/Downloads/data/ner/ned.testa.2' saved [450

## Training an NER model on Dutch CONLL data (60%)

In practice, however, you'll likely have more training data than just three examples with the same entity. Things become really interesting when you have access to a labelled data set of hundreds or more examples of several entity types: CVs that have been labelled with job titles and skills, medical documents that have been labelled with symptoms and diseases, etc.

As an example, let's train a Named Entity Recognition model on the Dutch data that was collected for the [CoNLL-2002 Shared Task](https://www.clips.uantwerpen.be/conll2002/ner/). This data can be downloaded from Github. Can you evalute spaCy model performance, e.g., get precision, recall, f1-score for each label LOC, MISC, O, ORG, PER, etc.






In [7]:
from operator import itemgetter

train_file = "data/ner/ned.train"
dev_file = "data/ner/ned.testa"
test_file = "data/ner/ned.testb"

def read_conll_file(f):
    """
    Used to read file data and store
    it as a list. 
    Output- list
    """
    data = []
    with open(f) as i:
        sentences = i.read().strip().split("\n\n")
        
    for sentence in sentences:
        data.append([token.split() for token in sentence.split("\n")])

    return data
        
train_data = read_conll_file(train_file)
dev_data = read_conll_file(dev_file)
test_data = read_conll_file(test_file)

In [8]:
from sklearn.metrics import classification_report, precision_recall_fscore_support

def evaluate(model, data, verbose=0): 
    """
    Function to evaluate
    performance of the trained nlp model
    Output- A list of attributes with precision, recall, F1 score, support
            of each attribute in the document
    
    """
    ner = model.get_pipe("ner")
    
    correct, predicted = [], []
    for sentence in data:
        tokens = [t[0] for t in sentence]
        ent_labels = [t[2].split("-")[-1] for t in sentence]
        
        doc = Doc(model.vocab, words=tokens)
        ner(doc)
        
        pred_labels = [t.ent_type_ or "O" for t in doc]
        correct += ent_labels
        predicted += pred_labels
        
    if verbose:
        print(classification_report(correct, predicted))
    
    return precision_recall_fscore_support(correct, predicted, average="micro")


In [9]:
# Testing performance of the pretrained model- nl on our data
import nl_core_news_sm
nlp = nl_core_news_sm.load()
evaluate(nlp, test_data, verbose=1)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    CARDINAL       0.00      0.00      0.00         0
        DATE       0.00      0.00      0.00         0
       EVENT       0.00      0.00      0.00         0
         FAC       0.00      0.00      0.00         0
         GPE       0.00      0.00      0.00         0
    LANGUAGE       0.00      0.00      0.00         0
         LAW       0.00      0.00      0.00         0
         LOC       0.22      0.02      0.04       823
        MISC       0.00      0.00      0.00      1597
       MONEY       0.00      0.00      0.00         0
        NORP       0.00      0.00      0.00         0
           O       0.98      0.91      0.94     63236
     ORDINAL       0.00      0.00      0.00         0
         ORG       0.48      0.31      0.37      1433
         PER       0.00      0.00      0.00      1905
     PERCENT       0.00      0.00      0.00         0
      PERSON       0.00      0.00      0.00         0
     PRODUCT       0.00    

(0.8447256283155057, 0.8447256283155057, 0.8447256283155057, None)

Spacy's NER has various types of entity recognition schemes for better annotation/labelling of the text. More information can be found at https://spacy.io/api/annotation. The conversion from IOB to BILUO is done here as BILUO is more efficient than IOB in recognizing tokens and gives better performance. 

In [10]:
from spacy.gold import iob_to_biluo
# Converting training_data from IOB annotation scheme to BILUO scheme
training_data = []
for sentence in train_data:
    tokens = [t[0] for t in sentence]
    ent_labels = iob_to_biluo([t[2] for t in sentence])
    doc = Doc(nlp.vocab, words=tokens)
    gold = GoldParse(doc, entities=ent_labels)
    training_data.append((doc, gold))

Following is the custom training of our data to create a better NER model. The steps followed in training the model are as follows:
- Check for pretrained models. 
- If no model is assigned, create a pipe- 'ner'. If a model is given, check for the pipes in the model. If 'ner' is not present in the model pipes, add it. 
- For a given model, other pipes need to be disabled as training needs to be performed only on 'ner' model.
- 'ner' pipe needs to be present in the model irrespective of new or training an existing model
- Used a flag called reset_weights to ensure that 'ner' is present before starting the training
- "PER", "LOC", "ORG", "MISC" labels need to be present in 'ner' pipe, they're added to 'ner'.
- 'ner' pipe is trained using training_data

In [11]:
from spacy.util import minibatch
from pathlib import Path

def train(train_docs, dev_data, output_dir, model=None, max_epochs=100): 
    """
    Training a new model on given data
    """
    reset_weights = False
    print(nlp.pipe_names)
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
        reset_weights = True
    else:
        ner = nlp.get_pipe("ner")
    print(ner,reset_weights)
    
    if not model or reset_weights: 
        print("inside reset weights")
        model = spacy.blank("nl")
        ner = model.create_pipe("ner")
        model.add_pipe(ner, last=True)
    for label in ["PER", "LOC", "ORG", "MISC"]: 
        ner.add_label(label)
    model.begin_training()
        
    other_pipes = [pipe for pipe in model.pipe_names if pipe != 'ner']
    print(other_pipes)
    fscore_history = []
    patience=3
        
    with model.disable_pipes(*other_pipes):
        print("inside disable pipes")
    
        for i in range(max_epochs):

            losses = {}
            random.shuffle(train_docs)
            batches = minibatch(train_docs, size=32)
            for batch in tqdm(batches):
                docs, golds = zip(*batch)

                model.update(
                    docs,
                    golds,
                    drop=0.4,
                    losses=losses)
            print("Training Loss:", losses)
            
            _, _, dev_f, _ = evaluate(model, dev_data)
            print("Development F-score:", dev_f)
            
            if len(fscore_history) > 0 and dev_f > max(fscore_history): 
                if output_dir is not None:
                    output_dir = Path(output_dir)
                    if not output_dir.exists():
                        output_dir.mkdir()
                    model.to_disk(output_dir)
                    print("Saved model to", output_dir)
            
            fscore_history.append(dev_f)
            
            if max(fscore_history) > max(fscore_history[-patience:]):
                print("No improvement on development set. Stop training.")
                break

In [12]:
# Training a model from scratch

output_dir_scratch = "models/spacy_ner_scratch"
train(training_data, dev_data, model=None, output_dir=output_dir_scratch)

['tagger', 'parser', 'ner']
<spacy.pipeline.pipes.EntityRecognizer object at 0x1a2dd604b0> False
inside reset weights
[]
inside disable pipes


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 16851.210388486797}
Development F-score: 0.9462408304864808


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 10228.721772782765}
Development F-score: 0.9592701464473928
Saved model to models/spacy_ner_scratch


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 8000.932297884683}
Development F-score: 0.964248828156034
Saved model to models/spacy_ner_scratch


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 6777.075392361529}
Development F-score: 0.9648049574958291
Saved model to models/spacy_ner_scratch


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 5890.74409409282}
Development F-score: 0.9682741452821695
Saved model to models/spacy_ner_scratch


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 5191.929908104242}
Development F-score: 0.9686448981753661
Saved model to models/spacy_ner_scratch


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 4795.659536664179}
Development F-score: 0.9664468631657


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 4379.598490996272}
Development F-score: 0.9688832393209925
Saved model to models/spacy_ner_scratch


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 4015.421437057075}
Development F-score: 0.9677444982918885


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 3791.0608238737577}
Development F-score: 0.9686713805248802


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 3425.342025940746}
Development F-score: 0.9708958978840603
Saved model to models/spacy_ner_scratch


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 3285.1435861859245}
Development F-score: 0.9666322396122984


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 3077.7436384366774}
Development F-score: 0.969121580466619


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 2809.3543861899543}
Development F-score: 0.970207356796695
No improvement on development set. Stop training.


In [13]:
# Training a model on top of nlp- continued training of the model
# This model is expected to give better results as spacy's nlp is used as a base model
output_dir_cntd = "models/spacy_ner_cntd"
train(training_data, dev_data, model=nlp, output_dir=output_dir_cntd)

['tagger', 'parser', 'ner']
<spacy.pipeline.pipes.EntityRecognizer object at 0x1a2dd604b0> False
['tagger', 'parser']
inside disable pipes


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 114947.73417854309}
Development F-score: 0.9525965943698524


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 110180.59282875061}
Development F-score: 0.9591377346998226
Saved model to models/spacy_ner_cntd


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 108978.9408569336}
Development F-score: 0.9605412992240672
Saved model to models/spacy_ner_cntd


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 108514.84890174866}
Development F-score: 0.9628717459813034
Saved model to models/spacy_ner_cntd


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 108059.70310401917}
Development F-score: 0.9624745107385927


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 107620.2323884964}
Development F-score: 0.9600381345833002


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 107349.48663520813}
Development F-score: 0.9650962633404836
Saved model to models/spacy_ner_cntd


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106713.5507364273}
Development F-score: 0.963189534175472


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106931.87976264954}
Development F-score: 0.9655199809327083
Saved model to models/spacy_ner_cntd


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106780.37586402893}
Development F-score: 0.9654934985831943


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106683.66618347168}
Development F-score: 0.9661290749715314
Saved model to models/spacy_ner_cntd


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106312.89533996582}
Development F-score: 0.9637986282142952


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 105938.4668750763}
Development F-score: 0.9661555573210455
Saved model to models/spacy_ner_cntd


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106621.87007331848}
Development F-score: 0.9650697809909695


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106051.79179954529}
Development F-score: 0.9655729456317365


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Training Loss: {'ner': 106155.75259017944}
Development F-score: 0.9644342046026323
No improvement on development set. Stop training.


In [19]:
# Performance Evaluation
nlp_base = nl_core_news_sm.load()
nlp_scratch = spacy.load(output_dir_scratch)
nlp_cntd = spacy.load(output_dir_cntd)

print("*"*60)
print("\t\t\tBase Model")
print("*"*60)
evaluate(nlp_base, test_data, verbose=1)
print("*"*60)
print("\t\t\tNew Model")
print("*"*60)
evaluate(nlp_scratch, test_data, verbose=1)
print("*"*60)
print("\t\t\tContinued Model")
print("*"*60)
evaluate(nlp_cntd, test_data, verbose=1)

************************************************************
			Base Model
************************************************************
              precision    recall  f1-score   support

    CARDINAL       0.00      0.00      0.00         0
        DATE       0.00      0.00      0.00         0
       EVENT       0.00      0.00      0.00         0
         FAC       0.00      0.00      0.00         0
         GPE       0.00      0.00      0.00         0
    LANGUAGE       0.00      0.00      0.00         0
         LAW       0.00      0.00      0.00         0
         LOC       0.22      0.02      0.04       823
        MISC       0.00      0.00      0.00      1597
       MONEY       0.00      0.00      0.00         0
        NORP       0.00      0.00      0.00         0
           O       0.98      0.91      0.94     63236
     ORDINAL       0.00      0.00      0.00         0
         ORG       0.48      0.31      0.37      1433
         PER       0.00      0.00      0.00      1905

(0.9726063135924863, 0.9726063135924863, 0.9726063135924863, None)

### References-

- https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718
- https://github.com/nlptown/nlp-notebooks/blob/master/Updating%20spaCy's%20Named%20Entity%20Recognition%20System.ipynb
- https://spacy.io/usage/training
- https://spacy.io/usage/linguistic-features#named-entities
- https://spacy.io/api/goldparse
- https://spacy.io/api/annotation