# Updating spaCy's Named Entity Recognition System

Pretrained models are simple to use, but they're unlikely to obtain state-of-the-art performance if your data differs even slightly from the type of data it was trained on. If state-of-the-art performance is what you're looking for, at some point you're going to want to train your own model. Luckily, spaCy allows this, too. In fact, spaCy offers us two options: it allows us to train a model from scratch, or to continue training its pretrained model with our own data.

## A toy example

SpaCy's pretrained named entity recognition model is pretty good, but of course, now and then it makes mistakes. Take a look at the sentence `Theresa May is a British politician serving as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2016`. SpaCy successfully recognizes `British` as a nationality (NORP), `the United Kingdom` as a geo-political entity (GPE), `the Conservative Party` as an organization (ORG), and `2016` as a date. However, it does not recognize Theresa May as a person. Instead, it labels `Theresa` as an organization, and `May` as a date.

In [1]:
from IPython.display import HTML, display
import tabulate
import spacy

nlp = spacy.load("en")
text = "Theresa May is a British politician serving as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2016. "

doc = nlp(text)
entities = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Theresa,B,ORG
May,B,DATE
is,O,
a,O,
British,B,NORP
politician,O,
serving,O,
as,O,
Prime,O,
Minister,O,


Let's fix this by giving the model some more training data. Obviously, we're not going to give it the exact sentence above &mdash; that would make the task just a bit too easy. Instead, we're going to use similar sentences with our target entity. We split up each of these sentences in its tokens, and provide each token with its correct label. In contrast to spaCy's output labelling scheme, these training labels follow the BILUO scheme. This means we don't just mark tokens and the Beginning and Inside of entities, but also tokens that make up an entity all by themselves (U), and those that are Last in the entity.

In [2]:
training_texts = [
    (["Theresa", "May", "is", "determined", "to", "leave", "the", "EU", "in", "March", "."],
     ["B-PERSON", "L-PERSON", "O", "O", "O", "O", "O", "U-ORG", "O", "U-DATE", "O"]
    ),
    (["Theresa", "May", "says", "she", "will", "seek", "a", "pragmatic", "Brexit", "deal", "."],
     ["B-PERSON", "L-PERSON", "O", "O", "O", "O", "O", "O", "O", "O", "O"]
    ),
    (["Theresa", "May", "vows", "to", "battle", "in", "Brussels", "."],
     ["B-PERSON", "L-PERSON", "O", "O", "O", "O", "U-GPE", "O"]
    )
]


For each of these training sentences, we make a spaCy document, reusing the vocabulary of the spaCy model we're using. Because we've already taken care of the tokenization, we also pass the tokens explicitly. Next, we combine this document with the correct labels in a so-called GoldParse object.

In [3]:
from spacy.tokens import Doc
from spacy.gold import GoldParse

training_data = []
for tokens, annotation in training_texts:
    doc = Doc(nlp.vocab, words=tokens)
    gold = GoldParse(doc, entities=annotation)
    training_data.append((doc, gold))

Now we're going to do the actual training. This means we're going to let our model see the labelled training data several times. For each of these so-called epochs, we shuffle the training data to avoid any form of bias, and update the model with the each of the training documents and its gold parse. We do this 10 times.

In [4]:
import random
from tqdm import tqdm_notebook as tqdm

for _ in tqdm(range(10)):
    random.shuffle(training_data)
    for doc, gold in training_data:
        nlp.update([doc], [gold], drop=0.3)

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




Let's now test the model on the same sentence as before. The output shows it still recognizes all the correct entities it found before, but now it has also identified `Theresa May` as a person. Hurray!

In [5]:
text = "Theresa May is a British politician serving as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2016. "

doc = nlp(text)
entities = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Theresa,B,PERSON
May,I,PERSON
is,O,
a,O,
British,B,NORP
politician,O,
serving,O,
as,O,
Prime,O,
Minister,O,


## Training an NER model on Dutch CONLL data

In practice, however, you'll likely have more training data than just three examples with the same entity. Things become really interesting when you have access to a labelled data set of hundreds or more examples of several entity types: CVs that have been labelled with job titles and skills, medical documents that have been labelled with symptoms and diseases, etc.

As an example, let's train a Named Entity Recognition model on the Dutch data that was collected for the [CoNLL-2002 Shared Task](https://www.clips.uantwerpen.be/conll2002/ner/). This data can be downloaded from Github.

In [6]:
!wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.train -P data/ner/
!wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.testa -P data/ner/
!wget https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.testb -P data/ner/

--2019-02-04 20:11:19--  https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.train
Resolving raw.githubusercontent.com... 151.101.36.133
Connecting to raw.githubusercontent.com|151.101.36.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2377174 (2.3M) [text/plain]
Saving to: 'data/ner/ned.train.6'


2019-02-04 20:11:21 (3.88 MB/s) - 'data/ner/ned.train.6' saved [2377174/2377174]

--2019-02-04 20:11:21--  https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.testa
Resolving raw.githubusercontent.com... 151.101.36.133
Connecting to raw.githubusercontent.com|151.101.36.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 450785 (440K) [text/plain]
Saving to: 'data/ner/ned.testa.6'


2019-02-04 20:11:21 (3.41 MB/s) - 'data/ner/ned.testa.6' saved [450785/450785]

--2019-02-04 20:11:21--  https://raw.githubusercontent.com/teropa/nlp/master/resources/corpora/conll2002/ned.testb


The Dutch CoNLL data is formatted comes in the conll format (surprise, surprise). This means every line in the text files contains a token, and sentences are separated by empty lines. Every token consists of several tab-separated fields. For our purposes, we're just interested in the token itself (the first field), and its named entity label (the last field).

In [8]:
from operator import itemgetter

train_file = "data/ner/ned.train"
dev_file = "data/ner/ned.testa"
test_file = "data/ner/ned.testb"

def read_conll_file(f):
    data = []
    with open(f) as i:
        sentences = i.read().strip().split("\n\n")
        
    for sentence in sentences:
        data.append([token.split() for token in sentence.split("\n")])

    return data
        
train_data = read_conll_file(train_file)
dev_data = read_conll_file(dev_file)
test_data = read_conll_file(test_file)

The Dutch CoNLL data contains the same entity types as spaCy's named entity pipe, but it wasn't part of the training data. As a result, spaCy's pretrained model performs so-so on the test data: it achieves an F-score of 63% for locations, 68% for organizations, 79% for persons and 54% for miscellaneous entities. Don't be fooled by the high average F-score: it's mainly due to the high accuracy of O tokens, which far outnumber the entities in our data. The total performance is not bad, but it's not very good, either.

In [9]:
from sklearn.metrics import classification_report, precision_recall_fscore_support

def evaluate(model, data, verbose=0): 

    ner = model.get_pipe("ner")
    
    correct, predicted = [], []
    for sentence in data:
        tokens = [t[0] for t in sentence]
        ent_labels = [t[2].split("-")[-1] for t in sentence]
        
        doc = Doc(model.vocab, words=tokens)
        ner(doc)
        
        pred_labels = [t.ent_type_ or "O" for t in doc]
        correct += ent_labels
        predicted += pred_labels
        
    if verbose:
        print(classification_report(correct, predicted))
    
    return precision_recall_fscore_support(correct, predicted, average="micro")

In [10]:
nlp = spacy.load("nl")
evaluate(nlp, test_data, verbose=1)

5195
              precision    recall  f1-score   support

         LOC       0.50      0.85      0.63       823
        MISC       0.66      0.45      0.54      1597
           O       0.99      0.99      0.99     63236
         ORG       0.71      0.64      0.68      1433
         PER       0.77      0.81      0.79      1905

   micro avg       0.97      0.97      0.97     68994
   macro avg       0.73      0.75      0.72     68994
weighted avg       0.97      0.97      0.97     68994



(0.9663159115285387, 0.9663159115285387, 0.9663159115285387, None)

Let's now see what happens if we train a spaCy model specifically on the CoNLL data. To this goal, we'll convert the CoNLL data to spaCy documents and GoldParses like we did above. This means we have to convert its BIO labels to BILUO labels.

In [11]:
from spacy.gold import iob_to_biluo

training_data = []
for sentence in train_data:
    tokens = [t[0] for t in sentence]
    ent_labels = iob_to_biluo([t[2] for t in sentence])
    doc = Doc(nlp.vocab, words=tokens)
    gold = GoldParse(doc, entities=ent_labels)
    training_data.append((doc, gold))

We'll now compare two different situations. First we'll train a new spaCy model from scratch. We do this by initializing a blank Dutch spaCy model with `spacy.blank("nl")`. We'll add a named entity recognition pipe to it, and add the four entity labels in our training data. Then, we'll initialize the document for training. 

Second, we don't train a model from scratch, but we take the pretrained spaCy entity model and continue training it on our new training data. This means we can make use of everything the pretrained model has already learnt from its original training set. Because this model has seen much more data, we hope it will eventually give better results.

Apart from the initialization stage, the training of these two models looks exactly the same. We disable all other pipes, and train the models for a maximum of 100 epochs. Whenever we achieve a new highest F-score on the development data, we save them. To avoid overfitting, we break the training cycle whenever we haven't been able to improve on the development F-score for three steps in a row. 

In [12]:
from spacy.util import minibatch
from pathlib import Path

def train(train_docs, dev_data, output_dir, model=None, max_epochs=100): 
    
    if not model: 
        model = spacy.blank("nl")
        ner = model.create_pipe("ner")
        model.add_pipe(ner, last=True)
        for label in ["PER", "LOC", "ORG", "MISC"]: 
            ner.add_label(label)
        model.begin_training()
        
    other_pipes = [pipe for pipe in model.pipe_names if pipe != 'ner']
    fscore_history = []
    patience=3
        
    with model.disable_pipes(*other_pipes):
    
        for i in range(max_epochs):
            print("Epoch", i)
            losses = {}
            random.shuffle(train_docs)
            batches = minibatch(train_docs, size=32)
            for batch in tqdm(batches):
                docs, golds = zip(*batch)
                model.update(
                    docs,
                    golds,
                    drop=0.4,
                    losses=losses)
            print("Training Loss:", losses)
            
            _, _, dev_f, _ = evaluate(model, dev_data)
            print("Development F-score:", dev_f)
            
            if len(fscore_history) > 0 and dev_f > max(fscore_history): 
                if output_dir is not None:
                    output_dir = Path(output_dir)
                    if not output_dir.exists():
                        output_dir.mkdir()
                    model.to_disk(output_dir)
                    print("Saved model to", output_dir)
            
            fscore_history.append(dev_f)
            
            if max(fscore_history) > max(fscore_history[-patience:]):
                print("No improvement on development set. Stop training.")
                break

First we train the completely new model.

In [13]:
output_dir_scratch = "models/spacy_ner_scratch"
train(training_data, dev_data, model=None, output_dir=output_dir_scratch)

Epoch 0


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 17.634281397026918}
Development F-score: 0.955536135165912
Epoch 1


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 8.124413278556299}
Development F-score: 0.9652551574375678
Saved model to models/spacy_ner_scratch
Epoch 2


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 5.07255810720055}
Development F-score: 0.9674531924472339
Saved model to models/spacy_ner_scratch
Epoch 3


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 3.594017944427045}
Development F-score: 0.9692539922141894
Saved model to models/spacy_ner_scratch
Epoch 4


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 2.818861507985981}
Development F-score: 0.9607266756706655
Epoch 5


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 2.3161421975320895}
Development F-score: 0.9678769100394587
Epoch 6


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 1.934440403581468}
Development F-score: 0.9677974629909165
No improvement on development set. Stop training.


Then we continue training the existing model.

In [14]:
output_dir_cntd = "models/spacy_ner_cntd"
train(training_data, dev_data, model=nlp, output_dir=output_dir_cntd)

Epoch 0


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 6.690205245732558}
Development F-score: 0.9751860385053361
Epoch 1


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 3.8389393237998024}
Development F-score: 0.9784698498450782
Saved model to models/spacy_ner_cntd
Epoch 2


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 2.788690348488808}
Development F-score: 0.9790259791848733
Saved model to models/spacy_ner_cntd
Epoch 3


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 2.0977905896412414}
Development F-score: 0.9789200497868171
Epoch 4


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 1.7780554252080347}
Development F-score: 0.9781520616509096
Epoch 5


HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


Training Loss: {'ner': 1.4171092849177291}
Development F-score: 0.9782579910489658
No improvement on development set. Stop training.


The F-scores we recorded on the development data already suggested that the continued model is indeed better than the completely new model: its development F-score lies around 1% higher. This is confirmed by the results on our testing data. 

First, our new model already scores better than spaCy's pretrained model on the CoNLL test data. This is particularly the case for the LOC and MISC entities, where its F-score lies 14% and 13% higher, respectively. This shows how important it is to train on in-domain data: although spaCy's pretrained model has seen more data than our CoNLL model, the higher similarity of the CoNLL training data to our testing data makes our new model perform much better. 

Second, the continued model goes one step further. It improves the F-score on the locations by another 8%, on the miscellaneous entities by 9%, on the organizations by 9%, and on the persons by 7%. As some parts of the training data are random (such as the random order in which we feed the data to the model), your mileage may vary, but the bigger patterns should be pretty similar. They demonstrate how the continued training is able to combine the strengths of the two approaches above: it still relies in part on the knowledge that was encoded in the pretrained model, but it has finetuned this model on our in-domain data.

In [17]:
nlp_base = spacy.load("nl")
nlp_scratch = spacy.load(output_dir_scratch)
nlp_cntd = spacy.load(output_dir_cntd)

print("\n********** Base Model **********")
evaluate(nlp_base, test_data, verbose=1)
print("\n********** New Model **********")
evaluate(nlp_scratch, test_data, verbose=1)
print("\n********** Continued Model **********")
evaluate(nlp_cntd, test_data, verbose=1)



********** Base Model **********
              precision    recall  f1-score   support

         LOC       0.50      0.85      0.63       823
        MISC       0.66      0.45      0.54      1597
           O       0.99      0.99      0.99     63236
         ORG       0.71      0.64      0.68      1433
         PER       0.77      0.81      0.79      1905

   micro avg       0.97      0.97      0.97     68994
   macro avg       0.73      0.75      0.72     68994
weighted avg       0.97      0.97      0.97     68994


********** New Model **********
              precision    recall  f1-score   support

         LOC       0.76      0.78      0.77       823
        MISC       0.78      0.59      0.67      1597
           O       1.00      1.00      1.00     63236
         ORG       0.71      0.67      0.69      1433
         PER       0.76      0.89      0.82      1905

   micro avg       0.98      0.98      0.98     68994
   macro avg       0.80      0.79      0.79     68994
weighted a

(0.9825926892193524, 0.9825926892193524, 0.9825926892193524, None)

Unfortunately, the continued training of a pretrained model is not without its challenges. Most importantly, you need to make sure that your model doesn't overfit on the new training data and loses its ability to label the type of data it was originally trained on. A related challenge is that of [catastrophic forgetting](https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting), which typically occurs when weights are shared between several NLP tasks. Still, when you get it right, finetuning an existing model is a powerful way of training high-quality model with a limited amount of data.