In [1]:
#|hide
#|default_exp spacy_ner

In [2]:
#| hide
%matplotlib inline
from nbdev.showdoc import *

In [13]:
#| export
from IPython.display import display_html
import tabulate
import spacy
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training import Example

# Updating spaCy's NER system
(follows: https://github.com/nlptown/nlp-notebooks/blob/master/Updating%20spaCy's%20Named%20Entity%20Recognition%20System.ipynb)

Although pre-trained models are simple to use, we just have to plug them in, results will be disappointing when the data we work with differs, even slightly, from the data the model was trained on.

So, we want to be able to train our own model. SpaCy has us covered:

- we can train our model from scratch
- we can continue a trained model with our own data.

This second option aligns a bit more with our views: Context is king. We start with a contextualized dataset (persons, affiliations, topics together with the content of publications), start with unsupervised ML. Then use the output TOGETHER with the context of the questions we want to answer to use supervised ML (data) to make the models better.

Let's look at a toy example:

In [4]:
nlp = spacy.load('en_core_web_sm')
text = "Alexander Boris de Pfeffel Johnson (born 19 June 1964) is a British politician who has served as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2019; he lead the Vote Leave campaign for Brexit . "

doc = nlp(text)
entities = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
display(display_html(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Alexander,B,PERSON
Boris,I,PERSON
de,I,PERSON
Pfeffel,I,PERSON
Johnson,I,PERSON
(,O,
born,O,
19,B,DATE
June,I,DATE
1964,I,DATE


None

Although the spaCy NER is actually quite good, we want to train the model some more with extra training data. So that a word like "Brexit" for example is properly recognized (Brexit is now labelled as PERSON). For this we do not use the actual sentence itself, too easy. But we will use similar sentences. Here we will use just a couple of sentences from Wikipedia.

Below are the 18 NER labels that spaCy uses:

In [5]:
#| export
ner_lst = nlp.pipe_labels['ner']
print(ner_lst)
for i in ner_lst:
  print(f"{i}: {spacy.explain(i)}\n")

['CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART']
CARDINAL: Numerals that do not fall under another type

DATE: Absolute or relative dates or periods

EVENT: Named hurricanes, battles, wars, sports events, etc.

FAC: Buildings, airports, highways, bridges, etc.

GPE: Countries, cities, states

LANGUAGE: Any named language

LAW: Named documents made into laws.

LOC: Non-GPE locations, mountain ranges, bodies of water

MONEY: Monetary values, including unit

NORP: Nationalities or religious or political groups

ORDINAL: "first", "second", etc.

ORG: Companies, agencies, institutions, etc.

PERCENT: Percentage, including "%"

PERSON: People, including fictional

PRODUCT: Objects, vehicles, foods, etc. (not services)

QUANTITY: Measurements, as of weight or distance

TIME: Times smaller than a day

WORK_OF_ART: Titles of books, songs, etc.



So, we have this existing pre-trained spaCy model that we want to update with some new examples (ideally these should be around 200-300 examples).

These examples should be presented to spaCy as a list of tuples, that contain the text, and a dictionary of tuples, named entities that contains: the start and end indices of the named entity in the text, and the label of that named entity:

In [6]:
#| export
train_data = [
  ("Boris Johnson announced his pending resignation on 7 July 2022.", {"entities": [(0,13,"PERSON"), (36,47,"EVENT"), (51,62,"DATE")]}),
  ("He will remain as prime minister until a new party leader is elected.", {"entities": [(18,32,"NORP"), (45,57,"NORP"), (61,68,"EVENT")]}),
  ("He served as Secretary of State for Foreign and Commonwealth Affairs from 2016 to 2018.", {"entities": [(13,68,"NORP"), (74,86,"DATE")]}),
  ("Boris Johnson served as Mayor of London from 2008 to 2016.", {"entities": [(0,13,"PERSON"), (24,39,"NORP"), (45,57,"DATE")]}),
  ("He became a prominent figure in the successful Vote Leave campaign for Brexit in the 2016 European Union (EU) membership referendum.", {"entities": [(47,66,"EVENT"), (71,77,"EVENT"), (85,89,"DATE"), (90,104,"ORG"), (106,108,"ORG"), (121,131,"EVENT")]}),
]

Before we set up the NER pipeline with the content of our training data, we make sure that we got the indices right:

In [7]:
#| hide
test = "Boris Johnson announced his pending resignation on 7 July 2022."
test[36:47]

'resignation'

To set thing up, let's check if we have a NER in our pipeline:

In [8]:
#| hide
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

That looks OK, now we assign the NER to a variable:

In [9]:
#| export
ner = nlp.get_pipe('ner')

Next step is to add these labels to the NER:

In [10]:
#| export
for _, annotations in train_data:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])

Now we can start training, but only for the NER component of the pipeline, hence the following code snippet:

In [11]:
#| export
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In order to properly train the NER model, we need to:

- let the ner model loop over the examples for a sufficient number of iterations (10)
- shuffle the examples in order NOT to base the training on the sequence (`random.shuffle()`)
- pass the training data in batches (`minibatch`)
- the use `nlp.update()` 

In [14]:
#| export
with nlp.disable_pipes(*unaffected_pipes):
  for iteration in range(10):
    random.shuffle(train_data)
    losses = {}
    batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
      texts, annotations = zip(*batch)
      # new way of updating nlp NOT using nlp.update() anymore
      example = []
      # update the model with iterating each text
      for i in range(len(texts)):
        doc = nlp.make_doc(texts[i])
        example.append(Example.from_dict(doc, annotations[i]))

      nlp.update(example, drop=0.5, losses=losses)

print(losses)

{'ner': 16.5259361276355}


Let's check how our updated NER model now performs, using our earlier sentence:

In [15]:
#| export
text = "Alexander Boris de Pfeffel Johnson (born 19 June 1964) is a British politician who has served as Prime Minister of the United Kingdom and Leader of the Conservative Party since 2019; he lead the Vote Leave campaign for Brexit . "

doc = nlp(text)
entities = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
display(display_html(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Alexander,B,PERSON
Boris,I,PERSON
de,I,PERSON
Pfeffel,I,PERSON
Johnson,I,PERSON
(,O,
born,O,
19,B,DATE
June,I,DATE
1964,I,DATE


None

Better, but we trained a little bit on the subject which is precisely why we, the RePubXL team, propose supervised ML on smaller contextual datasets. The power of these NER updates is that, based on the examples, the model can still generalize due to the word-embeddings vectorspace.

Now, we want to keep our updated model for future use:

In [16]:
#| export
# save the model to a directory
output_dir = Path('content/')
nlp.to_disk(output_dir)
print(f"Saved model to: {output_dir}")

Saved model to: content


In [17]:
#| export
# Load the saved model to predict
print(f"Loading from: {output_dir}")
nlp_updated = spacy.load(output_dir)
doc = nlp_updated("Johnson is a controversial figure in British politics.")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])

Loading from: content
Entities [('Johnson', 'PERSON'), ('British', 'NORP')]


In the cells above we started with a pre-trained model. One can also choose to start with an empty model, using `spacy.blank()`, passing in the "en" argument for the English language. Because it is an empty model, we have to add this `ner` to the pipeline using `add_pipe()`. We do not have to disable other pipelines, as we are just adding a new one, **not** changing an existing one, and just that one and not the other parts of the pipeline.

One does have to use a large(r) number of training cases.

Just a small example below:

In [18]:
#| export
#Build upon the spaCy Small Model
nlp = spacy.blank("en")


#Sample text
text = "Treblinka is a small village in Poland. Wikipedia notes that Treblinka is not large."

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
            {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)

doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.start_char, ent.end_char, ent.label_)

Treblinka 0 9 GPE
Treblinka 61 70 GPE


We can use our new model to get more info and train it:

In [19]:
#| export
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "Treblinka is a small village in Poland. Wikipedia notes that Treblinka is not large."

corpus = []

doc = nlp(text)
for sent in doc.sents:
    corpus.append(sent.text)

#Build upon the spaCy Small Model
nlp = spacy.blank("en")

#Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
patterns = [
            {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)


TRAIN_DATA = []

#iterate over the corpus again
for sentence in corpus:
    doc = nlp(sentence)
    
    #remember, entities needs to be a dictionary in index 1 of the list, so it needs to be an empty list
    entities = []
    
    #extract entities
    for ent in doc.ents:

        #appending to entities in the correct format
        entities.append([ent.start_char, ent.end_char, ent.label_])
        
    TRAIN_DATA.append([sentence, {"entities": entities}])

print (TRAIN_DATA)

[['Treblinka is a small village in Poland.', {'entities': [[0, 9, 'GPE']]}], ['Wikipedia notes that Treblinka is not large.', {'entities': [[21, 30, 'GPE']]}]]


## Training a completely new entity type in spaCy

All code above was directed at training the `ner` to categorize correctly, either adjusting a pre-trained model or starting from a new blank model and adjusting that as one goes.

But what to do if you want to work with a category that is NOT defined?

In [20]:
#| export
# Get the `ner` component of the pipeline
nlp = spacy.load('en_core_web_sm')
ner = nlp.get_pipe('ner')

In [24]:
#| export
# Add the new label
LABEL = "FOOD"

# Training examples in the required format
TRAIN_DATA =[ ("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}),
              ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]}),
              ("China's noodles are very famous", {"entities": [(8,15, "FOOD")]}),
              ("Shrimps are famous in China too", {"entities": [(0,7, "FOOD")]}),
              ("Lasagna is another classic of Italy", {"entities": [(0,7, "FOOD")]}),
              ("Sushi is extemely famous and expensive Japanese dish", {"entities": [(0,5, "FOOD")]}),
              ("Unagi is a famous seafood of Japan", {"entities": [(0,5, "FOOD")]}),
              ("Tempura , Soba are other famous dishes of Japan", {"entities": [(0,7, "FOOD")]}),
              ("Udon is a healthy type of noodles", {"entities": [(0,4, "ORG")]}),
              ("Chocolate soufflé is extremely famous french cuisine", {"entities": [(0,17, "FOOD")]}),
              ("Flamiche is french pastry", {"entities": [(0,8, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Frenchfries are considered too oily", {"entities": [(0,11, "FOOD")]})
           ]

We have to train the model:

- first add the new label with `ner.add_label()`
- Resume training
- Select the pipes to be trained
- Single out the pipes NOT to be trained

In [25]:
#| export
# Add the new label to ner
ner.add_label(LABEL)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [26]:
#| export
# Begin training by disabling other pipeline components

with nlp.disable_pipes(*other_pipes):
  sizes = compounding(1.0, 4.0, 1.001)
  for iteration in range(30):
    random.shuffle(TRAIN_DATA)
    losses = {}
    batches = minibatch(TRAIN_DATA, size=sizes)
    for batch in batches:
      texts, annotations = zip(*batch)
      # new way of updating nlp NOT using nlp.update() anymore
      example = []
      # update the model with iterating each text
      for i in range(len(texts)):
        doc = nlp.make_doc(texts[i])
        example.append(Example.from_dict(doc, annotations[i]))

      nlp.update(example, drop=0.5, losses=losses)

print(losses)

{'ner': 0.0024407938347574204}


With the training complete, let's test our `ner`:

In [27]:
#| export
test_text = "I ate Sushi yesterday. Maggi is a common fast food "
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
  print(ent)

Entities in 'I ate Sushi yesterday. Maggi is a common fast food '
Sushi
Maggi


In [28]:
#| hide
import nbdev; nbdev.nbdev_export()