# Updating the Model on Henslow Data
---
---
## First, A Word

### This Python is Intermediate
This notebook presents a **codewalk showing how to update a spaCy model with Henslow data**. There isn't a lot of explanation of the Python itself; if you're fairly new to the language, there might be some features of Python with which you're not familiar. 

I encourage you to learn whatever new Python features you can as you go along, but if the Python does become too difficult to understand at any point, it's absolutely fine. Just run the examples to see the results, and come back to the Python another time. Of course, you can just copy and paste the code to try with your own data, and see where that takes you!

![Woman coding at a computer with her back to the viewer](assets/woman-back-computer.png "Woman coding at a computer with her back to the viewer")

### This Notebook is for Demonstration Purposes

In a real-world project, you would need **a few hundred to a few thousand examples to update an existing model**, and **a few thousand to a million (!) examples to train for a new entity type**. This would take a long time to arrange and a long time to process. For the purposes of this workshop, the code here uses just a few examples to demonstrate the methods in a simplified way.

### Issues of Copyright

The Henslow letters are included here courtesy of the [The Henslow Correspondence Project](https://epsilon.ac.uk/search?sort=date;f1-collection=John%20Henslow) and licensed under [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). Since this license allows for unrestricted remixing and transformation of the material, we should be able to use it freely for text-mining. For the purposes of abiding by the terms of the license: we are creating derivatives.

### spaCy Annotation Format

If you remember from the [previous notebook](3-principles-of-machine-learning-for-named-entities.ipynb#Training-Data-Labelling-and-Format), when you train a spaCy model you need to give the `nlp.update()` method training data like this:

`[("Yours very sincerely | John Evans", {"entities": [(23, 33, "PERSON")]}),]`

However, it is hard for a human to look at this format and work out if the entity span is in the correct place. Therefore, **for the purposes of teaching**, I have included the **entity text** when creating the training data in this notebook, and only removed it at the very end when we actually update the model.

Thus, most of this training data looks like this:

`[("Yours very sincerely | John Evans", {"entities": [('John Evans', 23, 33, "PERSON")]}),]`

But be aware, this is **not** correct. Remember not to include the entity text when doing your real-world project!

---
---
## Clean up the Sentence Boundaries
As with all text mining projects, a large proportion of time is spent on cleaning up data and correcting outputs.

So, before we can even start to pick training examples, we need to fix some problems with the sentence boundaries that spaCy has created.

We start by inspecting how the transcriptions have been split into **sentences** by the default syntactic dependency parser (pipeline component):

In [None]:
from bs4 import BeautifulSoup
import en_core_web_sm

with open("data/henslow/letters_152.xml", encoding="utf-8") as file:
    letter = BeautifulSoup(file, "lxml-xml")
transcription = letter.find(type='transcription').text

# We can't disable the `'parser'` component because it contains the sentencizer
nlp = en_core_web_sm.load(disable=['tagger'])

document = nlp(transcription)

for i, sentence in enumerate(document.sents):
    print(f'Sentence {i + 1}:{sentence}\n')

Reviewing these sentences, we can identify a few problems:

```
Sentence 1:
I am ashamed that I have never before acknowledged the receipt of the very valuable packet of specimens

Sentence 2:you were so kind as to send me.
```

Many of the sentences have been chopped up into sentences apparently without a good reason.

We should also look at letters by other correspondents as well, because punctuation style can be quite different. 

---
> **EXERCISE**: Change the code above to examine `letters_90.xml` (or another Henslow letter) instead. What issues do you notice? Are they the same or different?

### Statistical versus Rule-Based Sentence Segmentation
SpaCy provides two alternative components to find the **sentence boundaries** of texts:

* A **statistical sentencizer** is included with the syntactic dependency parser (['parser'](https://spacy.io/api/dependencyparser)) — we used this above.
* An independent **rule-based sentencizer** is available (['sentencizer'](https://spacy.io/api/sentencizer)) that can be customised.

By default, the rule-based sentencizer segments sentences based on the characters `.`, `!`, and `?`. 

> If you wish to experiment with customising the characters used to mark the ends of sentences, you have to create an instance of the `Sentencizer` like this and pass in the characters in a list:

```
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。", "|"])`
```

Let us compare the performance of the `'parser'` component (above) with the `'sentencizer'` component (below):

In [None]:
# Disable both the POS tagger and syntactic dependency parser
nlp = en_core_web_sm.load(disable=['tagger', 'parser'])

# Create a new rule-based sentencizer component with the default sentence markers ., !, and ?
# sentencizer = nlp.create_pipe("sentencizer")
from spacy.pipeline import Sentencizer
sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。", "|", "—"])

# Add the sentencizer to the pipeline
nlp.add_pipe(sentencizer)

with open("data/henslow/letters_152.xml", encoding="utf-8") as file:
    letter = BeautifulSoup(file, "lxml-xml")
transcription = letter.find(type='transcription').text

document = nlp(transcription)

for i, sentence in enumerate(document.sents):
    print(f'Sentence {i + 1}:{sentence}\n')

I hope you agree this has resolved some of the problems flagged above? There is still an issue in some letters with splitting on abbreviations that are terminated with a full stop (e.g. 'y.<sup>r</sup>' and 'w.<sup>ch</sup>' , but let's move on now.

---
---
## Process Multiple Texts
So far we have processed only a single letter at a time. In fact, we want to **process the whole corpus** at once, and search the whole corpus for training examples.

The most efficient way to do this is with the **[`nlp.pipe()`](https://spacy.io/api/language#pipe)** method. This streams the texts one-by-one, or in small batches (`batch_size`), so you are not loading the whole corpus into computer memory; and it gives you the option to process using more than one of your computer's CPUs at a time (`n_process`). For example:

`nlp.pipe(texts, batch_size=25, n_process=2)`

Let's try named entity recognition over the whole Henslow corpus now:

In [None]:
# Import a class that helps to manage filepaths
from pathlib import Path

# Create a filepath to the 1865 directory
dir_path = Path('data/henslow').absolute()

# Get filepaths to all the XML files in the directory
xml_files = (file for file in dir_path.iterdir() if file.is_file() and file.name.lower().endswith('.xml'))

# Open each XML file in turn and create a list of the transcriptions
transcriptions = []
for file in xml_files:
    with file.open('r', encoding='utf-8') as xml:
        letter = BeautifulSoup(xml, "lxml-xml")
        text = letter.find(type='transcription')
        if text:
            # Strip out whitespace at start and end, newlines and non-breaking spaces for better readability
            # Also replace ampersands with 'and' for better NER
            strip_text = text.text.strip().replace('\n', ' ').replace(u'\xa0', u' ').replace('& ', 'and ')
            transcriptions.append(strip_text)

# Disable unnecessary components
nlp = en_core_web_sm.load(disable=['tagger', 'parser'])
# Add the rule-based sentencizer
nlp.add_pipe(sentencizer)

In [None]:
%%time
# Using the Jupyter magic method %%time to time cell execution

# From all documents create a list of sentences and named entity labels
# Format: ('My sentence has an entity.', {'entities': [('span', 19, 25, 'LABEL')]})
ner_data = []
for doc in nlp.pipe(transcriptions, batch_size=25, n_process=2):
    for sent in doc.sents:
        
        entities = [(ent.text, 
                     (ent.start_char-sent.start_char), 
                     (ent.end_char-sent.start_char), 
                     ent.label_) 
                     for ent in sent.ents]

        ner_data.append((sent.text, {"entities": entities}))

> **EXERCISE**: I set the batch size (`batch_size=25`) and number of processors (`n_process=2`) to try and speed up the processing. Try changing these parameters to see if you can speed it up or slow it down. What is the optimum combination? Bear in mind: larger batches take up more memory, and you can't have more processors than your computer has available cores. Also, there is an overhead associated with setting up multi-processing (i.e. using multiple cores/CPUs), so it's only worth doing if you have a lot of things to process.

Now we should have a list of 11,130 sentences and their named entities:

In [None]:
len(ner_data)

Let's have a quick look at a few of these sentences and their annotations. The output we've created is a **list of tuples**, which is the format we need to input to spaCy's training method. What do you think about the accuracy of the output?

In [None]:
ner_data[450:455]

---
---
## Create Training Examples
We have **two options** for creating training examples:

1. Find examples within the documents already **labelled by the model's first pass** and correct them.
2. Take examples from a **manually-labelled dataset** created by humans.

We will try various examples and both ways. But towards the end of this notebook, we will only use the manually-labelled dataset to try updating the model.

### Sentences, Paragraphs or Documents?

The example texts can be a sentence, paragraph or longer document. In a real-world project, you should choose **whatever is most similar to what the model will see at runtime**. For example, if you intend to process each letter transcription as a block, you might get better results with examples that are whole-letter transcriptions too.

For most of the examples below, I will just use sentences for clarity. For updating the model with the manually-labelled dataset, I will use whole transcriptions.

---
---
## Correct Existing Predictions

To correct the predictions we recieved from the model's first pass through our corpus of letters, we first need to browse the data and find some problem predictions.

We find that a token has often been recognised as a named entity, but it is the **wrong entity type**. Examples ('PRODUCT' and 'ORG'):

---
> I am only anxious to shew you every opportunity of benefiting your 
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    Herbarium
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span>
</mark>
.

---
---
> In return I shall be glad to receive as many specimens as you please of the rarer 
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    Cambridgeshire
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">ORG</span>
</mark>
 plants

---

Or that it has **wrongly predicted a span as a named entity**:

---

> but I am not aware of any case in wch such a 
<mark class="entity" style="background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    Crystal
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">LOC</span>
</mark>
 or case of chert has been found
    
---

Or that it has **wrongly included**, or **excluded**, **tokens** in the span that is labelled. Examples:

---

> you may be assured of the readiness of yours | most truly 
<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    | Edward Wilson
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
</mark>
    
---
---

> <mark class="entity" style="background: #f0d0ff; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    Enc.
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">WORK_OF_ART</span>
</mark>
 Britannica
    
---

### Nineteenth-century Letter Style and Gold Standard Training
In fact, spaCy's default English language model has not done well at all with this style of letter writing! The model has been confused by frequent capitalisation of nouns, the use of full stops in abbreviations, and the frequent use of ampersands to replace the word 'and'.

If that wasn't enough, the editors have used a traditional print convention of marking the line break in the signing off with a vertical bar `|`. This character is not even present in the original letter text itself!

In a real-world scenario, we would want to train the model on a large training set of high-quality fully labelled named entities (**gold standard** data). But for now, we will just settle for making some small corrections.

### Find Training Examples for Single Tokens with `Matcher`

Once we have reviewed the first pass at NER and started to list some predictions that need changing, we need to search for some examples.

One approach we can take is to use spaCy's **`Matcher`** to find suitable tokens within the documents without having to iterate over each of the tokens one by one. We can then correct the named entity type.

> For a more detailed look see [Rule-based matching](https://spacy.io/usage/rule-based-matching) and to play with the matcher interactively see the [Rule-based Matcher Explorer](https://explosion.ai/demos/matcher).

Let's use the example of 'Cambridgeshire' to provide correct training data for this named entity as `GPE`.

In [None]:
from spacy.matcher import Matcher

nlp = en_core_web_sm.load(disable=['tagger', 'parser'])
nlp.add_pipe(sentencizer)

# Create a new Matcher with English vocabulary
matcher = Matcher(nlp.vocab, validate=True)

# Specify the pattern we are looking for
pattern = [[{"TEXT": "Cambridgeshire"}]]

# Add the patterns to the Matcher
matcher.add("Cambridgeshire", pattern)

loc_train_data=[]

# Process the transcriptions into documents
for doc in nlp.pipe(transcriptions, batch_size=25, n_process=2):
    
    # For every sentence create a new document
    for sent in doc.sents:
        sent_doc = nlp.make_doc(sent.text)
        
        # Create a matcher for the sentence document
        matches = matcher(sent_doc)
        
        # If there is match, get the named entities, correct and append them to the training data list
        if matches:
            
            spans = [(sent_doc[start:end]) for match_id, start, end in matches]
            entities = [(span.text, span.start_char, span.end_char, "GPE") for span in spans]
            example = (sent_doc.text, {"entities": entities})
            
            loc_train_data.append(example)

We now have training data in a format suitable for updating the model (even if the sentencization still needs some work!):

In [None]:
loc_train_data

### Find Training Examples of Multiple Tokens with `PhraseMatcher`
If we have a large list of terminology that we would like match for, and those terms may span multiple tokens, then the **[`PhraseMatcher`](https://spacy.io/api/phrasematcher)** is the most efficient way to find examples.

> For more details on using the `PhraseMatcher` see [Rule-based matching: Efficient phrase matching](https://spacy.io/usage/rule-based-matching#phrasematcher).

Let's try this on a range of known signatures, in an attempt to correct the sign-offs with the added vertical bar:
* "J S Henslow"
* "H. T. Stainton"
* "John Evans"
* "R.T. Lowe"
* "W. H. Miller"
* "J. S. Bowerbank"

(Of course, we already have this particular information in the XML metadata, but with sufficient effort a similar list could be created for mentions of persons in the main body of the transcriptions, or learned societies, or whatever terminology is specific to the context.)

In [None]:
from spacy.matcher import PhraseMatcher

nlp = en_core_web_sm.load(disable=['tagger', 'parser'])
nlp.add_pipe(sentencizer)

# Create a new PhraseMatcher with English vocabulary
matcher = PhraseMatcher(nlp.vocab, validate=True)

# Specify the terms we are looking for
terms = ["J S Henslow", "H. T. Stainton", "John Evans", "R.T. Lowe", "W. H. Miller", "J. S. Bowerbank"]

# PhraseMatcher takes Doc objects rather than text patterns
# We use `nlp.make_doc()` to create the Docs quickly
patterns = [nlp.make_doc(text) for text in terms]

# Add the pattern Docs to the Matcher
matcher.add("AuthorList", None, *patterns)

signoff_train_data=[]

for doc in nlp.pipe(transcriptions, batch_size=25, n_process=2):
    
    for sent in doc.sents:
        sent_doc = nlp.make_doc(sent.text)
        
        matches = matcher(sent_doc)
        
        # If there is match, get the named entities, label and append them to the training data list
        if matches:
            
            spans = [(sent_doc[start:end]) for match_id, start, end in matches]
            entities = [(span.text, span.start_char, span.end_char, "PERSON") for span in spans]
            example = (sent_doc.text, {"entities": entities})
            
            signoff_train_data.append(example)

In [None]:
signoff_train_data

---
---
## Add a New Entity Type
Now we are going to add a completely new entity type **"TAXONOMY"**. We shall define this as a type of entity for any Linnaean taxonomic name (domain, kingdom, phylum, division, class, order, family, genus or species). Binomials (genus plus species together) should be labelled as one span. 

> "TAXONOMY" seems like a new entity that would be of relevance and interest in the set of Henslow letters, and could be linked to an authority or knowledgebase later. In the much larger set of Darwin's letters edited by the Darwin Correspondence Project (DCP) a lot of information is included in the footnotes where species occur, reconciling historical taxonomic names with new ones. But this sort of metadata isn't included in the Henslow corpus, and neither corpora have, as yet, marked up the species in TEI where they occur in the transcriptions.

You could add any new entity type that is relevant to your project, but it should be as **general** as possible, otherwise it will be difficult for the model to learn and predict. So you would perhaps not choose "FLOWER_SPECIES" as a new named entity; it would be too difficult to learn the difference between that and other species.

### Getting Training Examples from Human Annotation
In the annotation exercise using Doccano (see the [previous notebook](3-principles-of-machine-learning-for-named-entities.ipynb#Annotation-Using-Doccano)) you created a training set for "TAXONOMY" collaboratively. Since the result of your efforts is not available as I write this notebook, I have done a small amount of annotation manually myself in order to demonstrate what we can do with it. I have a subset of 40 HCP letters labelled up with "TAXONOMY" entities: [`data/henslow_data_doccano_taxonomy_ner.jsonl`](data/henslow_data_doccano_taxonomy_ner.jsonl).

After exporting a training set from Doccano, it is necessary to transform the output from Doccano JSONL format into spaCy's training format.

Doccano's output format looks like this:

```
{"id": 3742, "text": "Could you procure me ripe seeds of Melampyrum arvense", "meta": {}, "annotation_approver": null, "labels": [[35, 54, "TAXONOMY"]]}
```

But we need to transform it into spaCy's training format:

```
[('Could you procure me ripe seeds of Melampyrum arvense', {'entities': [(35, 54, 'TAXONOMY')]})]
```

> REMEMBER: **for the purposes of teaching**, I have included the **entity text** when creating the training data in this notebook, and only removed it at the very end when we actually update the model.

Let's use a package called **srsly** (made by the creators of spaCy) to read the JSONL into a list of dictionaries:

In [None]:
import srsly

filepath = Path('data', 'henslow_data_doccano_taxonomy_ner.jsonl').absolute()
annotations = list(srsly.read_jsonl(filepath))
annotations[1]

Now we can transform this list of dictionaries into the list of tuples that we need:

In [None]:
taxonomy_train_data = []
for annotation in annotations:
    text = annotation.get('text')
    entities = [(text[start:end], start, end, type_) for start, end, type_ in annotation.get('labels')]
    taxonomy_train_data.append((text, {'entities': entities}))
    
taxonomy_train_data[1]

Notice that this training set is a list of documents (transcriptions) not sentences, as that is how the data was loaded into Doccano. When working on real-world project you will need to decide how best to segment your data.

### Rule-Based Entity Recognition with `EntityRuler`
Before we move onto training the statistical model, I want to cover an important new feature of spaCy, the **[`EntityRuler`](https://spacy.io/api/entityruler)**. This component allows you to **add named entities based on patterns**, and **combine** this rule-based approach with statistical named entity recognition.

We are going to use it to add some examples of "TAXONOMY" named entities in a rule-based way. 

> For more details about what you can do with `EntityRuler` see [Rule-based entity recognition: EntityRuler](https://spacy.io/usage/rule-based-matching#entityruler).

To use `EntityRuler` we:
* Create the **pattern** we are looking for ("TAXONOMY" entities that match things like "Pulmonaria").
* Create a new **'ruler'** component and add the pattern to it.
* Ensure that it **overwrites** any named entities predicted by the 'ner' component.

> BUG ALERT: We should be able to add the new 'ruler' component to the pipeline _before_ the statistical 'ner' component, so that the 'ner' component respects the existing entity spans and adjust its predictions around it. However, there is a bug in spaCy that prevents us loading the 'ner' component manually. See: [Adding EntityRuler before ner and saving model to disk crashes loading the model](https://github.com/explosion/spaCy/issues/4042)

To bootstrap our list of patterns, we can use the "TAXONOMY" entities we have already labelled.

In [None]:
# Loop through the "TAXONOMY" training data and create a list of named taxons
entity_taxon = []
for item in taxonomy_train_data:
    entities = item[1].get('entities')
    if entities:
        for ent in entities:
            taxon = ent[0]
            entity_taxon.append(taxon)
            
# Create a set from the list to ensure uniqueness
unique_entity_taxons = set(entity_taxon)

In [None]:
unique_entity_taxons

In [None]:
# Create the patterns to match from this set
patterns = []
for taxon in unique_entity_taxons:
    texts = [{"TEXT": word} for word in taxon.split()]
    patterns.append({"label": "TAXONOMY", "pattern": texts})

In [None]:
patterns[:10]

In [None]:
from spacy.pipeline import EntityRuler

nlp = en_core_web_sm.load(disable=['tagger', 'parser'])
nlp.add_pipe(sentencizer)

# Add the new label to the 'ner' component
ner = nlp.get_pipe('ner')
ner.add_label("TAXONOMY")

# Create the new EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True, validate=True)

# Add patterns to the new EntityRuler
ruler.add_patterns(patterns)

# Add new component to the pipeline
nlp.add_pipe(ruler)

# Check the components in the pipeline
nlp.pipeline

Let's now check to see if the new "TAXONOMY" entity has been created as we expect it:

In [None]:
%%time

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab, validate=True)
pattern = [{"ENT_TYPE": "TAXONOMY"}]
matcher.add("Taxonomy", [pattern])

taxonomy_data=[]
for doc in nlp.pipe(transcriptions, batch_size=25, n_process=2):
    
    for sent in doc.sents:
        sent_doc = nlp(sent.text)
        matches = matcher(sent_doc)
        
        # If there is match, get all the named entities and append them to the training data list
        if matches:
            entities = [(ent.text, 
             (ent.start_char-sent.start_char), 
             (ent.end_char-sent.start_char), 
             ent.label_) 
             for ent in sent.ents]
            
            taxonomy_data.append((sent.text, {"entities": entities}))

In [None]:
taxonomy_data[179:182]

It is no surprise in this and other examples that while the exact matches for the patterns (e.g. 'Primula vulgaris') have been successfully labelled "TAXONOMY", other taxonomic names we know are still wrongly identified, e.g.:
* `('Rubia', 35, 40, 'PERSON')`
* `('Valeriana', 96, 105, 'GPE')`

For the model to be able to generalise about the new entity "TAXONOMY" we need to train it.

---
---
## Overview of Training Data Examples

Over the course of this notebook we have looked at multiple ways to create training data:

* Using spaCy's "matchers" to **match specific patterns** in the predicted named entities _or_
* Creating **manually-annotated** named entity labels.

We have run through the following examples of creating training data:

* To correct the named entity type of "Cambridgeshire" to `'GPE'` (using `Matcher`).
* To correct sign-offs by providing a list of known person signatures (using `PhraseMatcher`).
* To add a new named entity type "TAXONOMY" by providing manually-annotated data (exported from Doccano).

We have also directly added a new named entity type "TAXONOMY" using a rule-based approach (using `EntityRuler`), by providing a list of taxonomic terms.

Now we will try using the **manually-annotated data for a new named entity "TAXONOMY"** to update the machine learning model in a **simplified example**.

---
---
## Prevent Catastrophic Forgetting

Previously, we saved the manually-annotated "TAXONOMY" entities with the name `taxonomy_train_data`. Note that this training set includes whole transcriptions, not sentences.

In [None]:
taxonomy_train_data[16]

In general, before any training occurs, we must ensure our training set includes examples that have **no named entity labels** so that the model does not learn to generalise incorrectly. Fortunately, in this example, we already have documents with no "TAXONOMY" entities in the manually-labelled data:

In [None]:
taxonomy_train_data[39]

It's also crucial that examples are inclusive of entities that have already been **correctly labelled by the model**. Otherwise the model may forget the labels it has already correctly labelled and that certain spans should not be labelled -- **catastrophic forgetting**.

> There is more you can do to prevent a model from forgetting its initialized knowledge, called **rehearsing**, but this is out of our scope.

To simplify things, we will not do this in this example. But you should do this in your real-world project!

---
---
## Training and Validation Sets

In general, it's also important that we take our training set and reserve a certain portion for validating the results -- before setting the new model loose on the rest of our data.

In our simplified case, we only have 33 examples of documents with "TAXONOMY" entities out of a total set of 40, so we are not going to do this! But, again, this is a vital step in a real-world project.

In [None]:
num_examples = [example for example in taxonomy_train_data if example[1].get('entities') != []]
len(num_examples)

---
---
## Updating the Model


### Training Data

>Before going any further, we need to make sure our training data does **not** have the **entity texts** in it. If you remember from the introduction to this notebook, I included the entity texts so we could better see the results of our code. Now we need to remove them.

In [None]:
TRAIN_DATA = []
for annotation in annotations:
    text = annotation.get('text')
    entities = [(start, end, type_) for start, end, type_ in annotation.get('labels')]
    TRAIN_DATA.append((text, {'entities': entities}))
    
TRAIN_DATA[1]

### The Training Loop

Finally, we can try to update the model!

The steps to training the model are:

* **Load** the model to start with
* **Shuffle** and **loop** over the examples
* **Update** the model by calling **[`nlp.update()`](https://spacy.io/api/language#update)**
* Save the model
* Test the model

The **training loop** goes over the examples several times to update the statistical model, and by **shuffling** the examples we prevent the model generalising based on the order of the examples.

> There's lots and lots more about updating the model in the spaCy documentation: see [Training the named entity recognizer](https://spacy.io/usage/training#ner).

First, let's set up the default English model we have used before, with unwanted pipes disabled, and the new entity type added to the 'ner' component:

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

# Disable all the other pipeline components
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
nlp.disable_pipes(*other_pipes)

# Add the new label to the 'ner' component
ner = nlp.get_pipe('ner')
ner.add_label("TAXONOMY")

We need to create an **optimizer**, which is a function that holds intermediate results when updating the model.

(NB: Some other tutorials will use `nlp.begin_training()`, but if you do this to create the optimizer automatically instead, it forgets all the entity types and you have to add them back. You can also use `nlp.resume_training()` instead of creating an optimizer manually.)

In [None]:
# Create an optimizer to hold intermediate results
optimizer = nlp.entity.create_optimizer()

Then we present the training examples in a loop and in **random order** to avoid training the model to learn anything about the order of the example. We also **batch** the examples up in small sizes (between 2 and 4 documents at a time) in order to improve the contextual awareness of the model.

> NOTE: The code below may take several minutes to complete.

In [None]:
%%time

import random
from spacy.util import minibatch, compounding

# Loop over 10 times
for i in range(10):
    
    # Randomised the training data
    random.shuffle(TRAIN_DATA)
    
    # Set up the batch sizes
    max_batch_size = 4
    batch_size = compounding(2.0, max_batch_size, 1.001)
    
    # Create the batches of training data
    batches = minibatch(TRAIN_DATA, size=batch_size)
    
    losses = {}
    
    # Update the model with each batch
    for batch in batches:
        texts, annotations = zip(*batch)
        
        # Update the model, passing in the optimizer
        nlp.update(texts, annotations, sgd=optimizer, drop=0.3, losses=losses)
        
    print("Loss", losses)

**Success!** If the code cell returned a time, something like below, then you have successfully updated the model!

```
CPU times: user 27.4 s, sys: 13.4 ms, total: 27.4 s
Wall time: 27.4 s
```

### Loss and Model Performance
What is this "loss" that has printed out? During training, the goal is to minimize the error of the model prediction. This error is called the **loss**. 

Ideally, the value should **decrease** each time you loop over the training examples. If it increases or stays the same, then you need to make some changes. In a real-world scenario, you would need to fiddle with the various options (loop, batch size, examples, etc.) in order to get the ideal performance.

If you find that the loss starts going back up again, you need to stop after a certain number of iterations to preserve the best model performance.

### Save and Reload the Updated Model
After updating the model, we want to save it so that we can re-use it another time.

In [None]:
# Rename the model and serialize it to disk
nlp.meta["name"] = 'core_web_sm_taxonomy'
nlp.to_disk('output/core_web_sm_taxonomy')

Now if you check the `output/core_web_sm_taxonomy` folder in the Jupyter notebook listing you should see something like this:

![Jupyter notebook listing showing the updated model saved to a folder](assets/taxonomy_saved_model.png "Jupyter notebook listing showing the updated model saved to a folder")

To use the updated model again, we simply load it like this:

In [None]:
import spacy
nlp2 = spacy.load('output/core_web_sm_taxonomy')

# Review the metadata to see the new name and entity type
nlp2.meta

### Test the Updated Model
So, let's test the updated model with a transcription from the training set that we know it should recognise:

In [None]:
%%time

doc2 = nlp2(u"I have to thank you for copy of Dict.– I have not proceeded with it – still whenever you feel that you have a certain claim on me be pleased to say so & it shall be met duly. I have always believed that the Primula vulgaris, elatior, & veris, were but varieties of one species, & I think you had the same opinion. If it be not inconvenient will you oblige me with a note by return, stating any evidence that has fallen under y. r notice. I shall in next No. of Botanic Garden & Scientist publish two varieties raised by Mr. Williams of Pitmaston from seed of the Cowslip – the one a pretty Polyanthus. Mr W. has many varieties, intermediate between the wild cowslip & garden Polyanthus – perhaps I can find you one – all are from cowslip seed. I am anxious to compare different varieties of Wheat with each other, but do not find it easy to obtain them. If you can oblige me with an Ear of a sort of any Suffolk varieties I shall have pleasure in “Paying in Kind”. Can you direct me to a detail of the proximate principles of many varieties? (besides Mackenzie’s) I find here & there an analysis, but nothing worth notice. It appears to me that if we deal with the Gluten & Starch it is sufficient for ordinary purposes. Does this coincide with your observations? Have you any nice method of mounting your specimens of Wheat? That a happy year be allotted your self & family circle, is the wish of, my dear Prof. r  Yours faithfully | Benjamin Maund")
for ent in doc2.ents:
    print(ent.text, ent.label)

What is the output?

> **EXERCISE**: Try some other letters or sentences. Has the updated model predicted the new "TAXONOMY" named entities correctly? Are there any problems? How can you explain what has happened? How could we improve the results? Hint: Re-read [this section](4-updating-the-model-on-henslow-data.ipynb#This-Notebook-is-for-Demonstration-Purposes) and [this section in the previous notebook](3-principles-of-machine-learning-for-named-entities.ipynb#Catastrophic-Forgetting).

### Model Training using the Command-Line
Finally, the recommended way to train a spaCy model for real projects is with the command-line interface (CLI) `spacy train`, which you can read about in the documentation  [Training via the command-line interface](https://spacy.io/usage/training#spacy-train-cli).

In brief, the CLI has lots of ways of helping you optimise model performance. For example, it will run an evaluation on your evaluation set after each training loop, so it stops automatically before model performance starts to worsen.

You will need the training data in a **[spaCy train JSON format](https://spacy.io/api/annotation#json-input)** rather than the list of tuples needed for scripts.

Doccano provides a utility for transforming its export format into spaCy train JSON format:

In [None]:
from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl

filepath = Path('data', 'doccano_ner_test.jsonl').absolute()

dataset = read_jsonl(filepath=filepath, dataset=NERDataset, encoding='utf-8')
spacy_train_data = [item for item in dataset.to_spacy(tokenizer=str.split)]
spacy_train_data[0]

---
---
## Summary

In this notebook:

- All machine learning projects require a lot of time to be spent on **cleaning** input data and **correcting** outputs.
- It is possible to combine **rule-based approaches** with machine learning approaches in the same project. A variety of rule-based pipeline components are available with spaCy.
- To clean up **sentence boundaries** we used the rule-based `Sentencizer`. 
- When **processing large datasets** the most efficient method is to use **`nlp.pipe()`**. This streams texts one-by-one, or in small batches, to avoid loading everything into computer memory. It also gives the option to process in parallel on multiple computer cores.
- Creating good **training examples** is one of the most difficult aspects of machine learning.
- We tried to create training examples for:
  * Correcting wrong named entities, with `Matcher`;
  * Large lists of terminology, with `PhraseMatcher`;
  * Adding a new named entity, with manual annotation.
- It is possible to add new named entities based on patterns with the rule-based matcher `EntityRuler` and combine this with training the model.
- We re-trained the model by **shuffling** and **looping** over the examples, **updating** the model by calling **`nlp.update()`**, and then **saving** the updated model to disk. During training, we monitored the error of the model prediction, known as the **loss**.
- Upon testing the updated model we found that it had suffered **catastrophic forgetting**.

In the [next notebook](5-linking-named-entities.ipynb) we will look at how to **link** entities to authorities, so that entities can be reconciled to established people etc.; and to knowledgebases so that entities can be linked to known facts.