# Principles of Machine Learning for Named Entities
---
---

## Machine Learning Model Predictions

A machine learning (ML) algorithm "learns" from past experience by storing that experience in a **statistical model**. This happens by training the algorithm on a large amount of relevant data. Then, when it encounters new data, it applies the existing statistical model, makes the best **predictions** it can and calculates a **degree of certainty** in those predictions.

When a language model creates a named entity label for a particular token (or span of tokens) it is actually making predictions (or "guesses", if you like) and recording the one with the highest probability.

<a href="https://museums.cam.ac.uk/research/cambridge-university-herbarium"><img src="https://museums.cam.ac.uk/sites/default/files/inline-images/Herbarium%202.jpg" alt="Cambridge University Herbarium" title="Cambridge University Herbarium"></a>
<p style="text-align: center; font-style: italic;">Cambridge University Herbarium</p>

For example, let's consider the word "Herbarium" in its context within this sentence in a Henslow letter (see the [last notebook](2-named-entity-recognition-of-henslow-data.ipynb#NER-in-Practice:-A-Letter-from-William-Christy,-Jr.,--to-John-Henslow)):

> _"I am only anxious to shew you every opportunity of benefiting your Herbarium."_

A model might calculate for all possible outcomes a probability of occurrence, which could look something like this:

Named Entities:
* "PRODUCT": 44%
* "ORG": 41%
* "WORK_OF_ART": 9%
* "PERSON": 5%
* (All others... 1% in total)

So, while we as humans can tell "PRODUCT" is not accurate in the context of this letter, it has the highest probability (44%) of being correct as far as the model is concerned, based on what it has learnt in the past.

---
> I am only anxious to shew you every opportunity of benefiting your 
<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    Herbarium
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PRODUCT</span>
</mark>
.

---

Of course, I have made up these figures for the purpose of my explanation. This is not actually how spaCy's algorithm works in detail, and it's not really possible to access raw scores like this from spaCy in a comparable way, but it helps illustrate my point that a model's output is probabilistic.

---
---

## Lifecycle of Machine Learning

Clearly, one pass over some training data may not be sufficient in many cases. It's necessary to train the model, check the model, re-train the model, check again, and so on, iterating on the model to achieve the most acceptable result with the time and resources available. This is often referred to as a lifecycle.

* **Train** the model: feed the algorithm a large set of correctly labelled data.
* **Validate** the model: test its accuracy by asking for predictions on a subset of labelled data that has been reserved for this purpose.
* **Re-train** the model: if necessary, update the model to make better predictions.
* **Apply** the model: run the main body of your novel data through the algorithm and get the predictions.
 
As you work through a ML project, you may need to repeat and finesse these steps.

### Catastrophic Forgetting

Choosing which examples to train or update a model with is a skilled task. For example, if you don't include examples in your training data of named entities in a particular context the model has already seen and can predict correctly, you may find that the model stops bothering to predict those entities. This is called **catastrophic forgetting** and is something you have to work hard to avoid!

![Caïn venant de tuer son frère Abel, by Henri Vidal in Tuileries Garden in Paris, France. Alex E. Proimos: CC BY 2.0](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Paris_Tuileries_Garden_Facepalm_statue.jpg/320px-Paris_Tuileries_Garden_Facepalm_statue.jpg "Caïn venant de tuer son frère Abel, by Henri Vidal in Tuileries Garden in Paris, France. Alex E. Proimos: CC BY 2.0")

---
---

## Working with spaCy's Default Language Models

As we have seen, spaCy provides some [pre-trained language models](https://spacy.io/usage/models) you can download and use for a small range of modern European languages. These have been trained on one or more large, high-quality datasets. 

These training sets may have limited relevance for projects that you hope to work with, such as:

* language dialects
* ancient or historical forms of language
* context-specific language styles. 

For named entities, in particular, you may wish to recognise _new_ named entities or a _different set_ of named entities than provided by default.

The alternatives are:

* **Train a new model from scratch**: this may be appropriate in some cases, but it requires a lot of effort to create labels and it is time consuming computationally.
* **Improve an existing model**: if you can just re-train a model to improve or modify it, this will be significantly easier and less time consuming.

For an overview of the technical details, you can read more about [Training spaCy’s Statistical Models](https://spacy.io/usage/training#basics).

### Support for Other Languages

Support is currently being developed for many other languages, but they are at various stages of development. Many have some elements of a tokenizer and lemmatizer available, but no pipeline components for parts-of-speech tagging or named entity recognition.

Below is code for how to load and run a blank model with a Russian tokenizer and feed in some example sentences. You can substitute Russian for any of the [supported languages listed](https://spacy.io/usage/models#languages) (with mixed results).

In [None]:
import spacy
from spacy.lang.ru import Russian
from spacy.lang.ru.examples import sentences

# Create an empty Russian language model
nlp = Russian()

# Process list of example texts in Russian
docs = nlp.pipe(sentences)

# Print each sentence and its alphabetic tokens and lemmas
for doc in docs:
    print(f'\nSentence: {doc.text}')
    for token in doc:
        if token.is_alpha:
            print(f'Token: {token.text}, Lemma: {token.lemma_}')

---
---

## Training Data Labelling and Format
Before we starting training a model, we need to consider how we can acquire the training data we need. 

As we have seen, training data is a portion of your data that has been accurately **labelled** with the location of each named entity in the text and its entity type. Labelling is one type of **annotation** for machine learning.

However, spaCy only accepts annotations in certain formats, either of:

* A list of texts and named entity labels — when using the [`nlp.update()` method](https://spacy.io/api/language#update)
* JSON format — when using spaCy's [`spacy train` command](https://spacy.io/usage/training#spacy-train-cli)

For example, the `nlp.update()` method needs to receive training data like this:

`[("Yours very sincerely | John Evans", {"entities": [(23, 33, "PERSON")]}),]`

It hardly needs to be said, this is not very user friendly for a human! 😫 

We might need to manually add or correct hundreds of labels. **What can we do to make it less painful?**

If we used displaCy to visualise this label it might look like this:

---
> Yours very sincerely | 
<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    John Evans
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem">PERSON</span>
    </mark>

---

This sort of visual interface would be ideal for annotating data as a human. In the next section we will look at software for doing just that.

---
---

## Annotation as a Human

### Labelling Tasks

As a human, there are several different labelling tasks you may want to perform:
* Labelling data from scratch.
* Verifying or rejecting labels predicted by spaCy.
* Labelling data for a new type of entity that spaCy doesn't know about yet.

Which of these tasks you need to do depends on your project, but we will have the opportunity to try all three.

### Annotation Using Doccano

There are various software options for annotation as a human. Arguably the best for integration with spaCy is [Prodigy](https://prodi.gy/), which is a high-quality annotation tool made by the same company that created spaCy. It is a paid product, which means we can't use it for CDH Data School this year, but if you are serious about building any machine learning pipeline into your institution's workflow then you may wish to consider this professional annotation tool.

For the moment we need a free and open-source alternative, of which there are many of varying quality, but [Doccano](https://doccano.herokuapp.com/) is perhaps the most polished, and it is collaborative, which means we can all edit the same documents.

<img src="https://raw.githubusercontent.com/doccano/doccano/master/frontend/assets/icon.png" alt="Doccano logo" title="Doccano logo" width="200">

The Doccano interface for annotating named entities looks something like this:

![Doccano annotation interface with example text and named entities](assets/doccano-named-entities.png "Doccano annotation interface with example text and named entities")

---
> **EXERCISE**: Follow the instructions given to you by the trainer to open Doccano in your browser, log in and try the various tasks.

> Note: If you are following this notebook outside the context of the CDH Data School 2020, then you can try the [official Doccano demo](https://doccano.herokuapp.com/demo/named-entity-recognition/).

---

---
---
## Converting Between spaCy and Doccano Formats

In order to load the named entities spaCy produces into Doccano, it is necessary to transform the spaCy output into **Doccano's JSONL format**, where each document sits on a newline:

```
{"text": "We start with a letter Charles Lyell sent to Darwin in 1865 where he discusses the latest revision of his book Elements of Geology and relates to Darwin his discussions with various aquaintances about Darwin's own book On the Origin of Species. This letter has been transcribed and annotated in TEI-XML by editors from the DCP team in Cambridge.", "labels": [[23, 36, "PERSON"], [45, 51, "PERSON"], [55, 59, "DATE"], [111, 130, "WORK_OF_ART"], [146, 152, "PERSON"], [201, 207, "PERSON"], [219, 243, "WORK_OF_ART"], [323, 326, "ORG"], [335, 344, "GPE"]]}
```

I have written a utility class called **`DoccanoNamedEnts`** to help us bootstrap a first-pass of named entity recognition into Doccano for annotation. You can browse the code in [doccano/doccano_named_ents.py](doccano/doccano_named_ents.py).

NB: This script is written specifically for our usage with letters marked up in Cambridge-style TEI. To use it for your own projects you may need to make some modifications to extract relevant data from your XML files. It's also designed for educational rather than production use.

Here is how to use it:

In [None]:
# Import the DoccanoNamedEnts class
from doccano.doccano_named_ents import DoccanoNamedEnts

# Create an instance with the whole set of Henslow data
labels = DoccanoNamedEnts('data/henslow')

# Print the NER labels in Doccano format if you want to copy and paste it
labels.print()

In [None]:
# Write the labels in Doccano format to JSONL file
labels.to_file('output/doccano_ner_henslow.jsonl')

This JSONL file can now be uploaded to Doccano ready for annotation. 

> Doccano has an upload limit of 1MB on the file size so you may need to batch your output into multiple files. Make sure that all your XML documents have valid non-empty transcriptions as Doccano will reject any file that contains empty fields e.g. `{"text": "", "labels": []}`. You file must not contain any blank lines so check that before you try to upload.

NB: On the Doccano set up for this workshop, you do not have access to the import feature. To try this out you will need to install your own version.

---
---
## Summary

In this notebook:

- A machine learning algorithm learns from the data it has been trained on and stores its experience as a **statistical model**. The trained model can make **predictions** for text it has never seen before. For example, it can predict which named entities are the most probable for a given token (or span) in a given context.
- The model is not always correct in its predictions and may have to be **trained** again to improve its performance.
- Training takes place by feeding the model **training data**, which is data that has been accurately **labelled** by humans.
- We can use **text annotation software** to make it easier for humans to label training data by hand.
- New predictions must be **evaluated** (validated) with a sub-set of training data reserved for the purpose.

In the [next notebook](4-updating-the-model-on-henslow-data.ipynb) we will do a **codewalk** through an example of re-training the model with Henslow data. Using Python, we will clean up the tokenization, correct named entity predictions and add a new named entity type.