# Chapter 4. Training a neural network model

*  how to update spaCy's statistical models to customize them for your use case
    * to predict a new entity type in online comments
* write your own training loop from scratch, and understand the basics of how training works, along with tips and tricks that can make your custom NLP projects more successful

## 01. Training and updating models

### Why updating the model?
* Better results on your __specific domain__
* Learn __classification schemes__ specifically for your problem
* Essential for __text classification__
* Very useful for __named entity recognition__ (개체명 인식)
* Less critical for __part-of-speech tagging__ and __dependency parsing__

### How training works (1)
1. __Initialize__ the model weights randomly with `nlp.begin_training`
2. __Predict__ a few examples with the current weights by calling `nlp.update`
3. __Compare__ prediction with true labels
4. __Calculate__ how to change weights to improve predictions
5. __Update__ weights slightly
6. Go back to __2__.

### How training works (2)
![](https://course.spacy.io/training.png)
* __Training data__: Examples and their annotations.
* __Text__: The input text the model should predict a label for.
* __Label__: The label the model should predict.
* __Gradient__: How to change the weights.

### Example: Training the __entity recognizer__
* The entity recognizer tags words and phrases in context
* Each token can only be part of one entity
* Examples need to come with context
`("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]})`
* Texts with no entities are also important
`("I need a new phone! Any tips?", {'entities': []})`
* __Goal__: teach the model to generalize

### The training data
* Examples of what we want the model to predict in context
* Update an __existing model__: a few hundred to a few thousand examples
* Train a __new category__: a few thousand to a million examples
    * spaCy's English models: 2 million words
* Usually created manually by human annotators
* Can be semi-automated – for example, using spaCy's `Matcher`!

## 02. Purpose of training
While spaCy comes with a range of pre-trained models to predict linguistic annotations, you almost always want to fine-tune them with more examples. You can do this by training them with more labelled data.

### What does training not help with?
* Improve model accuracy on your data.
* Learn new classification schemes.
* Discover patterns in unlabelled data. (x)

## 03. Creating training data(1)
spaCy’s __rule-based Matcher__ is a great way to quickly __create training data__ for named entity models. A list of sentences is available as the variable __TEXTS__. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as __'GADGET'__.

* Write a pattern for two tokens whose lowercase forms match '`iphone'` and `'x'`.
* Write a pattern for two tokens: one token whose lowercase form matches `'iphone'` and an optional digit using the `'?'` operator.

In [None]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

## 04. Creating training data(2)
Let’s use the match patterns we’ve created in the previous exercise to bootstrap a set of training examples. A list of sentences is available as the variable `TEXTS`.

* Create a doc object for each text using `nlp.pipe`.
* Match on the `doc` and create a list of matched spans.
* Get `(start character, end character, label)` __tuples__ of matched spans.
* Format each example as a tuple of the text and a dict, mapping `'entities'` to the entity tuples.
* Append the example to `TRAINING_DATA` and inspect the printed data.

In [None]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})

('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})

('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})

('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})

('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})

('I need a new phone! Any tips?', {'entities': []})

## 05. The training loop

### The steps of a training loop
1. __Loop__ for a number of times.
2. __Shuffle__ the training data.
3. __Divide__ the data into batches.
4. __Update__ the model for each batch.
5. __Save__ the updated model.

### Recap: How training works
![](https://course.spacy.io/training.png)
* __Training data__: Examples and their annotations.
* __Text__: The input text the model should predict a label for.
* __Label__: The label the model should predict.
* __Gradient__: How to change the weights.

### Example loop

In [4]:
import random
import spacy
from spacy.lang.en import English
nlp = English()

TRAINING_DATA = [
    ("How to preorder the iPhone X", {'entities': [(20, 28, 'GADGET')]})
    # And many more examples...
]
# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA):
        # Split the batch in texts and annotations
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

# Save the model
nlp.to_disk(path_to_model)

NameError: name 'path_to_model' is not defined

### Updating an existing model
* Improve the predictions on new data
* Especially useful to improve existing categories, like __`PERSON`__ or __`ORGANIZATION`__
* Also possible to add new categories
* Be careful and make sure the model doesn't "forget" the old ones

### Setting up a new pipeline from scratch

In [None]:
# Start with blank English model
nlp = spacy.blank('en')
# Create blank entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
# Add a new label
ner.add_label('GADGET')

# Start the training
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(examples)
    # Divide examples into batches
    for batch in spacy.util.minibatch(examples, size=2):
        texts = [text for text, annotation in batch]
        annotations = [annotation for text, annotation in batch]
        # Update the model
        nlp.update(texts, annotations)

## 06. Setting up the pipeline

In this exercise, you’ll prepare a spaCy pipeline to train the entity recognizer to recognize `'GADGET'` entities in a text – for example, “iPhone X”.

* Create a blank `'en'` model, for example using the `spacy.blank` method.
* Create a new entity recognizer using `nlp.create_pipe` and add it to the pipeline.
* Add the new label `'GADGET'` to the entity recognizer using the `add_label` method on the pipeline component.

In [None]:
import spacy

# Create a blank 'en' model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label("GADGET")

## 07. Building a training loop
Let’s write a simple training loop from scratch!

The pipeline you’ve created in the previous exercise is available as the `nlp` object. It already contains the entity recognizer with the added label `'GADGET'`.

The small set of labelled examples that you’ve created previously is available as `TRAINING_DATA`. To see the examples, you can print them in your script.

* Call `nlp.begin_training`, create a training loop for 10 iterations and shuffle the training data.
* Create batches of training data using `spacy.util.minibatch` and iterate over the batches.
* Convert the `(text, annotations)` tuples in the batch to lists of `texts` and `annotations`.
* For each batch, use `nlp.update` to update the model with the texts and annotations.

In [None]:
import spacy
import random
import json

with open("exercises/gadgets.json") as f:
    TRAINING_DATA = json.loads(f.read())

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("GADGET")

# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

{'ner': 11.999999642372131}

{'ner': 21.978200435638428}

{'ner': 32.035117387771606}

{'ner': 9.415393471717834}

{'ner': 14.99043345451355}

{'ner': 19.80482855439186}

{'ner': 2.9741987846791744}

{'ner': 5.8216000609099865}

{'ner': 7.243585231248289}

{'ner': 2.2887884667434264}

{'ner': 10.279852092295187}

{'ner': 12.472829270933289}

{'ner': 1.8872964698821306}

{'ner': 4.788792780018412}

{'ner': 7.156845846562646}

{'ner': 1.9806696806917898}

{'ner': 2.8503573195825993}

{'ner': 4.763962420756343}

{'ner': 1.3438352504745126}

{'ner': 3.938637473517929}

{'ner': 4.092377472894006}

{'ner': 0.09106311114499022}

{'ner': 2.694920648989994}

{'ner': 2.7182347015319284}

{'ner': 0.0014720811257120658}

{'ner': 1.2311951145064413}

{'ner': 1.2317418446089548}

{'ner': 4.281543517836717e-06}

{'ner': 0.0003685692108844618}

{'ner': 2.4162862128263285}

## 09. Best practices for training spaCy models
### Problem 1: Models can "forget" things
* Existing model can overfit on new data
    * e.g.: if you only update it with `WEBSITE`, it can "unlearn" what a `PERSON` is
* Also known as "catastrophic forgetting" problem

### Solution 1: Mix in previously correct predictions
* For example, if you're training `WEBSITE`, also include examples of `PERSON`
* Run existing spaCy model over data and extract all other relevant entities

__BAD__:

`TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})
]`

__GOOD__:

`TRAINING_DATA = [
    ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})
]`

### Problem 2: Models can't learn everything
* spaCy's models make predictions based on __local context__
* Model can struggle to learn if decision is difficult to make based on context
* Label scheme needs to be consistent and not too specific
    * For example: `CLOTHING` is better than `ADULT_CLOTHING` and `CHILDRENS_CLOTHING`
    
### Solution 2: Plan your label scheme carefully
* Pick categories that are reflected in local context
* More generic is better than too specific
* Use rules to go from generic labels to specific categories

__BAD__:

`LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']`

__GOOD__:

`LABELS = ['CLOTHING', 'BAND']
Let's practice!`


## 10. Good data vs. bad data
Here’s an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

`TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]`

### Part 2
* Rewrite the `TRAINING_DATA` to only use the label `GPE` (cities, states, countries) instead of `TOURIST_DESTINATION`.
* Don’t forget to add tuples for the `GPE` entities that weren’t labeled in the old data.

`TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]`

### Answer

`TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "GPE")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "GPE")]},
    ),
    (
        "There's also a Paris in Arkansas, lol",
        {"entities": [(15, 20, "GPE"), (24, 32, "GPE")]},
    ),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "GPE")]},
    ),
]`


## 11. Training multiple labels
Here’s a small sample of a dataset created to train a new entity type `WEBSITE`. The original dataset contains a few thousand sentences. In this exercise, you’ll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, `Brat`, a popular open-source solution, or `Prodigy`, our own annotation tool that integrates with spaCy.

### Part 1
* Complete the entity offsets for the `WEBSITE` entities in the data. Feel free to use `len()` if you don’t want to count the characters.

`TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(____, ____, "WEBSITE"), (____, ____, "WEBSITE")]},
    ),
    ("PewDiePie smashes YouTube record", {"entities": [(____, ____, "WEBSITE")]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(____, ___, "WEBSITE")]},
    ),
    # And so on...`
    
### Answer
`TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    ("PewDiePie smashes YouTube record", {"entities": [(18, 25, "WEBSITE")]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE")]},
    ),
    # And so on...
]`

### Part 3
Update the training data to include annotations for the `PERSON` entities “PewDiePie” and “Alexis Ohanian”.

`TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    ("PewDiePie smashes YouTube record", {"entities": [____, (18, 25, "WEBSITE")]}),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE"), ____]},
    ),
    # And so on...`
    
### Answer

`TRAINING_DATA = [
    (
        "Reddit partners with Patreon to help creators build communities",
        {"entities": [(0, 6, "WEBSITE"), (21, 28, "WEBSITE")]},
    ),
    (
        "PewDiePie smashes YouTube record",
        {"entities": [(0, 9, "PERSON"), (18, 25, "WEBSITE")]},
    ),
    (
        "Reddit founder Alexis Ohanian gave away two Metallica tickets to fans",
        {"entities": [(0, 6, "WEBSITE"), (15, 29, "PERSON")]},
    ),
    
]`
