# Training/Updating models in spaCy

> **Training**: requires a few thousand to a million examples

> **Updating** an existing model: a few hundred to a few thousand examples

> Can be *semi-automated* using `Matcher`

1. Initialize the model weights randomly with `nlp.begin_training`
2. Predict a few examples with the current weights by calling `nlp.update`
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2.

## Training an entity recognizer

- The entity recognizer tags words and phrases in context
- Each token can only be part of one entity
- Examples need to come with context
- Texts with no entities are also important (helps w/ generalization)

In [None]:
# example w/ entity 'GADGET'
("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]})

# example w/o entities
("I need a new phone! Any tips?", {'entities': []})

## Create Training Data

Here is an example using spaCy’s rule-based `Matcher` to quickly create training data for named entity models.

Using `Matcher` to identify iPhone models:
- Write a pattern for two tokens whose lowercase forms match 'iphone' and 'x'.
- Write a pattern for two tokens: one token whose lowercase form matches 'iphone' and an optional digit using the '?' operator.

In [4]:
import requests
import json

req = requests.get("https://raw.githubusercontent.com/ines/spacy-course/master/exercises/iphone.json")
TEXTS = [item for item in req.json()]
print(TEXTS)

['How to preorder the iPhone X', 'iPhone X is coming', 'Should I pay $1,000 for the iPhone X?', 'The iPhone 8 reviews are here', 'Your iPhone goes up to 11 today', 'I need a new phone! Any tips?']


In [2]:
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

In [3]:
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X', 'iPhone']
Matches: ['iPhone X', 'iPhone']
Matches: ['iPhone X', 'iPhone']
Matches: ['iPhone 8']
Matches: ['iPhone']
Matches: []


### Pipeline to bootstrap a set of training examples

- Create a doc object for each text using `nlp.pipe`.
- Match on the `doc` and create a list of matched spans.
- Get `(start character, end character, label)` tuples of matched spans.
- Format each example as a tuple of the text and a dict, mapping `'entities'` to the entity tuples.
- Append the example to `TRAINING_DATA` and inspect the printed data.

In [5]:
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp = English()
matcher = Matcher(nlp.vocab)

pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "+"}]

matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': []})
('I need a new phone! Any tips?', {'entities': []})
