### Training a neural network model

Why updating the model?
1. Better results on your specific domain
2. Learn classification schemes specifically for your problem
3. Essential for text classification
4. Very useful for named entity recognition
5. Less critical for part-of-speech tagging and dependency parsing

*Statistical models make predictions based on the examples they were trained on.
You can usually make the model more accurate by showing it examples from your domain.*

#### How training works

1. Initialize the model weights randomly with nlp.begin_training
2. Predict a few examples with the current weights by calling nlp.update
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2.

The training data are the examples we want to update the model with.
The text should be a sentence, paragraph or longer document. For the best results, it should be similar to what the model will see at runtime. \
The label is what we want the model to predict. This can be a **text category, or an entity span and its type.**

#### Training the Entity Recognizer

The entity recognizer takes a document and predicts phrases and their labels. This means that the training data needs to include texts, the entities they contain, and the entity labels. \
The easiest way to do this is to show the model a text and a list of character offsets. For example, "iPhone X" is a gadget, starts at character 0 and ends at character 8. \
It's also very important for the model to learn words that aren't entities.
In this case, the list of span annotations will be empty.

The training data tells the model what we want it to predict. This could be texts and named entities we want to recognize, or tokens and their correct part-of-speech tags.

#### Creating Training Data

spaCy’s rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it the IPython shell to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as 'GADGET'.

In [1]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("iphone.json") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True, "OP": "?"}]

# Add patterns to the matcher
matcher.add("GADGET", None, pattern1, pattern2)

In [6]:
TRAINING_DATA = []

for doc in nlp.pipe(TEXTS):
    print('document: '+ doc.text)
    
    # Match on a doc, and create a list of matched span
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    
    # Get the start,end and entity label for matchs
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    
    print(entities)
    
    # Format the matches
    
    training_example = (doc.text, {"entities": entities})
    
    # Append the training example
    TRAINING_DATA.append(training_example)
    print()

document: How to preorder the iPhone X
[(20, 28, 'GADGET'), (20, 26, 'GADGET')]

document: iPhone X is coming
[(0, 8, 'GADGET'), (0, 6, 'GADGET')]

document: Should I pay $1,000 for the iPhone X?
[(28, 36, 'GADGET'), (28, 34, 'GADGET')]

document: The iPhone 8 reviews are here
[(4, 12, 'GADGET')]

document: Your iPhone goes up to 11 today
[(5, 11, 'GADGET')]

document: I need a new phone! Any tips?
[]



In [9]:
print(*TRAINING_DATA, sep="\n")

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET'), (20, 26, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET'), (0, 6, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET'), (28, 34, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


#### Setting up the Pipeline

In this exercise, you’ll prepare a spaCy pipeline to train the entity recognizer to recognize 'GADGET' entities in a text – for example, “iPhone X”.
1. Create a blank 'en' model, for example using the spacy.blank method.
2. Create a new entity recognizer using nlp.create_pipe and add it to the pipeline.
3. Add the new label 'GADGET' to the entity recognizer using the add_label method on the pipeline component

In [14]:
import spacy
import random
import json

# Create a blank 'en' model
nlp = spacy.blank("en")

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label("GADGET")

nlp.vocab.vectors.name = 'example'

# Start the training
nlp.begin_training()

for itn in range(10):
    
    # Shuffle the random examples:
    random.shuffle(TRAINING_DATA)
    
    losses = {}
    
    for batch in spacy.util.minibatch(TRAINING_DATA, size= 2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        
        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print("{0:.10f}".format(losses['ner']) )

15.2000001669
22.8406054974
32.1119040251
7.3094803691
14.8997436762
19.4442685843
2.8595558442
4.2886928841
7.3126856391
2.4475634160
4.2746486550
4.8367750854
0.7174652632
1.5059775252
2.1358702074
0.0545349050
0.9280381885
0.9855622620
0.0009599526
0.0033613277
1.0005644007
0.7024979204
0.7025046145
0.7025407334
7.2759053467
7.2759540292
7.2759554727
0.0000247821
0.0000264150
2.5797555731
