# Why updating the model?

* Better results on your specific domain
* Learn classification schemes specifically for  your problem
* Essential for text classification
* Very useful for named entity recognition
* Less critical for part-of-speech tagging and dependency parsing

# How training works

1. **Initialize** the model weights randomly with `nlp.begin_training`
1. **Predict** a few examples with the current weights by calling `nlp.update`
1. **Compare** prediction with true labels
1. **Calculate** how to change weights to improve predictions
1. **Update** weights slightly
1. Go back to *2*

* *Training data:* Examples and their annotations.
* *Text:* The input text the model should predict a label for.
* *Label:* the label the model should predict.
* *Gradient:* How to change the weights.

# Example: Training the entity recognizer

* The entity recognizer tags words and phrases in context
* Each token can only be part of one entity
* Examples need to come with context
```("iPhone X is coming", {'entities': [(0, 8, 'GADGET')]}```
* Texts with no entities are also important
```("I need a new phone! Any tips?", {'entities': []})```
* **Goal:** teacht the model to generalize

# The training data

* Example of what we ant the model to predict in context
* Update an **exisiting model:** a few hundres to a few thousand of examples
* Traing a **new category:** a few thousand to a million examples
    * spaCy's English models: 2 million words
* Usually created manually by human annotators
* Can bem semi-automated - for , using spaCy's `matcher`!

# Creating training data

In [1]:
import spacy
spacy.require_gpu()
nlp = spacy.load('en_core_web_lg')
from spacy.matcher import Matcher

In [2]:
matcher = Matcher(nlp.vocab)
# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]
# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]
# Add patterns to the matcher
matcher.add('GADGET_RULES', [pattern1, pattern2], greedy='FIRST')

Let's use the match patterns we've created in the previous exercise to bootstrap a set of training examples. 

In [3]:
TEXTS = [
    "How to preorder the iPhone X",
    "iPhone X is coming",
    "Should I pay $1,000 for the iPhone X?",
    "The iPhone 8 reviews are here",
    "Your iPhone goes up to 11 today",
    "I need a new phone! Any tips?",
]

In [4]:
# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Find the matches in the doc
    matches = matcher(doc)
    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(start, end, 'GADGET') for match_id, start, end in matches]
    print(doc.text, entities) 

How to preorder the iPhone X [(4, 6, 'GADGET')]
iPhone X is coming [(0, 2, 'GADGET')]
Should I pay $1,000 for the iPhone X? [(7, 9, 'GADGET')]
The iPhone 8 reviews are here [(1, 3, 'GADGET')]
Your iPhone goes up to 11 today [(1, 2, 'GADGET')]
I need a new phone! Any tips? []


In [5]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in spans]
    
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print('TRAINING DATA:')
print(*TRAINING_DATA, sep='\n')    

TRAINING DATA:
('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


Before you train a model with the data, you always want to double-check that your matcher didn't identify any false positives. But that process is still much faster than doing everything manually.

# The training loop

## The steps of a training loop
1. **Loop** for a number of times.
    1. **Shuffle** the training data.
        * This is a very common strategy when doing stochastic gradient descent
    1. **Divide** the data into batches (**minibatching**)
        * This makes it esasier to make a more accurate estimate of the gradient
    1. **Update** the model for each batch.
1. **Save** the updated model.
    * to a directory and use it in spaCy

## Recap: How training works

* **Training data:** Examples and their annotations.
* **Text:** The input text the model should predict a label for.
* **Label:** The label the model should predict.
* **Gradient:** How to change the weights.

## Example loop

In [6]:
import random
from spacy.training.example import Example

# Loop for 10 iterations
for i in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    # Create batches and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        for text, annotations in batch:
            # create Example
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            # Update the model
            nlp.update([example], losses=losses)
            print(losses)
# Save the model
# nlp.to_disk('example_model_dir')

{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 1.893784943974425}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 5.880575732397553}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 9.622654999275502}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 9.62265507602581}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 13.34633607948837}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 14.806273358892062}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 1.5741935175658563}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 1.57419405182385}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 2.812813435502919}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 3.8874090640282146}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 5.250660509876805}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 6.324660833347172}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'ner': 1.2402845573857508}
{'tok2vec': 0.0, 'tagger': 0.0, 'parser': 0.0, 'n

# Updating an existing model
* Improve the predictions on new data
* Especially useful to improve existing categories, like `PERSON`
* Also possible to add new categories
* Be careful and make sure the model doesn't "forget" the old ones

## Setting up a new pipeline from scratch

In [7]:
from spacy import Language
# Start with blank English model
nlp = spacy.blank('en')
print(nlp.meta)

{'lang': 'en', 'name': 'pipeline', 'version': '0.0.0', 'spacy_version': '>=3.4.3,<3.5.0', 'description': '', 'author': '', 'email': '', 'url': '', 'license': '', 'spacy_git_version': '63673a792', 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None, 'mode': 'default'}, 'labels': {}, 'pipeline': [], 'components': [], 'disabled': []}


In [8]:
# Create blank entity recognizer
ner = nlp.create_pipe('ner')
ner

<spacy.pipeline.ner.EntityRecognizer at 0x7fb0291b6e40>

In [9]:
# Add component to the pipeline
nlp.add_pipe('ner')
nlp.pipeline

[('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fb0291b6f20>)]

In [10]:
# Add a new label to the ner component
ner.add_label('GADGET')
ner.labels

('GADGET',)

In [11]:
# Start the training to initialize the model with random weights
nlp.begin_training()
# Train for 10 iterations
for itn in range(10):
    random.shuffle(TRAINING_DATA)
    losses={}
    # Divide examples into batches
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        # Update the model
        nlp.update([Example.from_dict(nlp.make_doc(text), annotations) for text, annotations in batch],
                   losses=losses)      
        print(losses)

{'ner': 13.333333969116211}
{'ner': 22.191051602363586}
{'ner': 33.834924817085266}
{'ner': 7.21665346622467}
{'ner': 17.289491593837738}
{'ner': 26.563148707151413}
{'ner': 7.375452399253845}
{'ner': 11.601762011647224}
{'ner': 14.12882811576128}
{'ner': 1.320203622803092}
{'ner': 3.078902713721618}
{'ner': 11.463523261016235}
{'ner': 9.077382935211062}
{'ner': 10.918071497697383}
{'ner': 14.191691261250526}
{'ner': 2.2280071023851633}
{'ner': 3.266769803361967}
{'ner': 5.278516272548586}
{'ner': 0.741884347567975}
{'ner': 2.0866386377383606}
{'ner': 2.326729200380214}
{'ner': 0.05624298445036402}
{'ner': 0.09393305278581465}
{'ner': 1.5297884328711007}
{'ner': 0.0011251797302520572}
{'ner': 1.525631058919771}
{'ner': 1.526033094552261}
{'ner': 0.0002986922327750108}
{'ner': 1.097896041893799}
{'ner': 1.0979152700196462}


> The numbers printed represent the loss on each iteration, the amount of work left for the optimizer.  
The lower the number, the better.  
In real life, you normally want to use a lot more data than this, ideally at least a few hundred or a few thousand examples.

## Exploring the model

In [12]:
TEST_DATA = [
    "Apple is slowing down the iPhone 8 and iPhone X - how to stop it",
    "I finally understand what the iPhone X 'notch' is for",
    "Everything you need to know about the Samsung Galaxy S9",
    "Looking to compare iPad models? Here’s how the 2018 lineup stacks up",
    "The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple",
    "what is the cheapest ipad, especially ipad pro???",
    "Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics",
]

In [13]:
# Process each text in TEST_DATA
for doc in nlp.pipe(TEST_DATA):
    # Print the document text and entitites
    print(doc.text)
    print([(ent.text, ent.label_ )for ent in doc.ents], '\n\n')

Apple is slowing down the iPhone 8 and iPhone X - how to stop it
[('iPhone 8', 'GADGET'), ('iPhone X', 'GADGET')] 


I finally understand what the iPhone X 'notch' is for
[('iPhone X', 'GADGET')] 


Everything you need to know about the Samsung Galaxy S9
[] 


Looking to compare iPad models? Here’s how the 2018 lineup stacks up
[] 


The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple
[('iPhone 8', 'GADGET'), ('iPhone 8', 'GADGET')] 


what is the cheapest ipad, especially ipad pro???
[] 


Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics
[] 




# Training best practices

## Problem 1: Models can "forget" things

Statistical models can learn lots of things – but it doesn't mean that they won't unlearn them. If you're updating an existing model with new data, especially new labels, it can overfit and adjust *too much* to the new examples. For instance, if you're only updating it with examples of "website", it may "forget" other labels it previously predicted correctly – like "person". This is also known as the catastrophic forgetting problem. 

* Existing model can overfit on new data
* e.g.: if you only update it with WEBSITE , it can "unlearn" what a PERSON is
* Also known as "catastrophic forgetting" problem

## Solution 1: Mix in previously correct predictions

To prevent this, make sure to always mix in examples of what the model previously got correct. If you're training a new category "website", also include examples of "person". spaCy can help you with this. You can create those additional examples by running the existing model over data and extracting the entity spans you care about. You can then mix those examples in with your existing data and update the model with annotations of all labels. 

* **For example**, if you're training `WEBSITE`, also include examples of `PERSON`  
* Run existing spaCy model over data and extract all other relevant entities

**BAD:**

    TRAINING_DATA = [  
        (Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})]

**GOOD:**

    TRAINING_DATA = [  
        ('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),  
        ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})]

## Problem 2: Models can't learn everything

Another common problem is that your model just won't learn what you want it to. spaCy's models make predictions based on the **local context** – for example,

for named entities, the surrounding words are most important. If the decision is difficult to make based on the context, the model can struggle to learn it.

The label scheme also needs to be consistent and not too specific.

For example, it may be very difficult to teach a model to predict whether something is adult clothing or children's clothing based on the context. However, just predicting the label "clothing" may work better. 

* spaCy's models make predictions based on **local context**
* Model can struggle to learn if decision is difficult to make based on context
* Label scheme needs to be consistent and not too specific
    * For example: CLOTHING is better than ADULT_CLOTHING and CHILDRENS_CLOTHING

## Solution 2: Plan your label scheme carefully

Before you start training and updating models, it's worth taking a step back and planning your label scheme.

Try to pick **categories** that are reflected in the **local context** and make them more **generic** if possible.

You can always add a **rule-based system** later to go from **generic** to **specific**.

**Generic categories** like "clothing" or "band" are both easier to label and easier to learn.

* Pick categories that are reflected in **local context**
* More **generic is better than too specific**
* Use rules to go from **generic labels** to **specific categories**

**BAD:**

    LABELS = ['ADULT_SHOES', 'CHILDRENS_SHOES', 'BANDS_I_LIKE']
    
**GOOD:**

    LABELS = ['CLOTHING', 'BAND']

## Good data vs. bad data

Here's an excerpt from a training set that labels the entity type TOURIST_DESTINATION in traveler reviews.

TOURIST_DESTINATION

    ('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'TOURIST_DESTINATION')]})
    ('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'TOURIST_DESTINATION')]})
    ("There's also a Paris in Arkansas, lol", {'entities': []})
    ('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'TOURIST_DESTINATION')]})

Why is this data and label scheme problematic?

    Whether a place is a tourist destination is a subjective judgement and not a definitive category. It will be very difficult for the entity recognizer to learn.
    
Rewrite the TRAINING_DATA to only use the label GPE (cities, states, countries) instead of TOURIST_DESTINATION.  
Don't forget to add tuples for the GPE entities that weren't labeled in the old data.


In [14]:
TRAINING_DATA = [
    ("i went to amsterdem last year and the canals were beautiful", {'entities': [(10, 19, 'GPE')]}),
    ("You should visit Paris once in your life, but the Eiffel Tower is kinda boring", {'entities': [(17, 22, 'GPE')]}),
    ("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24,32, 'GPE')]}),
    ("Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!", {'entities': [(0, 6, 'GPE')]})
]
     
print(*TRAINING_DATA, sep='\n')

('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'GPE')]})
('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'GPE')]})
("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]})
('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'GPE')]})


> Once the model achieves good results on detecting GPE entities in the traveler reviews, you could add a rule-based component to determine whether the entity is a tourist destination in this context.  
> For example, you could resolve the entities types back to a knowledge base or look them up in a travel wiki.

## Training multiple labels

Here's a small sample of a dataset created to train a new entity type WEBSITE. The original dataset contains a few thousand sentences. In this exercise, you'll be doing the labeling by hand. In real life, you probably want to automate this and use an annotation tool – for example, Brat (https://brat.nlplab.org/), a popular open-source solution, or Prodigy (https://prodi.gy/), our own annotation tool that integrates with spaCy.

After this exercise you will be nearly done with the course! If you enjoyed it, feel free to send Ines a thank you via Twitter - she'll appreciate it! Tweet to Ines (http://twitter.com/home?status=Thoroughly%20enjoyed%20the%20Advanced%20NLP%20With%20spaCy%20course%20%40DataCamp%20by%20%40_inesmontani.%20https%3A%2F%2Fbit.ly/2DTUzxP)

Complete the entity offsets for the WEBSITE entities in the data. Feel free to use len() if you don't want to count the characters.

In [None]:
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities", 
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),
  
    ("PewDiePie smashes YouTube record", 
     {'entities': [(18, 25, 'WEBSITE')]}),
  
    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans", 
     {'entities': [(0, 6, 'WEBSITE')]}),
    # And so on...
]

Update the training data to include annotations for the PERSON entities "PewDiePie" and "Alexis Ohanian".

In [None]:
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities", 
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),
  
    ("PewDiePie smashes YouTube record", 
     {'entities': [(0, 9, 'PERSON'), (18, 25, 'WEBSITE')]}),
  
    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans", 
     {'entities': [(0, 6, 'WEBSITE'), (15, 29, 'PERSON')]}),
    # And so on...
]

# Your new spaCy skills

In the first chapter, you learned how to extract linguistic features like part-of-speech tags, syntactic dependencies and named entities, and how to work with pre-trained statistical models. You also learned to write powerful match patterns to extract words and phrases using spaCy's matcher and phrase matcher.

Chapter 2 was all about information extraction, and you learned how to work with the data structures, the Doc, Token and Span, as well as the vocab and lexical entries. You also used spaCy to predict semantic similarities using word vectors.

In chapter 3, you got some more insights into spaCy's pipeline, and learned to write your own custom pipeline components that modify the Doc. You also created your own custom extension attributes for Docs, Tokens and Spans, and learned about processing streams and making your pipeline faster.

Finally, in chapter 4, you learned about training and updating spaCy's statistical models, specifically the entity recognizer. You learned some useful tricks for how to create training data, and how to design your label scheme to get the best results. 

* Chapter 1
    * Extract linguistic features: part-of-speech tags, dependencies, named entities
    * Work with pre-trained statistical models
    * Find words and phrases using Matcher and PhraseMatcher match rules
* Chapter 2
    * Best practices for working with data structures Doc , Token Span , Vocab , Lexeme
    * Find semantic similarities using word vectors
* Chapter 3
    * Write custom pipeline components with extension a attributes
    * Scale up your spaCy pipelines and make them fast
* Chapter 4
    * Create training data for spaCy' statistical models
    * Train and update spaCy's neural network models with new data

## More things to do with spaCy

Of course, there's a lot more that spaCy can do that we didn't get to cover in this course. While we focused mostly on training the entity recognizer, you can also train and update the other statistical pipeline components like the part-of-speech tagger and dependency parser. Another useful pipeline component is the text classifier, which can learn to predict labels that apply to the whole text. It's not part of the pre-trained models, but you can add it to an existing model and train it on your own data. 

* Training (https://spacy.io/usage/training) and updating other pipeline components
    * Part-of-speech tagger
    * Dependency parser
    * Text classifier
    
In this course, we basically accepted the default tokenization as it is. But you don't have to! spaCy lets you customize the rules used to determine where and how to split the text. You can also add and improve the support for other languages. While spaCy already supports tokenization for many different languages, there's still a lot of room for improvement. Supporting tokenization for a new language is the first step towards being able to train a statistical model. 

* Customizing the tokenizer (https://spacy.io/usage/linguistic-features#tokenization)
    * Adding rules and exceptions to split text differently
* Adding or improving support for other languages (https://spacy.io/usage/adding-languages)
    * 45+ languages currently
    * Lots of room for improvement and more languages
    * Allows training models for other languages