# <center> Training and updating models

Training and updating spaCy neural network models - focusing in NER.
    

#### Why updating the model?
    
- Better results on your specic domain
- Learn classication schemes specically for your problem
- Essential for text classication
- Very useful for named entity recognition
- Less critical for part-of-speech tagging and dependency parsing
    
#### How training works
1. Initialize the model weights randomly with nlp.begin_training
2. Predict a few examples with the current weights by calling nlp.update
3. Compare prediction with true labels
4. Calculate how to change weights to improve predictions
5. Update weights slightly
6. Go back to 2.
    
    
<img src="https://d33wubrfki0l68.cloudfront.net/a634ac2555f216f30e47a08312745a85e552f4f1/b1d15/training-73950e71e6b59678754a87d6cf1481f9.svg" width="800" height="800">   
    
- Training data: Examples and their annotations.
- Text: The input text the model should predict a label for.
- Label: The labelthe model should predict.
- Gradient: How to change the weights.
    
    
#### Example: Training the entity recognizer
- The entity recognizer tags words and phrases in context
- Each token can only be part of one entity
- Examples need to come with context
    
` ("iPhone X is coming" , {'entities': [(0, 8, 'GADGET')]}) `
- Texts with no entities are also important
    
` ("I need a new phone! Any tips?", {'entities': []})`
- Goal:teach the model to generalize
       
#### The training data
- Examples of what we want the modelto predict in context
- Update an existing model: a few hundred to a few thousand examples
- Train a new category: a few thousand to a million examples
    - spaCy's English models: 2 million words
- Usually created manually by human annotators
- Can be semi-automated – for example, using spaCy's Matcher !

#### Creating training data example:
spaCy's rule-based Matcher is a great way to quickly create training data for named entity models

In [19]:
TEXT=['How to preorder the iPhone X',
 'iPhone X is coming',
 'Should I pay $1,000 for the iPhone X?',
 'The iPhone 8 reviews are here',
 'Your iPhone goes up to 11 today',
 'I need a new phone! Any tips?']

In [24]:
from spacy.matcher import Matcher
from spacy.lang.en import English

nlp=English()

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match 'iphone' and 'x'
pattern1 = [{'LOWER': 'iphone'}, {'LOWER': 'x'}]

# Token whose lowercase form matches 'iphone' and an optional digit
pattern2 = [{'LOWER': 'iphone'}, {'IS_DIGIT': True, 'OP': '?'}]

# Add patterns to the matcher
matcher.add('GADGET', None, pattern1, pattern2)

In [30]:
# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXT):
    # Find the matches in the doc
    matches = matcher(doc)
    
    # Get a list of (start, end, label) tuples of matches in the text
    entities = [(start, end, 'GADGET') for index, start, end in matches]
    print(doc.text, entities) 

How to preorder the iPhone X [(4, 6, 'GADGET'), (4, 5, 'GADGET')]
iPhone X is coming [(0, 2, 'GADGET'), (0, 1, 'GADGET')]
Should I pay $1,000 for the iPhone X? [(7, 9, 'GADGET'), (7, 8, 'GADGET')]
The iPhone 8 reviews are here [(1, 2, 'GADGET'), (1, 3, 'GADGET')]
Your iPhone goes up to 11 today [(1, 2, 'GADGET')]
I need a new phone! Any tips? []


In [34]:
for x in [(1, 2, 'GADGET'), (1, 3, 'GADGET')]:
    print (x)

(1, 2, 'GADGET')
(1, 3, 'GADGET')


In [35]:
TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXT):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    
    ### filter overlapping spans
    filtered = spacy.util.filter_spans(spans)
    
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, 'GADGET') for span in filtered]
    
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {'entities': entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)
    
print(*TRAINING_DATA, sep='\n')    

('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
('Your iPhone goes up to 11 today', {'entities': [(5, 11, 'GADGET')]})
('I need a new phone! Any tips?', {'entities': []})


# <center> The training loop

Spacy gives full control over the training loop.
    

#### The steps of a training loop
- Series of steps that{s performed to train or update a model.
    
1. Loop for a number of times. (model can learn effectively)
2. Shuffle the training data (preventing the model from getting stuck in a suboptimal solution)
3. Divide the data into batches. (also call minibatching , this makes easier to make a more accurate estimate of the gradient)
4. Update the model for each batch.
5. Save the updated model.    
        
<img src="https://d33wubrfki0l68.cloudfront.net/a634ac2555f216f30e47a08312745a85e552f4f1/b1d15/training-73950e71e6b59678754a87d6cf1481f9.svg" width="800" height="800">   

#### Updating an existing model
    
- Improve the predictions on new data
- Especially useful to improve existing categories, like PERSON
- Also possible to add new categories
- Be careful and make sure the model doesn't "forget" the old ones (make sure to use examples of the new categories and the old)
    
#### Setting up a new pipeline from scratch example
prepare a spaCy pipeline to train the entity recognizer to recognize 'GADGET' entities in a text

In [37]:
import spacy
# Create a blank 'en' model
nlp = spacy.blank('en')

# Create a new entity recognizer and add it to the pipeline
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

# Add the label 'GADGET' to the entity recognizer
ner.add_label('GADGET')
nlp.pipeline

[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x2a3e4e6d588>)]

#### Building a training loop

In [38]:
import random
# Start the training
nlp.begin_training()

# Loop for 10 iterations
for itn in range(10):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}
    
    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]
        
        # Update the model
        nlp.update(texts, annotations, losses=losses)
        print(losses)

{'ner': 15.833333730697632}
{'ner': 23.832850575447083}
{'ner': 33.59307312965393}
{'ner': 5.8727701008319855}
{'ner': 14.006607383489609}
{'ner': 20.45204469561577}
{'ner': 2.446519672870636}
{'ner': 6.871846782974899}
{'ner': 10.04149281885475}
{'ner': 1.2218325913418084}
{'ner': 3.6921539160830434}
{'ner': 6.993714232550701}
{'ner': 3.369400496594608}
{'ner': 5.634716486092657}
{'ner': 9.70386698609218}
{'ner': 3.141834852285683}
{'ner': 4.4143831285400665}
{'ner': 7.627638988451508}
{'ner': 0.8304325730150595}
{'ner': 2.970588672587837}
{'ner': 4.8794374020963005}
{'ner': 0.8755696392472601}
{'ner': 0.9817464473417203}
{'ner': 3.6814402281124785}
{'ner': 1.5681217284000013}
{'ner': 2.165615670730613}
{'ner': 2.681548525042995}
{'ner': 0.08017386267886195}
{'ner': 0.08304749466278283}
{'ner': 2.4762319764478598}


#### Exploring the model
View how the model performs on unseen data

In [39]:
TEST_DATA=['Apple is slowing down the iPhone 8 and iPhone X - how to stop it',
 "I finally understand what the iPhone X 'notch' is for",
 'Everything you need to know about the Samsung Galaxy S9',
 'Looking to compare iPad models? Here’s how the 2018 lineup stacks up',
 'The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple',
 'what is the cheapest ipad, especially ipad pro???',
 'Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics']

In [40]:
# Process each text in TEST_DATA
for doc in nlp.pipe(TEST_DATA):
    # Print the document text and entitites
    print(doc.text)
    print(doc.ents, '\n\n')

Apple is slowing down the iPhone 8 and iPhone X - how to stop it
(iPhone 8, iPhone X) 


I finally understand what the iPhone X 'notch' is for
(iPhone X,) 


Everything you need to know about the Samsung Galaxy S9
() 


Looking to compare iPad models? Here’s how the 2018 lineup stacks up
() 


The iPhone 8 and iPhone 8 Plus are smartphones designed, developed, and marketed by Apple
(iPhone 8, iPhone 8) 


what is the cheapest ipad, especially ipad pro???
() 


Samsung Galaxy is a series of mobile computing devices designed, manufactured and marketed by Samsung Electronics
() 




# <center> Best practices for training spaCy models


#### Problem 1: Models can "forget" things
- Existing model can overt on new data
- e.g.: if you only update it with WEBSITE , it can "unlearn" what a PERSON is
- Also known as "catastrophic forgetting" problem
    
#### Solution 1: Mix in previously correct predictions
- For example, if you're training WEBSITE , also include examples of PERSON
- Run existing spaCy model over data and extract all other relevant entities

BAD:
    
```
    TRAINING_DATA = [('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]})] 
    
```

GOOD:
    
```
    TRAINING_DATA = [('Reddit is a website', {'entities': [(0, 6, 'WEBSITE')]}),
    ('Obama is a person', {'entities': [(0, 5, 'PERSON')]})]
    
```
    
#### Problem 2: Models can'tlearn everything
    
- spaCy's models make predictions based on local context
- Model can struggle to learn if decision is difcult to make based on context
- Label scheme needs to be consistent and not too specic
- For example: CLOTHING is better than ADULT_CLOTHING and CHILDRENS_CLOTHING  
    
    
#### Solution 2: Plan your label scheme carefully
- Pick categories that are reected in local context
- More generic is better than too specic
- Use rules to go from generic labels to specic categories
    
BAD:
    
```   
    LABELS = ['ADULT_SHOES','CHILDRENS_SHOES','BANDS_I_LIKE']
```
    
GOOD:

```
    LABELS = ['CLOTHING','BAND']
```

In [46]:
TRAINING_DATA = [
  ('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'TOURIST_DESTINATION')]}),
('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'TOURIST_DESTINATION')]}),
("There's also a Paris in Arkansas, lol", {'entities': []}),
('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'TOURIST_DESTINATION')]}),
]
print('BAD LABELING')
print(*TRAINING_DATA, sep='\n')

BAD LABELING
('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'TOURIST_DESTINATION')]})
('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'TOURIST_DESTINATION')]})
("There's also a Paris in Arkansas, lol", {'entities': []})
('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'TOURIST_DESTINATION')]})


In [48]:
TRAINING_DATA = [
    ("i went to amsterdem last year and the canals were beautiful", {'entities': [(10, 19, 'GPE')]}),
    ("You should visit Paris once in your life, but the Eiffel Tower is kinda boring", {'entities': [(17, 22, 'GPE')]}),
    ("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]}),
    ("Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!", {'entities': [(0, 6, 'GPE')]})
]
print('GOOD LABELING')
print(*TRAINING_DATA, sep='\n')

GOOD LABELING
('i went to amsterdem last year and the canals were beautiful', {'entities': [(10, 19, 'GPE')]})
('You should visit Paris once in your life, but the Eiffel Tower is kinda boring', {'entities': [(17, 22, 'GPE')]})
("There's also a Paris in Arkansas, lol", {'entities': [(15, 20, 'GPE'), (24, 32, 'GPE')]})
('Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!', {'entities': [(0, 6, 'GPE')]})


#### Training multiple labels

In [None]:
##EXAMPLE OF TRAINING DATA
TRAINING_DATA = [
    ("Reddit partners with Patreon to help creators build communities", 
     {'entities': [(0, 6, 'WEBSITE'), (21, 28, 'WEBSITE')]}),
  
    ("PewDiePie smashes YouTube record", 
     {'entities': [(0,9,'PERSON'), (18, 25, 'WEBSITE')]}),
  
    ("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans", 
     {'entities': [(0, 6, 'WEBSITE'), (15,29,'PERSON')]}),
    # And so on...
]

# <center> Wrapping up


#### All learned on spaCy 
- Extract linguistic features: part-of-speech tags, dependencies, named entities
- Work with pre-trained statistical models
- Find words and phrases using Matcher and PhraseMatcher match rules
- Best practices for working with data structures Doc , Token Span , Vocab , Lexeme
- Find semantic similarities using word vectors
- Write custom pipeline components with extension attributes
- Scale up your spaCy pipelines and make them fast
- Create training data for spaCy' statistical models
- Train and update spaCy's neural network models with new data

#### More things to do with spaCy
- Training and updating other pipeline components (https://spacy.io/usage/training)
    - Part-of-speech tagger
    - Dependency parser
    - Text classfier    
    
- Customizing the tokenizer (https://spacy.io/usage/linguistic-features#tokenization)
    - Adding rules and exceptions to split text differently\
    
- Adding or improving supportfor other languages (https://spacy.io/usage/adding-languages)
    - 45+ languages currently
    - Lots of room for improvement and more languages
    - Allows training models for other languages
    
    
SPACY DOCUMENTATION : https://spacy.io/