### Readme:

This notebook documents what I learnt from https://course.spacy.io/en/. It contains notes/sample codes/sample problems from the course. 

Special thanks to the content creators and the presenter Ines

This notebook is intended for self-study, not re-distributing contents. 
If you want to learn more about spaCy, please visit https://spacy.io/ or https://course.spacy.io/en/

Thank you!

### Chapter 4 : Training and Updating Models

* How training words?
    * Initialize the model weights randomly with nlp.begin_training
    * Predict a few examples with the current weights by calling nlp.updates
    * Compare prediction with true labels 
    * Calculate how to change weights to improve predictions
    * Update weights slightly
    * Go back to 2
    
 Note :  Gradient : how to change the weight
 
    * update an existing model : a few hundred to a few thousand examples
    * train a new category : a few thousand to a million examples 
    * training to teach the model new labels, entity types or other classification schemes.
    * spaCy’s components are supervised models for text annotations, meaning they can only learn to reproduce examples, not guess new labels from raw text. Training does not help with discovering patterns in unlabelled data

In [None]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and check the result
matcher.add("GADGET", None, pattern1, pattern2)
for doc in nlp.pipe(TEXTS):
    print([doc[start:end] for match_id, start, end in matcher(doc)])

In [None]:
import json
from spacy.matcher import Matcher
from spacy.lang.en import English

with open("exercises/en/iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = English()
matcher = Matcher(nlp.vocab)
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]
matcher.add("GADGET", None, pattern1, pattern2)

TRAINING_DATA = []

# Create a Doc object for each text in TEXTS
for doc in nlp.pipe(TEXTS):
    # Match on the doc and create a list of matched spans
    spans = [doc[start:end] for match_id, start, end in matcher(doc)]
    # Get (start character, end character, label) tuples of matches
    entities = [(span.start_char, span.end_char, "GADGET") for span in spans]
    # Format the matches as a (doc.text, entities) tuple
    training_example = (doc.text, {"entities": entities})
    # Append the example to the training data
    TRAINING_DATA.append(training_example)

print(*TRAINING_DATA, sep="\n")

# ('How to preorder the iPhone X', {'entities': [(20, 28, 'GADGET')]})
# ('iPhone X is coming', {'entities': [(0, 8, 'GADGET')]})
# ('Should I pay $1,000 for the iPhone X?', {'entities': [(28, 36, 'GADGET')]})
# ('The iPhone 8 reviews are here', {'entities': [(4, 12, 'GADGET')]})
# ("iPhone 11 vs iPhone 8: What's the difference?", {'entities': [(0, 9, 'GADGET'), (13, 21, 'GADGET')]})
# ('I need a new phone! Any tips?', {'entities': []})

* How does training loop work?
    * loop for a number of times
    * shuffle the training data
    * divide the data into batches
    * update the model for each batch
    * save the updated model
* Best practice when training models
    * Models can forget things
        * mix in previously correct predictions
                * website v.s. persons
                * run existing spaCy model over data and extract all other relevant entities
    * Models can't learn everything
        * local context / surrounding words
                * label schemes need to be consistent and not too specific (clothing is better than adult clothing/children clothing)
                * use rules from generic to specific

In [None]:
# example 1
TRAINING_DATA = [
    (
        "i went to amsterdem last year and the canals were beautiful",
        {"entities": [(10, 19, "TOURIST_DESTINATION")]},
    ),
    (
        "You should visit Paris once in your life, but the Eiffel Tower is kinda boring",
        {"entities": [(17, 22, "TOURIST_DESTINATION")]},
    ),
    ("There's also a Paris in Arkansas, lol", {"entities": []}),
    (
        "Berlin is perfect for summer holiday: lots of parks, great nightlife, cheap beer!",
        {"entities": [(0, 6, "TOURIST_DESTINATION")]},
    ),
]

# this is subjective, it will be better if we just name it as GPE or location, and use the rule-based system
# to determine whether such an entity is tourist destination or not 

# so we should replace all "TOURIST_DESTINATION" with "GPE"
# for the third doc, both paris and arkansas will have "GPE"
#         "There's also a Paris in Arkansas, lol",
#        {"entities": [(15, 20, "GPE"), (24, 32, "GPE")]},