# Chapter 4 (Training a Neural Network Model) 
These are my notes for the fourth chapter of the advanced NLP [course](https://course.spacy.io/en/) provided by spaCy. 

In [3]:
import spacy

This chapter contains:
- How to update spaCy's statistical models
- How to train your own model from scratch
- Other tips & tricks regarding model building

### 4.1: Training and Updating Models

Why should we train and update our own model?
- Better results on your specific domain, since the model will know your problem better
- Classification schemes specific to your use-case/problem
- Essential for text classification
- Useful for named entity recognition
- Less critical for part-of-speech tagging and dependency parsing. 

spaCy supports both: the training of new models and updates of existing ones. 

If we are not starting with a trained pipeline, we first initialize the weights randomly. Then, spaCy calls `nlp.update`, which predicts a batch of examples with the current weights. It checks the predictions against the correct answers, and decides how to change the weights to achieve better results. We make a small correction to the current weights and move on to the next batch. This continues in a cycle until the model stops improving. 

Let's take a look at the entity recognizer. It takes a document and predicts phrases and their labels in context. This means that the training data must include texts, the entities they contain, and the entity labels. Entities can't overlap, so each token can only be part of one entity. The easiest way to train an entity recognizer is to show the model a text and entity spans. It's also important for the model to learn words that aren't entities. The goal is to teach the model to recognize new entities in similar contexts, even if they weren't in the training data. 

In [7]:
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("iPhone X is coming!")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]

# Also show examples that don't have the entities
doc2 = nlp("I need a new phone! Any tips?")
doc2.ents = []

To update an existing model, we can start with a few hundred to a few thousand examples. To train a new category, we will need up to a million. For example, spaCy's trained English pipelines were trained on 2 million words labelled with part-of-speech tags, dependencies and named entities. Training data is usually created by humans who assign labels to texts; this can be semi-automated using tools like `Matcher`

In [9]:
docs = [doc1, doc2]
# Split data into train and test
import random
random.shuffle(docs)
train_docs = docs[:len(docs) // 2]
dev_docs = docs[len(docs) // 2:]

You typically want to store the data as files on disk to load them into spaCy's training process. The `DocBin` is a container for effeciently storing and serialzing `Doc` objects. You can instantiate it with a list of `Doc` objects and call its `to_disk` method. Typically use the .spacy extension for these files. Compare to other serialization tools like `pickle`, the `DocBin` is faster and produces smaller files, since it only stores the shared vocab once. 

In [12]:
from spacy.tokens import DocBin

train_docbin = DocBin(docs=train_docs)
train_docbin.to_disk("./train.spacy")

test_docbin = DocBin(docs=dev_docs)
test_docbin.to_disk("./test.spacy")

In case your data is already stored in some common format like CoNLL or IOB, spacy's `convert` command converts those files into spaCy's binary format. Also converts JSON files. 

$ python -m spacy convert ./train.gold.conll ./corpus

### 4.5: Configuring and running the training

#### Configuring 
spaCy uses a config file (config.cfg) to determine settings like: how to init the `nlp` object, which components to add, how their internal implementations should be configured, settings for the training process, how to load the data, and hyperparameters. This means you only need to pass the config file when training the model. Helps with versioning and reproducability as well. An excerpt would look like as follows:


```
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

[components]

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64
# And so on...
```

The above is split into sections, and nested sections are denoted with `.`, e.g: `[components.ner]`. Can reference Python objects using `@`. You can use this to customize different parts of the `nlp` object and training.

Config files don't have to be written by hand; they can be generated by spaCy. You can use spaCy's built in `init config` command. The first argument is the file name, conventionally `config.cfg`. The argument `--lang` defines the language class; `--pipeline` lets you specify one or more command separated pipeline components to include. There's also an interactive [quick-start widget](https://spacy.io/usage/training#quickstart) on their website. An example command:

`python -m spacy init config ./config.cfg --lang en --pipeline ner`

#### Training

To train a pipeline, all you need is the config file, and the training and testing data. The first argument of `spacy train` is the path to the config file. The `--output` argument lets you specify a directory for saving the final trained pipeline. Can also override different config settings on the command line. 

After training is done, you can easily load the pipeline using `spacy.load`. `model-last` is the last trained model, while `model-best` is the best trained model. 

#### Deploying
To make it easy to deploy the pipeline, you can package it. The `spacy package` command takes the path to the pipeline and generates a Python package. Can also provide an optional name and version. 

### 4.10: Best Practices
#### Problem 1: Models can 'forget' things
When updating, existing models can overfit on new data, forgetting the stuff they learnt previously. For instance, if you're only updating it with examples of `WEBSITE`, it may "forget" other labels it previously predicted correctly – like `PERSON`. This is also known as "catastrophic forgetting" problem. 

One of the solutions to this problem is to include previously correct predictions. spaCy can help by running the existing model over data and extracting the entity spans you care about. You can then mix those examples in with yout existing data and update the model with annotations of all labels. 

#### Problem 2: Models can't learn everything
spaCy's models make predictions based on local context - for example, for named entities, surrounding words are important. If the decision is difficult to make based on the context, the model will have a hard time. The label scheme also needs to be consistent and not too specific; `CLOTHING` might be a better label than `ADULT_CLOTHING` and `CHILDREN_CLOTHING`. 

As such, you should plan your label scheme carefully by:
- Picking categories reflected in local context
- Making them more generic
- Use rules to go from generic labels to specific categories