# Chapter 4: Training a neural network model

# 1.Training and updating models 

Training entity recognizer
- The entity recognizer tags words and phrases in context
- Each token can only be part of one entity
- Examples need to come with context

In [1]:
import spacy 
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")

doc = nlp("iPhone X is coming")
doc.ents = [Span(doc, 0, 2, label="GADGET")]

Texts with no entities are also important

In [2]:
doc = nlp("I need a new phone! Any tips?")
doc.ents = []

Goal: teach the model to generalize

# 2. Training and evaluation data

Generating a training corpus

In [3]:
import spacy

nlp = spacy.blank("en")

# Create a Doc with entity spans
doc1 = nlp("iPhone X is coming")
doc1.ents = [Span(doc1, 0, 2, label="GADGET")]
# Create another doc without entity spans
doc2 = nlp("I need a new phone! Any tips?")

docs = [doc1, doc2]  # and so on...

spliting data

In [4]:
import random 
random.shuffle(docs)
train_docs = docs[:len(docs) // 2]
dev_docs = docs[len(docs) // 2:]

- DocBin: container to efficiently store and save Doc objects
- can be saved to a binary file
- binary files are used for training

In [None]:
from spacy.training import DocBin

# Create and save a collection of training docs
train_docbin = DocBin(docs=train_docs)
train_docbin.to_disk("./train.spacy")
# Create and save a collection of evaluation docs
dev_docbin = DocBin(docs=dev_docs)
dev_docbin.to_disk("./dev.spacy")

# 3. Creating training data

spaCy’s rule-based Matcher is a great way to quickly create training data for named entity models. A list of sentences is available as the variable TEXTS. You can print it to inspect it. We want to find all mentions of different iPhone models, so we can create training data to teach a model to recognize them as "GADGET".

- Write a pattern for two tokens whose lowercase forms match "iphone" and "x".
- Write a pattern for two tokens: one token whose lowercase form matches "iphone" and a digit.

In [9]:
import json
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span

with open("iphone.json", encoding="utf8") as f:
    TEXTS = json.loads(f.read())

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)

# Two tokens whose lowercase forms match "iphone" and "x"
pattern1 = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Token whose lowercase form matches "iphone" and a digit
pattern2 = [{"LOWER": "iphone"}, {"IS_DIGIT": True}]

# Add patterns to the matcher and create docs with matched entities
matcher.add("GADGET", [pattern1, pattern2])
docs = []
for doc in nlp.pipe(TEXTS):
    matches = matcher(doc)
    spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
    print(spans)
    doc.ents = spans
    docs.append(doc)

[iPhone X]
[iPhone X]
[iPhone X]
[iPhone 8]
[iPhone 11, iPhone 8]
[]


After creating the data for our corpus, we need to save it out to a .spacy file. The code from the previous example is already available.

- Instantiate the DocBin with the list of docs.
- Save the DocBin to a file called train.spacy.

In [10]:
from spacy.tokens import DocBin
doc_bin = DocBin(docs=docs)
doc_bin.to_disk("train.spacy")

# 5. Configuring and running the training

The training config
- single source of truth for all settings
- typically called config.cfg
- defines how to initialize the nlp object
- includes all settings about the pipeline components and their model implementations
- configures the training process and hyperparameters
- makes your training more reproducible

In [None]:
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
batch_size = 1000

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

[components]

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64

# 7. Generating a config

- spaCy can auto-generate a default config file for you
- interactive quickstart widget in the docs
- init config command on the CLI

```$ python -m spacy init config ./config.cfg --lang en --pipeline ner```
- init config: the command to run
- config.cfg: output path for the generated config
- --lang: language class of the pipeline, e.g. en for English
- --pipeline: comma-separated names of components to include

# 8. Training a pipeline

- all you need is the `config.cfg` and the training and development data
- config settings can be overwritten on the command line

```$ python -m spacy train ./config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy```

- train: the command to run
- config.cfg: the path to the config file
- --output: the path to the output directory to save the trained pipeline
- --paths.train: override with path to the training data
- --paths.dev: override with path to the evaluation data

Loading a trained pipeline

- output after training is a regular loadable spaCy pipeline
    - model-last: last trained pipeline
    - model-best: best trained pipeline
- load it with spacy.load

In [None]:
import spacy

nlp = spacy.load("/path/to/output/model-best")
doc = nlp("iPhone 11 vs iPhone 8: What's the difference?")
print(doc.ents)

Packaging your pipeline

- spacy package: create an installable Python package containing your pipeline
- easy to version and deploy

```$ python -m spacy package /path/to/output/model-best ./packages --name my_pipeline --version 1.0.0```

```$ cd ./packages/en_my_pipeline-1.0.0```

```$ pip install dist/en_my_pipeline-1.0.0.tar.gz```

Load and use the pipeline after installation:

# 12. Training multiple labels

In [None]:
import spacy
from spacy.tokens import Span

nlp = spacy.blank("en")

doc1 = nlp("Reddit partners with Patreon to help creators build communities")
doc1.ents = [
    Span(doc1, 0, 1, label="WEBSITE"),
    Span(doc1, 3, 4, label="WEBSITE"),
]

doc2 = nlp("PewDiePie smashes YouTube record")
doc2.ents = [Span(doc2, 2, 3, label="WEBSITE")]

doc3 = nlp("Reddit founder Alexis Ohanian gave away two Metallica tickets to fans")
doc3.ents = [Span(doc3, 0, 1, label="WEBSITE")]

# And so on...