In [9]:
import spacy
from spacy.tokens import Span, Doc, Token, DocBin
import random

In [4]:
nlp = spacy.load('en_core_web_md')

## Training the entity recognizer

Previously we saw how to use rules to recognize new entities. However, a rule-based approach can require very complicated rules and still may be quite brittle.

A statistical model can be very helpful in many instances. And of course only a statistical model can do things like sentiment analysis.

A rules-based approach is mostly helpful for preliminary processing and bulk-labeling of data.

In [11]:
doc1 = nlp("iPhone X is coming")
doc1.ents = [Span(doc1, 0, 2, label='GADGET')]
# the model needs to see examples where entities are labeled

In [13]:
doc2 = nlp('I need a new phone! Any tips?')
doc2.ents = []
# but the model also needs unlabeled examples

## DocBins to store data

In [14]:
doc3 = nlp("The banana phone rings, but the Android phone beeps")
doc3.ents = [Span(doc3, 7, 9, label='GADGET')]
doc4 = nlp("This ChromeBook is very cool")
doc4.ents = [Span(doc4, 1, 2, label='GADGET')]

In [19]:
docs = [doc1, doc2, doc3, doc4]
random.shuffle(docs)
train_docs = docs[:len(docs) // 2]
test_docs = docs[len(docs) // 2 + 1:]
# create efficient container for docs to be used in training of model
train_docbin = DocBin()
for doc in train_docs: train_docbin.add(doc)
train_docbin.to_disk('chap4_train.spacy')
test_docbin = DocBin()
for doc in test_docs: test_docbin.add(doc)
test_docbin.to_disk('chap4_test.spacy')

In [21]:
# conll, conllu, and iob are other file extensions for corpora
# spacy convert also allows conversion from spacy's old json format
# !python -m spacy convert <fname> <to_dir>
# converts file fname with one of those formats to the spacy format

## making a config file

In [27]:
!python -m spacy init config -F ./chap4_config.cfg --lang en --pipeline ner,tok2vec
# the --pipeline args are comma-separated
# the -F arg means to overwrite the config file if it already exists

[i] Generated config template specific for your use case
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[+] Auto-filled config with all values
[+] Saved config
chap4_config.cfg
You can now add your data and train your pipeline:
python -m spacy train chap4_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


2022-05-01 10:46:16.054133: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-05-01 10:46:16.054179: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## training the model

In [28]:
!python -m spacy train ./chap4_config.cfg --output ./chap4_out --paths.train chap4_train.spacy --paths.dev chap4_test.spacy

[i] Saving to output directory: chap4_out
[i] Using CPU
[1m
[+] Initialized pipeline
[1m
[i] Pipeline: ['tok2vec', 'ner']
[i] Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00      8.83    0.00    0.00    0.00    0.00
200     200          0.76    126.49    0.00    0.00    0.00    0.00
400     400          0.00      0.00    0.00    0.00    0.00    0.00
600     600          0.00      0.00    0.00    0.00    0.00    0.00
800     800          0.00      0.00    0.00    0.00    0.00    0.00
1000    1000          0.00      0.00    0.00    0.00    0.00    0.00
1200    1200          0.00      0.00    0.00    0.00    0.00    0.00
1400    1400          0.00      0.00    0.00    0.00    0.00    0.00
1600    1600          0.00      0.00    0.00    0.00    0.00    0.00
[+] Saved pipeline to output directory
chap4_out\model-last


2022-05-01 10:47:08.650055: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-05-01 10:47:08.650092: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[2022-05-01 10:47:12,618] [INFO] Set up nlp object from config
[2022-05-01 10:47:12,627] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-05-01 10:47:12,627] [INFO] Created vocabulary
[2022-05-01 10:47:12,627] [INFO] Finished initializing nlp object
[2022-05-01 10:47:12,707] [INFO] Initialized pipeline components: ['tok2vec', 'ner']


In [29]:
nlp = spacy.load('chap4_out/model-best')

In [30]:
doc = nlp("Android 11 vs iPhone 8 vs kumquat phone: what's the diff?")

In [33]:
[(x.text, x.label_) for x in doc.ents]
# note that the trained model can now find GADGET entities on its own

[('Android 11', 'GADGET')]

In [None]:
# !python -m spacy package /chap4_out/model-best ./packages --name gadget_labeler --version 1.0.0
# ! cd ./packages/en_gadget_labeler
# python -m pip install dist/en_gadget_labeler-1.0.0.tar.gz

## Problems and solutions in model training

**Problem:** model forgets how to apply one label when learning how to apply another (e.g., it learns how to label GADGET and forgets how to label FRUIT)

**Solution:** Mix in FRUIT examples with the GADGET examples, especially FRUIT examples that were previously labeled correctly. This way the model will learn both at the same time


**Problem:** spaCy can't distinguish SQUASH from LEAFY_GREEN

**Solution:** Try having a less granular category, like VEGETABLE. 

Models fit based on local context of examples - if there are too few examples of local context, it will overfit to the tokens surrounding what few examples it saw.

**Problem:** But I need to be able to distinguish SQUASH from LEAFY_GREEN!

**Solution:** Use rules to break down general labels into subcategories. Like maybe you can just have a lookup table of squashes and a lookup table of leafy greens.