<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W08-extracting-named-entities-with-spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is written by [Haowen Jiang](https://howard-haowen.rohan.tw/), and meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.rohan.tw/NLP-demos/nsysu_workshop).

In [None]:
from datetime import date

today = date.today()
print("Last updated:", today)

# Extracting named entities with spaCy

Named entity recognition (NER) is a highly valuable AI capability and widely used in industries like ecommerce, news outlets, and FinTech. When it comes to extracting named entities, there usually three scenarios.

- Scenario 1: A model is readily usable.

When there happens to be a model available to you, which can recognize the types of labels you're interested in, you can just use that model and call it a day. We've learnend how to do that with spaCy.

- Scenario 2: There's no model or annotated data.

When there's no model or annotated data, the traditional solution is to use Regex to first extract some sample annotations. But spaCy offers better solutions, including:

- [`Matcher`](https://spacy.io/api/matcher), which is based on token attributes
- [`DependencyMatcher`](https://spacy.io/api/dependencymatcher), which is based on a dependency tree

You can play with spaCy's token-based and rule-based matcher [here](https://explosion.ai/demos/matcher).

Alternatively, you can outsource the annotation or use annotation tools like [Doccano](https://github.com/doccano/doccano) or [Label Studio](https://labelstud.io/) to easily create annotations.

- Scenario 3: There's annotated data but no trained model.

This is the scenario we're dealing with in this tutorial. We'll learn how to train a NER model using spaCy's commandline, and compare its performance against a pretrained spaCy model.



## Download spaCy

In [None]:
!pip install -U -q pip setuptools wheel
!pip install -U -q spacy

## Download the [CoNLL-2003 dataset](https://paperswithcode.com/dataset/conll-2003)

In [1]:
!mkdir data

In [2]:
!wget -O data/train.txt https://raw.githubusercontent.com/practical-nlp/practical-nlp-code/master/Ch5/Data/conll2003/en/train.txt
!wget -O data/valid.txt https://raw.githubusercontent.com/practical-nlp/practical-nlp-code/master/Ch5/Data/conll2003/en/valid.txt
!wget -O data/test.txt https://raw.githubusercontent.com/practical-nlp/practical-nlp-code/master/Ch5/Data/conll2003/en/test.txt

--2022-06-10 02:09:46--  https://raw.githubusercontent.com/practical-nlp/practical-nlp-code/master/Ch5/Data/conll2003/en/train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3283420 (3.1M) [text/plain]
Saving to: ‘data/train.txt’


2022-06-10 02:09:46 (44.8 MB/s) - ‘data/train.txt’ saved [3283420/3283420]

--2022-06-10 02:09:46--  https://raw.githubusercontent.com/practical-nlp/practical-nlp-code/master/Ch5/Data/conll2003/en/valid.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 827443 (808K) [text/plain]
Saving to: ‘data/

This is what the orignal dataset looks like.

In [12]:
!head data/train.txt -n 30

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP B-NP B-LOC
1996-08-22 CD I-NP O

The DT B-NP O
European NNP I-NP B-ORG
Commission NNP I-NP I-ORG
said VBD B-VP O
on IN B-PP O
Thursday NNP B-NP O
it PRP B-NP O
disagreed VBD B-VP O
with IN B-PP O
German JJ B-NP B-MISC
advice NN I-NP O
to TO B-PP O


## Convert data to spaCy format

In [9]:
!mkdir spacyNER_data

If your dataset is already in the CoNLL format, you don't need to write your own data conversion functions, which is what we did last week. spaCy provides a convenient command for converting CoNLL format to spaCy's binary format, i.e. `DocBin`.

In [10]:
!python -m spacy convert "./data/train.txt" spacyNER_data --converter ner
!python -m spacy convert "./data/valid.txt" spacyNER_data --converter ner
!python -m spacy convert "./data/test.txt" spacyNER_data --converter ner

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents):
spacyNER_data/train.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.spacy[0m


## Configure the training process

The configuration and training process are exactly the same as what we did for text classification models, except for some differences in setting. 

In [13]:
!mkdir configs

In [14]:
LANG = 'en'
OPTIMIZE = 'accuracy'
CONFIG_PREFIX = 'cpu'
!python -m spacy init config configs/{CONFIG_PREFIX}_config.cfg \
--lang {LANG} \
--pipeline tok2vec,ner \
--optimize {OPTIMIZE} \
--force

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: accuracy
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
configs/cpu_config.cfg
You can now add your data and train your pipeline:
python -m spacy train cpu_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [16]:
!head configs/cpu_config.cfg -n 15

[paths]
train = null
dev = null
vectors = "en_core_web_lg"
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []


## Download a pretrained model

In [17]:
!python -m spacy download en_core_web_lg

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.3.0/en_core_web_lg-3.3.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.3.0
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## Start training

In [18]:
CONFIG_PREFIX = 'cpu'
CONFIG_FILE = CONFIG_PREFIX + '_config.cfg'
TRAIN_FILE = './spacyNER_data/train'
VALID_FILE = './spacyNER_data/valid'
MODEL_DIR = f'./{CONFIG_PREFIX}_model'
!python -m spacy train configs/{CONFIG_FILE} \
--output {MODEL_DIR} \
--paths.train {TRAIN_FILE}.spacy \
--paths.dev {VALID_FILE}.spacy \
--verbose

[2022-06-10 02:44:03,882] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
[38;5;2m✔ Created output directory: cpu_model[0m
[38;5;4mℹ Saving to output directory: cpu_model[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[2022-06-10 02:44:04,519] [INFO] Set up nlp object from config
[2022-06-10 02:44:04,528] [DEBUG] Loading corpus from path: spacyNER_data/valid.spacy
[2022-06-10 02:44:04,529] [DEBUG] Loading corpus from path: spacyNER_data/train.spacy
[2022-06-10 02:44:04,529] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-06-10 02:44:04,533] [INFO] Created vocabulary
[2022-06-10 02:44:06,170] [INFO] Added vectors: en_core_web_lg
[2022-06-10 02:44:07,452] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2022-06-10 02:44:19,860] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔

## Evaluate models

### Evaluate `model-best`

This is spaCy's official explanation about the `evaluate` commannd:

> Evaluate a trained pipeline. Expects a loadable spaCy pipeline (package name or path) and evaluation data in the binary .spacy format. The --gold-preproc option sets up the evaluation examples with gold-standard sentences and tokens for the predictions. Gold preprocessing helps the annotations align to the tokenization, and may result in sequences of more consistent length. However, it may reduce runtime accuracy due to train/test skew. To render a sample of dependency parses in a HTML file using the displaCy visualizations, set as output directory as the --displacy-path argument.

Read [this thread](https://github.com/explosion/spaCy/issues/2607) to understand why there's a need for the option `gold-preproc`.

In [26]:
CONFIG_PREFIX = 'cpu'
NER_MODEL_PATH = f'./{CONFIG_PREFIX}_model/model-best'
TEST_DATA_PATH = './spacyNER_data/test'
EVAL_PATH = NER_MODEL_PATH + '/evaluation'
METRICS_FILE = EVAL_PATH + '/test_metrics.json'

In [24]:
!mkdir {EVAL_PATH}

In [29]:
!python -m spacy evaluate {NER_MODEL_PATH} {TEST_DATA_PATH}.spacy \
--output {METRICS_FILE} \
--displacy-path {EVAL_PATH}

[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m

TOK     -    
NER P   87.14
NER R   86.72
NER F   86.93
SPEED   2903 

[1m

           P       R       F
LOC    88.99   91.55   90.25
PER    90.21   91.71   90.95
MISC   79.64   75.78   77.66
ORG    85.12   81.64   83.34

[38;5;2m✔ Generated 25 parses as HTML[0m
cpu_model/model-best/evaluation
[38;5;2m✔ Saved results to
cpu_model/model-best/evaluation/test_metrics.json[0m


In [31]:
import IPython
entity_file = EVAL_PATH + '/entities.html'
IPython.display.HTML(filename=entity_file)

### Evaluate `en_core_web_lg`

In [36]:
NER_MODEL_PATH = 'en_core_web_lg'
TEST_DATA_PATH = './spacyNER_data/test'
EVAL_PATH = './spacy_evaluation'
METRICS_FILE = EVAL_PATH + '/test_metrics.json'

In [37]:
!mkdir {EVAL_PATH}

In [38]:
!python -m spacy evaluate {NER_MODEL_PATH} {TEST_DATA_PATH}.spacy \
--output {METRICS_FILE} \
--displacy-path {EVAL_PATH}

[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m

TOK      -    
TAG      87.19
POS      -    
MORPH    -    
LEMMA    -    
UAS      -    
LAS      -    
NER P    7.11 
NER R    10.30
NER F    8.42 
SENT P   96.53
SENT R   98.29
SENT F   97.40
SPEED    7193 

[1m

                  P       R       F
ORG           49.46   33.11   39.67
GPE            0.00    0.00    0.00
LOC           51.61    1.92    3.70
PER            0.00    0.00    0.00
PERSON         0.00    0.00    0.00
EVENT          0.00    0.00    0.00
DATE           0.00    0.00    0.00
MISC           0.00    0.00    0.00
ORDINAL        0.00    0.00    0.00
CARDINAL       0.00    0.00    0.00
TIME           0.00    0.00    0.00
NORP           0.00    0.00    0.00
LAW            0.00    0.00    0.00
PERCENT        0.00    0.00    0.00
PRODUCT        0.00    0.00    0.00
LANGUAGE       0.00    0.00    0.00
MONEY          0.00    0.00    0.00
QUANTITY       0.00    0.00    0.00
FAC   

In [39]:
entity_file = EVAL_PATH + '/entities.html'
IPython.display.HTML(filename=entity_file)

## Adieu

Having trained a NER model, we're now reaching the end of this 8-week workshop! 🙌 It's time to bid adieu!

![](https://media.makeameme.org/created/adieu-adieu-adieu-5c47b2.jpg)