# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
spacyNER_data 

## Step 1: Converting data to json structures so it can be used by Spacy

In [2]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data
!mkdir spacyNER_data
#the above two lines create folders if they don't exist. If they do, the output shows a message that it
#already exists and cannot be created again
!python3 -m spacy convert "Data/conll2003/en/train.txt" spacyNER_data -c ner
!python3 -m spacy convert "Data/conll2003/en/test.txt" spacyNER_data -c ner
!python3 -m spacy convert "Data/conll2003/en/valid.txt" spacyNER_data -c ner

mkdir: cannot create directory ‘spacyNER_data’: File exists
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents):
spacyNER_data/train.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.spacy[0m


#### For example, the data before and after running spacy's convert program looks as follows.

In [3]:
!echo "BEFORE : (Data/conll2003/en/train.txt)"
!head "Data/conll2003/en/train.txt" -n 11 | tail -n 9
!echo "\nAFTER : (Data/conll2003/en/train.json)"
!head "spacyNER_data/train.json" -n 64 | tail -n 49

BEFORE : (Data/conll2003/en/train.txt)
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O
\nAFTER : (Data/conll2003/en/train.json)
head: cannot open 'spacyNER_data/train.json' for reading: No such file or directory


## Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json". 

-G stands for gpu option.
-p stands for pipeline, and it should be followed by a comma separated set of options - in this case, a tagger and an NER are being trained simultaneously

In [4]:
!python3 -m spacy train en model spacyNER_data/train.json spacyNER_data/valid.json -G -p tagger,ner

Usage: python -m spacy train [OPTIONS] CONFIG_PATH
Try 'python -m spacy train --help' for help.

Error: Invalid value for 'CONFIG_PATH': Path 'en' does not exist.


Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.json`)

### On Trained model (`model/model-best`)

In [4]:
#create a folder to store the output and visualizations. 
!mkdir result
!python3 -m spacy evaluate model/model-best spacyNER_data/test.json -dp result
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[1m

Time      3.53 s
Words     46666 
Words/s   13234 
TOK       100.00
POS       94.79 
UAS       0.00  
LAS       0.00  
NER P     78.09 
NER R     78.75 
NER F     78.42 

[38;5;2m✔ Generated 25 parses as HTML[0m
result


a Visualization of the entity tagged test data can be seen in result/entities.html folder. 

### On spacy's Pretrained NER model (`en`)

In [5]:
!mkdir pretrained_result
!python3 -m spacy evaluate en spacyNER_data/test.json -dp pretrained_result

[1m

Time      6.52 s
Words     46666 
Words/s   7160  
TOK       100.00
POS       86.84 
UAS       0.00  
LAS       0.00  
NER P     7.97  
NER R     10.68 
NER F     9.12  

[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result


a Visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder. 