# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
spacyNER_data 

## Step 1: Converting data to json structures so it can be used by Spacy

In [1]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data
!mkdir spacyNER_data
#the above two lines create folders if they don't exist. If they do, the output shows a message that it
#already exists and cannot be created again
!python3 -m spacy convert "Data/conll2003/en/train.txt" spacyNER_data -c ner
!python3 -m spacy convert "Data/conll2003/en/test.txt" spacyNER_data -c ner
!python3 -m spacy convert "Data/conll2003/en/valid.txt" spacyNER_data -c ner

mkdir: cannot create directory ‘spacyNER_data’: File exists
[38;5;2m✔ Generated output file (1 documents)[0m
spacyNER_data/train.json
[38;5;2m✔ Generated output file (1 documents)[0m
spacyNER_data/test.json
[38;5;2m✔ Generated output file (1 documents)[0m
spacyNER_data/valid.json


#### For example, the data before and after running spacy's convert program looks as follows.

In [2]:
!echo "BEFORE : (Data/conll2003/en/train.txt)"
!head "Data/conll2003/en/train.txt" -n 11 | tail -n 9
!echo "\nAFTER : (Data/conll2003/en/train.json)"
!head "spacyNER_data/train.json" -n 64 | tail -n 49

BEFORE : (Data/conll2003/en/train.txt)
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

AFTER : (Data/conll2003/en/train.json)
          {
            "tokens":[
              {
                "orth":"EU",
                "tag":"NNP",
                "ner":"U-ORG"
              },
              {
                "orth":"rejects",
                "tag":"VBZ",
                "ner":"O"
              },
              {
                "orth":"German",
                "tag":"JJ",
                "ner":"U-MISC"
              },
              {
                "orth":"call",
                "tag":"NN",
                "ner":"O"
              },
              {
                "orth":"to",
                "tag":"TO",
                "ner":"O"
              },
              {
                "orth":"boycott",
                "tag":"VB",
                "ner":"O"
              },
            

## Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json". 

-G stands for gpu option.
-p stands for pipeline, and it should be followed by a comma separated set of options - in this case, a tagger and an NER are being trained simultaneously

In [3]:
!python3 -m spacy train en model spacyNER_data/train.json spacyNER_data/valid.json -G -p tagger,ner

Training pipeline: ['tagger', 'ner']
Starting with blank model 'en'
Counting training words (limit=0)

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
  0       0.000   20994.512    0.000   78.404   77.230   77.813   94.075  100.000    15468        0
  1       0.000   10338.546    0.000   84.808   84.366   84.586   94.812  100.000    15833        0
  2       0.000    7414.531    0.000   86.235   85.931   86.083   95.015  100.000    15839        0
  3       0.000    5461.594    0.000   87.020   86.873   86.946   95.106  100.000    15737        0
  4       0.000    4101.375    0.000   87.669   87.344   87.506   95.182  100.000    15887        0
  5       0.000    3413.915    0.000   87.622   87.327   87.475   95.258  100.000    15919        0
  6       0.000    3008.749    0.000   88.024   87.580   87.802   95.322  100.000    18794       

Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.json`)

### On Trained model (`model/model-best`)

In [4]:
#create a folder to store the output and visualizations. 
!mkdir result
!python3 -m spacy evaluate model/model-best spacyNER_data/test.json -dp result
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[1m

Time      3.53 s
Words     46666 
Words/s   13234 
TOK       100.00
POS       94.79 
UAS       0.00  
LAS       0.00  
NER P     78.09 
NER R     78.75 
NER F     78.42 

[38;5;2m✔ Generated 25 parses as HTML[0m
result


a Visualization of the entity tagged test data can be seen in result/entities.html folder. 

### On spacy's Pretrained NER model (`en`)

In [5]:
!mkdir pretrained_result
!python3 -m spacy evaluate en spacyNER_data/test.json -dp pretrained_result

[1m

Time      6.52 s
Words     46666 
Words/s   7160  
TOK       100.00
POS       86.84 
UAS       0.00  
LAS       0.00  
NER P     7.97  
NER R     10.68 
NER F     9.12  

[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result


a Visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder. 