# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
spacyNER_data 

## Step 1: Converting data to json structures so it can be used by Spacy

In [1]:
import os

In [2]:
# upload train.txt, test.txt, valid.txt from Data/conll2003/en
# try:
#     from google.colab import files
#     uploaded = files.upload()
# except ModuleNotFoundError:
#     print('Not using colab')

In [3]:
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data

# !mkdir spacyNER_data
# os.mkdir('data/spacyNER_data')
        
#the above lines create folder if it doesn't exist. If it does, the output shows a message that it
#already exists and cannot be created again
# try:
#     import google.colab 
#     !python -m spacy convert "train.txt" spacyNER_data -c ner
#     !python -m spacy convert "test.txt" spacyNER_data -c ner
#     !python -m spacy convert "valid.txt" spacyNER_data -c ner
# except ModuleNotFoundError:
#     !python -m spacy convert "Data/conll2003/en/train.txt" spacyNER_data -c ner
#     !python -m spacy convert "Data/conll2003/en/test.txt" spacyNER_data -c ner
#     !python -m spacy convert "Data/conll2003/en/valid.txt" spacyNER_data -c ner


#### For example, the data before and after running spacy's convert program looks as follows.

In [8]:
# try:
#     import google.colab
#     !echo "BEFORE : (train.txt)"
#     !head "train.txt" -n 11 | tail -n 9
# except ModuleNotFoundError:
#     print("BEFORE : (Data/conll2003/en/train.txt)")
#     file = open("Data/conll2003/en/train.txt")
#     content = file.readlines()
#     print(*content[1:11])

print("BEFORE : (Data/conll2003/train.txt)")
file = open("Data/conll2003/train.txt")
content = file.readlines()
print(*content[1:11])


BEFORE : (Data/conll2003/train.txt)

 EU NNP B-NP B-ORG
 rejects VBZ B-VP O
 German JJ B-NP B-MISC
 call NN I-NP O
 to TO B-VP O
 boycott VB I-VP O
 British JJ B-NP B-MISC
 lamb NN I-NP O
 . . O O



In [9]:
# try:
#     import google.colab
#     !echo "AFTER : (spacyNER_data/train.json)"
#     !head "spacyNER_data/train.json" -n 77 | tail -n 58
# except ModuleNotFoundError:
#     print("AFTER : (spacyNER_data/train.json)")
#     f = open('spacyNER_data/train.json')
#     content = f.readlines()
#     print(*content[19:77])

print("AFTER : (data/spacyNER_data/train.json)")
f = open('data/spacyNER_data/train.json')
content = f.readlines()
print(*content[19:77])

AFTER : (data/spacyNER_data/train.json)
            ]
           }
         ],
         "cats":[
 
         ],
         "entities":[
 
         ],
         "links":[
 
         ]
       },
       {
         "raw":null,
         "sentences":[
           {
             "tokens":[
               {
                 "id":0,
                 "orth":"EU",
                 "space":" ",
                 "tag":"NNP",
                 "ner":"U-ORG"
               },
               {
                 "id":1,
                 "orth":"rejects",
                 "space":" ",
                 "tag":"VBZ",
                 "ner":"O"
               },
               {
                 "id":2,
                 "orth":"German",
                 "space":" ",
                 "tag":"JJ",
                 "ner":"U-MISC"
               },
               {
                 "id":3,
                 "orth":"call",
                 "space":" ",
                 "tag":"NN",
                 "ner":"O"
             

## Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.json" and the validation file is at: "spacyNER_data/valid.json". 

-G stands for gpu option.
-p stands for pipeline, and it should be followed by a comma separated set of options - in this case, a tagger and an NER are being trained simultaneously

In [44]:
# !python -m spacy train data/conll2003/config.cfg --output model.spacy
# !python -m spacy train data/conll2003/config.cfg --output-path output/model --train_corpus data/spacyNER_data/train.json --dev_corpus data/spacyNER_data/valid.json -G -p tagger,ner
# !python -m spacy train data/conll2003/config.cfg --output-path output/model
!python -m spacy train data/conll2003/config.cfg --output ./output/model --paths.train data/spacyNER_data/train.spacy --paths.dev data/spacyNER_data/valid.spacy

[38;5;4mℹ Saving to output directory: output\model[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS NER  TAG_ACC  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  -----------  --------  -------  ------  ------  ------  ------
  0       0          0.00        78.39     50.06    23.31    0.97    1.03    0.93    0.12
  0     200        370.71      6973.76   3296.08    87.73   49.34   49.50   49.18    0.69
  0     400       1510.72      3615.81   3009.21    91.64   62.71   62.61   62.81    0.77
  0     600        910.24      3537.96   2306.39    92.85   69.41   71.00   67.89    0.81
  0     800        901.83      3582.69   2268.20    93.61   78.32   79.45   77.23    0.86
  0    1000       1193.78      4128.27   2544.55    94.02   78.49   78.99   78.00    0.86
  1    1200       1312.34      4455.21   2543.85    94.28 

[2023-07-28 11:54:28,458] [INFO] Set up nlp object from config
[2023-07-28 11:54:28,470] [INFO] Pipeline: ['tok2vec', 'tagger', 'ner']
[2023-07-28 11:54:28,471] [INFO] Created vocabulary
[2023-07-28 11:54:28,473] [INFO] Finished initializing nlp object
[2023-07-28 11:54:54,064] [INFO] Initialized pipeline components: ['tok2vec', 'tagger', 'ner']


Notice how the performance improves with each iteration!
## Evaluating the model with test data set (`spacyNER_data/test.json`)

### On Trained model (`model/model-best`)

In [47]:
#create a folder to store the output and visualizations. 
# !mkdir result
# os.mkdir('result')
!python -m spacy evaluate output/model/model-best data/spacyNER_data/test.spacy -dp output
# !python -m spacy evaluate model/model-final data/test.txt.json -dp result

[38;5;4mℹ Using CPU[0m
[1m

TOK     -    
TAG     95.08
NER P   81.59
NER R   81.13
NER F   81.36
SPEED   29291

[1m

           P       R       F
LOC    87.50   84.35   85.90
PER    85.85   84.42   85.13
ORG    74.80   77.54   76.15
MISC   75.00   74.36   74.68

[38;5;2m✔ Generated 25 parses as HTML[0m
output




a Visualization of the entity tagged test data can be seen in result/entities.html folder. 

### On spacy's Pretrained NER model (`en_core_web_sm`)

In [48]:
# !python -m spacy download en_core_web_sm

In [51]:
# !mkdir pretrained_result
# os.mkdir('pretrained_result')
!python -m spacy evaluate en_core_web_sm data/spacyNER_data/test.spacy -dp output

[38;5;4mℹ Using CPU[0m
[1m

TOK      -    
TAG      86.59
POS      -    
MORPH    -    
LEMMA    -    
UAS      -    
LAS      -    
NER P    6.36 
NER R    9.60 
NER F    7.65 
SENT P   93.75
SENT R   96.96
SENT F   95.33
SPEED    10154

[1m

                  P       R       F
ORG           42.87   30.40   35.58
LOC           48.05    2.22    4.24
PER            0.00    0.00    0.00
GPE            0.00    0.00    0.00
EVENT          0.00    0.00    0.00
DATE           0.00    0.00    0.00
MISC           0.00    0.00    0.00
ORDINAL        0.00    0.00    0.00
TIME           0.00    0.00    0.00
NORP           0.00    0.00    0.00
PERSON         0.00    0.00    0.00
CARDINAL       0.00    0.00    0.00
LAW            0.00    0.00    0.00
MONEY          0.00    0.00    0.00
QUANTITY       0.00    0.00    0.00
PERCENT        0.00    0.00    0.00
WORK_OF_ART    0.00    0.00    0.00
PRODUCT        0.00    0.00    0.00
FAC            0.00    0.00    0.00
LANGUAGE       0.00    0.00    0



a Visualization of the entity tagged test data can be seen in pretrained_result/entities.html folder. 