# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
`spacyNER_data `

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/LAP/Subjects/AP2/labs

/content/drive/MyDrive/LAP/Subjects/AP2/labs


## spaCy v3.0

We will install the last version of SpaCy (v3.0) as it provides better training worklflow and config system to train custom models. In addition, it features transformer-based pipelines that bring spaCy’s accuracy right up to the current state-of-the-art.

In this tutorial we will learn how to use command line interface (CLI) along with the config file to train NER model in CONLL-2003 dataset.



In [4]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.2.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 5.3 MB/s 
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.2-py3-none-any.whl (7.2 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 45.6 MB/s 
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 37.7 MB/s 
[?25hCollecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.2 MB/s 
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.7-py3-none-any.whl (17 kB)
Collecting typing-extensions<4.0.0.0,>=3.7.4
  Downloading typing_extensions-3.10.0.2-py3-none-any.whl (26 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloa

Check you are using the correct spaCy version.

In [5]:
import spacy
print(spacy.__version__)

3.2.4


## Step 1: Converting data to binary files so it can be used by Spacy

Convert data from conll to spaCy format. Command provides more than one converter (IOB, BLIOU, etc). See converter's detail in https://spacy.io/api/cli#convert 

In [6]:
data_dir="../data"
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data
!mkdir spacyNER_data
#the above two lines create folders if they don't exist. If they do, the output shows a message that it
#already exists and cannot be created again
!python3 -m spacy convert "../data/conll2003/en/train.txt" spacyNER_data -c ner
!python3 -m spacy convert "../data/conll2003/en/test.txt" spacyNER_data -c ner
!python3 -m spacy convert "../data/conll2003/en/valid.txt" spacyNER_data -c ner

mkdir: cannot create directory ‘spacyNER_data’: File exists
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents):
spacyNER_data/train.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.spacy[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.spacy[0m


## Step 2: Create config file for spaCy

In this step we are going to create the config file that will be used by spaCy.
To get started with the recommended settings for your use case, check out the [quickstart widget](https://spacy.io/usage/training#quickstart) or run the [init config](https://spacy.io/api/cli#init-config) command. 


### Exercise 1: 
1. Create basic config file with the quickstart widget. 
2. Upload to your drive in the `data` folder. 
3. Modify your basic config if you need. For example, training for 3500 steps is enough. 
4. Create final config file (`config.cfg`) with `init fill-config` command (see the code below)



In [9]:
base_config_path=data_dir+"/base_config.cfg"
!cp "../data/base_config.cfg" "base_config.cfg"

In [10]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## Step:3 Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.spacy" and the validation file is at: "spacyNER_data/valid.spacy". 



In [11]:
!python -m spacy train config.cfg --output ./model --paths.train spacyNER_data/train.spacy --paths.dev spacyNER_data/valid.spacy

[38;5;2m✔ Created output directory: model[0m
[38;5;4mℹ Saving to output directory: model[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[2022-04-27 17:08:49,274] [INFO] Set up nlp object from config
[2022-04-27 17:08:49,287] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-04-27 17:08:49,292] [INFO] Created vocabulary
[2022-04-27 17:08:49,293] [INFO] Finished initializing nlp object
[2022-04-27 17:09:13,313] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     50.06    0.07    0.09    0.05    0.00
  0     200        169.17   3375.96   52.52   56.31   49.21    0.53
  0     400        321.33   2385.54   59.44   59.92   58.97    0.59
  0     600       

Notice how the performance improves with each iteration!


## Step4: Evaluating the model with test data set

In [12]:
#create a folder to store the output and visualizations. 
!mkdir result
!python3 -m spacy evaluate model/model-best spacyNER_data/test.spacy -o results -dp result

[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m

TOK     -    
NER P   80.75
NER R   79.60
NER F   80.17
SPEED   22389

[1m

           P       R       F
PER    78.71   86.21   82.29
ORG    80.69   69.96   74.94
LOC    86.26   85.43   85.84
MISC   73.05   73.36   73.21

[38;5;2m✔ Generated 25 parses as HTML[0m
result
[38;5;2m✔ Saved results to results[0m


In [13]:
# save model for later
!cp -r /content/model "../data/"

cp: cannot stat '/content/model': No such file or directory


### Exercise 2:
- Explore different options of training configuration. You can take advantage of the GPUs provided by Google Colab and train a transformer-based model.  See https://spacy.io/usage/training#quickstart

## Load your own model to use in Python code

In [14]:
import spacy
nlp = spacy.load("model/model-best")

In [15]:
nlp.pipe_names

['tok2vec', 'ner']

In [16]:
text = "Japan began the defence of their Asian Cup title."
doc = nlp(text)

print(doc.text)
for entity in doc.ents:
    print(entity.text, entity.label_)

Japan began the defence of their Asian Cup title.
Japan LOC
Asian Cup MISC


## Compare with the pretrained model

As you know spaCy provides various pipelines models. One of the most used is the `en_core_web_sm` model, which also provides NER module in its pipeline. 


### Exercise 3

 1. Download and load `en_core_web_sm` into your code
 2. Process some texts with the model
 3. What's the difference with your custom models and the pretrained one?

In [17]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 5.2 MB/s 
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [18]:
nlp_pre = spacy.load("en_core_web_sm")
doc = nlp_pre(text)

for entity in doc.ents:
    print(entity.text, entity.label_)

Japan GPE
Asian NORP


## Visualization


info: https://spacy.io/usage/visualizers

In [20]:
from IPython.core.display import display, HTML
from spacy.training import Corpus
from spacy import displacy

reader = Corpus('spacyNER_data/test.spacy')
test_data = reader(nlp)
test_golds = [example for example in test_data]
test_texts = [example.text for example in test_golds]

In [21]:
sent_i = 40
test_gold = test_golds[sent_i]
test_text =test_texts[sent_i]

Show gold annotations

In [22]:
entities = [(i, ent) for i, ent in enumerate(test_gold.to_dict()['doc_annotation']['entities']) if ent != "O"]
tokens = test_gold.to_dict()['token_annotation']['ORTH']
ents = []
for i, label in entities:
    prefix, label = label.split('-')
    start = len(' '.join(tokens[0:i]))
    end = len(' '.join(tokens[0:(i+1)]))
    if prefix == 'I' or prefix == 'L':
        ents[-1]['end'] = end
    if prefix == 'B' or prefix == 'U':   
       ents.append({'start':start, 'end':end, 'label': label})
    #print('start:{} -  end:{} - {}-{}'.format(start, end, prefix, label))

ex = [{'text':  ' '.join(tokens), 'ents': ents, 'title': None}]
html = displacy.render(ex, style="ent", manual=True)
display(HTML(html))

Show your model's predictions

In [23]:
doc = nlp(test_text)
html = displacy.render(doc, style="ent", jupyter=True)
display(HTML(html))

<IPython.core.display.HTML object>

Show pretrained model's predictions

In [24]:
doc = nlp_pre(test_text)
html = displacy.render(doc, style="ent", jupyter=True)
display(HTML(html))

<IPython.core.display.HTML object>