# Training and Evaluating an NER model with spaCy on the CoNLL dataset

In this notebook, we will take a look at using spaCy commandline to train and evaluate a NER model. We will also compare it with the pretrained NER model in spacy. 

Note: we will create multiple folders during this experiment:
`spacyNER_data `

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## spaCy v3.0

We will install the last version of SpaCy (v3.0) as it provides better training worklflow and config system to train custom models. In addition, it features transformer-based pipelines that bring spaCy’s accuracy right up to the current state-of-the-art.

In this tutorial we will learn how to use command line interface (CLI) along with the config file to train NER model in CONLL-2003 dataset.



In [None]:
!pip install -U spacy

Check you are using the correct spaCy version.

In [None]:
import spacy
print(spacy.__version__)

## Step 1: Converting data to binary files so it can be used by Spacy

Convert data from conll to spaCy format. Command provides more than one converter (IOB, BLIOU, etc). See converter's detail in https://spacy.io/api/cli#convert 

In [None]:
data_dir="drive/MyDrive/Colab Notebooks/nlp-app-II/data"
#Read the CONLL data from conll2003 folder, and store the formatted data into a folder spacyNER_data
!mkdir spacyNER_data
#the above two lines create folders if they don't exist. If they do, the output shows a message that it
#already exists and cannot be created again
!python3 -m spacy convert "drive/MyDrive/Colab Notebooks/nlp-app-II/data/conll2003/en/train.txt" spacyNER_data -c ner
!python3 -m spacy convert "drive/MyDrive/Colab Notebooks/nlp-app-II/data/conll2003/en/test.txt" spacyNER_data -c ner
!python3 -m spacy convert "drive/MyDrive/Colab Notebooks/nlp-app-II/data/conll2003/en/valid.txt" spacyNER_data -c ner

## Step 2: Create config file for spaCy

In this step we are going to create the config file that will be used by spaCy.
To get started with the recommended settings for your use case, check out the [quickstart widget](https://spacy.io/usage/training#quickstart) or run the [init config](https://spacy.io/api/cli#init-config) command. 


### Exercise 1: 
1. Create basic config file with the quickstart widget. 
2. Upload to your drive in the `data` folder. 
3. Modify your basic config if you need. For example, training for 3500 steps is enough. 
4. Create final config file (`config.cfg`) with `init fill-config` command (see the code below)



In [None]:
base_config_path=data_dir+"/base_config.cfg"
!cp "drive/MyDrive/Colab Notebooks/nlp-app-II/data/base_config.cfg" "base_config.cfg"

In [None]:
!python -m spacy init fill-config base_config.cfg config.cfg

## Step:3 Training the NER model with Spacy (CLI)

All the commandline options can be seen at: https://spacy.io/api/cli#train
We are training using the train program in spacy, for English (en), and the results are stored in a folder 
called "model" (created while training). Our training file is in "spacyNER_data/train.spacy" and the validation file is at: "spacyNER_data/valid.spacy". 



In [None]:
!python -m spacy train config.cfg --output ./model --paths.train spacyNER_data/train.spacy --paths.dev spacyNER_data/valid.spacy

Notice how the performance improves with each iteration!


## Step4: Evaluating the model with test data set

In [None]:
#create a folder to store the output and visualizations. 
!mkdir result
!python3 -m spacy evaluate model/model-best spacyNER_data/test.spacy -o results -dp result

In [None]:
# save model for later
!cp -r /content/model "drive/MyDrive/Colab Notebooks/nlp-app-II/data/"

### Exercise 2:
- Explore different options of training configuration. You can take advantage of the GPUs provided by Google Colab and train a transformer-based model.  See https://spacy.io/usage/training#quickstart

## Load your own model to use in Python code

In [None]:
import spacy
nlp = spacy.load("model/model-best")

In [None]:
nlp.pipe_names

In [None]:
text = "Japan began the defence of their Asian Cup title."
doc = nlp(text)

print(doc.text)
for entity in doc.ents:
    print(entity.text, entity.label_)

## Compare with the pretrained model

As you know spaCy provides various pipelines models. One of the most used is the `en_core_web_sm` model, which also provides NER module in its pipeline. 


### Exercise 3

 1. Download and load `en_core_web_sm` into your code
 2. Process some texts with the model
 3. What's the difference with your custom models and the pretrained one?

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
nlp_pre = spacy.load("en_core_web_sm")
doc = nlp_pre(text)

for entity in doc.ents:
    print(entity.text, entity.label_)

In [None]:
!python -m spacy download en_core_web_sm

## Visualization


info: https://spacy.io/usage/visualizers

In [None]:
from IPython.core.display import display, HTML
from spacy.training import Corpus
from spacy import displacy

reader = Corpus('spacyNER_data/test.spacy')
test_data = reader(nlp)
test_golds = [example for example in test_data]
test_texts = [example.text for example in test_golds]

In [None]:
sent_i = 40
test_gold = test_golds[sent_i]
test_text =test_texts[sent_i]

Show gold annotations

In [None]:
entities = [(i, ent) for i, ent in enumerate(test_gold.to_dict()['doc_annotation']['entities']) if ent != "O"]
tokens = test_gold.to_dict()['token_annotation']['ORTH']
ents = []
for i, label in entities:
    prefix, label = label.split('-')
    start = len(' '.join(tokens[0:i]))
    end = len(' '.join(tokens[0:(i+1)]))
    if prefix == 'I' or prefix == 'L':
        ents[-1]['end'] = end
    if prefix == 'B' or prefix == 'U':   
       ents.append({'start':start, 'end':end, 'label': label})
    #print('start:{} -  end:{} - {}-{}'.format(start, end, prefix, label))

ex = [{'text':  ' '.join(tokens), 'ents': ents, 'title': None}]
html = displacy.render(ex, style="ent", manual=True)
display(HTML(html))

Show your model's predictions

In [None]:
doc = nlp(test_text)
html = displacy.render(doc, style="ent", jupyter=True)
display(HTML(html))

Show pretrained model's predictions

In [None]:
doc = nlp_pre(test_text)
html = displacy.render(doc, style="ent", jupyter=True)
display(HTML(html))