# Training the NER pipeline component

The goal of this project is to update the Named Entity Recognition (NER) pipeline component of the language model we've used in a previous project, [POS Tagging, Syntactic Dependency Parsing and NER](https://github.com/j-n-t/natural_language_processing/blob/master/POS%20Tagging%20and%20NER.ipynb), and to improve its performance.

In order to do that, we'll use **spaCy** and the [**INCEpTION annotation tool**](https://inception-project.github.io/).

#### 1. Perform initial imports

In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

from spacy.util import minibatch, compounding
from spacy.pipeline import EntityRecognizer

from spacy import displacy

from mediawiki import MediaWiki

import os
import json
import random

#### 2. Load data

In our previous project, we've seen that the NER pipeline component was not able to associate the proper entity label to the token **Azores, an archipelago of nine islands in the North Atlantic Ocean**.

To correct this, we'll load the **wikipedia page about Azores and use that text to update our NER pipeline component**.

In [2]:
# load wikipedia page about Azores

wikipedia = MediaWiki()
azores = wikipedia.page('Azores - Wikipedia')
# Azores - Wikipedia corresponds to the title tag of the source code for this page

#### 3. Explore data

In [3]:
# url of the wikipedia page about Azores

azores.url

'https://en.wikipedia.org/wiki/Azores'

In [4]:
# fist 2 sentences (out of a maximum of 10) of the wikipedia page about Azores 

wikipedia.summary("Azores - Wikipedia", sentences=2)

'The Azores ( ə-ZORZ, also US:  AY-zorz; Portuguese: Açores [ɐˈsoɾɨʃ]), officially the Autonomous Region of the Azores (Região Autónoma dos Açores), is one of the two autonomous regions of Portugal (along with Madeira). It is an archipelago composed of nine volcanic islands in the Macaronesia region of the North Atlantic Ocean, about 1,360 km (850 mi) west of continental Portugal, about 1,500 km (930 mi) west of Lisbon, in continental Portugal, about 1,500 km (930 mi) northwest of Morocco, and about 2,980 km (1,850 mi) southeast of Newfoundland, Canada.'

In [5]:
# doc with all the content of the wikipedia page about Azores

doc = nlp(azores.content)

In [6]:
# first sentence of the doc container

list(doc.sents)[0]

The Azores ( ə-ZORZ, also US:  AY-zorz; Portuguese: Açores [ɐˈsoɾɨʃ]), officially the Autonomous Region of the Azores (Região Autónoma dos Açores), is one of the two autonomous regions of Portugal (along with Madeira).

In [7]:
# number of sentences in the doc container

len(list(doc.sents))

513

The wikipedia page about Azores has 513 sentences. To accomplish our goal, we'll use the **first 100 sentences** and test the updated NER pipeline component to see if it behaves as expected.

#### 3. Save data to a text file

In [8]:
# select first 100 sentences

azores_100sent = list(doc.sents)[0:100]

In [9]:
# last 5 sentences

azores_100sent[-5:]

[Beginning in 1868, Portugal issued its stamps overprinted with "AÇORES" for use in the islands.,
 Between 1892 and 1906, it also issued separate stamps for the three administrative districts of the time.,
 During the 18th and 19th centuries, Graciosa was host to many prominent figures, including Chateaubriand, the French writer who passed through upon his escape to America during the French revolution;,
 Almeida Garrett, the Portuguese poet who visited an uncle and wrote some poetry while there; and Prince Albert of Monaco, the 19th century oceanographer who led several expeditions in the waters of the Azores.,
 He arrived on his yacht Hirondelle, and visited the furna da caldeira, the noted hot springs grotto.]

In [10]:
# save azores_100sent to a txt file

# output directory name
output_dir = './ner_training/text'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

with open(output_dir+'/azores_100sent.txt', 'w', encoding='utf8') as f:
    for sentence in azores_100sent:
        f.write(sentence.text+'\n')

#### 4. Annotate the text

We will now use the INCEpTION annotation tool together with spaCy to perform the annotation of the text we have just saved.

In order to do this, we will need to **set an external recommender for INCEpTION**. Detailed explanations can be found [here](https://inception-project.github.io/example-projects/external-recommender/) and [here](https://github.com/inception-project/external-recommender-spacy).

After launching **INCEpTION** and **spaCy's external recommender**, we need to do the following with INCEpTION:

* create new project - set name and description
* in the 'Documents' tab, import our txt file
* in the 'Recommenders' tab, create a new recommender - set name, layer (Named entity), feature (value), tool (Remote classifier) and remote URL (http://localhost:5000/ner)
* in the 'Tagsets' tab, select 'Named Entity tags', and in 'Tagset Details' change the language to 'en' and select the option 'Annotators may add new tags'

We can now return to the main menu by clicking on the top left corner (INCEpTION). We click on 'Annotation' and open the document.

It's now time to start annotating! We select 'Named entity' from the 'Layer' dropdown menu on the right side of the screen and once we double click on a word and select the corresponding value, the external recommendations will appear above the tokens highlighted in blue.

We can now accept or reject the suggested annotations and make new annotations.

<img src="./ner_training/images/azores_inception.jpg" title="Annotating with INCEpTION" />

After finishing the annotation process, we can **export the document and save it in the CoNLL 2002 format**.

#### 5. Convert the file

The next step is to convert this file to spaCy's json format as documented [here](https://spacy.io/api/cli#convert) and [here](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data).

I've created an `out` folder to store the converted files and ran the following command on the command line: `python -m spacy convert azores_100sent.conll out -c ner -s -n 10 -b en`.

With this .json file, we could now **train our NER pipeline component** from the command line following [these guidelines](https://spacy.io/api/cli#train).

Alternatively, we could also get the entity offsets from this file and prepare a list of training examples. In order to do this, I've created a function to do this conversion.

In [62]:
def train_examples_converter(json):

    train_examples = []

    for i in range(len(json)):

        for j in range(len(json[i]['paragraphs'][0]['sentences'])):

            text=[]
            tags=[]
            entities=[]

            for tokens in json[i]['paragraphs'][0]['sentences'][j]['tokens']:

                text.append(tokens['orth'])
                tags.append(tokens['ner'])
                if tokens['ner'] != 'O':
                    start=len(' '.join(text))-len(text[-1])
                    end=len(' '.join(text))
                    if tokens['ner'].startswith('U'):
                        entities.append((start, end, tokens['ner'][2:]))
                    elif tokens['ner'].startswith('B'):
                        start_multi = start
                    elif tokens['ner'].startswith('L'):
                        end_multi = end
                        entities.append((start_multi, end_multi, tokens['ner'][2:]))

            train_examples.append((' '.join(text), {'entities':entities}))
            
    return train_examples

After copying the.json file to our project's folder, we can now load the file and pass it to the function.

In [38]:
with open('azores_5sent.json', encoding='utf8') as f:
    train_examples_json = json.loads(f.read())

In [39]:
train_examples_json[0]['paragraphs'][0]['sentences']

[{'tokens': [{'orth': 'The', 'tag': '-', 'ner': 'O'},
   {'orth': 'Azores', 'tag': '-', 'ner': 'U-LOC'},
   {'orth': '(', 'tag': '-', 'ner': 'O'},
   {'orth': 'ə-ZORZ', 'tag': '-', 'ner': 'O'},
   {'orth': ',', 'tag': '-', 'ner': 'O'},
   {'orth': 'also', 'tag': '-', 'ner': 'O'},
   {'orth': 'US', 'tag': '-', 'ner': 'U-GPE'},
   {'orth': ':', 'tag': '-', 'ner': 'O'},
   {'orth': 'AY-zorz', 'tag': '-', 'ner': 'O'},
   {'orth': ';', 'tag': '-', 'ner': 'O'},
   {'orth': 'Portuguese', 'tag': '-', 'ner': 'U-NORP'},
   {'orth': ':', 'tag': '-', 'ner': 'O'},
   {'orth': 'Açores', 'tag': '-', 'ner': 'O'},
   {'orth': '[', 'tag': '-', 'ner': 'O'},
   {'orth': 'ɐˈsoɾɨʃ', 'tag': '-', 'ner': 'O'},
   {'orth': ']', 'tag': '-', 'ner': 'O'},
   {'orth': ')', 'tag': '-', 'ner': 'O'},
   {'orth': ',', 'tag': '-', 'ner': 'O'},
   {'orth': 'officially', 'tag': '-', 'ner': 'O'},
   {'orth': 'the', 'tag': '-', 'ner': 'O'},
   {'orth': 'Autonomous', 'tag': '-', 'ner': 'B-LOC'},
   {'orth': 'Region', 'tag': 

In [63]:
train_examples = train_examples_converter(train_examples_json)

In [64]:
train_examples

[('The Azores ( ə-ZORZ , also US : AY-zorz ; Portuguese : Açores [ ɐˈsoɾɨʃ ] ) , officially the Autonomous Region of the Azores ( Região Autónoma dos Açores ) , is one of the two autonomous regions of Portugal ( along with Madeira ) .',
  {'entities': [(4, 10, 'LOC'),
    (27, 29, 'GPE'),
    (42, 52, 'NORP'),
    (93, 124, 'LOC'),
    (172, 175, 'CARDINAL'),
    (198, 206, 'GPE'),
    (220, 227, 'LOC')]}),
 ('It is an archipelago composed of nine volcanic islands in the Macaronesia region of the North Atlantic Ocean , about 1,360 km ( 850 mi ) west of continental Portugal , about 1,500 km ( 930 mi ) west of Lisbon , in continental Portugal , about 1,500 km ( 930 mi ) northwest of Morocco , and about 2,980 km ( 1,850 mi ) southeast of Newfoundland , Canada .',
  {'entities': [(33, 37, 'CARDINAL'),
    (62, 73, 'LOC'),
    (84, 108, 'LOC'),
    (111, 125, 'QUANTITY'),
    (128, 134, 'QUANTITY'),
    (157, 165, 'GPE'),
    (168, 182, 'QUANTITY'),
    (185, 188, 'CARDINAL'),
    (202, 208

In [49]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [53]:
nlp.disable_pipes(['tagger', 'parser'])

[('tagger', <spacy.pipeline.pipes.Tagger at 0x2342032e0f0>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x23420439e28>)]

In [54]:
nlp.pipe_names

['ner']

In [45]:
# create output directory name for the ner component

if not os.path.exists('ner_pipe'):
    os.makedirs('ner_pipe')

In [65]:
optimizer = nlp.entity.create_optimizer()
for i in range(25):
    random.shuffle(train_examples)
    max_batch_size = 3
    batch_size = compounding(2.0, max_batch_size, 1.001)
    batches = minibatch(train_examples, size=batch_size)
    for batch in batches:
        sentences, annotations = zip(*batch)
        #print(sentences, annotations)
        nlp.update(sentences, annotations, sgd=optimizer)
ner = nlp.get_pipe('ner')
ner.to_disk('./ner_pipe')

In [67]:
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ner = EntityRecognizer(nlp.vocab)
ner.from_disk('./ner_pipe')
nlp.add_pipe(ner, 'azores_ner')

In [69]:
nlp.pipe_names

['tagger', 'parser', 'azores_ner']

In [70]:
doc = nlp("if that between America and Europe is ample, will that between the Continent and the Azores, or Madeira, or the Canaries, or Ireland, be sufficient?")

In [71]:
for ent in doc.ents:
    print(ent.text, ent.label_)

America GPE
Europe LOC
Continent LOC
Azores LOC
Madeira LOC
Canaries LOC
Ireland GPE
