# New entities recogniser, annotatation with BILUO scheme, using spaCy

 * Use of pretrained Machine Learning (ML) model is quite prevalent in vision-related problems, where it is tuned for the desired task, nonetheless, last couple of years ([Peters et al.](https://www.aclweb.org/anthology/N18-1202/), [Akbik et al.](https://alanakbik.github.io/papers/coling2018.pdf)) has spurred the use of pretrained Natural Language Processing (NLP) models to do the same for NLP tasks. 
 
 * This notebook uses a pretrained [spaCy](https://spacy.io/models/en) model to train for user-specific entities in texts. 
 
 * Read [here](https://ruder.io/state-of-transfer-learning-in-nlp/) for the latest state of transfer learning in NLP.
 
 * The pretrained [model](https://spacy.io/models/en) used here is convolution neural network (CNN) architecture trained on [OneNotes](https://catalog.ldc.upenn.edu/LDC2013T19) 
 
 * The customised entity recogniser is trained on [BILUO](https://spacy.io/api/annotation#biluo) scheme. Note here that the BILUO scheme trains and performs better than IOB scheme. Read faq of [README](README.md) 
 
 * This is an extension with explanation for already provided [example](https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py) by spaCy.

In [13]:
## Load a NLP model

In [14]:
import spacy
import numpy as np
nlp = spacy.load('en') # or any other specific model like 'en_core_web_md' more at https://spacy.io/models/en


## Data Annotations

 * [Using BILUO scheme](#biluo)
 * [Using offset indices](#offset)
 * [Custom Doc](#customdoc)

### Using BILUO Scheme
<a id='biluo'> </a>

 * Annotate your data using [BILUO](https://spacy.io/api/annotation#biluo) scheme where each token is from [doc](https://spacy.io/api) created with model
 ```
 text = 'Write your text here.'
 doc = nlp(text)
 
 # for token in doc:
 # write the token in a file for annotating it later 
 ```
 
 * An example is provided [here](ner-token-per-line.biluo) 
 
 * This is less cumbersome process of annotating than offset indices- shown later.  
 


In [15]:
# Gor reproducing same results during mutiple run
s = 999
np.random.seed(s)
spacy.util.fix_random_seed(s)

# if Training with GPU also
# cupy.random.seed(s)

In [16]:
# Load the data from file

import pandas as pd
dpath = 'ner-token-per-line.biluo'
df = pd.read_csv(dpath, sep=',')
words  = df.word.values
ents = df.label.values
text = ' '.join(words)


### Add all the new annotations

In [17]:
# Provide all the extra entity that the model should recognise beyond existing named-entities https://spacy.io/api/annotation#named-entities
add_ents = ['DATED'] # 

# Create a pipe if it does not exist
# Piplines in pretrained model: tagger, parser, ner create new if blank model is to be trained using `spacy.blank('en')`
if "ner" not in nlp.pipe_names:
    ner = nlp.create_pipe("ner") # "architecture": "ensemble" simple_cnn ensemble, bow # https://spacy.io/api/annotation
    nlp.add_pipe(ner)
else:
    ner = nlp.get_pipe("ner")

prev_ents = ner.move_names
print('[Existing Entities] = ', ner.move_names)

for ent in add_ents:
    ner.add_label(ent)
    
new_ents = ner.move_names
# print('\n[All Entities] = ', ner.move_names)

print('\n\n[New Entities] = ', list(set(new_ents) - set(prev_ents)))



[Existing Entities] =  ['B-ORG', 'B-DATE', 'B-PERSON', 'B-GPE', 'B-MONEY', 'B-CARDINAL', 'B-NORP', 'B-PERCENT', 'B-WORK_OF_ART', 'B-LOC', 'B-TIME', 'B-QUANTITY', 'B-FAC', 'B-EVENT', 'B-ORDINAL', 'B-PRODUCT', 'B-LAW', 'B-LANGUAGE', 'I-ORG', 'I-DATE', 'I-PERSON', 'I-GPE', 'I-MONEY', 'I-CARDINAL', 'I-NORP', 'I-PERCENT', 'I-WORK_OF_ART', 'I-LOC', 'I-TIME', 'I-QUANTITY', 'I-FAC', 'I-EVENT', 'I-ORDINAL', 'I-PRODUCT', 'I-LAW', 'I-LANGUAGE', 'L-ORG', 'L-DATE', 'L-PERSON', 'L-GPE', 'L-MONEY', 'L-CARDINAL', 'L-NORP', 'L-PERCENT', 'L-WORK_OF_ART', 'L-LOC', 'L-TIME', 'L-QUANTITY', 'L-FAC', 'L-EVENT', 'L-ORDINAL', 'L-PRODUCT', 'L-LAW', 'L-LANGUAGE', 'U-ORG', 'U-DATE', 'U-PERSON', 'U-GPE', 'U-MONEY', 'U-CARDINAL', 'U-NORP', 'U-PERCENT', 'U-WORK_OF_ART', 'U-LOC', 'U-TIME', 'U-QUANTITY', 'U-FAC', 'U-EVENT', 'U-ORDINAL', 'U-PRODUCT', 'U-LAW', 'U-LANGUAGE', 'O']


[New Entities] =  ['I-DATED', 'B-DATED', 'U-DATED', 'L-DATED']


In [18]:
#### Create Dataset
from spacy.gold import GoldParse
doc = nlp.make_doc(text)
g = GoldParse(doc, entities=ents)

# Add examples as avaialble or needed
X = [doc, doc]
Y = [g, g]

### Training
<a id='training'> </a>

In [19]:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
print(f'[OtherPipes] = {other_pipes} will be disabled')

[OtherPipes] = ['tagger', 'parser'] will be disabled


In [20]:
model = None # Since we training a fresh model not a saved model
n_iter = 20
with nlp.disable_pipes(*other_pipes):  # only train ner
    # optimizer = nlp.begin_training()
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    for i in range(n_iter):
        losses = {}
        nlp.update(X, Y,  sgd=optimizer, drop=0.0, losses=losses)
        # nlp.entity.update(d, g)
        print("Losses", losses)


Losses {'ner': 15.094052302083471}
Losses {'ner': 2.813956273703063}
Losses {'ner': 2.6137797097536524}
Losses {'ner': 2.6699867776599628}
Losses {'ner': 3.6325323196373986}
Losses {'ner': 3.418292572110345}
Losses {'ner': 2.7101756360972948}
Losses {'ner': 2.0605769809223262}
Losses {'ner': 1.8786397921994649}
Losses {'ner': 1.7351509290678873}
Losses {'ner': 1.4021572820718298}
Losses {'ner': 4.4508419834298945}
Losses {'ner': 11.466295966513655}
Losses {'ner': 10.674786345421065}
Losses {'ner': 9.8725883235784604}
Losses {'ner': 8.5278043256903935}
Losses {'ner': 6.6744801108174743}
Losses {'ner': 5.7457604113131504}
Losses {'ner': 5.4725865582614581}
Losses {'ner': 5.2311404237445913}


In [21]:
test_text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(test_text)
# print("[Entities] in '%s'" % test_text, '\n\n')
for ent in doc.ents:
    print(ent.text, ent.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week. DATED


### Using Offset
<a id='offset'> </a>

 * Training process is the same as the previous one except data creation is different.
 * Here annotations are created using offset indices while the scheme is of course still BILUO.
 * One can see that this is a bit clumsy to use, of course, still works.
 * I can not make a claim which is better or has similar performance- as one needs to perform experiments to make any claim.

In [22]:
# For one instance
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = text
g = {'entities': [(5, 20, 'PERSON'), (61, 67, 'ORG'), (71, 75, 'DATE'), (173, 181, 'NORP'), 
    (271, 276, 'PERSON'), (299, 305, 'ORG'), (306, 323, 'DATED')]}

X = [doc]
Y = [g]

In [27]:
with nlp.disable_pipes(*other_pipes):  # only train ner
    # optimizer = nlp.begin_training()
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    for i in range(n_iter):
        losses = {}
        nlp.update(X, Y,  sgd=optimizer, drop=0.0, losses=losses)
        # nlp.entity.update(d, g)
        print("Losses", losses)
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Losses {'ner': 3.7237621309480504e-07}
Losses {'ner': 4.2568601023480757e-08}
Losses {'ner': 9.9105726647369758e-09}
Losses {'ner': 1.106951591523704e-09}
Losses {'ner': 1.2892151331991783e-10}
Losses {'ner': 2.3693011855495481e-11}
Losses {'ner': 6.3620302761229311e-12}
Losses {'ner': 2.4024890882282167e-12}
Losses {'ner': 1.1433302154762119e-12}
Losses {'ner': 6.3102569354906027e-13}
Losses {'ner': 3.8399217233313528e-13}
Losses {'ner': 2.4715571862481995e-13}
Losses {'ner': 1.6659271366043418e-13}
Losses {'ner': 1.1629706412241111e-13}
Losses {'ner': 8.3657432465516869e-14}
Losses {'ner': 6.177951173299222e-14}
Losses {'ner': 4.6537979908801527e-14}
Losses {'ner': 3.5487294092645974e-14}
Losses {'ner': 2.7643605529383043e-14}
Losses {'ner': 2.1971045296529824e-14}
Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATED


In [39]:
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATED


## [Visual Display](https://spacy.io/usage/visualizers#ent) 


In [40]:
from spacy import displacy

In [44]:
displacy.render(doc, style="ent") # or displacy.serve(doc, style="ent") if not from jupyter notebook

### Custom Doc
<a id='customdoc'> </a>

 * Typically tokens created by the nlp model splits not only by spaces but also by stop words and special characters. 
 
 * We might want to label each word (split only with spaces or something else) rather than for each token as generated by NLP model (`doc = nlp(text)`) then in such case we need to modify a bit.
 
 * Please note that while we can have custom doc but a token in the doc can not be space.
 
 * In my experiments I did not find this helpful, in fact, it was weaker than other two annotation process. 

In [24]:
from spacy.tokens import Doc # https://spacy.io/api/doc

import pandas as pd
dpath = 'ner-token-per-line.biluo' # It not necessarily 
df = pd.read_csv(dpath, sep=',')
words  = df.word.values
ents = df.label.values
text = ' '.join(words)

spaces = [True]*len(words)
spaces[-1] = False # so remove space in last
doc = Doc(nlp.vocab, words=words, spaces=spaces) # Custom Doc
g = GoldParse(doc, entities=ents)

# 

### Note
 * Read this paper by [Akbik et al.](https://alanakbik.github.io/papers/coling2018.pdf) should help in understanding the algorithm behind the sequence labelling i.e. multiple word entities. 