<h2><b>Notebook to Investigate the Effect of Training an NER model on the detection of generic, pre-trained entities</h2></b>

<b><h3>A. Setting up & training NER model</h3></b>

1. Importing Relevant Modules

In [38]:
import spacy
import re
import pandas as pd
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy import displacy
from spacy.tokens import Doc
from spacy.training import Example
import random
nlp=spacy.load('en_core_web_sm')
ner=nlp.get_pipe("ner")

2. Importing & formatting training data
- Training data stored as an annotated XML file of aircraft upset incidents
- Producing a list to store training data in the correct format

In [52]:
marked_up_data = open("marked_up_training_data.xml").read().split('\n')
train_data = []
for sentence in marked_up_data:
    if sentence == '':
        continue
    clean = re.sub('<[^<]+?>', '', sentence)
    data = (clean, {"entities":[]})
    offset = 0
    active = False
    while '<'  in sentence:
        start = sentence.index('<')
        end = sentence.index('>')
        if not active:
            active = True
            sentence = sentence[end+1:]
            offset += start
        else:
            real_start = offset
            real_end = offset + start
            ent = clean[real_start:real_end]
            kind = sentence[start+2:end].upper()
            sentence = sentence[end+1:]
            active = False
            offset += start
            #print(real_start,real_end,ent, '-', kind)
            data[1]["entities"].append((real_start,real_end,kind))    
    train_data.append(data)

3. Actually training model on marked up data

In [54]:
#Adding training data labels to ner & disabling pipline components
for _, annotations in train_data:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):
  # Training for 30 iterations
  for iteration in range(30):
    # shuufling examples  before every iteration
    random.shuffle(train_data)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
      for text, annotations in batch:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example],  # batch of EXAMPLE
                    losses=losses,
                )



<h3><b>B. Evaluating NER and impact on generic entity recognition</h3><b>

<h4>Domain-specific testing data (other aircraft incident descriptions/narratives)</h4>

1. Entities identified with an untrained ner pipeline

In [56]:
untrained_nlp = spacy.load('en_core_web_sm')
untrained_ner = untrained_nlp.get_pipe("ner")
testing_data = open("testing_data.txt").read().split('\n')
for chunk in testing_data:
    if chunk != '':
        chunk = re.sub('<[^<]+?>', '', chunk)
        sent = untrained_nlp(chunk)
        displacy.render(sent, style = "ent")

2. Entities identified with the trained NER model

In [55]:
testing_data = open("testing_data.txt").read().split('\n')
for chunk in testing_data:
    if chunk != '':
        chunk = re.sub('<[^<]+?>', '', chunk)
        sent = nlp(chunk)
        displacy.render(sent, style = "ent")

<h4>Generic testing data with many generic entities</h4>

1. Entities identified with an untrained ner pipeline

In [64]:
untrained_nlp = spacy.load('en_core_web_sm')
untrained_ner = untrained_nlp.get_pipe("ner")
generic_text = open("generic_text_data.txt").read()
generic_pre_training = untrained_nlp(generic_text)
print('Named entities identified:', len(generic_pre_training.ents))
displacy.render(generic_pre_training, style = "ent")

Named entities identified: 80


1. Entities identified with the trained ner pipeline

In [1]:
generic_text = open("generic_text_data.txt").read()
sent = nlp(generic_text)
print('Named entities identified:', len(sent.ents))
displacy.render(sent, style = "ent")

NameError: name 'nlp' is not defined

<b><h3>Summary of Results<h3><b>

- Training an NER pipeline to recognize domain-specific entities appears to eliminate recognition of pre-trained generic entities