# <center>Named Entity Recognition Pipeline and demonstrated with 3 different NLP libraries - NLTK, Stanza, SpaCy</center>

***

### What is Named Entity Recognition (NER)?

**Named Entity Recognition (NER)** is a key technique in natural language processing aimed at extracting specific information from unstructured text. It is particularly valuable in scenarios where manually reading and annotating large volumes of documents is impractical.

A **named entity** refers to any real-world object or concept that has a proper name or label. These entities can vary widely depending on the domain but often include categories like:

- People
- Organizations
- Locations
- Countries
- Languages
- Works of art
- Dates and times
- Numerical data, such as quantities and measurements

---

In practice, entities can be tailored to the specific domain or dataset you are working with. For instance, if you are analyzing archaeological reports, you might define custom entities like "culture," "material," or "method" to extract more relevant insights.

NER involves both identifying and classifying these entities within a text. Entities can be as simple as single tokens (e.g., "Berlin" as a location) or span multiple tokens (e.g., "The Royal Society" as an organization or "Charles Robert Darwin" as a person). This ability to handle varying structures makes NER a flexible and powerful tool for text analysis.

***

In this project, Named Entity parsers are implemented using models from three widely-used NLP libraries: [**spaCy**](https://spacy.io/), [**nltk**](https://www.nltk.org/), and [**stanza**](https://stanfordnlp.github.io/stanza/). Each library is well-documented, making it relatively straightforward to get named entity parsing up and running.

To facilitate testing and debugging, a small sample of 10 sentences is provided, along with their corresponding gold labels. These are contained in two files: **sample.txt.gz** (input) and **sample.conllu.gz** (gold labels). These sample files highlight cases where incorrect tokenization could lead to misalignment. It is highly recommended to use these files during the initial development phase before running the full dataset, as processing the complete dataset may take up to an hour on a standard laptop for certain models.

For convenience, code snippets to process the sample files line by line are included, along with information on the file formats. The headers and setup code required for downloading the necessary library models are also provided. These assume that you’ve set up an environment based on the included `requirements.txt` file.

Additionally, a set of utility functions is provided to handle tasks like logging, writing output files, and converting label formats.

This streamlined framework is designed to help you focus on implementing and comparing the models without worrying about peripheral setup issues.

***

### IMPORT HEADERS

***

In [1]:
import sys, os, gzip
from tqdm import tqdm
import time
import logging
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
from sklearn.metrics import RocCurveDisplay

import stanza

import nltk
from nltk import pos_tag
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
from nltk.chunk import conlltags2tree, tree2conlltags
from nltk.tokenize import WhitespaceTokenizer

import spacy
import spacy_transformers
from spacy.training.iob_utils import biluo_to_iob, doc_to_biluo_tags
from spacy.tokens import Doc


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


***

### GLOBAL LOGFILE INITIALISATION

***

In [2]:
# Move logfile to backup each time the kernel is reran
backup_logfile = 'NER.log.bak'
logfile = 'NER.log'
if os.path.exists(backup_logfile):
  os.remove(backup_logfile)
if os.path.exists(logfile):
  os.rename(logfile, backup_logfile)

#Initialise logger
logger = logging.getLogger(__name__)
logging.basicConfig(filename='NER.log', 
                    level=logging.DEBUG)
# Example logging output types you can use.
# logger.debug('This debug message should go to the log file')
# logger.info('Info should be this')
# logger.warning('And this warning, too')
# logger.error('And an error')
# END GLOBAL LOGFILE INITIALISATION

In [3]:
# Make sure you have all of the models downloaded. You should only need to run this once.
def get_all_models():
  if not os.path.exists('models.done'):
    print('Downloading necessary model files.')
    logger.info(f'\nDownloading necessary model files.\n')
    nltk.download('punkt')
    nltk.download('averaged_perceptron_tagger')
    nltk.download('maxent_ne_chunker')
    nltk.download('words')
    spacy.cli.download('en_core_web_sm')
    spacy.cli.download('en_core_web_md')
    spacy.cli.download('en_core_web_lg')
    spacy.cli.download('en_core_web_trf')
    open('models.done', 'w').close()
  else:
    print ('Models have been downloaded. To re-enable, remove the models.done file in your working directory.')
    logger.info(f'\nModels have been downloaded. To re-enable, remove the models.done file in your working directory.\n')

# If something fails, then remove the file 'models.done' in your working directory and the downloads will run again.
get_all_models()


Models have been downloaded. To re-enable, remove the models.done file in your working directory.


***

### UTILITY FUNCTIONS

***

In [4]:
# Decorator function to compute function running time.
def nlptimer(func):
  """Function decorator to measure execution time of functions."""
  def wrapper(*args, **kwargs):
    start_time = time.time()  # Start time
    result = func(*args, **kwargs)  # Execute the function
    end_time = time.time()  # End time
    execution_time = end_time - start_time  # Calculate execution time
    print(f'\n{func.__name__} executed in: {execution_time} seconds\n')
    logger.info(f'\n{func.__name__} executed in: {execution_time} seconds\n')
    return result
  return wrapper

# Convert Ontonotes 5 formatted labels to conll03 format.
@nlptimer
def convert_ontonotes_to_conll(input_file, output_file):
  # Mapping from OntoNotes 5 labels to CoNLL-2003 BIO labels
  ontonotes_to_conll = {
    "B-PERSON": "B-PER",
    "B-GPE": "B-LOC",
    "B-LOC": "B-LOC",
    "B-ORG": "B-ORG",
    "B-NORP": "B-MISC",
    "B-FAC": "B-MISC",
    "B-PRODUCT": "B-MISC",
    "B-EVENT": "B-MISC",
    "B-WORK_OF_ART": "B-MISC",
    "B-LAW": "B-MISC",
    "B-LANGUAGE": "B-MISC",
    "I-PERSON": "I-PER",
    "I-GPE": "I-LOC",
    "I-LOC": "I-LOC",
    "I-ORG": "I-ORG",
    "I-NORP": "I-MISC",
    "I-FAC": "I-MISC",
    "I-PRODUCT": "I-MISC",
    "I-EVENT": "I-MISC",
    "I-WORK_OF_ART": "I-MISC",
    "I-LAW": "I-MISC",
    "I-LANGUAGE": "I-MISC",
    # The following labels do not exist in CoNLL-2003 and are thus ignored
    "B-DATE": "O",
    "B-TIME": "O",
    "B-PERCENT": "O",
    "B-MONEY": "O",
    "B-QUANTITY": "O",
    "B-ORDINAL": "O",
    "B-CARDINAL": "O",
    "I-DATE": "O",
    "I-TIME": "O",
    "I-PERCENT": "O",
    "I-MONEY": "O",
    "I-QUANTITY": "O",
    "I-ORDINAL": "O",
    "I-CARDINAL": "O"
  }

  with gzip.open(input_file, 'rt') as infile, gzip.open(output_file, 'wt') as outfile:
    for line in infile:
      if line.strip() == '':
        outfile.write('\n')
        continue
      pos, token, ontonotes_label = line.strip().split('\t')
      conll_label = ontonotes_to_conll.get(ontonotes_label, None)
      if conll_label:
        outfile.write(f"{pos}\t{token}\t{conll_label}\n")
      else:
        # If the label does not map to a CoNLL-2003 label, write the token with 'O'
        outfile.write(f"{pos}\t{token}\tO\n")
  return

# Convert NLTK formated labels to Conll03.
@nlptimer
def convert_nltk_to_conll(input_file, output_file):
  # Mapping from OntoNotes 5 labels to CoNLL-2003 BIO labels
  nltk_to_conll = {
    "B-FACILITY": "B-MISC",
    "I-FACILITY": "I-MISC",
    "B-GPE": "B-LOC",
    "I-GPE": "I-LOC",
    "B-GSP": "B-LOC",
    "I-GSP": "I-LOC",
    "B-LOCATION": "B-LOC",
    "I-LOCATION": "I-LOC",
    "B-ORGANIZATION": "B-ORG",
    "I-ORGANIZATION": "I-ORG",
    "B-PERSON": "B-PER",
    "I-PERSON": "I-PER",
    "I-FACILITY": "I-MISC",
    "B-FACILITY": "B-MISC"
  }
  with gzip.open(input_file, 'rt') as infile, gzip.open(output_file, 'wt') as outfile:
    for line in infile:
      if line.strip() == '':
        outfile.write('\n')
        continue
      pos, token, nltk_label = line.strip().split('\t')
      conll_label = nltk_to_conll.get(nltk_label, None)
      if conll_label:
        outfile.write(f"{pos}\t{token}\t{conll_label}\n")
      else:
        # If the label does not map to a CoNLL-2003 label, write the token with 'O'
        outfile.write(f"{pos}\t{token}\tO\n")
  return

# Score a conllu03 formatted file.
@nlptimer
def score_conllu_file(input_file, gold_label_file):
  # Since 'O' is the dominant label, and not a named entity, we will ignore
  # tokens with this label since it would unbalance the classes being
  # evaluated.
  label_names = ['PER','MISC','LOC','ORG']
  gold_labels = []
  predicted_labels = []
  with gzip.open(input_file, 'rt') as tst, gzip.open(gold_label_file, 'rt') as gold:
    for tst_ln, gold_ln in zip(tst,gold):
      tstfld = tst_ln.strip().split('\t')
      goldfld = gold_ln.strip().split('\t')
      # Verify that current token matches
      # If test file and gold file are not aligned, exit with error.
      if len(goldfld) == 3 and len(tstfld) == 3:
        # Second item in both files should be the term that is tagged.
        if tstfld[1] != goldfld[1]:
          print('ERROR: Token {} does not match token {}'.format(tstfld[1],goldfld[1]))
          logger.error('Token {} does not match token {}'.format(tstfld[1],goldfld[1]))
          sys.exit(-1)
        else:
          t = ''
          g = ''
          if '-' in goldfld[2]:
            gg = goldfld[2].split('-')
            g = gg[1]
          else:
            g = goldfld[2]
          if '-' in tstfld[2]:
            tt = tstfld[2].split('-')
            t = tt[1]
          else:
            g = tstfld[2]
          # Since 'O' is the majority of labels, we will filter them out now.
          if t != 'O' and g != 'O':
            gold_labels.append(g)
            predicted_labels.append(t)

  # Now ensure that only valid labels are in the predicted labels.
  # If this fails, you forgot to convert labels from ontonotes to conll03.
  sanity_labels = ['PER','MISC','LOC','ORG', 'O']
  for label in predicted_labels:
    if label not in sanity_labels:
      print('ERROR: Label {} not in valid label set {}.'.format(label,sanity_labels))
      print('       Please ensure that you have converted to conll03 format.')
      logger.error('Label {} not in valid label set {}.'.format(label,sanity_labels))
      logger.error('Please ensure that you have converted to conll03 format.')
      sys.exit(-1)

  # Print a summary evaluation report.
  cr = classification_report(gold_labels, predicted_labels, labels=label_names)
  print('\n{}\n'.format(cr))
  logger.info('\n{}\n'.format(cr))
  # Print the full confusion matrix
  cm = confusion_matrix(gold_labels, predicted_labels, labels=label_names)
  print('\n{}\n'.format(cm))
  logger.info('\n{}\n'.format(cm))

  # One Hot encode multiclass predictions so that a one versus all AUC can be
  # computed.
  label_binarizer = LabelBinarizer().fit(gold_labels)
  y_onehot_test = label_binarizer.transform(predicted_labels)
  x_onehot_true = label_binarizer.transform(gold_labels)
  micro_roc_auc_ovr = roc_auc_score(y_onehot_test, x_onehot_true,
                                    multi_class="ovr", average="micro")
  print(f"\nMicro-averaged One-vs-Rest ROC AUC score:\n{micro_roc_auc_ovr:.2f}\n")
  logger.info(f"\nMicro-averaged One-vs-Rest ROC AUC score:\n{micro_roc_auc_ovr:.2f}\n")
  return

# Utility function to write a gzip conllu file.
# Example:
# output_text = ne_spacy_weblg('input.txt.gz')
# write_conllu_gzip('spacy-weblg.conllu.gz, output_text)                
@nlptimer
def write_conllu_gzip(output_file, output_text):
  with gzip.open(output_file, 'wt') as fh:
      fh.write(output_text)
  return

# Utility function to get number of lines for tqdm progress bar.
# Complete file line length should be 249212.
@nlptimer
def get_file_len(input_file):
  with gzip.open(input_file, 'rt') as fh:
    num_lines = sum(1 for line in fh)
  return num_lines

***

### END UTILITY FUNCTIONS

***

#### Below is a template function that demonstrate how to walk the input file, which has one tokenized sentence per line, and output a tsv file with one parsed token per line. Each output sentence is separated by a single blank line. This code does not show you how to configure the parsing pipeline for each of the three libraries -- only to demonstrate the parser's output format

In [5]:
def example_parser(input_file):
  output_file = []
  num_lines = get_file_len(input_file)
  with gzip.open(input_file, 'rt') as fh:
    for ln in tqdm(fh, total=num_lines):
      tokens = ln.strip().split()
      for i,t in enumerate(tokens):
        output_file.append('{}\t{}\t{}'.format(i, t, 'A_LABEL'))
      output_file.append('')
  return '\n'.join(output_file)

# Run on the sample file.
output_text = example_parser('sample.txt.gz')
print (output_text)


get_file_len executed in: 0.0003497600555419922 seconds



100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 45003.26it/s]

0	The	A_LABEL
1	resulting	A_LABEL
2	implementation	A_LABEL
3	of	A_LABEL
4	the	A_LABEL
5	IA-64	A_LABEL
6	64-bit	A_LABEL
7	architecture	A_LABEL
8	was	A_LABEL
9	the	A_LABEL
10	Itanium	A_LABEL
11	,	A_LABEL
12	finally	A_LABEL
13	introduced	A_LABEL
14	in	A_LABEL
15	June	A_LABEL
16	2001	A_LABEL
17	.	A_LABEL

0	Raytheon	A_LABEL
1	is	A_LABEL
2	also	A_LABEL
3	working	A_LABEL
4	with	A_LABEL
5	the	A_LABEL
6	Missile	A_LABEL
7	Defense	A_LABEL
8	Agency	A_LABEL
9	to	A_LABEL
10	develop	A_LABEL
11	the	A_LABEL
12	Network	A_LABEL
13	Centric	A_LABEL
14	Airborne	A_LABEL
15	Defense	A_LABEL
16	Element	A_LABEL
17	(	A_LABEL
18	NCADE	A_LABEL
19	)	A_LABEL
20	,	A_LABEL
21	an	A_LABEL
22	anti-ballistic	A_LABEL
23	missile	A_LABEL
24	derived	A_LABEL
25	from	A_LABEL
26	the	A_LABEL
27	AIM-120	A_LABEL
28	.	A_LABEL

0	Much	A_LABEL
1	of	A_LABEL
2	his	A_LABEL
3	work	A_LABEL
4	at	A_LABEL
5	the	A_LABEL
6	patent	A_LABEL
7	office	A_LABEL
8	related	A_LABEL
9	to	A_LABEL
10	questions	A_LABEL
11	about	A_LABEL
12	transmission	A_LABE




***

## Stanza NER Parser

***

Stanza is a collection of accurate and efficient tools for many human languages in one place. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing. Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, to give a syntactic structure dependency parse, and to recognize named entities. The toolkit is designed to be parallel among more than 60 languages, using the Universal Dependencies formalism.

Native Python implementation requiring minimal efforts to set up; Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging, dependency parsing, and named entity recognition; Pretrained neural models supporting 66 (human) languages; A stable, officially maintained Python interface to CoreNLP.

In [6]:
@nlptimer
def ne_stanza_tagger(input_file):
    output_file = []
    num_lines = get_file_len(input_file)
    nlp = stanza.Pipeline(lang='en', processors={'ner': 'conll03_charlm'}, tokenize_pretokenized=True, use_gpu=True)
    with gzip.open(input_file, 'rt') as fh:
        for ln in tqdm(fh, total=num_lines):
            doc = nlp (ln)
            for sent in doc.sentences:
                for i, t in enumerate(sent.tokens):
                    output_file.append('{}\t{}\t{}'.format(i, t.text, t.ner))
            output_file.append('')
    return '\n'.join(output_file)

***

In [7]:
# Run the stanza parser and generate a compressed conllu output of the results.
# Toggle between the commented out sample / full dataset.

print('\nRun Stanza to tag named entities.\n')
logger.info('\nRun Stanza to tag named entities.\n')
#output_text = ne_stanza_tagger('sample.txt.gz')
output_text = ne_stanza_tagger('input.txt.gz')

print('\nWriting Stanza output\n')
logger.info('\nWriting Stanz output\n')
write_conllu_gzip('stanza.conllu.gz', output_text)

print('\nEvaluate the Stanza model tags\n')
logger.info('\nEvaluate the Stanza model tags\n')
# score_conllu_file('stanza.conllu.gz', 'sample.conllu.gz')
score_conllu_file('stanza.conllu.gz', 'input.conllu.gz')


Run Stanza to tag named entities.



2024-08-21 01:58:33 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES



get_file_len executed in: 0.15065765380859375 seconds



Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-08-21 01:58:33 INFO: Downloaded file to /home/nqmtien/stanza_resources/resources.json
2024-08-21 01:58:34 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| mwt          | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus_charlm      |
| ner          | conll03_charlm      |

2024-08-21 01:58:34 INFO: Using device: cuda
2024-08-21 01:58:34 INFO: Loading: tokenize
2024-08-21 01:58:34 INFO: Loading: mwt
  checkpoint = torch.load(filename, lambda storage, loc: storage)
2024-08-21 01:58:34 INFO: Loading: pos
  checkpoint = torch.load(filename, lambda storage, loc: storage)
  data = torch.load(self.filename, lambda storage, loc: storage)
  state = torch.load(filename, lambda storage, loc: storage)
2024-08-21 01:58:3


ne_stanza_tagger executed in: 72365.4467689991 seconds


Writing Stanza output


write_conllu_gzip executed in: 8.345110654830933 seconds


Evaluate the Stanza model tags


              precision    recall  f1-score   support

         PER       0.91      0.93      0.92    212736
        MISC       0.81      0.72      0.76    178313
         LOC       0.86      0.86      0.86    197732
         ORG       0.70      0.77      0.73    146036

    accuracy                           0.83    734817
   macro avg       0.82      0.82      0.82    734817
weighted avg       0.83      0.83      0.83    734817



[[197729   5484   3837   5686]
 [ 11186 128684  10139  28304]
 [  4173   8711 169837  15011]
 [  4544  16477  12557 112458]]


Micro-averaged One-vs-Rest ROC AUC score:
0.89


score_conllu_file executed in: 10.344003677368164 seconds



***

## NLTK NER Parser

***

## Quick Overview of NLTK

NLTK stands for the Natural Language Toolkit and is written by two eminent computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University). NTLK provides a combination of natural language corpora, lexical resources, and example grammars with language processing algorithms, methodologies and demonstrations for a very pythonic "batteries included" view of Natural Language Processing.   

As such, NLTK is perfect for researh driven (hypothesis driven) workflows for agile data science. Its suite of libraries includes:

- tokenization, stemming, and tagging
- chunking and parsing
- language modeling
- classification and clustering
- logical semantics

NLTK is a useful pedagogical resource for learning NLP with Python and serves as a starting place for producing production grade code that requires natural language analysis. It is also important to understand what NLTK is _not_:

- Production ready out of the box
- Lightweight
- Generally applicable
- Magic

NLTK provides a variety of tools that can be used to explore the linguistic domain but is not a lightweight dependency that can be easily included in other workflows, especially those that require unit and integration testing or other build processes. This stems from the fact that NLTK includes a lot of added code but also a rich and complete library of corpora that power the built-in algorithms. 

### The Good parts of NLTK

- Preprocessing
    - segmentation
    - tokenization
    - PoS tagging
- Word level processing
    - WordNet
    - Lemmatization
    - Stemming
    - NGrams
- Utilities
    - Tree
    - FreqDist
    - ConditionalFreqDist
    - Streaming CorpusReaders
- Classification
    - Maximum Entropy
    - Naive Bayes
    - Decision Tree
- Chunking
- Named Entity Recognition
- Parsers Galore!

### The Bad parts of NLTK

- Syntactic Parsing

    - No included grammar (not a black box)
    - No Feature/Dependency Parsing
    - No included feature grammar

- The sem package
    
    - Toy only (lambda-calculus & first order logic)

- Lots of extra stuff (heavyweight dependency)

    - papers, chat programs, alignments, etc.

## Named Entities

NLTK has an excellent MaxEnt backed Named Entity Recognizer that is trained on the Penn Treebank. You can also retrain the chunker if you'd like - the code is very readable to extend it with a Gazette or otherwise. 

Note that NLTK uses a newer tagset than the gold labels, so they are automatically converted for you in the runtime code using a remapping function.

In [8]:
@nlptimer
def ne_nltk_tagger(input_file):
    output_file = []
    num_lines = get_file_len(input_file)
    tokenizer = WhitespaceTokenizer()
    with gzip.open(input_file, 'rt') as fh:
        for ln in tqdm(fh, total=num_lines):
            tokens = tokenizer.tokenize(ln)
            pos_tags = pos_tag(tokens)
            chunks = nltk.chunk.ne_chunk(pos_tags)
            bio_tags = tree2conlltags(chunks)
            i = 0
            for w, t, n in bio_tags:
                output_file.append('{}\t{}\t{}'.format(i, w, n))
                i += 1
            output_file.append('')
    return('\n'.join(output_file))

***

In [9]:
# Run the NLTK parser  and generate a compressed conllu output of the results.
# Toggle between the sample and full collections as you develop your code.

print('\nRun NLTK to tag named entities.\n')
logger.info('\nRun NLTK to tag named entities.\n')
#output_text = ne_nltk_tagger('sample.txt.gz')
output_text = ne_nltk_tagger('input.txt.gz')

print('\nWriting NLTK output\n')
logger.info('\nWriting NLTK output\n')
write_conllu_gzip('nltk.gz', output_text)

# Convert Spacy OntoNotes labels to Conll03
print('\nConvert NLTK output format to conll03.\n')
logger.info('\nConvert NLTK output format to conll03.\n')
convert_nltk_to_conll('nltk.gz', 'nltk.conllu.gz')

print('\nEvaluate the NLTK model tags\n')
logger.info('\nEvaluate the NLTK model tags\n')
# score_conllu_file('nltk.conllu.gz', 'sample.conllu.gz')
score_conllu_file('nltk.conllu.gz', 'input.conllu.gz')


Run NLTK to tag named entities.


get_file_len executed in: 0.26224255561828613 seconds



100%|██████████████████████████████████████████████████████████████████████████| 249212/249212 [11:40<00:00, 355.56it/s]



ne_nltk_tagger executed in: 701.5920491218567 seconds


Writing NLTK output


write_conllu_gzip executed in: 9.506311655044556 seconds


Convert NLTK output format to conll03.


convert_nltk_to_conll executed in: 13.825253963470459 seconds


Evaluate the NLTK model tags


              precision    recall  f1-score   support

         PER       0.64      0.84      0.73    198101
        MISC       0.15      0.01      0.01    109178
         LOC       0.58      0.55      0.56    179040
         ORG       0.42      0.59      0.49    120924

    accuracy                           0.56    607243
   macro avg       0.45      0.50      0.45    607243
weighted avg       0.49      0.56      0.50    607243



[[166898    123  19253  11827]
 [ 29876    700  33601  45001]
 [ 35608   2518  98782  42132]
 [ 28265   1377  19736  71546]]


Micro-averaged One-vs-Rest ROC AUC score:
0.70


score_conllu_file executed in: 11.723029851913452 seconds



***

## SpaCy NER Parser

***

## NER with Machine Learning using spaCy
[spaCy](https://spacy.io) is a free and open-source package that can be used to perform automated named entity recognition.

spaCy uses a form of machine learning called **convolutional neural networks** (CNNs) for named entity recognition. The nature of these neural networks is out of scope for this workshop, but in general terms a CNN produces a **statistical model** that is used to **predict** what are the most likely named entities in a text. This type of machine learning is described as being **supervised**, which means it must be trained on data that has been correctly labelled with named entities before it can do automated labelling on data it's never seen before.

As a statistical technique, the predictions might be wrong; in fact, they often are, and it's usually necessary to re-train and adjust the model until its predictions are better. We will cover training a model in the following notebooks.

Fortunately, we don't need to start completely from scratch because spaCy provides a range of [pre-trained language models](https://spacy.io/usage/models#languages) for modern languages such as English, Spanish, French and German. For other languages you can train your own and use them with spaCy or re-train one of the existing models. In this notebook, we will be using four different English models for comparison.

The English models that spaCy offers have been trained on the [OntoNotes corpus](https://catalog.ldc.upenn.edu/LDC2013T19). These models can predict the following entities 'out of the box':

<table class="_59fbd182"><thead><tr class="_8a68569b"><th class="_2e8d2972">Type</th><th class="_2e8d2972">Description</th></tr></thead><tbody><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">PERSON</code></td><td class="_5c99da9a">People, including fictional.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">NORP</code></td><td class="_5c99da9a">Nationalities or religious or political groups.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">FAC</code></td><td class="_5c99da9a">Buildings, airports, highways, bridges, etc.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">ORG</code></td><td class="_5c99da9a">Companies, agencies, institutions, etc.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">GPE</code></td><td class="_5c99da9a">Countries, cities, states.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">LOC</code></td><td class="_5c99da9a">Non-GPE locations, mountain ranges, bodies of water.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">PRODUCT</code></td><td class="_5c99da9a">Objects, vehicles, foods, etc. (Not services.)</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">EVENT</code></td><td class="_5c99da9a">Named hurricanes, battles, wars, sports events, etc.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">WORK_OF_ART</code></td><td class="_5c99da9a">Titles of books, songs, etc.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">LAW</code></td><td class="_5c99da9a">Named documents made into laws.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">LANGUAGE</code></td><td class="_5c99da9a">Any named language.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">DATE</code></td><td class="_5c99da9a">Absolute or relative dates or periods.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">TIME</code></td><td class="_5c99da9a">Times smaller than a day.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">PERCENT</code></td><td class="_5c99da9a">Percentage, including ”%“.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">MONEY</code></td><td class="_5c99da9a">Monetary values, including unit.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">QUANTITY</code></td><td class="_5c99da9a">Measurements, as of weight or distance.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">ORDINAL</code></td><td class="_5c99da9a">“first”, “second”, etc.</td></tr><tr class="_8a68569b"><td class="_5c99da9a"><code class="_1d7c6046">CARDINAL</code></td><td class="_5c99da9a">Numerals that do not fall under another type.</td></tr></tbody></table>

Note that these do not align with the CONLL 2003 labeled data that we are using, which include only four different labels (PER, LOC, ORG, and MISC). For convenience, I have included a function that will automatically remap the spaCy named entity labels to the gold labels we have available.

SpaCy is arguably the most popular library for information extraction related tasks, and it includes several different models. The most difficult part is to get white space tokenization to work correctly, and so I have included a WhiteSpaceTokenizer class that can be used in the spaCy pipeline as it is non-trivial to disable all of the special tokenisation rules built into its tokeniser.

In [10]:
# Monkeypatch Spacy tokenizer to a whitespace only tokenizer.

class SpWhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False
        return Doc(self.vocab, words=words, spaces=spaces)

***

In [11]:
@nlptimer
def ne_spacy_tagger(input_file, model_name):
    nlp_en = spacy.load(model_name)
    output_lines = []
    nlp_en.tokenizer = SpWhitespaceTokenizer(nlp_en.vocab)
    num_lines = get_file_len(input_file)
    with gzip.open(input_file, 'rt') as fh:
        for ln in tqdm(fh, total=num_lines):
            doc = nlp_en(ln)
            for i, token in enumerate(doc):
                ent_label = token.ent_iob_ + '-' + token.ent_type_ if token.ent_iob_ != 'O' else 'O'
                if (token.text[-1] == '\n'):
                    output_lines.append('{}\t{}\t{}'.format(i, token.text[:-1], ent_label))
                else:
                    output_lines.append('{}\t{}\t{}'.format(i, token.text, ent_label))
            output_lines.append('')
    return ('\n'.join(output_lines))

***

### Run four different spacy models: en_core_web_sm, en_core_web_med, en_core_web_lg, and en_core_web_trf. Note the performance differences for each model choice.

***

In [12]:
# Run Spacy Web Small and generate a compressed conllu output of the results.

print('\nRun Spacy en_core_web_sm model to tag named entities.\n')
logger.info('\nRun Spacy en_core_web_sm model to tag named entities.\n')
# output_text = ne_spacy_tagger('sample.txt.gz', 'en_core_web_sm')
output_text = ne_spacy_tagger('input.txt.gz','en_core_web_sm')

print('\nWriting Spacy Web Small output\n')
logger.info('\nWriting Spacy Web Small output\n')
write_conllu_gzip('spacy-websm.gz', output_text)

# Convert Spacy OntoNotes labels to Conll03.
print('\nConvert Spacy output format to conll03.\n')
logger.info('\nConvert Spacy output format to conll03.\n')
convert_ontonotes_to_conll('spacy-websm.gz', 'spacy-websm.conllu.gz')

print('Evaluate Spacy en_core_web_sm model tags\n')
logger.info('Evaluate Spacy en_core_web_sm model tags\n')
# score_conllu_file('spacy-websm.conllu.gz', 'sample.conllu.gz')
score_conllu_file('spacy-websm.conllu.gz', 'input.conllu.gz')


Run Spacy en_core_web_sm model to tag named entities.


get_file_len executed in: 0.2169325351715088 seconds



100%|██████████████████████████████████████████████████████████████████████████| 249212/249212 [23:32<00:00, 176.41it/s]



ne_spacy_tagger executed in: 1414.0299181938171 seconds


Writing Spacy Web Small output


write_conllu_gzip executed in: 7.8404929637908936 seconds


Convert Spacy output format to conll03.


convert_ontonotes_to_conll executed in: 13.31121826171875 seconds

Evaluate Spacy en_core_web_sm model tags


              precision    recall  f1-score   support

         PER       0.79      0.79      0.79    197666
        MISC       0.72      0.55      0.62    146701
         LOC       0.80      0.67      0.73    184585
         ORG       0.53      0.79      0.64    132417

    accuracy                           0.70    661369
   macro avg       0.71      0.70      0.69    661369
weighted avg       0.73      0.70      0.71    661369



[[155465   6695   9257  26249]
 [ 15985  81125  11824  37767]
 [ 15307  16703 123792  28783]
 [  9536   8500   9534 104847]]


Micro-averaged One-vs-Rest ROC AUC score:
0.80


score_conllu_file executed in: 13.042055130004883 seconds



***

In [13]:
# Run spaCy web mediaum and generate a compressed conllu out of the results.

print('\nRun Spacy en_core_web_md model to tag named entities.\n')
logger.info('\nRun Spacy en_core_web_md model to tag named entities.\n')
# output_text = ne_spacy_tagger('sample.txt.gz', 'en_core_web_md')
output_text = ne_spacy_tagger('input.txt.gz','en_core_web_md')

print('\nWriting Spacy Web Medium output\n')
logger.info('\nWriting Spacy Web Medium output\n')
write_conllu_gzip('spacy-webmd.gz', output_text)

# Convert Spacy OntoNotes labels to Conll03
print('\nConvert Spacy output format to conll03.\n')
logger.info('\nConvert Spacy output format to conll03.\n')
convert_ontonotes_to_conll('spacy-webmd.gz', 'spacy-webmd.conllu.gz')

print('\nEvaluate Spacy en_core_web_md model tags\n')
logger.info('\nEvaluate Spacy en_core_web_md model tags\n')
# score_conllu_file('spacy-webmd.conllu.gz', 'sample.conllu.gz')
score_conllu_file('spacy-webmd.conllu.gz', 'input.conllu.gz')


Run Spacy en_core_web_md model to tag named entities.


get_file_len executed in: 0.201155424118042 seconds



100%|██████████████████████████████████████████████████████████████████████████| 249212/249212 [20:35<00:00, 201.74it/s]



ne_spacy_tagger executed in: 1237.2120561599731 seconds


Writing Spacy Web Medium output


write_conllu_gzip executed in: 7.496793270111084 seconds


Convert Spacy output format to conll03.


convert_ontonotes_to_conll executed in: 11.162131786346436 seconds


Evaluate Spacy en_core_web_md model tags


              precision    recall  f1-score   support

         PER       0.86      0.86      0.86    204155
        MISC       0.73      0.62      0.67    147488
         LOC       0.87      0.72      0.79    187873
         ORG       0.58      0.82      0.68    135281

    accuracy                           0.76    674797
   macro avg       0.76      0.76      0.75    674797
weighted avg       0.78      0.76      0.76    674797



[[174684   5455   3737  20279]
 [ 13218  91848   7499  34923]
 [  9404  19234 135137  24098]
 [  6260   8577   8888 111556]]


Micro-averaged One-vs-Rest ROC AUC score:
0.84


score_conllu_file executed in: 9.501359701156616 seconds



***

In [14]:
# Run spaCy Web Large and generate a compressed conllu output of the results

print('\nRun Spacy en_core_web_lg model to tag named entities.\n')
logger.info('\nRun Spacy en_core_web_lg model to tag named entities.\n')
# output_text = ne_spacy_tagger('sample.txt.gz', 'en_core_web_lg')
output_text = ne_spacy_tagger('input.txt.gz','en_core_web_lg')

print('\nWriting Spacy Web Large output\n')
logger.info('\nWriting Spacy Web Large output\n')
write_conllu_gzip('spacy-weblg.gz', output_text)

# Convert Spacy OntoNotes labels to Conll03
print('\nConvert Spacy output format to conll03.\n')
logger.info('\nConvert Spacy output format to conll03.\n')
convert_ontonotes_to_conll('spacy-weblg.gz', 'spacy-weblg.conllu.gz')

print('\nEvaluate Spacy en_core_web_lg model tags\n')
logger.info('\nEvaluate Spacy en_core_web_lg model tags\n')
# score_conllu_file('spacy-weblg.conllu.gz', 'sample.conllu.gz')
score_conllu_file('spacy-weblg.conllu.gz', 'input.conllu.gz')


Run Spacy en_core_web_lg model to tag named entities.


get_file_len executed in: 0.195420503616333 seconds



100%|██████████████████████████████████████████████████████████████████████████| 249212/249212 [15:46<00:00, 263.29it/s]



ne_spacy_tagger executed in: 950.7010734081268 seconds


Writing Spacy Web Large output


write_conllu_gzip executed in: 7.406781435012817 seconds


Convert Spacy output format to conll03.


convert_ontonotes_to_conll executed in: 11.194310426712036 seconds


Evaluate Spacy en_core_web_lg model tags


              precision    recall  f1-score   support

         PER       0.87      0.87      0.87    204780
        MISC       0.74      0.64      0.69    146459
         LOC       0.88      0.72      0.79    188144
         ORG       0.60      0.83      0.70    135880

    accuracy                           0.77    675263
   macro avg       0.77      0.77      0.76    675263
weighted avg       0.79      0.77      0.77    675263



[[178224   5114   3514  17928]
 [ 12008  94001   7092  33358]
 [  8690  19990 135896  23568]
 [  5480   8460   8655 113285]]


Micro-averaged One-vs-Rest ROC AUC score:
0.85


score_conllu_file executed in: 9.719312906265259 seconds



***

In [15]:
# Run Spacy Web Transformer and generate a compressed conllu output of the results.

print('\nRun Spacy en_core_web_trf model to tag named entities.\n')
logger.info('\nRun Spacy en_core_web_trf model to tag named entities.\n')
# output_text = ne_spacy_tagger('sample.txt.gz', 'en_core_web_trf')
output_text = ne_spacy_tagger('input.txt.gz','en_core_web_trf')

print('\nWriting Spacy Web Transformer output\n')
logger.info('\nWriting Spacy Web Transformer output\n')
write_conllu_gzip('spacy-webtrf.gz', output_text)

# Convert Spacy OntoNotes labels to Conll03
print('\nConvert Spacy output format to conll03.\n')
logger.info('\nConvert Spacy output format to conll03.\n')
convert_ontonotes_to_conll('spacy-webtrf.gz', 'spacy-webtrf.conllu.gz')

print('\nEvaluate Spacy en_core_web_trf model tags\n')
logger.info('\nEvaluate Spacy en_core_web_trf model tags\n')
# score_conllu_file('spacy-webtrf.conllu.gz', 'sample.conllu.gz')
score_conllu_file('spacy-webtrf.conllu.gz', 'input.conllu.gz')


Run Spacy en_core_web_trf model to tag named entities.



  model.load_state_dict(torch.load(filelike, map_location=device))



get_file_len executed in: 0.14785313606262207 seconds



  with torch.cuda.amp.autocast(self._mixed_precision):
100%|█████████████████████████████████████████████████████████████████████████| 249212/249212 [2:11:23<00:00, 31.61it/s]



ne_spacy_tagger executed in: 7888.370453834534 seconds


Writing Spacy Web Transformer output


write_conllu_gzip executed in: 7.727949619293213 seconds


Convert Spacy output format to conll03.


convert_ontonotes_to_conll executed in: 11.209738969802856 seconds


Evaluate Spacy en_core_web_trf model tags


              precision    recall  f1-score   support

         PER       0.95      0.96      0.96    210067
        MISC       0.77      0.86      0.81    171526
         LOC       0.94      0.79      0.86    193847
         ORG       0.83      0.89      0.86    144687

    accuracy                           0.87    720127
   macro avg       0.87      0.87      0.87    720127
weighted avg       0.88      0.87      0.88    720127



[[202641   3403   1257   2766]
 [  5608 146757   4148  15013]
 [  1557  30845 152447   8998]
 [  2658   9552   4295 128182]]


Micro-averaged One-vs-Rest ROC AUC score:
0.92


score_conllu_file executed in: 10.417960166931152 seconds



***

## Summary Results

***

The first table summarises the model performance based on F1 and AUC. Table 2 is a summary of the number of correct answers for each of the four tags. Table 3 is the number of True Negatives for each tag (the sum of values in each row, ignoring the True Positive Value, and Table 4 is the number of False Negatives for each tag (the sum of the values in each column, ignoring the True Positive Value).
<br><br>For Table 1, mark the best model is marked by boldfacing the highest score for each of the six metrics. For Table 2, the model value is boldfaced for the the most correct predictions for each tag.
<br><br> For the last two tables, the model with the fewest True Negatives/False Negatives is marked using boldface, and the model with the most True Negatives/False Negatives is marked using italics.

***

### <center>TABLE 1 - Evaluation Metrics</center><br>

| Model | F1(PER) | F1(MISC) | F1(LOC) | F1(ORG) | F1(Macro AVG) | OvA AUC |
| --- | --- | --- | --- | --- | --- | --- |
| Stanza | 0.92 | 0.76 | **0.86** | 0.73 | 0.82 | 0.89 |
| NLTK | 0.73 | 0.01 | 0.56 | 0.49 | 0.45 | 0.70 |
| spaCy-sm | 0.79 | 0.62 | 0.73 | 0.64 | 0.69 | 0.80 |
| spaCy-md | 0.86 | 0.67 | 0.79 | 0.68 | 0.75 | 0.84 |
| spaCy-lg | 0.87 | 0.69 | 0.79 | 0.70 | 0.76 | 0.85 |
| spaCy-tr | **0.96** | **0.81** | **0.86** | **0.86** | **0.87** | **0.92** |

***

### <center> Table 2 - True Positive Counts </center><br>

| TAG | Stanza | NLTK | spaCy_sm | spaCy_md | spaCy_lg | spaCy_tr |
|---|---|---|---|---|---|---|
| PER | 197729 | 166898 | **155465** | 174684 | 178224 | *202641* |
| MISC | 128684 | **700** | 81125 | 91848 | 94001 | *146757* |
| LOC | *169837* | **98782** | 123792 | 135137 | 135896 | 152447 |
| ORG | 112458 | **71546** | 104847 | 111556 | 113285 | *128182* |

***

### <center> Table 3 - True Negative Counts </center><br>

| TAG | Stanza | NLTK | spaCy-sm | spaCy-md | spaCy-lg | spaCy-tr |
|---|---|---|---|---|---|---|
| PER | *502178* | **315393** | 422875 | 441760 | 444305 | 500237 |
| MISC | *525832* | 494047 | **482770** | 494043 | 495240 | 504801 |
| LOC | 510552 | **355613** | 446169 | 466800 | 467858 | *516580* |
| ORG | 539780 | **387359** | 436153 | 460216 | 464529 | *548663* |

***

### <center> Table 4 - False Negative Counts </center><br>

| TAG | Stanza | NLTK | spaCy-sm | spaCy-md | spaCy-lg | spaCy-tr |
|---|---|---|---|---|---|---|
| PER | 15007 | 31203 | *42201* | 29471 | 26556 | **7426** |
| MISC | 49629 | *108478* | 65576 | 55640 | 52458 | **24769** |
| LOC | **27895** | *80258* | 60793 | 52736 | 52248 | 41400 |
| ORG | 33578 | *49378* | 27570 | 23725 | 22595 | **16505** |