## Named Entity Recognition Task SystemA (MultiNERD Dataset)

### XLNET Base Cased

[MultiNERD Dataset] 🤗: https://huggingface.co/datasets/Babelscape/multinerd 

[xlnet Base Cased Model] 🤗: https://huggingface.co/xlnet-base-cased

#### Import Necessary Libraries

* [AutoTokenizer](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer): A tokenizer class designed to accommodate the tokenization conventions of various pre-trained models.

* [AutoModelForTokenClassification](https://huggingface.co/transformers/model_doc/auto.html#automodelfortokenclassification): An extension of the [AutoModel](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel) class, capable of loading diverse pre-trained models. It supports fine-tuning for the classification of each token within a sequence.

* [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments): A straightforward class tailored for storing hyperparameters and other settings essential for model training.

* [Trainer](https://huggingface.co/transformers/main_classes/trainer.html): A versatile class that facilitates various forms of training for transformer models.

* [DataCollatorForTokenClassification](https://github.com/huggingface/transformers/blob/master/src/transformers/data/data_collator.py): A class designed for padding token classification examples to the same length during training.

* [load_dataset](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset): A function crafted for effortlessly loading datasets, including those from the [datasets](https://huggingface.co/datasets) collection.


In [1]:
import os, sys
os.environ['TOKENIZERS_PARALLELISM']='false'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sklearn
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

import datasets
from datasets import load_dataset, DatasetDict

import torch
from torch import nn
from torch.nn.functional import cross_entropy

import transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import DataCollatorForTokenClassification
from transformers import Trainer, TrainingArguments

import evaluate

import warnings
warnings.filterwarnings("ignore")


  from .autonotebook import tqdm as notebook_tqdm


#### Extracting English Subset of Dataset as input of SystemA

* First load the dataset with [`load_dataset`](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset), Then Filter out the non-English examples of the dataset as the training dataset of the SystemA.

* System A supposed to be a fine-tuned XLNET model on the English subset of the training set.

* The loaded dataset is a dictionary-like container for [`Dataset`](https://huggingface.co/docs/datasets/exploring.html) objects for training, development (validation), and test data. We'll here be using the `tokens` and `ner_tags` fields and ignoring `lang` (language).

In [2]:

original_dataset = load_dataset('Babelscape/multinerd')
data = original_dataset.filter(lambda example: example['lang'] == 'en')
data = data.remove_columns(["lang"])
label_list = ['O',
 'B-EVE',
 'I-EVE',
 'B-LOC',
 'B-MEDIA',
 'I-MEDIA',
 'I-PER',
 'B-PER',
 'B-DIS',
 'I-LOC',
 'B-PLANT',
 'B-FOOD',
 'B-VEHI',
 'I-VEHI',
 'I-ORG',
 'I-FOOD',
 'B-ORG',
 'B-TIME',
 'B-ANIM',
 'B-CEL',
 'I-TIME',
 'B-MYTH',
 'I-MYTH',
 'I-DIS',
 'I-ANIM',
 'B-INST',
 'I-PLANT',
 'B-BIO',
 'I-INST',
 'I-CEL',
 'I-BIO']
num_labels = len(label_list)

Resolving data files: 100%|██████████| 20/20 [00:04<00:00,  4.68it/s]
Resolving data files: 100%|██████████| 20/20 [00:00<00:00, 144382.24it/s]
Resolving data files: 100%|██████████| 20/20 [00:00<00:00, 182361.04it/s]


#### Split Dataset into DatasetDict

In [3]:
ds = DatasetDict({
    'train': data['train'], 
    'eval': data['validation'], 
    'test': data['test']})

print('Training data shape:', ds['train'].shape)
print('Validation data shape:', ds['eval'].shape)
print('Testing data shape:', ds['test'].shape)

Training data shape: (262560, 2)
Validation data shape: (32908, 2)
Testing data shape: (32820, 2)


#### Example

In [4]:
example = ds['train'][12]
example

{'tokens': ['The',
  'wild',
  'bulb',
  'vernal',
  'squill',
  'is',
  'known',
  'locally',
  'as',
  '"',
  'grice',
  "'s",
  'onions',
  '"',
  'because',
  'it',
  'was',
  'a',
  'favourite',
  'food',
  'of',
  'the',
  'swine',
  '.'],
 'ner_tags': [0,
  0,
  0,
  25,
  26,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

#### Display Feature Information About Each Feature

In [5]:
for k, v in ds["train"].features.items():
    print(f"{k}: \n{v}\n")

tokens: 
Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)

ner_tags: 
Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)



#### Define Tag Values & Conversions Between String & Integer Values

In [6]:
label2id = {
    "O": 0,
    "B-PER": 1,
    "I-PER": 2,
    "B-ORG": 3,
    "I-ORG": 4,
    "B-LOC": 5,
    "I-LOC": 6,
    "B-ANIM": 7,
    "I-ANIM": 8,
    "B-BIO": 9,
    "I-BIO": 10,
    "B-CEL": 11,
    "I-CEL": 12,
    "B-DIS": 13,
    "I-DIS": 14,
    "B-EVE": 15,
    "I-EVE": 16,
    "B-FOOD": 17,
    "I-FOOD": 18,
    "B-INST": 19,
    "I-INST": 20,
    "B-MEDIA": 21,
    "I-MEDIA": 22,
    "B-MYTH": 23,
    "I-MYTH": 24,
    "B-PLANT": 25,
    "I-PLANT": 26,
    "B-TIME": 27,
    "I-TIME": 28,
    "B-VEHI": 29,
    "I-VEHI": 30,
  }

id2label = {tag: idx for idx, tag in label2id.items()}

pos_tag_values = list(label2id.keys())
NUM_OF_LABELS = len(pos_tag_values)

print(f"List of tag values: \n{pos_tag_values}")
print(f"Number of NER Tags: \n{NUM_OF_LABELS}")
print(f"id2label: \n{id2label}")
print(f"label2id: \n{label2id}")

List of tag values: 
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-ANIM', 'I-ANIM', 'B-BIO', 'I-BIO', 'B-CEL', 'I-CEL', 'B-DIS', 'I-DIS', 'B-EVE', 'I-EVE', 'B-FOOD', 'I-FOOD', 'B-INST', 'I-INST', 'B-MEDIA', 'I-MEDIA', 'B-MYTH', 'I-MYTH', 'B-PLANT', 'I-PLANT', 'B-TIME', 'I-TIME', 'B-VEHI', 'I-VEHI']
Number of NER Tags: 
31
id2label: 
{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-ANIM', 8: 'I-ANIM', 9: 'B-BIO', 10: 'I-BIO', 11: 'B-CEL', 12: 'I-CEL', 13: 'B-DIS', 14: 'I-DIS', 15: 'B-EVE', 16: 'I-EVE', 17: 'B-FOOD', 18: 'I-FOOD', 19: 'B-INST', 20: 'I-INST', 21: 'B-MEDIA', 22: 'I-MEDIA', 23: 'B-MYTH', 24: 'I-MYTH', 25: 'B-PLANT', 26: 'I-PLANT', 27: 'B-TIME', 28: 'I-TIME', 29: 'B-VEHI', 30: 'I-VEHI'}
label2id: 
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-ANIM': 7, 'I-ANIM': 8, 'B-BIO': 9, 'I-BIO': 10, 'B-CEL': 11, 'I-CEL': 12, 'B-DIS': 13, 'I-DIS': 14, 'B-EVE': 15, 'I-EVE': 16, 'B-FOOD': 17, 'I-

In the text classification tasks discussed earlier, each data example included a string representing an entire sentence. In contrast, the current data is already segmented into words, denoted by the key `tokens` in the `Dataset` object. Interestingly, the tags in the data correspond to these individual words. Let's examine a sample sentence for better clarity.

In [7]:
train_words = ds['train']['tokens']
train_tags = ds['train']['ner_tags']
for word, tag in zip(train_words[0], train_tags[0]):
    print(f'{word:8s} ➔ {label_list[tag]}')

The      ➔ O
type     ➔ O
locality ➔ O
is       ➔ O
Kīlauea  ➔ I-MEDIA
.        ➔ O


Aligning labels (tags) with words, especially when they don't align perfectly with the tokenization of the model, introduces significant complexity. This becomes more apparent once we've loaded a tokenizer.

#### Fundamental Constants and Parameters Configuration

Let's establish critical global variables encompassing the designation of the pre-trained model and the essential hyperparameters instrumental for its meticulous fine-tuning:

* `MODEL_CKPT`: The nomenclature of a pre-trained model accessible within the [model repository](https://huggingface.co/models).
* `REPORTS_TO`: Incorporating visualization tools vital for comprehensive machine learning experimentation, opt for [TensorBoard](https://www.tensorflow.org/tensorboard) in this context.
* `LR`, `WEIGHT_DECAY`, `BATCH_SIZE`, `STEPS`, and `NUM_OF_EPOCHS`: Configurable hyperparameters governing the nuanced fine-tuning process of the model. (Experimenting with diverse values is encouraged!)
* `OUTPUT_DIR` and `LOG_DIR`: The designated directories for storing the model and its associated parameters.
* `DEVICE` : The computational device on which the model training is executed. In this instance, utilize GPU for optimal training efficiency.

In [8]:
MODEL_CKPT = "xlnet-base-cased"
MODEL_NAME = f"{MODEL_CKPT}-finetuned-MultiNERD-SystemA"

NUM_OF_EPOCHS = 2
BATCH_SIZE = 12

STRATEGY = "epoch"
REPORTS_TO = "tensorboard"

WEIGHT_DECAY = 0.01
LR = 2e-5

# DEVICE = torch.device("cpu")
STEPS = 1250

OUTPUT_DIR = f'/srv/users/rudxia/Developer_NLP/notebooks/results/Outdir/{MODEL_CKPT}-finetuned-MultiNERD-SystemA'
LOG_DIR= f'/srv/users/rudxia/Developer_NLP/notebooks/results/Log/{MODEL_CKPT}-finetuned-MultiNERD-SystemA'


if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
    print("CUDA is available. Using GPU.")

CUDA is available. Using GPU.


#### Function to Tokenize & Align Inputs

Ensuring the alignment of labels (tags) with words that might not align perfectly with the model involves loading a suitable tokenizer using [`AutoTokenizer.from_pretrained`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer.from_pretrained). Specifically, we opt for a ["fast" tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html) since it offers a mapping from tokens to input words, a crucial requirement for Named Entity Recognition (NER).

In [9]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CKPT)

def tokenize_and_align_labels(samples):
    tokenized_inputs = tokenizer(samples["tokens"], 
                                      truncation=True, 
                                      is_split_into_words=True)
    
    labels = []
    
    for idx, label in enumerate(samples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        prev_word_idx = None
        label_ids = []
        for word_idx in word_ids: # set special tokens to -100
            if word_idx is None or word_idx == prev_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(label[word_idx])
            prev_word_idx = word_idx
        labels.append(label_ids)
    
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

#### Apply Above Function to Tokenize Dataset

In [10]:
encoded_ds = ds.map(tokenize_and_align_labels, 
                    batched=True, 
                    remove_columns=
                        [
                            'ner_tags', 
                            'tokens'
                        ]
                    )

Map:   0%|          | 0/32908 [00:00<?, ? examples/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|██████████| 32908/32908 [00:05<00:00, 6129.16 examples/s]


#### Load Custom Model

In [12]:
model = (AutoModelForTokenClassification.from_pretrained(
    MODEL_CKPT,
    num_labels=NUM_OF_LABELS,
    id2label=id2label,
    label2id=label2id
    ).to(DEVICE))

Some weights of XLNetForTokenClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Create Compute Metrics Function

In [13]:
label_list = pos_tag_values

seqeval = evaluate.load("seqeval")

labels = [label_list[i] for i in example[f'ner_tags']]

def compute_metrics(eval_preds):
    predictions, labels = eval_preds
    predictions = np.argmax(predictions, 
                            axis=2)
    
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    results = seqeval.compute(predictions=true_predictions, 
                              references=true_labels)
    
    return results

#### Define TrainingArguments

In [14]:
args = TrainingArguments(
    OUTPUT_DIR,    # output directory for checkpoints and predictions
    MODEL_NAME,
    log_level="error",
    logging_first_step=True,
    logging_dir = LOG_DIR,
    learning_rate=LR,
    num_train_epochs=NUM_OF_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    evaluation_strategy=STRATEGY,
    report_to=REPORTS_TO,
    disable_tqdm=False,
    logging_steps=STEPS,
    weight_decay=WEIGHT_DECAY,
    save_strategy=STRATEGY,
    hub_private_repo=False,
    push_to_hub=False
)

#### Define Trainer and Subclass Trainer to Handle Class Imbalance

To fine-tune the pre-trained model, we'll instantiate a [`Trainer`](https://huggingface.co/transformers/main_classes/trainer.html) object. This involves providing the pre-trained model, configuration settings, training and validation datasets, and the previously defined evaluation metric.

Most of this should be familiar from the sentence classification notebook, with one notable exception: we include a [`DataCollatorForTokenClassification`](https://github.com/huggingface/transformers/blob/master/src/transformers/data/data_collator.py) with the trainer. This class efficiently handles the padding of token classification examples in batches, ensuring they are of the same length, as required by PyTorch. The tokenizer's `[PAD]` symbol is employed for padding the output, along with the specified `label_pad_token_id` for padding the labels.

In [15]:
class CustomTrainer(Trainer):
    def compute_loss(self, 
                     model, 
                     inputs, 
                     return_outputs=False):
        
        labels = inputs.get("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss 
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor(
            [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
             9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0,
             16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0,
             23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0], 
            device=DEVICE)
        )
        loss = loss_fct(logits.view(-1, 
                                    self.model.config.num_labels), 
                        labels.view(-1)
                        )
        return (loss, outputs) if return_outputs else loss

In [16]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
trainer = CustomTrainer(model, 
                  args=args,
                  data_collator=data_collator,
                  compute_metrics=compute_metrics,
                  tokenizer=tokenizer,
                  train_dataset=encoded_ds["train"],
                  eval_dataset=encoded_ds["eval"],
                  )

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


#### Train Model

In [17]:
train_results = trainer.train()

Epoch,Training Loss,Validation Loss,Anim,Bio,Cel,Dis,Eve,Food,Inst,Loc,Media,Myth,Org,Per,Plant,Time,Vehi,Overall Precision,Overall Recall,Overall F1,Overall Accuracy
1,0.1185,0.104612,"{'precision': 0.5805658056580566, 'recall': 0.8827930174563591, 'f1': 0.7004699480583725, 'number': 3208}","{'precision': 0.5, 'recall': 0.375, 'f1': 0.42857142857142855, 'number': 16}","{'precision': 0.5535714285714286, 'recall': 0.7560975609756098, 'f1': 0.6391752577319588, 'number': 82}","{'precision': 0.5325301204819277, 'recall': 0.8735177865612648, 'f1': 0.6616766467065868, 'number': 1518}","{'precision': 0.9344262295081968, 'recall': 0.9715909090909091, 'f1': 0.9526462395543176, 'number': 704}","{'precision': 0.384297520661157, 'recall': 0.8215547703180212, 'f1': 0.5236486486486486, 'number': 1132}","{'precision': 0.4444444444444444, 'recall': 0.6666666666666666, 'f1': 0.5333333333333333, 'number': 24}","{'precision': 0.9922686840136338, 'recall': 0.9926813040585496, 'f1': 0.9924749511495448, 'number': 24048}","{'precision': 0.9200819672131147, 'recall': 0.980349344978166, 'f1': 0.9492600422832981, 'number': 916}","{'precision': 0.3835616438356164, 'recall': 0.875, 'f1': 0.5333333333333333, 'number': 64}","{'precision': 0.9717548076923077, 'recall': 0.9773345421577516, 'f1': 0.9745366882627693, 'number': 6618}","{'precision': 0.9925558312655087, 'recall': 0.9876543209876543, 'f1': 0.9900990099009901, 'number': 10530}","{'precision': 0.41868317388857623, 'recall': 0.8322147651006712, 'f1': 0.5570947210782479, 'number': 1788}","{'precision': 0.6090425531914894, 'recall': 0.7923875432525952, 'f1': 0.6887218045112782, 'number': 578}","{'precision': 0.6511627906976745, 'recall': 0.875, 'f1': 0.7466666666666667, 'number': 64}",0.863486,0.966114,0.911922,0.982975
2,0.068,0.115915,"{'precision': 0.6152164407520769, 'recall': 0.8771820448877805, 'f1': 0.7232074016962221, 'number': 3208}","{'precision': 0.47058823529411764, 'recall': 1.0, 'f1': 0.6399999999999999, 'number': 16}","{'precision': 0.603448275862069, 'recall': 0.8536585365853658, 'f1': 0.7070707070707071, 'number': 82}","{'precision': 0.5658783783783784, 'recall': 0.8827404479578392, 'f1': 0.6896551724137931, 'number': 1518}","{'precision': 0.9301075268817204, 'recall': 0.9829545454545454, 'f1': 0.9558011049723756, 'number': 704}","{'precision': 0.4458413926499033, 'recall': 0.8144876325088339, 'f1': 0.5762499999999999, 'number': 1132}","{'precision': 0.4117647058823529, 'recall': 0.5833333333333334, 'f1': 0.4827586206896552, 'number': 24}","{'precision': 0.9946795244825006, 'recall': 0.9950931470392548, 'f1': 0.9948862927701326, 'number': 24048}","{'precision': 0.9430379746835443, 'recall': 0.9759825327510917, 'f1': 0.9592274678111588, 'number': 916}","{'precision': 0.6666666666666666, 'recall': 0.875, 'f1': 0.7567567567567567, 'number': 64}","{'precision': 0.9779722389861195, 'recall': 0.9794499848896948, 'f1': 0.9787105541295485, 'number': 6618}","{'precision': 0.9929857819905213, 'recall': 0.9948717948717949, 'f1': 0.9939278937381405, 'number': 10530}","{'precision': 0.46262507357268984, 'recall': 0.8791946308724832, 'f1': 0.6062475896644813, 'number': 1788}","{'precision': 0.6657534246575343, 'recall': 0.8408304498269896, 'f1': 0.7431192660550459, 'number': 578}","{'precision': 0.7631578947368421, 'recall': 0.90625, 'f1': 0.8285714285714286, 'number': 64}",0.883212,0.971378,0.9252,0.985433


#### Save & Log Model

In [18]:
output_model_dir = "/srv/users/rudxia/Developer_NLP/notebooks/results/saved_model/XLNET_MultiNERD_SystemA"
trainer.save_model(output_model_dir)

In [19]:
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

***** train metrics *****
  epoch                    =        2.0
  total_flos               = 22355475GF
  train_loss               =     0.1292
  train_runtime            = 1:06:53.31
  train_samples_per_second =    130.845
  train_steps_per_second   =      2.726


The fine-tuned model is now available as `trainer.model`. Here we define functions to predict tags for a user-defined string.

The main complexity here arises from the need to map back from token labels to word labels, inverting the mapping performed in `encode_dataset`. The process is basically the same as in `compute_metrics` above.

In [20]:
model = trainer.model
model.eval()    # switch to evaluation mode
model.to('cpu')    # switch to CPU


def word_start_tokens(tokenized):
    """Return list of bool identifying which tokens start words."""
    prev_word_idx = None
    is_word_start = []
    for word_idx in tokenized.word_ids():
        if word_idx is None or word_idx == prev_word_idx:
            is_word_start.append(False)
        else:
            is_word_start.append(True)
        prev_word_idx = word_idx
    return is_word_start


def predict_ner(words):
    tokenized = tokenizer(words, is_split_into_words=True, return_tensors='pt')
    pred = model(**tokenized)
    pred_idx = pred.logits.detach().numpy().argmax(axis=2)
    token_labels = [label_list[i] for s in pred_idx for i in s]
    word_labels = []
    for label, is_word_start in zip(token_labels, word_start_tokens(tokenized)):
        if is_word_start:
            word_labels.append(label)
    return word_labels

Let's try that out on a couple of example sentences:

In [21]:
example_sentences = [
    'Paris, the capital of France, is known for its iconic Eiffel Tower.',
    'Elon Musk is the CEO of SpaceX and Tesla.',
    'The Great Barrier Reef is the world\'s largest coral reef system, located in Australia.',
    'Mount Everest, the highest peak in the world, is part of the Himalayan mountain range.',
    'The Mona Lisa, painted by Leonardo da Vinci, is displayed in the Louvre Museum in Paris.',
    'Barack Obama served as the 44th President of the United States from 2009 to 2017.',
    'The Amazon rainforest, often called the "lungs of the Earth," is vital for global oxygen production.',
    'Albert Einstein, a renowned physicist, developed the theory of relativity.',
    'The Nile River is the longest river in Africa, flowing through multiple countries.',
]


for e in example_sentences:
    words = e.split()    # Note: assumes white-space tokenization is OK
    ner_tags = predict_ner(words)
    for word, tag in zip(words, ner_tags):
        print(f'{word:10s} ➔ {tag}')
    print()

Paris,     ➔ B-LOC
the        ➔ O
capital    ➔ O
of         ➔ O
France,    ➔ B-LOC
is         ➔ O
known      ➔ O
for        ➔ O
its        ➔ O
iconic     ➔ O
Eiffel     ➔ B-LOC
Tower.     ➔ I-LOC

Elon       ➔ B-PER
Musk       ➔ I-PER
is         ➔ O
the        ➔ O
CEO        ➔ O
of         ➔ O
SpaceX     ➔ B-ORG
and        ➔ O
Tesla.     ➔ B-ORG

The        ➔ O
Great      ➔ B-LOC
Barrier    ➔ I-LOC
Reef       ➔ I-LOC
is         ➔ O
the        ➔ O
world's    ➔ O
largest    ➔ O
coral      ➔ O
reef       ➔ O
system,    ➔ O
located    ➔ O
in         ➔ O
Australia. ➔ B-LOC

Mount      ➔ B-LOC
Everest,   ➔ I-LOC
the        ➔ O
highest    ➔ O
peak       ➔ O
in         ➔ O
the        ➔ O
world,     ➔ O
is         ➔ O
part       ➔ O
of         ➔ O
the        ➔ O
Himalayan  ➔ B-LOC
mountain   ➔ O
range.     ➔ O

The        ➔ O
Mona       ➔ B-LOC
Lisa,      ➔ I-LOC
painted    ➔ O
by         ➔ O
Leonardo   ➔ B-PER
da         ➔ I-PER
Vinci,     ➔ I-PER
is         ➔ O
displayed  ➔ O
in         ➔ O
t

# Visualization

To get a better intuitive understanding of tagging results, let's implement a visualization using the[`displacy`](https://explosion.ai/demos/displacy-ent) library.

The code here mostly maps the IOB tags to character offets and formats the data for displacy. Unless you're interested in modifying this or otherwise working with this library, there's no need to go through this in detail.

In [22]:
from spacy import displacy
# Mapping of MultiNERD types for displacy

type_map = {
    'PER': 'Person',
    'LOC': 'Location',
    'ORG': 'Organization',
    'ANIM': 'Animal',
    'BIO': 'Biological entity',
    'CEL': 'Celestial Body',
    'DIS': 'Disease',
    'EVE': 'Event',
    'FOOD': 'Food',
    'INST': 'Instrument',
    'MEDIA': 'Media',
    'PLANT': 'Plant',
    'MYTH': 'Mythological entity',
    'TIME': 'Time',
    'VEHI': 'Vehicle',
}

def render_with_displacy(words, tags):
    tagged, offset, start, label = [], 0, None, None
    for word, tag in zip(words, tags):
        if tag[0] in 'OB' and start is not None:    # current ends
            tagged.append({
                'start': start,
                'end': offset,
                'label': type_map.get(label, label)
            })
            start, label = None, None
        if tag[0] == 'B':
            start, label = offset, tag[2:]
        elif tag[0] == 'I':
            if start is None:    # I without B, but nevermind
                start, label = offset, tag[2:]
        else:
            assert tag == 'O', 'unexpected tag {}'.format(tag)
        offset += len(word) + 1    # +1 for space
    if start:    # span open at sentence end
        tagged.append({
                'start': start,
                'end': offset,
                'label': type_map.get(label, label)
        })
    doc = {
        'text': ' '.join(words),
        'ents': tagged
    }
    displacy.render(doc, style='ent', jupyter=True, manual=True)


for e in example_sentences:
    words = e.split()    # Note: assumes white-space tokenization is OK
    ner_tags = predict_ner(words)
    render_with_displacy(words, ner_tags)

### Error Analysis

#### Define Method to Apply to Validation Dataset (& Then Apply it)

In [23]:
def forward_pass_with_label(batch):
    # Convert dict of lists to list of dicts suitable for data collator
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    # Pad inputs and labels and put all tensors on device
    batch = data_collator(features)
    input_ids = batch["input_ids"].to('cpu')
    attention_mask = batch["attention_mask"].to('cpu')
    labels = batch["labels"].to('cpu')
    with torch.no_grad():
        # Pass data through model  
        output = trainer.model(input_ids, 
                               attention_mask
                               )
        # Logit.size: [batch_size, sequence_length, classes]
        predicted_label = torch.argmax(output.logits, 
                                       axis=-1
                                       ).cpu().numpy()
        
    # Calculate loss per token after flattening batch dimension with view
    loss = cross_entropy(output.logits.view(-1, 31), 
                         labels.view(-1), 
                         reduction="none")
    # Unflatten batch dimension and convert to numpy array
    loss = loss.view(len(input_ids), -1).cpu().numpy()

    return {"loss":loss, "predicted_label": predicted_label}

#### Apply Above Function to Entire Test Dataset

In [24]:
eval_set = encoded_ds['test']

eval_set = eval_set.map(forward_pass_with_label,
                        batched=True,
                        batch_size=32)

eval_df = eval_set.to_pandas()

Map: 100%|██████████| 32820/32820 [24:15<00:00, 22.55 examples/s]  


#### Clean Up Padding Tokens

Defined a placeholder label ID for special tokens (e.g. `[IGN]`) and tokens that represent continuation wordpieces. For example, if the word `Partition` is tokenized into the parts `Part` and `##ition`, the subword token `##ition` would get this ID.

(Here the "magic" value -100 is significant: this matches the default pythorch `ignore_index`, a value that is ignored in loss functions.)

In [25]:
id2label[-100] = "IGN"
eval_df["input_tokens"] = eval_df["input_ids"].apply(
    lambda x: tokenizer.convert_ids_to_tokens(x))
eval_df["predicted_label"] = eval_df["predicted_label"].apply(
    lambda x: [id2label[i] for i in x])
eval_df["labels"] = eval_df["labels"].apply(
    lambda x: [id2label[i] for i in x])
eval_df['loss'] = eval_df.apply(
    lambda x: x['loss'][:len(x['input_ids'])], axis=1)
eval_df['predicted_label'] = eval_df.apply(
    lambda x: x['predicted_label'][:len(x['input_ids'])], axis=1)
eval_df.head(1)

Unnamed: 0,input_ids,token_type_ids,attention_mask,labels,loss,predicted_label,input_tokens
0,"[6464, 18, 239, 20, 339, 621, 1273, 21, 18, 19...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[O, O, O, O, B-EVE, I-EVE, I-EVE, O, O, O, O, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[B-ORG, B-ORG, B-ORG, B-ORG, B-ORG, B-ORG, B-O...","[▁Between, ▁the, ▁end, ▁of, ▁World, ▁War, ▁II,..."


#### Unwrap Each Token Within Sample Separately

In [26]:
eval_df_tokens = eval_df.apply(pd.Series.explode)
eval_df_tokens = eval_df_tokens.query("labels != 'IGN'")
eval_df_tokens["loss"] = eval_df_tokens["loss"].astype(float).round(2)
eval_df_tokens.head(7)

Unnamed: 0,input_ids,token_type_ids,attention_mask,labels,loss,predicted_label,input_tokens
0,6464,0,1,O,0.0,B-ORG,▁Between
0,18,0,1,O,0.0,B-ORG,▁the
0,239,0,1,O,0.0,B-ORG,▁end
0,20,0,1,O,0.0,B-ORG,▁of
0,339,0,1,B-EVE,0.0,B-ORG,▁World
0,621,0,1,I-EVE,0.0,B-ORG,▁War
0,1273,0,1,I-EVE,0.0,B-ORG,▁II


#### See Which Tokens Have Accumulated Most Loss in Evaluation Dataset

In [27]:
(
    eval_df_tokens.groupby("input_tokens")[["loss"]]
    .agg(["count", "mean", "sum"])
    .droplevel(level=0, axis=1)  # Get rid of multi-level columns
    .sort_values(by="sum", ascending=False)
    .reset_index()
    .round(3)
    .head(10)
    .T
)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
input_tokens,▁,▁and,▁the,▁a,▁of,▁to,▁Qu,▁in,▁with,▁or
count,121976,20850,35218,13900,22056,12762,38,16374,6058,3416
mean,0.033,0.013,0.007,0.013,0.008,0.011,3.528,0.007,0.02,0.035
sum,4011.37,275.17,257.03,186.62,170.33,142.56,134.08,121.77,119.1,118.66


#### See Which Label IDs Have Most Loss in Evaluation Dataset

In [28]:
(
    eval_df_tokens.groupby("labels")[["loss"]] 
    .agg(["count", "mean", "sum"])
    .droplevel(level=0, axis=1)
    .sort_values(by="mean", ascending=False)
    .reset_index()
    .round(3)
    .fillna(0)
    .T
)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
labels,I-PLANT,I-ANIM,B-BIO,I-TIME,B-VEHI,I-VEHI,I-DIS,B-ANIM,I-FOOD,B-PLANT,...,I-ORG,I-PER,B-LOC,B-PER,I-MEDIA,I-CEL,I-BIO,I-INST,I-MYTH,B-EVE
count,830,1154,28,508,160,174,2502,2268,1656,2360,...,7824,15984,15700,15014,2818,38,8,44,18,598
mean,0.61,0.416,0.292,0.149,0.137,0.117,0.113,0.097,0.093,0.079,...,0.001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sum,506.44,479.84,8.18,75.45,21.98,20.44,283.74,220.75,153.39,185.36,...,11.41,5.82,5.38,1.24,0.02,0.0,0.0,0.0,0.0,0.0


#### Create Function to Display Confusion Matrix

In [29]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion_matrix(y_preds, y_true, labels):
    
    cm = confusion_matrix(y_true, y_preds, normalize="true")

    fig, ax = plt.subplots(figsize=(15, 15))

    # Plot confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)

    # Set tick label font size
    plt.tick_params(axis='both', which='both', labelsize=10)

    # Value name for rotating horizontally
    plt.xticks(rotation=90)

    # Set title
    plt.title("Normalized Confusion Matrix of XLNET-NER-SystemA")

    plt.show()

#### Display Confusion Matrix

In [1]:
eval_token_list = list(set(eval_df_tokens['labels']))
eval_df_tokens = eval_df_tokens[eval_df_tokens['labels'] != 'B-PER']
eval_df_tokens = eval_df_tokens[eval_df_tokens['predicted_label'] != 'B-PER']
plot_confusion_matrix(eval_df_tokens["labels"], eval_df_tokens["predicted_label"],
                      eval_token_list)

NameError: name 'eval_df_tokens' is not defined

#### Define & Call Function to Display Example Token Sequences Along With Labels & Losses

In [31]:
def get_samples(df):
    for _, row in df.iterrows():
        labels, preds, tokens, losses = [], [], [], []
        for i, mask in enumerate(row["attention_mask"]):
            if i not in {0, len(row["attention_mask"])}:
                labels.append(row["labels"][i])
                preds.append(row["predicted_label"][i])
                tokens.append(row["input_tokens"][i])
                losses.append(f"{row['loss'][i]:.2f}")
        eval_df_tmp = pd.DataFrame({"tokens": tokens, "labels": labels, 
                               "preds": preds, "losses": losses}).T
        yield eval_df_tmp

eval_df["total_loss"] = eval_df["loss"].apply(sum)
eval_df_tmp = eval_df.sort_values(by="total_loss", ascending=False).head(3)

pd.set_option('display.max_columns', None)

for sample in get_samples(eval_df_tmp):
    display(sample)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84
tokens,▁species,▁include,▁Coast,▁Live,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁a,gri,folia,▁,"""",▁,),▁,",",▁Engel,mann,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁,eng,elman,ni,i,▁,"""",▁,),▁,",",▁Canyon,▁Live,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁,ch,ry,s,ole,pi,s,▁,"""",▁,),▁,",",▁and,▁Baja,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁peninsula,ris,▁,"""",▁,),▁,.,<sep>,<cls>
labels,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,IGN,I-PLANT,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,I-PLANT,IGN,O,IGN,B-ANIM,IGN,I-ANIM,I-ANIM,IGN,I-ANIM,IGN,I-ANIM,IGN,IGN,I-ANIM,IGN,IGN,IGN,IGN,I-ANIM,IGN,I-ANIM,IGN,O,IGN,B-PLANT,I-PLANT,I-PLANT,I-PLANT,IGN,I-PLANT,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,IGN,IGN,IGN,IGN,IGN,I-PLANT,IGN,I-PLANT,IGN,O,IGN,O,B-ANIM,I-ANIM,I-ANIM,IGN,I-ANIM,IGN,I-ANIM,IGN,IGN,I-ANIM,IGN,I-ANIM,IGN,I-ANIM,IGN,O,IGN,IGN,IGN
preds,O,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,O,O,B-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O
losses,0.00,0.00,0.00,0.00,0.04,0.00,0.00,0.27,0.39,0.20,8.40,0.00,8.17,0.00,7.24,0.00,0.00,0.06,0.00,0.00,7.99,0.00,8.56,0.00,0.00,0.00,4.74,0.00,4.72,9.20,0.00,8.91,0.00,9.26,0.00,0.00,3.73,0.00,0.00,0.00,0.00,9.03,0.00,9.64,0.00,0.00,0.00,0.15,0.27,0.17,8.33,0.00,8.14,0.00,7.41,0.00,0.00,0.10,0.00,0.00,0.00,0.00,0.00,0.00,7.54,0.00,8.14,0.00,0.00,0.00,0.00,4.41,4.78,9.20,0.00,8.67,0.00,9.59,0.00,0.00,4.23,0.00,8.83,0.00,9.50


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84
tokens,▁species,▁include,▁Coast,▁Live,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁a,gri,folia,▁,"""",▁,),▁,",",▁Engel,mann,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁,eng,elman,ni,i,▁,"""",▁,),▁,",",▁Canyon,▁Live,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁,ch,ry,s,ole,pi,s,▁,"""",▁,),▁,",",▁and,▁Baja,▁Oak,▁,(,▁,"""",▁Qu,er,cus,▁peninsula,ris,▁,"""",▁,),▁,.,<sep>,<cls>
labels,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,IGN,I-PLANT,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,I-PLANT,IGN,O,IGN,B-ANIM,IGN,I-ANIM,I-ANIM,IGN,I-ANIM,IGN,I-ANIM,IGN,IGN,I-ANIM,IGN,IGN,IGN,IGN,I-ANIM,IGN,I-ANIM,IGN,O,IGN,B-PLANT,I-PLANT,I-PLANT,I-PLANT,IGN,I-PLANT,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,IGN,IGN,IGN,IGN,IGN,I-PLANT,IGN,I-PLANT,IGN,O,IGN,O,B-ANIM,I-ANIM,I-ANIM,IGN,I-ANIM,IGN,I-ANIM,IGN,IGN,I-ANIM,IGN,I-ANIM,IGN,I-ANIM,IGN,O,IGN,IGN,IGN
preds,O,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,O,O,B-PLANT,I-PLANT,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O
losses,0.00,0.00,0.00,0.00,0.04,0.00,0.00,0.27,0.39,0.20,8.40,0.00,8.17,0.00,7.24,0.00,0.00,0.06,0.00,0.00,7.99,0.00,8.56,0.00,0.00,0.00,4.74,0.00,4.72,9.20,0.00,8.91,0.00,9.26,0.00,0.00,3.73,0.00,0.00,0.00,0.00,9.03,0.00,9.64,0.00,0.00,0.00,0.15,0.27,0.17,8.33,0.00,8.14,0.00,7.41,0.00,0.00,0.10,0.00,0.00,0.00,0.00,0.00,0.00,7.54,0.00,8.14,0.00,0.00,0.00,0.00,4.41,4.78,9.20,0.00,8.67,0.00,9.59,0.00,0.00,4.23,0.00,8.83,0.00,9.50


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90
tokens,▁of,▁the,▁chap,ar,ral,▁woodland,▁species,▁include,▁,:,▁canyon,▁live,▁oak,▁,(,▁,"""",▁Qu,er,cus,▁,ch,ry,s,ole,pi,s,▁,"""",▁,),▁,",",▁valley,▁oak,▁,(,▁,"""",▁Qu,er,cus,▁,lob,ata,▁,"""",▁,),▁,",",▁blue,▁oak,▁,(,▁,"""",▁Qu,er,cus,▁do,ug,las,ii,▁,"""",▁,),▁,",",▁and,▁gray,▁pine,▁,(,▁,"""",▁Pin,us,▁,sa,bin,iana,▁,"""",▁,),▁,.,<sep>,<cls>
labels,O,O,O,IGN,IGN,O,O,O,O,IGN,B-PLANT,I-PLANT,I-PLANT,I-PLANT,IGN,I-PLANT,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,IGN,IGN,IGN,IGN,IGN,I-PLANT,IGN,I-PLANT,IGN,O,IGN,B-PLANT,I-PLANT,I-PLANT,IGN,I-PLANT,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,IGN,I-PLANT,IGN,I-PLANT,IGN,O,IGN,B-ANIM,I-ANIM,I-ANIM,IGN,I-ANIM,IGN,I-ANIM,IGN,IGN,I-ANIM,IGN,IGN,IGN,I-ANIM,IGN,I-ANIM,IGN,O,IGN,O,O,O,O,IGN,O,IGN,O,IGN,O,IGN,IGN,IGN,O,IGN,O,IGN,O,IGN,IGN,IGN
preds,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,O,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,O,O,O,O,O,O,O,B-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,I-PLANT,O,O,O,O,O,O,O,O
losses,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,3.04,4.02,3.80,11.31,0.00,11.28,0.00,9.23,0.00,0.00,0.04,0.00,0.00,0.00,0.00,0.00,0.00,9.79,0.00,11.01,0.00,0.00,0.00,3.52,4.32,10.81,0.00,11.45,0.00,9.31,0.00,0.00,0.05,0.00,0.00,9.94,0.00,11.00,0.00,0.00,0.00,4.01,5.80,10.35,0.00,10.90,0.00,10.83,0.00,0.00,3.68,0.00,0.00,0.00,9.84,0.00,10.85,0.00,0.00,0.00,0.00,0.15,0.12,0.00,0.00,0.00,0.00,4.81,0.00,4.33,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
