# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 2: Transformer Architecture - Encoder-only Models</font>

# <font color="#003660">Notebook 4: Named Entity Recognition with Transformers</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... you understand the differences between sequence and token classification, <br>
        ... know how to fine-tune a NER model on labelled data, and <br>
        ... upload and download a model to the HF hub.
    </font>
</div>
</center>
</p>

The following content is heavily inspired by the following excellent sources:


*   Tunstall et al. (2021): Natural Language Processing with Transformers. O'Reilly. https://www.oreilly.com/library/view/natural-language-processing/9781098103231/
*   Hugging Face (2021): Transformer Models - Hugging Face Course. https://huggingface.co/course/



# Token vs. Sequence Classification

Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization. Learn more: https://huggingface.co/learn/llm-course/chapter7/2

The figures below show the main differences of using encoders like BERT and its sibblings and cousins for sequence and token classification. When doing **sequence classification** (e.g., sentiment analysis), we feed a sequence into the model and only work with the contextual embedding of the [CLS] token when training the classification head of the model. In contrast, when doing **token classification** we feed the contextual embeddings of all tokens through the classification head of the model. 

<center><img width=500 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/seq_class.png"/><br></center>

<hr>

<center><img width=500 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/token_class.png"/><br></center>

# Import Packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `torch` is an open source machine learning library used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab. 
- `transformers` provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages.
- `datasets` and `evaluate` are libraries from Hugging Face to feed Transformers with data and evaluate their predictive accuracy.

In [1]:
#!pip install evaluate seqeval
#!pip install datasets
#!pip install datasets==2.19.1 # downward compatibility for datasets with scripts

In [2]:
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
from huggingface_hub import notebook_login, login
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import pipeline
from datasets import load_dataset
import evaluate

# Load data

In this notebook we will use the WNUT 17 dataset, which is a NER dataset focusing on emerging and rare entities. You can find more information about the dataset here: https://github.com/leondz/emerging_entities_17 and https://huggingface.co/datasets/manu/wnut_17

In [3]:
wnut = load_dataset("manu/wnut_17")

In [4]:
wnut["train"][0]

{'id': '0',
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

Each number in `ner_tags` represents a type of named entity (e.g., location, person). We can convert the numbers to textual labels to learn more about those entities.

In [5]:
label_list = wnut["train"].features["ner_tags"].feature.names
label_list

['O',
 'B-corporation',
 'I-corporation',
 'B-creative-work',
 'I-creative-work',
 'B-group',
 'I-group',
 'B-location',
 'I-location',
 'B-person',
 'I-person',
 'B-product',
 'I-product']

The prefixes of the tags indicate whether a given token signifies the beginning or middle/end of a named entity:

* **B-**: indicates the beginning of an entity.
* **I-**: indicates a token is contained inside the same entity (for example, the State token is a part of an entity like Empire State Building).
* **O**: indicates that the token doesn’t correspond to any entity.

# Preprocess data

As always, the first thing to do when processing raw texts with Transformers is to tokenize the sequences. First, we need to load a tokenizer that is compatible with the architecture we want to use (here: DistilBERT).

In [6]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Let's re-join one example from the training set and feed it through the tokenizer. This mimics how we would use the tokenizer on new data. 

In [7]:
tokenized_input = tokenizer(" ".join(wnut["train"][0]["tokens"]))
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 '@',
 'paul',
 '##walk',
 'it',
 "'",
 's',
 'the',
 'view',
 'from',
 'where',
 'i',
 "'",
 'm',
 'living',
 'for',
 'two',
 'weeks',
 '.',
 'empire',
 'state',
 'building',
 '=',
 'es',
 '##b',
 '.',
 'pretty',
 'bad',
 'storm',
 'here',
 'last',
 'evening',
 '.',
 '[SEP]']

As expected, the tokenizer performed sub-word tokenization, splitting, for example, `ESB` into `es` and `##b`. In addition, the tokenizer added the usual `[CLS]` and `[SEP]` tokens.

As a result, the tokenized sequence and the labels of our training data (which has NOT been tokenized with sub-word tokenization) are not aligend anymore. A quick look at the length of the two sequences confirms this.

In [8]:
len(tokens)

34

In [9]:
len(wnut["train"][0]["tokens"])

27

Hence, we need to realign the tokens and labels using the following process:

1. Map all tokens to their corresponding word with the `word_ids` method.
2. Assig the label `-100` to the special tokens `[CLS]` and `[SEP]`, so that they can be ignored by the loss function.
3. Only label the first token of a given word. Assign `-100` to other subtokens from the same word.

Below is a function to tokenize the text and, afterwards, realign the tokens and labels.

In [10]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

Apply the function to the whole dataset (i.e., train, validation, and test).

In [11]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/1287 [00:00<?, ? examples/s]

Check whether the sequences are aligned now.

In [12]:
tokenized_wnut["train"][0]

{'id': '0',
 'tokens': ['@paulwalk',
  'It',
  "'s",
  'the',
  'view',
  'from',
  'where',
  'I',
  "'m",
  'living',
  'for',
  'two',
  'weeks',
  '.',
  'Empire',
  'State',
  'Building',
  '=',
  'ESB',
  '.',
  'Pretty',
  'bad',
  'storm',
  'here',
  'last',
  'evening',
  '.'],
 'ner_tags': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  7,
  8,
  8,
  0,
  7,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'input_ids': [101,
  1030,
  2703,
  17122,
  2009,
  1005,
  1055,
  1996,
  3193,
  2013,
  2073,
  1045,
  1005,
  1049,
  2542,
  2005,
  2048,
  3134,
  1012,
  3400,
  2110,
  2311,
  1027,
  9686,
  2497,
  1012,
  3492,
  2919,
  4040,
  2182,
  2197,
  3944,
  1012,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [-100,
  0,
  -100,
  -100,
  0,
  0,
  -100,
  0,
  0,
  0,
  0,
  0,
  0,
  

In [13]:
len(tokens) == len(tokenized_wnut["train"][0]["input_ids"])

True

To preprocess our texts on-the-fly while training our model, we can use a data collator (more information: https://huggingface.co/docs/transformers/main_classes/data_collator).

In [14]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Fine-tune model

Now we are (almost) ready to fine-tune a pre-trained DistilBERT on our emerging and rare named entities dataset. We will also see how to publish our model (checkpoint + tokenizer) on the Hugging Face dataset hub and download it again.

As always, when training a model we need a compute_metrics function to calculate and display selected accuracy metrics during training. Here we use precision, recall, F1, and accuracy (they are all saved together in the seqeval evaluation object).

In [15]:
seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In order to be able to interpret the model's predictions, we also need two dictionaries to translate between the textual labels and their indices.

In [16]:
id2label = {
    0: "O",
    1: "B-corporation",
    2: "I-corporation",
    3: "B-creative-work",
    4: "I-creative-work",
    5: "B-group",
    6: "I-group",
    7: "B-location",
    8: "I-location",
    9: "B-person",
    10: "I-person",
    11: "B-product",
    12: "I-product",
}

label2id = {
    "O": 0,
    "B-corporation": 1,
    "I-corporation": 2,
    "B-creative-work": 3,
    "I-creative-work": 4,
    "B-group": 5,
    "I-group": 6,
    "B-location": 7,
    "I-location": 8,
    "B-person": 9,
    "I-person": 10,
    "B-product": 11,
    "I-product": 12,
}

🚀 🚀 🚀 Now we are really ready to fine-tune... 🚀 🚀 🚀

Load the pre-trained model, configure the classification head (how many labels?), and pass the dictionaries to translate between label IDs and textual labels.

In [17]:
model = AutoModelForTokenClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=len(label2id), id2label=id2label, label2id=label2id
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Configure the Trainer. Here, we will only train for 2 epochs. In reality, you probably want to train longer!

In [18]:
training_args = TrainingArguments(
    output_dir="my_awesome_wnut_model_2025",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to="none",
    no_cuda=True, # Apple M compatibility
    push_to_hub=False, # You need to be logged in for this
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


Go! 🏁

In [19]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.2751,0.577963,0.257646,0.35641,0.938566
2,No log,0.270022,0.57541,0.325301,0.415631,0.941943


  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=426, training_loss=0.20419825522552634, metrics={'train_runtime': 226.1982, 'train_samples_per_second': 30.009, 'train_steps_per_second': 1.883, 'total_flos': 91781128898820.0, 'train_loss': 0.20419825522552634, 'epoch': 2.0})

# Inference

## Using a Pipeline

In the following cells, we instantiate a pipeline, which integrates a trained model (aka checkpoint) and compatible tokenizer, and feed it with a sample sentence.

In [20]:
text = "The Golden State Warriors are an American professional basketball team based in San Francisco."

In [21]:
ner_classifier = pipeline("ner", model=model, tokenizer=tokenizer)

Device set to use mps:0


In [22]:
ner_classifier(text)

[{'entity': 'B-group',
  'score': 0.21721636,
  'index': 1,
  'word': 'the',
  'start': 0,
  'end': 3},
 {'entity': 'B-location',
  'score': 0.45459718,
  'index': 2,
  'word': 'golden',
  'start': 4,
  'end': 10},
 {'entity': 'I-location',
  'score': 0.21033008,
  'index': 3,
  'word': 'state',
  'start': 11,
  'end': 16},
 {'entity': 'B-group',
  'score': 0.24908938,
  'index': 4,
  'word': 'warriors',
  'start': 17,
  'end': 25},
 {'entity': 'B-location',
  'score': 0.6600871,
  'index': 13,
  'word': 'san',
  'start': 80,
  'end': 83},
 {'entity': 'B-location',
  'score': 0.51444423,
  'index': 14,
  'word': 'francisco',
  'start': 84,
  'end': 93}]

## By Hand

Instead of using a pipeline, we can also compute predictions step-by-step. Reproducing each of the steps by hand ✍️ will increase your understanding of the underlying logic.

Tokenize the example sentence.

In [23]:
inputs = tokenizer(text, return_tensors="pt")
inputs

{'input_ids': tensor([[ 101, 1996, 3585, 2110, 6424, 2024, 2019, 2137, 2658, 3455, 2136, 2241,
         1999, 2624, 3799, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Do a forward pass through the network.

In [24]:
with torch.no_grad():
    # Move inputs to the same device as the model
    inputs = {name: tensor.to(model.device) for name, tensor in inputs.items()}
    logits = model(**inputs).logits
logits

tensor([[[ 1.0330e+00, -2.5403e-01,  1.0945e-01, -8.0330e-01, -9.2720e-02,
           4.5146e-01,  1.7608e-02, -2.9176e-01,  2.0122e-01, -4.1523e-01,
          -4.4312e-01, -2.3184e-01, -3.9708e-01],
         [ 1.1797e+00, -5.6219e-01, -7.0985e-01, -6.2261e-01, -1.0430e+00,
           1.2617e+00,  8.2001e-01,  7.0649e-01,  2.4967e-01, -6.2227e-01,
          -8.8441e-01, -6.7178e-01, -7.3283e-01],
         [ 2.4666e-01, -5.6814e-01, -7.7910e-01, -4.8853e-01, -1.0387e+00,
           1.0034e+00, -4.9541e-02,  2.3482e+00,  1.2157e+00, -1.7934e-01,
          -8.6357e-01, -6.9077e-01, -7.4703e-01],
         [ 9.0998e-01, -7.3500e-01, -5.5140e-01, -9.8822e-01, -9.0076e-01,
           1.0711e+00,  4.8070e-01,  7.8460e-01,  1.2280e+00, -6.7979e-01,
          -7.6542e-01, -9.5512e-01, -8.7403e-01],
         [ 9.7002e-01, -7.4336e-01, -7.7352e-01, -9.5438e-01, -1.0250e+00,
           1.4820e+00,  1.2102e+00,  3.4165e-01,  6.2862e-01, -4.8289e-01,
          -3.5293e-01, -6.8561e-01, -7.1134e-01],


For each token, get the index of the label with the highest score.

In [25]:
predictions = torch.argmax(logits, dim=2)
predictions

tensor([[0, 5, 7, 8, 5, 0, 0, 0, 0, 0, 0, 0, 0, 7, 7, 0, 0]], device='mps:0')

Translate the indices to textual labels.

In [26]:
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class

['O',
 'B-group',
 'B-location',
 'I-location',
 'B-group',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-location',
 'B-location',
 'O',
 'O']