# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Week 7: Named Entity Recognition (NER) with Transformers</font>

# <font color="#003660">Notebook 1: Fine-tuning a NER model</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... you understand the differences between sequence and token classification, <br>
        ... know how to fine-tune a NER model on labelled data.
    </font>
</div>
</center>
</p>

The following content is heavily inspired by the following excellent sources:


*   Tunstall et al. (2021): Natural Language Processing with Transformers. O'Reilly. https://www.oreilly.com/library/view/natural-language-processing/9781098103231/
*   Hugging Face (2021): Transformer Models - Hugging Face Course. https://huggingface.co/course/



# Token vs. Sequence Classification

Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

Watch the Hugging Face YouTube video below to learn more about token classification / NER.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('wVHdVlPScxA')

The figures below show the main differences of using encoders like BERT and its sibblings and cousins for sequence and token classification. When doing **sequence classification** (e.g., sentiment analysis), we feed a sequence into the model and only work with the contextual embedding of the [CLS] token when training the classification head of the model. In contrast, when doing **token classification** we feed the contextual embeddings of all tokens through the classification head of the model. 

<center><img width=500 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/seq_class.png"/><br></center>

<hr>

<center><img width=500 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/token_class.png"/><br></center>

# Import Packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `torch` is an open source machine learning library used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab. 
- `transformers` provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages.
- `datasets` and `evaluate` are libraries from Hugging Face to feed Transformers with data and evaluate their predictive accuracy.

In [None]:
!pip install transformers datasets evaluate seqeval

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
from huggingface_hub import notebook_login
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import pipeline
from datasets import load_dataset, load_metric
import evaluate

# Load data

In this notebook we will use the WNUT 17 dataset, whcih is a NER dataset focusing on emerging and of rare entities. You can find more information about the dataset here: https://huggingface.co/datasets/wnut_17

In [None]:
wnut = load_dataset("wnut_17")

In [None]:
wnut

In [None]:
wnut["train"][0]

Each number in `ner_tags` represents a type of named entity (e.g., location, person). We can convert the numbers to textual labels to learn more about those entities.

In [None]:
label_list = wnut["train"].features["ner_tags"].feature.names
label_list

The prefixes of the tags indicate whether a given token signifies the beginning or middle/end of a named entity:

* **B-**: indicates the beginning of an entity.
* **I-**: indicates a token is contained inside the same entity (for example, the State token is a part of an entity like Empire State Building).
* **O**: indicates that the token doesn’t correspond to any entity.

# Preprocess data

As always, the first thing to do when processing raw texts with Transformers is to tokenize the sequences. First, we need to load a tokenizer that is compatible with the architecture we want to use (here: DistilBERT).

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Let's re-join one example from the training set and feed it through the tokenizer. This mimics how we would use the tokenizer on new data. 

In [None]:
tokenized_input = tokenizer(" ".join(wnut["train"][0]["tokens"]))
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

As expected, the tokenizer performed sub-word tokenization, splitting, for example, `ESB` into `es` and `##b`. In addition, the tokenizer added the usual `[CLS]` and `[SEP]` tokens.

As a result, the tokenized sequence and the labels of our training data (which has NOT been tokenized with sub-word tokenization) are not aligend anymore.

In [None]:
len(tokens)

In [None]:
len(wnut["train"][0]["tokens"])

Hence, we need to realign the tokens and labels following the following process:

1. Mapping all tokens to their corresponding word with the `word_ids` method.
2. Assigning the label `-100` to the special tokens `[CLS]` and `[SEP]`, so that they will be ignored by the loss function.
3. Only labeling the first token of a given word. Assign `-100` to other subtokens from the same word.


Below is a function to realign the tokens and labels, and truncate sequences that are longer than our model's maximum sequence length.

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

Apply the function to the whole dataset (i.e., train, validation, and test).

In [None]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

Check whether the sequences are aligned now.

In [None]:
tokenized_wnut["train"][0]

To preprocess our texts on-the-fly while training our model, we can use a data collator (more information: https://huggingface.co/docs/transformers/main_classes/data_collator).

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Fine-tune model

Now we are (almost) ready to fine-tune a pre-trained DistilBERT on our emerging and rare named entities dataset. We will also see how to publish our model (checkpoint + tokenizer) on the Hugging Face dataset hub and download it again.

Let's first log into the Hugging Face dataset hub.

In [None]:
notebook_login()

As always, when training a model we need a compute_metrics function to calculate and display selected accuracy metrics during training. Here we use precision, recall, F1, and accuracy (they are all saved together in the seqeval evaluation object).

In [None]:
seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In order to be able to interpret the model's predictions, we also need two dictionaries to be able to translate between the textual labels and their indices.

In [None]:
id2label = {
    0: "O",
    1: "B-corporation",
    2: "I-corporation",
    3: "B-creative-work",
    4: "I-creative-work",
    5: "B-group",
    6: "I-group",
    7: "B-location",
    8: "I-location",
    9: "B-person",
    10: "I-person",
    11: "B-product",
    12: "I-product",
}

label2id = {
    "O": 0,
    "B-corporation": 1,
    "I-corporation": 2,
    "B-creative-work": 3,
    "I-creative-work": 4,
    "B-group": 5,
    "I-group": 6,
    "B-location": 7,
    "I-location": 8,
    "B-person": 9,
    "I-person": 10,
    "B-product": 11,
    "I-product": 12,
}

🚀 🚀 🚀 Now we are really ready to fine-tune... 🚀 🚀 🚀

Load the pre-trained model, configure the classification head (how many labels?), and pass the functions to translate between label IDs and textual labels.

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=len(label2id), id2label=id2label, label2id=label2id
)

Configure the Trainer. Here, we will only train for 2 epochs. In reality, you should train much longer! And with more data!

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_wnut_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Go! 🏁

In [None]:
trainer.train()

Save and publish the results on the Hugging Face ☁️

In [None]:
trainer.push_to_hub()

# Inference

## Using a Pipeline

In the following cells, we instantiate a pipeline, which integrates a checkpoint (=trained model) and compatible tokenizer, and feed it with a sample sentence.

In [None]:
text = "The Golden State Warriors are an American professional basketball team based in San Francisco."

In [None]:
ner_classifier = pipeline("ner", model="olivermueller/my_awesome_wnut_model")

In [None]:
ner_classifier(text)

## By Hand

Instead of using a pipeline, we can also compute predictions step-by-step. Reproducing each of the steps by hand ✍️ will increase your understanding of the underlying logic.

Load only the tokenizer.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("olivermueller/my_awesome_wnut_model")

Tokenize the example sentence.

In [None]:
inputs = tokenizer(text, return_tensors="pt")
inputs

Load the checkpoint.

In [None]:
model = AutoModelForTokenClassification.from_pretrained("olivermueller/my_awesome_wnut_model")

Do a forward pass through the network.

In [None]:
with torch.no_grad():
    logits = model(**inputs).logits
logits

For each token, get the index of the label with the highest score.

In [None]:
predictions = torch.argmax(logits, dim=2)
predictions

Translate the indices to textual labels.

In [None]:
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class