<a href="https://colab.research.google.com/github/nyp-sit/it3103/blob/main/week15/token_classification_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Token Classification (Named Entity Recognition)

In this practical we will learn how to use the HuggingFace Transformers library to perform token classification.

Just like what we did in previous practical on BERT Transformer, we will use the DistilBERT transformer to classify each and every word (token) in a sentence.


## Install Transformers

Run the following cell to install the HuggingFace Transformers library.

In [None]:
!pip install transformers datasets

## Get the data

In this lab, we will use the CoNLL-2003 dataset. 

In [None]:
!wget https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/conll2003.zip
!unzip conll2003.zip

## Process the data 

The data file is in CoNLL format: 

```
sentence1-word1  PPTag-1-1  ChunkTag-1-1  NERTag-1-1
sentence1-word2  PPTag-1-2  ChunkTag-1-2  NERTag-1-2
sentence1-word3  PPTag-1-3  ChunkTag-1-3  NERTag-1-3
<empty line>
sentence2-word1  PPTag-2-1  ChunkTag-2-1  NERTag-2-1
sentence2-word2  PPTag-2-2  ChunkTag-2-2  NERTag-2-2
...
sentence2-wordn  PPTag-2-n  ChunkTag-2-n  NERTag-2-n
<empty line>
...
```

For example, the sentence "U.N. official Ekeus heads for Baghdad." will be represented as follow in CoNLL format: 

```
U.N.      NNP  I-NP  I-ORG
official  NN   I-NP  O
Ekeus     NNP  I-NP  I-PER
heads     VBZ  I-VP  O
for       IN   I-PP  O
Baghdad   NNP  I-NP  I-LOC
.         .    O     O
```

We define a function to read the data file line by line and combined lines that belong to a sentence into a list of words and list of tags. 

As we are only interested in the Named Entity Recognition (NER) tags, we will only extract tags from column_index 3.

In [None]:
# This function returns a 2D list of words and a 2D list of labels
# corresponding to each word.

def load_conll(filepath, delimiter=' ', word_column_index=0, label_column_index=3):
    all_texts = []
    all_tags = []

    texts = []
    tags = []

    # Opens the file.
    #
    with open(filepath, "r") as f:

        # Loops through each line 
        for line in f:

            # Split each line by its delimiter (default is a space)
            tokens = line.split(delimiter)

            # If the line is empty, treat it as the end of the
            # previous sentence, and construct a new sentence
            #
            if len(tokens) == 1:
                # Append the sentence
                # 
                all_texts.append(texts)
                all_tags.append(tags)

                # Create a new sentence
                #
                texts = []
                tags = []
            else:
                # Not yet end of the sentence, continue to add
                # words into the current sentence
                #
                thistext = tokens[word_column_index].replace('\n', '')
                thistag = tokens[label_column_index].replace('\n', '')

                texts.append(thistext)
                tags.append(thistag)

    # Insert the last sentence if it contains at least 1 word.
    #
    if len(texts) > 0:
        all_texts.append(texts)
        all_tags.append(tags)

    # Return the result to the caller
    #
    return all_texts, all_tags


We will now process our files with the function and examine the outputs.

In [None]:
train_texts, train_tags = load_conll("token_train.txt")
val_texts, val_tags = load_conll("token_test.txt")

In [None]:
print(train_texts[:3])
print(train_tags[:3])

## Tokenization

Now we have our texts and labels. Before we can feed the texts and labels into our model for training, we need to tokenize our texts and also encode our labels into numeric forms.

We first define the token labels we need and define the mapping to a numeric index.


In [None]:
# Define a list of unique token labels that we will recognize
#
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# Define a dictionary to map txt label to numeric label
label2id = {label:i for i, label in enumerate(label_names)}

We will also import the tokenizer.

In [None]:
from transformers import AutoTokenizer

model_checkpoint = 'distilbert-base-uncased'
# Initialize the DistilBERT tokenizer.
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, is_fast=True)


Before we tokenize our texts,  let's look at a potential problem that can happen we do tokenization. Transformers model like Bert or DistilBert uses WordPiece tokenization, meaning that a single word can sometimes be split into multiple tokens (this is done to solve the out-of-vocabulary problem for rare words). For example, DistilBert’s tokenizer would split the `["Nadim", "Ladki"]` into the tokens `[[CLS], "na", "##im","lad", ##ki", [SEP]]`. This is a problem for us because we have exactly one tag per token in the original dataset. If the tokenizer splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels, as illustrated below:

Before tokenization with WordPiece, it is one to one matching between tokens and tags:

```
tokens = ["Nadim", "Ladki"]
labels = ['B-PER', 'I-PER']
```

After tokenization with WordPiece, there is no more one-to-one match between them: 
```
tokens = ["[CLS]", "nad", "##im", "lad", "##ki", "[SEP]"]
labels = ['B-PER', 'I-PER']
```

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in Transformers by setting the labels we wish to ignore to -100. We will also ignore special tokens like `[CLS]` and `[SEP]`. In the example above, if the label for 'Nadim' is 1 (index for B-PER) and 'Ladki' is 2 (index for I-PER), we would set the labels as follows: 

```
tokens = ["[CLS]", "nad", "##im", "lad", "##ki", "[SEP]"]
labels = [-100, 1, -100, 2, -100, -100]
```

But how do we know which are sub-tokens and the special tokens to ignore? Fortunately, the Huggingface tokenize provides us a way to do it: `word_ids`. `word_ids` will tell us which word each token comes from, as well as which words are special tokens (e.g. `[CLS]`). 

For example, the `word_ids` for the following tokens will be: 

```
tokens = ["[CLS]", "nad", "##im", "lad", "##ki", "[SEP]"]
word_ids = [None, 0, 0, 1, 1, None]
```

`None` means it is a special token. You can see that `"nad"`, `"##im"` are both having word_ids `0`, which means both tokens comes from the word at index 0, i.e. `"nadim"`. Similarly, `"lad"` and `"##ki"` have word_ids of `1`, which means both comes from the 2nd word, i.e. word at index 1.


So we can simply use the following logic to decide how to label each of the processed tokens (i.e tokens that have already processed by the tokenizer, and consist of special tokens and subtokens):
- if a token has a `word_id` of `None`, we will set its corresponding label to `-100`. - if the `word_id` of the token appears the 1st time, i.e. different from previous `word_id`, set the label of the token to the corresponding original label. 
- if the `word_id` is the same as previous `word_id`, set the label for the tokens to `-100`

The following function `tokenize_and_align_labels()` takes in original tags and encode it according to the logic described above. 

Note that for this version of HuggingFace, we need to supply the label as part of the dictionary. That is why we create additional entry `tokenized_inputs['labels']` to hold the labels. 

In [None]:
# set the max sequence length and require padding
max_length=128
padding=True

def tokenize_and_align_labels(texts, all_tags):
    
    tokenized_inputs = tokenizer(
        texts,
        max_length=max_length,
        padding=padding,
        truncation=True,
        is_split_into_words=True,
    )

    labels = []

    for i, tags in enumerate(all_tags):
        word_ids = tokenized_inputs[i].word_ids
        tokens = tokenized_inputs[i].ids
        previous_word_idx = None
        label_ids = []
       
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(int(label2id[tags[word_idx]]))
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(-100)
                
            previous_word_idx = word_idx

        labels.append(label_ids)
        
        tokenized_inputs['labels'] = labels
        
    return tokenized_inputs, labels

In [None]:
train_encodings, train_labels = tokenize_and_align_labels(train_texts, train_tags)

In [None]:
val_encodings, val_labels = tokenize_and_align_labels(val_texts, val_tags)

In [None]:
import tensorflow as tf

batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

Run the following cell below to see the first few samples of the train dataset to see if they looks all right.

In [None]:
iterator = iter(train_dataset)

for i in range(3):
    print (train_texts[i])
    print(iterator.get_next())
    print ("---")

## Train our Token Classification Model

We will now load the pretrained model and configure the required token labels for the model.


In [None]:
from transformers import TFAutoModelForTokenClassification, AutoConfig

config = AutoConfig.from_pretrained(model_checkpoint, num_labels=len(label_names))

model = TFAutoModelForTokenClassification.from_pretrained(
    model_checkpoint, 
    config=config
)

Let’s double-check that our model has the right number of labels:

In [None]:
model.config.num_labels

In [None]:
from transformers import create_optimizer

num_epochs = 1
num_train_steps = len(train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

def dummy_loss(y_true, y_pred):
    return tf.reduce_mean(y_pred)

losses = {"loss": dummy_loss}
model.compile(loss=losses,optimizer=optimizer)

Huggingface model can actually compute loss internally — if you compile without a loss and supply your labels in the input dictionary (as we do in our datasets), then the model will train using that internal loss, which will be appropriate for the task and model type you have chosen.  This works previously using previous version of Tensorflow, but Tensorflow 2.7 seems to require a loss function to be supplied. So we create a dummy loss function as a work-around.

In [None]:
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=num_epochs
)

## Evaluation

The traditional framework used to evaluate token classification prediction is seqeval. To use this metric, we first need to install the seqeval library:

In [None]:
!pip install seqeval

In [None]:
from datasets import load_metric

metric = load_metric('seqeval')

In the following codes, we use our model to predict our val_dataset, in batches. For each batch of tf dataset, we have two parts: 1st contains the `input_ids`, `attention_masks`, and `labels`, while the second one is target label. We will only use the 1st part for prediction, i.e. `batch[0]`

Also while looping through the list of predicted label for each token, we will ignore those positions that is labeled "-100". 

In [None]:
import numpy as np

all_predictions = []
all_labels = []

for batch in val_dataset:
    logits = model.predict(batch[0])["logits"]
    labels = batch[0]["labels"]
    predictions = np.argmax(logits, axis=-1)
    for prediction, label in zip(predictions, labels):
        for predicted_idx, label_idx in zip(prediction, label):
            if label_idx == -100:
                continue
            all_predictions.append(label_names[predicted_idx])
            all_labels.append(label_names[label_idx])

metric.compute(predictions=[all_predictions], references=[all_labels])

In [None]:
model.save_pretrained('my_tokenmodel')

## Test on your sentence


In [None]:
def infer_tokens(text, model, tokenizer):
    # here we assume the text has not been splitted into individual words
    text = text.split()
    
    encodings = tokenizer(
        [text],
        padding=True,
        truncation=True,
        is_split_into_words=True,
        return_tensors='tf')
    
    logits = model(encodings)[0] # assume only a single prediction
    preds = np.argmax(logits, axis=-1)[0]

    # as the prediction is on individual tokens, including subtokens, 
    # we need to group subtokens belonging to the same word together
    # again, we use the word_ids to help us here
    previous_word_idx = None
    word_ids = encodings[0].word_ids
    labels = []
    for i, word_idx in enumerate(word_ids):
        # we check if the word_id different from previous one, then it is a new word
        # we also need to check if the word_id is not None so that we won't include it
        if word_idx != previous_word_idx and word_idx != None:
            labels.append(label_names[preds[i]])
        # update the previous_word_idx to current word_id
        previous_word_idx = word_idx

    return text, labels

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
model = TFAutoModelForTokenClassification.from_pretrained('my_tokenmodel')

In [None]:
sample_text = 'Ashish Vaswani has developed the transformer architecture during his time at Google.'
infer_tokens(sample_text, model, tokenizer)