<a href="https://colab.research.google.com/github/nyp-sit/it3103/blob/main/week14/token_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Token Classification (Named Entity Recognition)

In this practical we will learn how to use the HuggingFace Transformers library to perform token classification.

Just like what we did in Practical 3a, we will use the DistiBERT transformer architecture, which also allows us to classify each and every word in a sentence.

####**NOTE: Be sure to set your runtime to a GPU instance!**

## Install Transformers

Run the following cell to install the HuggingFace Transformers library.

In [None]:
!pip install transformers

## Get the data

In [None]:
!wget https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/token_train.txt
!wget https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/token_test.txt

## Process the data 

The data file is in CoNLL format: 

```
sentence1-word1  PPTag-1-1  ChunkTag-1-1  NERTag-1-1
sentence1-word2  PPTag-1-2  ChunkTag-1-2  NERTag-1-2
sentence1-word3  PPTag-1-3  ChunkTag-1-3  NERTag-1-3
<empty line>
sentence2-word1  PPTag-2-1  ChunkTag-2-1  NERTag-2-1
sentence2-word2  PPTag-2-2  ChunkTag-2-2  NERTag-2-2
...
sentence2-wordn  PPTag-2-n  ChunkTag-2-n  NERTag-2-n
<empty line>
...
```

For example, the sentence "U.N. official Ekeus heads for Baghdad." will be represented as follow in CoNLL format: 

```
U.N.      NNP  I-NP  I-ORG
official  NN   I-NP  O
Ekeus     NNP  I-NP  I-PER
heads     VBZ  I-VP  O
for       IN   I-PP  O
Baghdad   NNP  I-NP  I-LOC
.         .    O     O
```



We define a function to read the data file line by line and combined lines that belong to a sentence into a list of words and list of tags. 

As we are only interested in the Named Entity Recognition (NER) tags, we will only extract tags from column_index 3.

In [None]:
# This function returns a 2D list of words and a 2D list of labels
# corresponding to each word.

def load_conll(filepath, delimiter=' ', word_column_index=0, label_column_index=3):
    all_texts = []
    all_tags = []

    texts = []
    tags = []

    # Opens the file.
    #
    with open(filepath, "r") as f:

        # Loops through each line 
        for line in f:

            # Split each line by its delimiter (default is a space)
            tokens = line.split(delimiter)

            # If the line is empty, treat it as the end of the
            # previous sentence, and construct a new sentence
            #
            if len(tokens) == 1:
                # Append the sentence
                # 
                all_texts.append(texts)
                all_tags.append(tags)

                # Create a new sentence
                #
                texts = []
                tags = []
            else:
                # Not yet end of the sentence, continue to add
                # words into the current sentence
                #
                thistext = tokens[word_column_index].replace('\n', '')
                thistag = tokens[label_column_index].replace('\n', '')

                texts.append(thistext)
                tags.append(thistag)

    # Insert the last sentence if it contains at least 1 word.
    #
    if len(texts) > 0:
        all_texts.append(texts)
        all_tags.append(tags)

    # Return the result to the caller
    #
    return all_texts, all_tags


We will now process our files with the function and examine the outputs.

In [None]:
train_texts, train_tags = load_conll("token_train.txt")
val_texts, val_tags = load_conll("token_test.txt")

In [None]:
print(train_texts[:3])
print(train_tags[:3])

## Tokenization

Now we have our texts and labels. Before we can feed the texts and labels into our model for training, we need to tokenize our texts and also encode our labels into numeric forms.

We first define the token labels we need and define the mapping to a numeric index.


In [None]:
# Define a list of unique token labels that we will recognize
#
token_labels = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

# Create a reverse-mapping dictionary of the label -> index.
#
token_labels_id_by_label = {tag: id for id, tag in enumerate(token_labels)}



We will now need to tokenize our text.  Let's look at a potential problem that can happen we do tokenization. Transformers model like Bert uses WordPiece Tokenization, meaning that single words are split into multiple tokens (this is done to solve the out-of-vocabulary problem for rare words). For example, DistilBert’s tokenizer would split the `["Nadim", "Ladki"]` into the tokens `[[CLS], "na", "##im","lad", ##ki", [SEP]]`. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.

Before tokenization with WordPiece, it is one to one matching between tokens and tags:

```
tokens = ["Nadim", "Ladki"]
labels = ['B-PER', 'I-PER']
```

After tokenization with WordPiece, there is no more one-to-one match between them: 
```
tokens = ["[CLS]", "nad", "##im", "lad", "##ki", "[SEP]"]
labels = ['B-PER', 'I-PER']
```

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in Transformers by setting the labels we wish to ignore to -100. We will also ignore special tokens like `[CLS]` and `[SEP]`. In the example above, if the label for 'Nadim' 1 (index for B-PER) and 'Ladki' is 2 (index for I-PER), we would set the labels as follows: 

```
tokens = ["[CLS]", "nad", "##im", "lad", "##ki", "[SEP]"]
labels = [-100, 1, -100, 2, -100, -100]
```

But how do we know which token to ignore? This is where we need to use the offset_mapping from the tokenizer. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. 

For example, in the origial token 'nadim', subtoken "##dim" is starts at original position 3 (i.e. `d`) and ends in position 5 (i.e. `m`). So the offset_mapping for `##dim` thus is given as `(3,5)`. Also, you can see that the special tokens like `[CLS]` has a offset_mapping of `(0,0)`. 

```
tokens = ["[CLS]", "nad", "##im", "lad", "##ki", "[SEP]"]
offset_mappings = [(0, 0), (0, 3), (3, 5), (0, 3), (3, 5), (0, 0)]
```

That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to -100. While we’re at it, we can also set labels to -100 if the second position of the offset mapping is 0, since this means it must be a special token like `[SEP]` or `[CLS]`.

The following function `encode_tags()` takes in original tags and encode it according to the logic described above. 

In [None]:
import numpy as np

def encode_tags(tags, encodings):
    labels = [[token_labels_id_by_label[tag] for tag in doc] for doc in tags]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)

        # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels



In [None]:
from transformers import AutoTokenizer

# Initialize the DistilBERT tokenizer.
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# define a reverse lookup table for mapping id to corresponding word
index2word = { value: key for key, value in tokenizer.get_vocab().items() }

In [None]:
train_encodings = tokenizer(train_texts, 
                            is_split_into_words=True, 
                            return_offsets_mapping=True, 
                            padding=True, 
                            truncation=True)
val_encodings = tokenizer(val_texts, 
                          is_split_into_words=True, 
                          return_offsets_mapping=True, 
                          padding=True, truncation=True)

Let's examine the encoding of one sample. Since we set `return_offsets_mapping` to `True`, we will see the offset_mapping in the output.

In [None]:
print(train_encodings.keys())

In [None]:
for i in range(5):
  print([index2word[id] for id in train_encodings.input_ids[i] if id != 0])
  print(train_encodings.offset_mapping[i])

Now we will go ahead and encode our tag labels.

In [None]:
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)

Now we are ready to create our datasets for training and evaluating our models. Before that we need to remove offset_mapping from the encodings as it is not needed by our model. 

In [None]:
import tensorflow as tf

train_encodings.pop("offset_mapping") # we don't want to pass this to the model
val_encodings.pop("offset_mapping")

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

Run the following cell below to see the first few samples of the train dataset to see if they looks all right.

In [None]:
iterator = iter(train_dataset)

for i in range(3):
    print (train_texts[i])
    print(iterator.get_next())
    print ("---")
    


## Train our Token Classification Model

We will now set up the training configuration. 


In [None]:
from transformers import (
    TFAutoModelForTokenClassification, 
    TFTrainer, 
    TFTrainingArguments
)

from transformers.utils import logging as hf_logging

# We enable logging level to info and use default log handler and log formatting
hf_logging.set_verbosity_info()
hf_logging.enable_default_handler()
hf_logging.enable_explicit_format()

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

with training_args.strategy.scope():
    token_model = TFAutoModelForTokenClassification.from_pretrained('distilbert-base-uncased', 
                                                              num_labels=len(token_labels))

trainer = TFTrainer(
    model=token_model,                   # the instantiated Token Classification Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)


In [None]:
trainer.train()

In [None]:
token_model.save_pretrained('./my_tokenmodel/')

In [None]:
#!zip -r my_model.zip ./my_tokenmodel

## Section 8 - Evaluate the Model

Run the following cells below to evaluate your model performance.

Obviously, you can only do this AFTER your training is completed. 

In [None]:
import numpy as np

from transformers import (
    AutoTokenizer,
    TFAutoModelForTokenClassification
)
                          
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')


In [None]:
model = TFAutoModelForTokenClassification.from_pretrained('my_tokenmodel')

In [None]:
def infer_tokens(text):
    encodings = tokenizer([text], is_split_into_words=True, padding=True, truncation=True, return_offsets_mapping=True, return_tensors="tf")

    label_mapping = [0] * len(encodings.offset_mapping[0])
    for i, offset in enumerate(encodings.offset_mapping[0]):
        if encodings.offset_mapping[0][i][0] == 0 and encodings.offset_mapping[0][i][1] != 0:
            label_mapping[i] = 1

    encodings.pop("offset_mapping")
    #encodings = encodings.to("cuda")

    # Use the token classification model to predict the labels
    # for each word.
    #
    output = token_model(encodings)[0]

    result = []

    for i in range(output.shape[1]):
        if label_mapping[i] == 1:
            result.append(np.argmax(output[0][i]).item())

    return result


In [None]:
from tqdm import tqdm

# This function takes in a list of sentences (texts) and passes them into the
# infer_tokens method to tokenize and predict each word's label.
# 
# It will then convert the list of labels into their numeric index, and
# return both actual label and predicted label to the caller.
#
def get_actual_pred_y(texts, labels):
    all_actual_y = []
    all_pred_y = []

    for i in tqdm(range(len(texts))):
        x = texts[i]

        actual_y = list(filter(lambda x: x != -100, labels[i]))
        pred_y = infer_tokens(x)

        if (len(actual_y) == len(pred_y)):
            all_actual_y += actual_y
            all_pred_y += pred_y
        else:
            print ("Error: %d, %d, %d, %s " % (i, len(actual_y), len(pred_y), x ))

    return all_actual_y, all_pred_y

# Get the actual and predicted labels for all words in all sentences
# for both the training and the test set.
# 
#actual_y_train, pred_y_train = get_actual_pred_y(train_texts, train_labels)
actual_y_test, pred_y_test = get_actual_pred_y(val_texts, val_labels)



In [None]:
from sklearn.metrics import classification_report 

print(classification_report(actual_y_test, pred_y_test, target_names=token_labels))


Ok, let's test it on your own text

In [None]:
text = input()

text = text.split(" ")

print(text)
predicted = [token_labels[label] for label in infer_tokens(text)]
print(predicted)