# Token Classification (Named Entity Recognition)

In this practical we will learn how to use the HuggingFace Transformers library to perform token classification.

Just like what we did in Practical 3a, we will use the DistiBERT transformer architecture, which also allows us to classify each and every word in a sentence.

####**NOTE: Be sure to set your runtime to a GPU instance!**

## Install Transformers

Run the following cell to install the HuggingFace Transformers library.

## Get the data

In [None]:
!wget https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/token_train.txt
!wget https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/token_test.txt

In [157]:
# This function returns a 2D list of words and a 2D list of labels
# corresponding to each word.

def load_conll(filepath, delimiter=' ', word_column_index=0, label_column_index=3):
    all_texts = []
    all_tags = []

    texts = []
    tags = []

    # Opens the file.
    #
    with open(filepath, "r") as f:

        # Loops through each line 
        for line in f:

            # Split each line by its delimiter (default is a space)
            tokens = line.split(delimiter)

            # If the line is empty, treat it as the end of the
            # previous sentence, and construct a new sentence
            #
            if len(tokens) == 1:
                # Append the sentence
                # 
                all_texts.append(texts)
                all_tags.append(tags)

                # Create a new sentence
                #
                texts = []
                tags = []
            else:
                # Not yet end of the sentence, continue to add
                # words into the current sentence
                #
                thistext = tokens[word_column_index].replace('\n', '')
                thistag = tokens[label_column_index].replace('\n', '')

                texts.append(thistext)
                tags.append(thistag)

    # Insert the last sentence if it contains at least 1 word.
    #
    if len(texts) > 0:
        all_texts.append(texts)
        all_tags.append(tags)

    # Return the result to the caller
    #
    return all_texts, all_tags


We will now process our files with the function and examine the outputs.

In [189]:
train_texts, train_tags = load_conll("token_train.txt")
val_texts, val_tags = load_conll("token_test.txt")

In [190]:
print(train_texts[:3])
print(train_tags[:3])

[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], ['Peter', 'Blackburn'], ['BRUSSELS', '1996-08-22']]
[['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'], ['B-PER', 'I-PER'], ['B-LOC', 'O']]


## Tokenization

Now we have our texts and labels. Before we can feed the texts and labels into our model for training, we need to tokenize our texts and also encode our labels into numeric forms.

We first define the token labels we need and define the mapping to a numeric index.


In [191]:
# Define a list of unique token labels that we will recognize
#
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [192]:
id2label

{'0': 'O',
 '1': 'B-PER',
 '2': 'I-PER',
 '3': 'B-ORG',
 '4': 'I-ORG',
 '5': 'B-LOC',
 '6': 'I-LOC',
 '7': 'B-MISC',
 '8': 'I-MISC'}

In [194]:
from transformers import AutoTokenizer

model_checkpoint = 'distilbert-base-uncased'
# Initialize the DistilBERT tokenizer.
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

# define a reverse lookup table for mapping id to corresponding word
index2word = { value: key for key, value in tokenizer.get_vocab().items() }

In [195]:
max_length=128
padding=True

In [212]:
def tokenize_and_align_labels(texts, all_tags):
    
    tokenized_inputs = tokenizer(
        texts,
        max_length=max_length,
        padding=padding,
        truncation=True,
        is_split_into_words=True,
    )

    labels = []
    labels2 = [] 
    for i, tags in enumerate(all_tags):
        word_ids = tokenized_inputs[i].word_ids
        tokens = tokenized_inputs[i].ids
        previous_word_idx = None
        label_ids = []
        label_ids2 = []
        for j,word_idx in enumerate(word_ids):
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
                if tokens[j] == 101:
                    label_ids2.append(-100)
                elif tokens[j] == 102:
                    label_ids2.append(-100)
                elif tokens[j] == 0:
                    label_ids2.append(0)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                try:
                    label_ids.append(int(label2id[tags[word_idx]]))
                    label_ids2.append(int(label2id[tags[word_idx]]))
                except Exception as e:
                    print(f'error {e}')
                    print(f'labels = {labels}')
                    print('\n*****labels_len\n',len(labels))
                    print('\n**********word_idx\n',word_idx)
                    #print(word_idx)
                    break
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(-100)
                label_ids2.append(-100)
                
            previous_word_idx = word_idx

        labels.append(label_ids)
        labels2.append(label_ids2)
        tokenized_inputs['labels'] = labels
        
    return tokenized_inputs, labels2

In [213]:
train_encodings, train_labels = tokenize_and_align_labels(train_texts, train_tags)

In [214]:
val_encodings, val_labels = tokenize_and_align_labels(val_texts, val_tags)

In [215]:
print(train_encodings.keys())

dict_keys(['input_ids', 'attention_mask', 'labels'])


In [216]:
print(train_labels[0])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, -100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [217]:
len(train_encodings['labels'])

14986

In [218]:
len(train_labels)

14986

In [219]:
import tensorflow as tf

batch_size = 8

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

Run the following cell below to see the first few samples of the train dataset to see if they looks all right.

In [220]:
for x in train_dataset:
    #print(x[0].keys())
    print(x[0])
    print('*'*20)
    print(x[1])
    break

{'input_ids': <tf.Tensor: shape=(8, 128), dtype=int32, numpy=
array([[  101,  7327, 19164, ...,     0,     0,     0],
       [  101,  2848, 13934, ...,     0,     0,     0],
       [  101,  9371,  2727, ...,     0,     0,     0],
       ...,
       [  101,  1000,  2057, ...,     0,     0,     0],
       [  101,  2002,  2056, ...,     0,     0,     0],
       [  101,  2002,  2056, ...,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(8, 128), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>, 'labels': <tf.Tensor: shape=(8, 128), dtype=int32, numpy=
array([[-100,    3,    0, ..., -100, -100, -100],
       [-100,    1,    2, ..., -100, -100, -100],
       [-100,    5,    0, ..., -100, -100, -100],
       ...,
       [-100,    0,    0, ..., -100, -100, -100],
       [-100,    0

## Train our Token Classification Model

We will now load the pretrained model and configure the required token labels for the model.


In [224]:
len(label_names)

9

In [225]:
from transformers import TFAutoModelForTokenClassification, AutoConfig

config = AutoConfig.from_pretrained(model_checkpoint, num_labels=len(label_names))

model = TFAutoModelForTokenClassification.from_pretrained(
    model_checkpoint, 
    config=config
)

2022-01-05 00:30:17.650386: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForTokenClassification: ['vocab_projector', 'vocab_layer_norm', 'vocab_transform', 'activation_13']
- This IS expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForTokenClassification were not initialized from the model checkpoint at distilber

In [226]:
len(tokenizer)

30522

In [18]:
model.resize_token_embeddings(len(tokenizer))

<tf.Variable 'tf_bert_for_token_classification/bert/embeddings/word_embeddings/weight:0' shape=(30522, 768) dtype=float32, numpy=
array([[-0.01018257, -0.06154883, -0.02649689, ..., -0.01985357,
        -0.03720997, -0.00975152],
       [-0.01170495, -0.06002603, -0.03233192, ..., -0.01681456,
        -0.04009988, -0.0106634 ],
       [-0.01975381, -0.06273633, -0.03262176, ..., -0.01650258,
        -0.04198876, -0.00323178],
       ...,
       [-0.02176224, -0.0556396 , -0.01346345, ..., -0.00432698,
        -0.0151355 , -0.02489496],
       [-0.04617237, -0.05647721, -0.00192082, ...,  0.01568751,
        -0.01387033, -0.00945213],
       [ 0.00145601, -0.08208051, -0.01597912, ..., -0.00811687,
        -0.04746607,  0.07527421]], dtype=float32)>

Let’s double-check that our model has the right number of labels:

In [227]:
model.config.num_labels

9

In [229]:
from transformers import create_optimizer

num_epochs = 3
num_train_steps = len(train_dataset) * num_epochs

optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=0,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

def dummy_loss(y_true, y_pred):
    return tf.reduce_mean(y_pred)

losses = {"loss": dummy_loss}
model.compile(loss=losses,optimizer=optimizer)

#model.compile(optimizer=optimizer)

Note also that we don’t supply a loss argument to compile(). This is because the models can actually compute loss internally — if you compile without a loss and supply your labels in the input dictionary (as we do in our datasets), then the model will train using that internal loss, which will be appropriate for the task and model type you have chosen.

In [231]:
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=num_epochs
)

Epoch 1/3

KeyboardInterrupt: 

In [239]:
from datasets import load_metric

metric = load_metric('seqeval')

In [258]:
import numpy as np

all_predictions = []
all_labels = []

count = 0
for batch in val_dataset:
    logits = model.predict(batch[0])["logits"]
    labels = batch[0]["labels"]
    predictions = np.argmax(logits, axis=-1)
    for prediction, label in zip(predictions, labels):
        for predicted_idx, label_idx in zip(prediction, label):
            if label_idx == -100:
                continue
            all_predictions.append(label_names[predicted_idx])
            all_labels.append(label_names[label_idx])
    count = count + 1
    if count > 3:
        break
        
metric.compute(predictions=[all_predictions], references=[all_labels])

{'LOC': {'precision': 0.9259259259259259,
  'recall': 0.8333333333333334,
  'f1': 0.8771929824561403,
  'number': 30},
 'MISC': {'precision': 0.9285714285714286,
  'recall': 0.9285714285714286,
  'f1': 0.9285714285714286,
  'number': 14},
 'ORG': {'precision': 0.14285714285714285,
  'recall': 0.5,
  'f1': 0.22222222222222224,
  'number': 2},
 'PER': {'precision': 0.9545454545454546,
  'recall': 0.9545454545454546,
  'f1': 0.9545454545454546,
  'number': 22},
 'overall_precision': 0.8571428571428571,
 'overall_recall': 0.8823529411764706,
 'overall_f1': 0.8695652173913043,
 'overall_accuracy': 0.9763113367174281}

In [256]:
for batch in val_dataset:
    print(batch[0]['labels'])
    break

tf.Tensor(
[[-100    0    0 ... -100 -100 -100]
 [-100    1 -100 ... -100 -100 -100]
 [-100    5 -100 ... -100 -100 -100]
 ...
 [-100    5    0 ... -100 -100 -100]
 [-100    1    2 ... -100 -100 -100]
 [-100    0    0 ... -100 -100 -100]], shape=(8, 128), dtype=int32)


In [235]:
model.save_pretrained('./my_tokenmodel/')

In [None]:
print(len(token_labels))

In [316]:
sample_text = 'Lee Kuan Yew lived in Oxley Road'
sample_text = val_texts[10]
encodings = tokenizer(
        [sample_text],
        max_length=max_length,
        padding=padding,
        truncation=True,
        is_split_into_words=True,
    return_tensors='tf'

)


In [317]:
encodings[0].word_ids

[None,
 0,
 0,
 0,
 1,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 14,
 15,
 15,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 None]

In [318]:
logits = model(encodings)
predictions = np.argmax(logits[0], axis=-1)

In [319]:
print(predictions)
pred_label = []
for i in predictions[0]:
    pred_label.append(label_names[i])
print(pred_label)

[[0 1 1 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 2 2 0 0 0 7 0 0 0 1 2 2 0 0
  0 0 0 0 0 0 0 0 0 0 0 0]]
['O', 'B-PER', 'B-PER', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'B-PER', 'I-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [320]:
preds = predictions[0]
previous_word_idx = None
word_ids = encodings[0].word_ids
tokens = []
labels = []
for i, word_idx in enumerate(word_ids):
    # Special tokens have a word id that is None. We set the label to -100 so they are automatically
    # ignored in the loss function.
    if word_idx != previous_word_idx and word_idx != None:
        labels.append(label_names[preds[i]])
    # For the other tokens in a word, we set the label to either the current label or -100, depending on
    # the label_all_tokens flag.
    previous_word_idx = word_idx

In [322]:
print(sample_text)
print(labels)

['Takuya', 'Takagi', 'scored', 'the', 'winner', 'in', 'the', '88th', 'minute', ',', 'rising', 'to', 'head', 'a', 'Hiroshige', 'Yanagimoto', 'cross', 'towards', 'the', 'Syrian', 'goal', 'which', 'goalkeeper', 'Salem', 'Bitar', 'appeared', 'to', 'have', 'covered', 'but', 'then', 'allowed', 'to', 'slip', 'into', 'the', 'net', '.']
['B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


## Section 8 - Evaluate the Model

Run the following cells below to evaluate your model performance.

Obviously, you can only do this AFTER your training is completed. 

In [None]:
import numpy as np

from transformers import (
    AutoTokenizer,
    TFAutoModelForTokenClassification
)
                          
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')


In [None]:
model = TFAutoModelForTokenClassification.from_pretrained('my_tokenmodel')

In [None]:
def infer_tokens(text):
    encodings = tokenizer([text], is_split_into_words=True, padding=True, truncation=True, return_offsets_mapping=True, return_tensors="tf")

    label_mapping = [0] * len(encodings.offset_mapping[0])
    for i, offset in enumerate(encodings.offset_mapping[0]):
        if encodings.offset_mapping[0][i][0] == 0 and encodings.offset_mapping[0][i][1] != 0:
            label_mapping[i] = 1

    encodings.pop("offset_mapping")
    #encodings = encodings.to("cuda")

    # Use the token classification model to predict the labels
    # for each word.
    #
    output = token_model(encodings)[0]

    result = []

    for i in range(output.shape[1]):
        if label_mapping[i] == 1:
            result.append(np.argmax(output[0][i]).item())

    return result


In [None]:
from tqdm import tqdm

# This function takes in a list of sentences (texts) and passes them into the
# infer_tokens method to tokenize and predict each word's label.
# 
# It will then convert the list of labels into their numeric index, and
# return both actual label and predicted label to the caller.
#
def get_actual_pred_y(texts, labels):
    all_actual_y = []
    all_pred_y = []

    for i in tqdm(range(len(texts))):
        x = texts[i]

        actual_y = list(filter(lambda x: x != -100, labels[i]))
        pred_y = infer_tokens(x)

        if (len(actual_y) == len(pred_y)):
            all_actual_y += actual_y
            all_pred_y += pred_y
        else:
            print ("Error: %d, %d, %d, %s " % (i, len(actual_y), len(pred_y), x ))

    return all_actual_y, all_pred_y

# Get the actual and predicted labels for all words in all sentences
# for both the training and the test set.
# 
#actual_y_train, pred_y_train = get_actual_pred_y(train_texts, train_labels)
actual_y_test, pred_y_test = get_actual_pred_y(val_texts, val_labels)



In [None]:
from sklearn.metrics import classification_report 

print(classification_report(actual_y_test, pred_y_test, target_names=token_labels))


Ok, let's test it on your own text

In [None]:
text = input()

text = text.split(" ")

print(text)
predicted = [token_labels[label] for label in infer_tokens(text)]
print(predicted)

In [314]:
val_texts[10]


['Takuya',
 'Takagi',
 'scored',
 'the',
 'winner',
 'in',
 'the',
 '88th',
 'minute',
 ',',
 'rising',
 'to',
 'head',
 'a',
 'Hiroshige',
 'Yanagimoto',
 'cross',
 'towards',
 'the',
 'Syrian',
 'goal',
 'which',
 'goalkeeper',
 'Salem',
 'Bitar',
 'appeared',
 'to',
 'have',
 'covered',
 'but',
 'then',
 'allowed',
 'to',
 'slip',
 'into',
 'the',
 'net',
 '.']