# Fine Tuning BERT For Named Entity Recognition On United Nations Documents

Humans understand the world by putting labels on things and examining how these labels relate to each other. A reflection of this natural language processing and information retrievial world is technique called Named Entity Recognition (NER). The objective is to detect the entity type of segments of text in a document. These entities could be organizations, locations, persons or others. 

In this blog post, I will go through an example for learning an named entity recognition model on specific domain. Instead of creating a NER model from scratch, I will use transfer-learning by taking pre-trained language model, BERT, trained on a large number of general examples and fine-tune that neural network on a very specific type of domain. 

Alongside the tutorial on learning an NER model, I will run this project on Layer in order to make use of their metadata store for storing and tracking the datasets and model artifacts as well as their free GPU compute instances. 

Firstly, let's define the problem. We are working with a set of documents from United Nations (UN). Diplomatic jargon is the norm at the UN and these documents contain many specific entities that we don't encounter in everyday language such as the Office for the Coordination of Humanitarian Affairs of the Secretariat and the Office of the United
Nations High Commissioner for Refugees. We would like to automatically detect these entities with their corresponding types. With the entities flagged, we can power many interesting use cases such as information retrivial, question/answering, document similarity etc. 

The dataset is generously made available to the public by Leslie Huang. It consists of transcribed speeches given at the UN General Assembly from 1993-2016, which were scraped from the UN website, parsed (e.g. from PDF), and cleaned. More than 50,000 tokens were manually annotated for NER tags.
https://github.com/leslie-huang/UN-named-entity-recognition

## Installing/Importing Libraries

Let's start by creating a project at Layer so that we can define a reproducible project and dataset and artifacts logged along with parameters for future reference. Layer helps you build, train and track all your machine learning project metadata including ML models and datasets‍ with semantic versioning. It also allows you to use their cloud infrastucture free of charge including access to GPUs. We will work with a pretrained transformer based language model; so added processing power is very welcome.

We will start by installing the necessary libraries.

In [1]:
!pip install layer --upgrade -qqq
!pip install -U ipython

!pip install transformers
!pip install datasets
!pip install seqeval

Here we log in to Layer and initialize our ML project called "ner-finetuning".  

In [1]:
import os
import itertools
import pandas as pd
import random
from collections import Counter
from math import ceil
from datasets import Dataset

from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification
import torch
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset, DataLoader

import layer
from layer.decorators import model, pip_requirements, fabric, dataset, resources

layer.login()
layer.init("united_nations_ner-finetuning")

Your Layer project is here: https://app.layer.ai/kaankarakeben/united_nations_ner-finetuning

After setting up the ML metadatastore, we will now clone the Github repository that hosts the dataset files.

In [4]:
!git clone https://github.com/leslie-huang/UN-named-entity-recognition

Cloning into 'UN-named-entity-recognition'...
remote: Enumerating objects: 21580, done.[K
remote: Total 21580 (delta 0), reused 0 (delta 0), pack-reused 21580[K
Receiving objects: 100% (21580/21580), 14.70 MiB | 6.54 MiB/s, done.
Resolving deltas: 100% (21095/21095), done.


## Dataset

At this step, we will load the tagged documents from both training and test sets and store them in a DataFrame.
As you may have noticed, we are using decorators from Layer to define a dataset artifact that will be logged on our cloud project at Layer. By calling "layer.run()" we will running the function "create_dataset" on the cloud infrastructure.

You may have also noticed we are logging some text metadata with the raw dataset. This enriches our ML project at the readability and reproducability level. As code is more often read then written, so are ML projects. 

Next, we will get the dataset into local memory by calling it from Layer with layer.get_dataset() function. 

In [8]:
@dataset("un_ner_raw_dataset")
@resources(path="./UN-named-entity-recognition")
def create_raw_dataset():
    directories = ['./UN-named-entity-recognition/tagged-training/', './UN-named-entity-recognition/tagged-test/']
    data_files = []
    for dir in directories:
        for filename in os.listdir(dir):
            file_path = os.path.join(dir, filename)

            with open(file_path, 'r', encoding="utf8") as f:
                lines = f.readlines()
                split_list = [list(y) for x, y in itertools.groupby(lines, lambda z: z == '\n') if not x]
                tokens = [[x.split('\t')[0] for x in y] for y in split_list]
                entities = [[x.split('\t')[1][:-1] for x in y] for y in split_list]
                data_files.append(pd.DataFrame({'tokens': tokens, 'ner_tags': entities}))

    dataset = pd.concat(data_files).reset_index().drop('index', axis=1)

    dataset_description = """The corpus consists of a sample of transcribed speeches given at the UN General Assembly from 1993-2016, which were scraped from the UN website, parsed (e.g. from PDF), and cleaned. More than 50,000 tokens in the test data were manually tagged for Named Entity Recognition (O - Not a Named Entity; I-PER - Person; I-ORG - Organization; I-LOC - Location; I-MISC - Other Named Entity)."""
    layer.log({"# Examples": len(dataset)})
    layer.log({"Dataset Description": dataset_description})
    layer.log({"Source": "https://github.com/leslie-huang/UN-named-entity-recognition"})

    return dataset

layer.run([create_raw_dataset])

Output()

Next we will examine the dataset. The annotation follows us specific Named Entity Recognition annotation scheme called IOB-tagging. It stands for Inside-Outside-Beginning. The document is tagged at the word level and entities sometimes comes in word groups. To note the entities that cover a few words we use the Beginning (B) and Inside (I) tags. 
Example: Tim Cook works at Apple. 
[Tim, Cook, works, at, Apple] -> [B-PER, I-PER, O, 0, B-ORG]

Our dataset consists of two columns where each item is a list. At "tokens" column, we have words in the document in a list. In the "ner_tags" column, we have the corresponding tags.

In [10]:
raw_dataset = layer.get_dataset("kaankarakeben/united_nations_ner-finetuning/datasets/un_ner_raw_dataset").to_pandas()
raw_dataset.head()

Output()

Unnamed: 0,tokens,ner_tags
0,"[Kuwait, congratulates, Mr., Srgjan, Kerim, up...","[I-LOC, O, O, I-PER, I-PER, O, O, O, O, O, O, ..."
1,"[Despite, the, fact, that, two, years, have, e...","[O, O, O, O, O, O, O, O, O, O, O, I-MISC, I-MI..."
2,"[Recent, times, have, seen, a, number, of, out...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[These, were, all, necessary, achievements, ,,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[Moreover, ,, the, revival, of, racial, bias, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


We will now create a Counter object from the NER tags. As expected the most common tag is "O" denoting "Outside" for words that are not a part of a named entity. Second is "I-ORG" tag denoting organisation entities and next in line is location.
An interesting find is that while we have Inside (I) tags, we don't have their beginning (B) tags. We also have some typos that have very low representations. 

In [11]:
raw_tags_counter = Counter([tag for tags in raw_dataset["ner_tags"] for tag in tags])
raw_tags_counter.most_common()

[('O', 135914),
 ('I-ORG', 3562),
 ('I-LOC', 3329),
 ('I-MISC', 2649),
 ('I-PER', 444),
 ('0', 7),
 ('I-', 2),
 ('I-PRG', 1),
 ('I-I-MISC', 1),
 ('I-OR', 1),
 ('VMISC', 1)]

It would pay off the clean the tag further and remove the tags that are typos to have clearer dataset. 

In [12]:
tags_to_remove = ["I-PRG", "I-I-MISC", "I-OR", "VMISC", "I-", "0"]

def clean_tags(tags):
    clean_list = []
    for tag in list(tags):
        if tag != "O":
            if tag not in tags_to_remove:
                clean_list.append(tag)
            else:
                clean_list.append("O")    
        else:
            clean_list.append("O")
    return clean_list
raw_dataset["ner_tags"] = raw_dataset["ner_tags"].apply(lambda x: clean_tags(x))

tag_counter = Counter([tag for tags in raw_dataset["ner_tags"] for tag in tags])
tag_counter.most_common()

[('O', 135927),
 ('I-ORG', 3562),
 ('I-LOC', 3329),
 ('I-MISC', 2649),
 ('I-PER', 444)]

Now that we have a better idea of the dataset, let's log the clean dataset along with with tags metadata at Layer. This helps us to log distinct steps at our project and with an overview of the dataset. 

In [14]:
@dataset("un_ner_clean_dataset")
@resources(path="./UN-named-entity-recognition")
def clean_clean_dataset():
    layer.log({"Raw Tags Counter": raw_tags_counter})
    layer.log({"Clean Tags Counter": tag_counter})
    return raw_dataset

layer.run([clean_clean_dataset])

Output()

In [18]:
clean_dataset = layer.get_dataset("kaankarakeben/united_nations_ner-finetuning/datasets/un_ner_clean_dataset").to_pandas()

Output()

## Fine-tuning Pretrained BERT with PyTorch

As stated earlier we will use a transfer learning to create our NER model. The pretrained model we'll use is BERT which large neural network traiend on masked language modelling and next sentence prediction tasks. If you are interested have a look at the original paper [https://arxiv.org/abs/1810.04805] and this brilliant blog post [http://jalammar.github.io/illustrated-bert/] by Jay Alammar. The fine-tunning will be supervised learning effort with our annotated dataset. 

We will work HuggingFace's very useful "transformer" library to get the pretrained model as well the tokenizer object that is required to turn our dataset into the input format for BERT. Below is the code to load the tokenizer and store it on our Layer project. This is an important step in the reproducibility of our work. Layer allow us to log our ML project artifacts and versions them automatically. 

In [17]:
@pip_requirements(packages=["transformers"])
@fabric("f-medium")
@model(name="bert-base-uncased-tokenizer")
def download_tokenizer():
    from transformers import BertTokenizerFast
    tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
    return tokenizer

layer.run([download_tokenizer])

Output()

In [None]:
tokenizer = layer.get_model("kaankarakeben/united_nations_ner-finetuning/models/bert-base-uncased-tokenizer").get_train()

We need the BERT tokenizer in order to map the tokens (words) and NER tags into numerical representations in the format the pretraiend model expects. The following method carries out this job for us.

In [20]:
# Also, we will create numerical indexes for tags
tag_to_ids = {tag: ix for ix, tag in enumerate(tag_counter.keys())}
id_to_tag = {ix: tag for tag, ix in tag_to_ids.items()}

class dataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

  def __getitem__(self, index):

        label_all_tokens = True
        tokenized_inputs = tokenizer([list(self.data.tokens[index])], truncation=True, is_split_into_words=True, max_length=128, padding='max_length')

        labels = []
        for i, label in enumerate([list(self.data.ner_tags[index])]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                if word_idx is None:
                    label_ids.append(-100)
                elif label[word_idx] == '0':
                    label_ids.append(0)
                elif word_idx != previous_word_idx:
                    label_ids.append(tag_to_ids[label[word_idx]])
                else:
                    label_ids.append(tag_to_ids[label[word_idx]] if label_all_tokens else -100)
                previous_word_idx = word_idx
            labels.append(label_ids)
            
        tokenized_inputs["labels"] = labels

        single_tokenized_input = {}
        for k, v in tokenized_inputs.items():
          single_tokenized_input[k] = torch.as_tensor(v[0])
        
        return single_tokenized_input

  def __len__(self):
        return self.len

One last thing before we start modelling is splitting our dataset into train and test sets. We will hold out 20% of the dataset for evaluating purposes.

In [21]:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1
LEARNING_RATE = 1e-05
MAX_GRAD_NORM = 10

train_size = 0.8
train_dataset = clean_dataset.sample(frac=train_size,random_state=200)
test_dataset = clean_dataset.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(clean_dataset.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = dataset(train_dataset, tokenizer, MAX_LEN)
testing_set = dataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (5731, 2)
TRAIN Dataset: (4585, 2)
TEST Dataset: (1146, 2)


At this point, we are ready to fine-tune our model by training the pretrained network with our annotated NER dataset. For demonstration purposes, we will stop at one epoch. Once again we will turn to Layer to do the heavy lifting. By calling "layer.run([train])" we will effectively carry out the computation at Layer's infrastructure, taking advantage of the available free GPU.

In [22]:
train_dataset.head()

Unnamed: 0,tokens,ner_tags
0,"[It, is, offensive, that, ,, in, the, twenty-f...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
1,"[The, victims, are, ,, as, always, ,, the, inn...","[O, O, O, O, O, O, O, O, O, O]"
2,"[Anyone, who, thinks, that, Israel, will, achi...","[O, O, O, O, I-LOC, O, O, O, O, O, O, O, O, O,..."
3,"[Corruption, remains, endemic, .]","[O, O, O, O]"
4,"[It, enlightens, us, .]","[O, O, O, O]"


In [24]:
@pip_requirements(packages=["transformers", "sklearn", "torch"])
@fabric("f-gpu-small")
@model("un_ner_finuted_bert")

def train():
    from sklearn.metrics import accuracy_score
    from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification
    from torch.utils.data import Dataset, DataLoader

    train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

    training_loader = DataLoader(training_set, **train_params)

    device = "cpu"
    model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(tag_to_ids))
    model.to(device)

    optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)
    
    for epoch in range(EPOCHS):
        print(f"Training epoch: {epoch + 1}")
        tr_loss, tr_accuracy = 0, 0
        nb_tr_examples, nb_tr_steps = 0, 0
        tr_preds, tr_labels = [], []
        # put model in training mode
        model.train()
        
        for idx, batch in enumerate(training_loader):
            
            ids = batch['input_ids'].to(device, dtype = torch.long)
            mask = batch['attention_mask'].to(device, dtype = torch.long)
            labels = batch['labels'].to(device, dtype = torch.long)

            outputs = model(input_ids=ids, attention_mask=mask, labels=labels)
            loss = outputs[0]
            tr_logits = outputs[1]
            tr_loss += loss.item()

            nb_tr_steps += 1
            nb_tr_examples += labels.size(0)
            
            if idx % 100==0:
                loss_step = tr_loss/nb_tr_steps
                print(f"Training loss per 100 training steps: {loss_step}")
            
            # compute training accuracy
            flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
            active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            
            # only compute accuracy at active labels
            active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
            #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))
            
            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            tr_labels.extend(labels)
            tr_preds.extend(predictions)

            tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            tr_accuracy += tmp_tr_accuracy
        
            # gradient clipping
            torch.nn.utils.clip_grad_norm_(
                parameters=model.parameters(), max_norm=MAX_GRAD_NORM
            )
            
            # backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        epoch_loss = tr_loss / nb_tr_steps
        tr_accuracy = tr_accuracy / nb_tr_steps
        print(f"Training loss epoch: {epoch_loss}")
        print(f"Training accuracy epoch: {tr_accuracy}")

    return model

model = train()

Output()

Once the model is trained, we will call it from Layer and we will call the trainer object for evaluation once the model in on memory.

In [None]:
model = layer.get_model("kaankarakeben/united_nations_ner-finetuning/models/un_finetune_trainer").get_train()

Looking at the test set, we are able to achieve an accuracy of 99% and an F1 score of 88% with our trained model. Impressive results with a relatively small amount of annotated data!

Lastly we'll have a look at how the model performs in the wild with an example. 

In [28]:
from sklearn.metrics import classification_report

def validate(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    device = "cpu"
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):
            
            ids = batch['input_ids'].to(device, dtype = torch.long)
            mask = batch['attention_mask'].to(device, dtype = torch.long)
            labels = batch['labels'].to(device, dtype = torch.long)
            
            outputs = model(input_ids=ids, attention_mask=mask, labels=labels)
            loss = outputs[0]
            eval_logits = outputs[1]

            
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            
            # only compute accuracy at active labels
            active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        
            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(labels)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    labels = [id_to_tag[id.item()] for id in eval_labels]
    predictions = [id_to_tag[id.item()] for id in eval_preds]
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions


test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

testing_loader = DataLoader(testing_set, **test_params)

labels, predictions = validate(model, testing_loader)

print(classification_report(labels, predictions))

Validation loss per 100 evaluation steps: 0.07575470209121704
Validation loss per 100 evaluation steps: 0.04592693951136506
Validation loss per 100 evaluation steps: 0.045777699266779305
Validation loss per 100 evaluation steps: 0.04580374692628188
Validation loss per 100 evaluation steps: 0.04674512721619051
Validation loss per 100 evaluation steps: 0.04575173984985323
Validation Loss: 0.04462353571913982
Validation Accuracy: 0.9855823812930233
              precision    recall  f1-score   support

       I-LOC       0.91      0.97      0.94       780
      I-MISC       0.80      0.66      0.72       603
       I-ORG       0.80      0.89      0.84       748
       I-PER       0.96      0.97      0.96       178
           O       0.99      0.99      0.99     29144

    accuracy                           0.98     31453
   macro avg       0.89      0.90      0.89     31453
weighted avg       0.98      0.98      0.98     31453



In [59]:
sentence = """Expressing deep concern about the impact of the food security crisis on the
assistance provided by United Nations humanitarian agencies, in particular the World
Food Programme."""

inputs = tokenizer(sentence.split(),
                    is_split_into_words=True,
                    return_offsets_mapping=True, 
                    padding='max_length', 
                    truncation=True, 
                    max_length=MAX_LEN,
                    return_tensors="pt")

            
ids = inputs["input_ids"]
mask = inputs["attention_mask"]
# forward pass
outputs = model(ids, attention_mask=mask)
logits = outputs[0]

active_logits = logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size*seq_len,) - predictions at the token level

tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
token_predictions = [id_to_tag[i] for i in flattened_predictions.cpu().numpy()]
wp_preds = list(zip(tokens, token_predictions)) # list of tuples. Each tuple = (wordpiece, prediction)

prediction = []
for token_pred, mapping in zip(wp_preds, inputs["offset_mapping"].squeeze().tolist()):
  #only predictions on first word pieces are important
  if mapping[0] == 0 and mapping[1] != 0:
    prediction.append(token_pred[1])
  else:
    continue

print(sentence.split())
print(prediction)

['Expressing', 'deep', 'concern', 'about', 'the', 'impact', 'of', 'the', 'food', 'security', 'crisis', 'on', 'the', 'assistance', 'provided', 'by', 'United', 'Nations', 'humanitarian', 'agencies,', 'in', 'particular', 'the', 'World', 'Food', 'Programme.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'I-ORG', 'I-ORG', 'I-ORG']


Extracting named entitities from text has many uses that transform the way we interact with these documents. With the usage of pretrained models such as Bert and libraries such as Hugginface makes easy to fine-tune general purpose models. However, for a data scientist life doesn't end with trained model at a notebok. Features we have shown from Layer allows us to follow the best MLOps practises in bulding, tracking and logging all of our artifacts. When all these technologies combine, long-lasting value is unlocked.

Blog posts and tutorial I find useful in preparation for this work:

https://medium.com/@andrewmarmon/fine-tuned-named-entity-recognition-with-hugging-face-bert-d51d4cb3d7b5

https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb#scrollTo=zPDla1mmZiax

https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/

https://jalammar.github.io/illustrated-bert/

https://huggingface.co/docs/transformers/tasks/token_classification