Humans understand the world by putting labels on things and examining how these labels relate to each other. A reflection of this natural language processing and information retrievial world is technique called Named Entity Recognition (NER). The objective is to detect the entity type of segments of text in a document. These entities could be organizations, locations, persons or others. 

In this blog post, I will go through an example for learning an named entity recognition model on specific domain. Instead of creating a NER model from scratch, I will use transfer-learning by taking pre-trained language model, BERT, trained on a large number of general examples and fine-tune that neural network on a very specific type of domain. 

Alongside the tutorial on learning an NER model, I will run this project on Layer in order to make use of their metadata store for storing and tracking the datasets and model artifacts as well as their free GPU compute instances. 

Firstly, let's define the problem. We are working with a set of documents from United Nations (UN). Diplomatic jargon is the norm at the UN and these documents contain many specific entities that we don't encounter in everyday language such as the Office for the Coordination of Humanitarian Affairs of the Secretariat and the Office of the United
Nations High Commissioner for Refugees. We would like to automatically detect these entities with their corresponding types. With the entities flagged, we can power many interesting use cases such as information retrivial, question/answering, document similarity etc. 

The dataset is generously made available to the public by Leslie Huang. It consists of transcribed speeches given at the UN General Assembly from 1993-2016, which were scraped from the UN website, parsed (e.g. from PDF), and cleaned. More than 50,000 tokens were manually annotated for NER tags.
https://github.com/leslie-huang/UN-named-entity-recognition

Let's start by creating a project at Layer so that we can define a reproducible project and dataset and artifacts logged along with parameters for future reference. Layer helps you build, train and track all your machine learning project metadata including ML models and datasets‍ with semantic versioning. It also allows you to use their cloud infrastucture free of charge including access to GPUs. We will work with a pretrained transformer based language model; so added processing power is very welcome.

We will start by installing the necessary libraries.

In [1]:
!pip install layer --upgrade -qqq
!pip install -U ipython

!pip install transformers
!pip install datasets
!pip install seqeval

Here we log in to Layer and initialize our ML project called "ner-finetuning".  

In [2]:
import layer
from layer.decorators import model,pip_requirements,fabric
layer.login()
layer.init("ner-finetuning")

Your Layer project is here: https://app.layer.ai/kaankarakeben/ner-finetuning

After setting up the ML metadatastore, we will now clone the Github repository that hosts the dataset files.

In [None]:
!git clone https://github.com/leslie-huang/UN-named-entity-recognition

...

In [1]:
import os
import itertools
import pandas as pd
import random
from collections import Counter
from math import ceil
from datasets import Dataset

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
import torch
from layer.decorators import dataset, resources

At this step, we will load the tagged documents from both training and test sets and store them in a DataFrame.
As you may have noticed, we are using decorators from Layer to define a dataset artifact that will be logged on our cloud project at Layer. By calling "layer.run()" we will running the function "create_dataset" on the cloud infrastructure.

You may have also noticed we are logging some text metadata with the raw dataset. This enriches our ML project at the readability and reproducability level. As code is more often read then written, so are ML projects. 

Next, we will get the dataset into local memory by calling it from Layer with layer.get_dataset() function. 

In [None]:
@dataset("un_ner_raw_dataset")
@resources(path="./UN-named-entity-recognition")
def create_raw_dataset():
    directories = ['./UN-named-entity-recognition/tagged-training/', './UN-named-entity-recognition/tagged-test/']
    data_files = []
    for dir in directories:
        for filename in os.listdir(dir):
            file_path = os.path.join(dir, filename)

            with open(file_path, 'r', encoding="utf8") as f:
                lines = f.readlines()
                split_list = [list(y) for x, y in itertools.groupby(lines, lambda z: z == '\n') if not x]
                tokens = [[x.split('\t')[0] for x in y] for y in split_list]
                entities = [[x.split('\t')[1][:-1] for x in y] for y in split_list]
                data_files.append(pd.DataFrame({'tokens': tokens, 'ner_tags': entities}))

    dataset = pd.concat(data_files).reset_index().drop('index', axis=1)

    dataset_description = """The corpus consists of a sample of transcribed speeches given at the UN General Assembly from 1993-2016, which were scraped from the UN website, parsed (e.g. from PDF), and cleaned. More than 50,000 tokens in the test data were manually tagged for Named Entity Recognition (O - Not a Named Entity; I-PER - Person; I-ORG - Organization; I-LOC - Location; I-MISC - Other Named Entity)."""
    layer.log({"# Examples": len(dataset)})
    layer.log({"Dataset Description": dataset_description})
    layer.log({"Source": "https://github.com/leslie-huang/UN-named-entity-recognition"})

    return dataset

layer.run([create_raw_dataset])

dataset = layer.get_dataset("kaankarakeben/ner-finetuning/datasets/un_ner_raw_dataset").to_pandas()

Next we will examine the dataset. The annotation follows us specific Named Entity Recognition annotation scheme called IOB-tagging. It stands for Inside-Outside-Beginning. The document is tagged at the word level and entities sometimes comes in word groups. To note the entities that cover a few words we use the Beginning (B) and Inside (I) tags. 
Example: Tim Cook works at Apple. 
[Tim, Cook, works, at, Apple] -> [B-PER, I-PER, O, 0, B-ORG]

Our dataset consists of two columns where each item is a list. At "tokens" column, we have words in the document in a list. In the "ner_tags" column, we have the corresponding tags.

In [48]:
dataset.head()

Unnamed: 0,tokens,ner_tags
0,"[I, salute, Hisgreet, you, ,, Your, Excellency...","[O, O, O, O, O, O, O, O, I-PER, I-PER, O, O, O..."
1,"[A, short, distance, from, here, ,, on, the, f...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[In, Colombia, ,, violence, claims, the, same,...","[O, I-LOC, O, O, O, O, O, O, O, O, O, O, O]"
3,"[Forty, three, million, Colombians, ,, peace-l...","[O, O, O, I-MISC, O, O, O, O, O, O, O, O, O, O..."
4,"[Every, year, Colombia, buries, 34, ,000, of, ...","[O, O, I-LOC, O, O, O, O, O, O, O, O, O, O, O,..."


We will now create a Counter object from the NER tags. As expected the most common tag is "O" denoting "Outside" for words that are not a part of a named entity. Second is "I-ORG" tag denoting organisation entities and next in line is location.
An interesting find is that while we have Inside (I) tags, we don't have their beginning (B) tags. We also have some typos that have very low representations. 

In [7]:
raw_tags_counter = Counter([tag for tags in dataset["ner_tags"] for tag in tags])
raw_tags_counter.most_common()

[('O', 135927),
 ('I-ORG', 3562),
 ('I-LOC', 3329),
 ('I-MISC', 2649),
 ('I-PER', 444)]

It would pay off the clean the tag further and remove the tags that are typos to have clearer dataset. 

In [None]:
tags_to_remove = ["I-PRG", "I-I-MISC", "I-OR", "VMISC", "I-", "0"]

def clean_tags(tags):
    clean_list = []
    for tag in list(tags):
        if tag != "O":
            if tag not in tags_to_remove:
                clean_list.append(tag)
            else:
                clean_list.append("O")    
        else:
            clean_list.append("O")
    return clean_list
dataset["ner_tags"] = dataset["ner_tags"].apply(lambda x: clean_tags(x))

tag_counter = Counter([tag for tags in dataset["ner_tags"] for tag in tags])
tag_counter.most_common()

Now that we have a better idea of the dataset, let's log the clean dataset along with with tags metadata at Layer. This helps us to log distinct steps at our project and with an overview of the dataset. 

In [None]:
@dataset("un_ner_clean_dataset")
@resources(path="./UN-named-entity-recognition")
def clean_clean_dataset():
    layer.log({"Raw Tags Counter": raw_tags_counter})
    layer.log({"Clean Tags Counter": tag_counter})
    return dataset

layer.run([clean_clean_dataset])

As stated earlier we will use a transfer learning to create our NER model. The pretrained model we'll use is BERT which large neural network traiend on masked language modelling and next sentence prediction tasks. If you are interested have a look at the original paper [https://arxiv.org/abs/1810.04805] and this brilliant blog post [http://jalammar.github.io/illustrated-bert/] by Jay Alammar. The fine-tunning will be supervised learning effort with our annotated dataset. 

We will work HuggingFace's very useful "transformer" library to get the pretrained model as well the tokenizer object that is required to turn our dataset into the input format for BERT. Below is the code to load the tokenizer and store it on our Layer project. This is an important step in the reproducibility of our work. Layer allow us to log our ML project artifacts and versions them automatically. 

In [None]:
@pip_requirements(packages=["transformers"])
@fabric("f-medium")
@model(name="distilbert-base-cased-tokenizer")
def download_tokenizer():
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
    return tokenizer

layer.run([download_tokenizer])

tokenizer = layer.get_model("kaankarakeben/ner-finetuning/models/distilbert-base-cased-tokenizer").get_train()

We need the BERT tokenizer in order to map the tokens (words) and NER tags into numerical representations in the format the pretraiend model expects. The following method carries out this job for us.

In [None]:
# Also, we will create numerical indexes for tags
tag_to_ids = {tag: ix for ix, tag in enumerate(tag_counter.keys())}
id_to_tag = {ix: tag for tag, ix in tag_to_ids.items()}

def tokenize_and_align_labels(examples):
    # https://huggingface.co/docs/transformers/tasks/token_classification
    label_all_tokens = True
    tokenized_inputs = tokenizer(list(examples["tokens"]), truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif label[word_idx] == '0':
                label_ids.append(0)
            elif word_idx != previous_word_idx:
                label_ids.append(tag_to_ids[label[word_idx]])
            else:
                label_ids.append(tag_to_ids[label[word_idx]] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)
        
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

One last thing before we start modelling is splitting our dataset into train and test sets. We will hold out 20% of the dataset for evaluating purposes.

In [None]:
dataset_ix = set(dataset.index)
test_ix = random.sample(dataset_ix, ceil(len(dataset) * 0.2))
train_ix = dataset_ix - set(test_ix)

# creating Hugginface Dataset objects
train_dataset = Dataset.from_pandas(dataset.loc[train_ix])
test_dataset = Dataset.from_pandas(dataset.loc[test_ix])
tokenized_train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_and_align_labels, batched=True)

At this point, we are ready to fine-tune our model by training the pretrained network with our annotated NER dataset. For demonstration purposes, we will stop at one epoch. Once again we will turn to Layer to do the heavy lifting. By calling "layer.run([train])" we will effectively carry out the computation at Layer's infrastructure, taking advantage of the available free GPU.

In [None]:
@pip_requirements(packages=["transformers"])
@fabric("f-gpu-small")
@model("un_finetune_trainer")
def train():
    model = AutoModelForTokenClassification.from_pretrained("distilbert-base-cased", num_labels=len(tag_counter))

    training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1024,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    )

    data_collator = DataCollatorForTokenClassification(tokenizer)

    trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer
    )

    trainer.train()
    return trainer

layer.run([train])

Once the model is trained, we will call it from Layer and we will call the trainer object for evaluation once the model in on memory.

In [None]:
model = layer.get_model("kaankarakeben/ner-finetuning/models/distilbert-base-cased:1.2").get_train()

trainer.evaluate()

Looking at the test set, we are able to achieve an accuracy of 99% and an F1 score of 88% with our trained model. Impressive results with a relatively small amount of annotated data!

Lastly we'll have a look at how the model performs in the wild with an example. 

In [48]:
paragraph = '''Expressing deep concern about the impact of the food security crisis on the
assistance provided by United Nations humanitarian agencies, in particular the World
Food Programme.'''

tokens = tokenizer(paragraph)
predictions = model.forward(input_ids=torch.tensor(tokens['input_ids']).unsqueeze(0), attention_mask=torch.tensor(tokens['attention_mask']).unsqueeze(0))
predictions = torch.argmax(predictions.logits.squeeze(), axis=1)
predictions = [id_to_label[int(i)] for i in predictions]

words = tokenizer.batch_decode(tokens['input_ids'])
pd.DataFrame({'ner': predictions, 'words': words})

Extracting named entitities from text has many uses that transform the way we interact with these documents. With the usage of pretrained models such as Bert and libraries such as Hugginface makes easy to fine-tune general purpose models. However, for a data scientist life doesn't end with trained model at a notebok. Features we have shown from Layer allows us to follow the best MLOps practises in bulding, tracking and logging all of our artifacts. When all these technologies combine, long-lasting value is unlocked.

Blog posts and tutorial I find useful in preparation for this work:

https://medium.com/@andrewmarmon/fine-tuned-named-entity-recognition-with-hugging-face-bert-d51d4cb3d7b5

https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb#scrollTo=zPDla1mmZiax

https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/

https://jalammar.github.io/illustrated-bert/

https://huggingface.co/docs/transformers/tasks/token_classification