*Copyright (c) Microsoft Corporation. All rights reserved.*  
*Licensed under the MIT License.*

# Named Entity Recognition Using Transformer Model

## Before You Start

The running time shown in this notebook is on a Standard_NC6 Azure Deep Learning Virtual Machine with 1 NVIDIA Tesla K80 GPU. 
> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

The table below provides some reference running time on different machine configurations.  

|QUICK_RUN|Machine Configurations|Running time|
|:---------|:----------------------|:------------|
|True|4 **CPU**s, 14GB memory| ~ 2 minutes|
|False|4 **CPU**s, 14GB memory| ~1.5 hours|
|True|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 1 minute|
|False|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 7 minutes |

If you run into CUDA out-of-memory error or the jupyter kernel dies constantly, try reducing the `BATCH_SIZE` and `MAX_SEQ_LENGTH`, but note that model performance will be compromised. 

In [None]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = False

## Summary

This notebook demonstrates how to fine tune [pretrained Transformer model](https://github.com/huggingface/transformers) for named entity recognition (NER) task. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, and model evaluation. 

The pretrained transformer of [BERT (Bidirectional Transformers for Language Understanding)](https://arxiv.org/pdf/1810.04805.pdf) architecture is used in this notebook. [BERT](https://arxiv.org/pdf/1810.04805.pdf) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition, etc. It's able to achieve state of the art performance with only a few epochs of fine tuning on task specific datasets.

The figure below illustrates how BERT can be fine tuned for NER tasks. The input data is a list of tokens representing a sentence. In the training data, each token has an entity label. After fine tuning, the model predicts an entity label for each token in a given testing sentence. 

<img src="https://nlpbp.blob.core.windows.net/images/bert_architecture.png">

## Preparation

In [None]:
import sys
import os
import scrapbook as sb
import torch

from tempfile import TemporaryDirectory
from utils_nlp.dataset import wikigold
from utils_nlp.common.timer import Timer
from seqeval.metrics import classification_report
from utils_nlp.models.transformers.named_entity_recognition import TokenClassifier

## Configuration

In [None]:
# fraction of the dataset used for testing
TEST_DATA_FRACTION = 0.3

# sub-sampling ratio for training
TRAIN_SAMPLE_RATIO = 1

# sub-sampling ratio for testing
TEST_SAMPLE_RATIO = 1

NUM_TRAIN_EPOCHS = 5

# update variables for quick run option
if QUICK_RUN:
    TRAIN_SAMPLE_RATIO = 0.1
    TEST_SAMPLE_RATIO = 0.1
    NUM_TRAIN_EPOCHS = 1

# the data path used to save the downloaded data file
DATA_PATH = TemporaryDirectory().name

# the cache data path during find tuning
CACHE_DIR = TemporaryDirectory().name

# set random seeds
RANDOM_SEED = 100
torch.manual_seed(RANDOM_SEED)

# model configurations
MODEL_NAME = "bert-base-cased"
DO_LOWER_CASE = False
MAX_SEQ_LENGTH = 200
TRAILING_PIECE_TAG = "X"
DEVICE = "cuda"

if torch.cuda.is_available():
    BATCH_SIZE = 16
else:
    BATCH_SIZE = 8


## Get Traning & Testing Dataset

The dataset used in this notebook is the [wikigold dataset](https://www.aclweb.org/anthology/W09-3302). The wikigold dataset consists of 145 mannually labelled Wikipedia articles, including 1841 sentences and 40k tokens in total. The dataset can be directly downloaded from [here](https://github.com/juand-r/entity-recognition-datasets/tree/master/data/wikigold). 

A helper function `load_dataset` downloads the raw wikigold data, splits it into training and testing datasets (also sub-sampling if the sampling ratio is smaller than 1.0), and then process for the transformer model. Everything is done in one function call, and you can use the processed training and testing Pytorch datasets to fine tune the model and evaluate the performance of the model.

In [None]:
train_dataloader, test_dataloader, label_map, test_dataset = wikigold.load_dataset(
    local_path=DATA_PATH,
    test_fraction=TEST_DATA_FRACTION,
    random_seed=RANDOM_SEED,
    train_sample_ratio=TRAIN_SAMPLE_RATIO,
    test_sample_ratio=TEST_SAMPLE_RATIO,
    model_name=MODEL_NAME,
    to_lower=DO_LOWER_CASE,
    cache_dir=CACHE_DIR,
    max_len=MAX_SEQ_LENGTH,
    trailing_piece_tag=TRAILING_PIECE_TAG,
    batch_size=BATCH_SIZE,
    num_gpus=None
)

## Train Model

There are two steps to train a NER model using pretrained transformer model: 1). instantiate a TokenClassifier class which is a wrapper of the transformer using BERT architecture, and 2), fit the model using the preprocessed training dataset. The member method `fit` of TokenClassifier class is used to fine tune the model.

In [None]:
# Instantiate a TokenClassifier class for NER using pretrained transformer model
model = TokenClassifier(
    model_name=MODEL_NAME,
    num_labels=len(label_map),
    cache_dir=CACHE_DIR
)

# Fine tune the model using the training dataset
with Timer() as t:
    model.fit(
        train_dataloader=train_dataloader,
        num_epochs=NUM_TRAIN_EPOCHS,
        num_gpus=None,
        local_rank=-1,
        weight_decay=0.0,
        learning_rate=5e-5,
        adam_epsilon=1e-8,
        warmup_steps=0,
        verbose=True,
        seed=RANDOM_SEED
    )

print("Training time : {:.3f} hrs".format(t.interval / 3600))


## Evaluate on Testing Dataset

The `predict` method of the TokenClassifier returns a Numpy ndarray of raw predictions. The shape of the ndarray is \[`number_of_examples`, `sequence_length`, `number_of_labels`\]. Each value in the ndarray is not normalized. Post-process will be needed to get the probability for each class label. Function `get_predicted_token_labels` will process the raw prediction and output the predicted labels for each token.

In [None]:
with Timer() as t:
    preds = model.predict(
        eval_dataloader=test_dataloader,
        num_gpus=None,
        verbose=True
    )

print("Prediction time : {:.3f} hrs".format(t.interval / 3600))

Get the true token labels of the testing dataset. 

In [None]:
true_labels = model.get_true_test_labels(label_map=label_map, dataset=test_dataset)

Get the predicted labels for each token by calling member method `get_predicted_token_labels`, and generate the classification report.

In [None]:
predicted_labels = model.get_predicted_token_labels(
    predictions=preds,
    label_map=label_map,
    dataset=test_dataset
)

report = classification_report(true_labels, 
              predicted_labels, 
              digits=2
)

print(report)

## For Testing

In [None]:
report_splits = report.split('\n')[-2].split()

sb.glue("precision", float(report_splits[2]))
sb.glue("recall", float(report_splits[3]))
sb.glue("f1", float(report_splits[4]))