# Named Entity Recognition Using BERT
## Summary
This notebook demonstrates how to fine tune [pretrained BERT model](https://github.com/huggingface/pytorch-pretrained-BERT) for token level named entity recognition (NER) task. A few utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, and model evaluation. 

[BERT (Bidirectional Transformers forLanguage Understanding)](https://arxiv.org/pdf/1810.04805.pdf) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition. It's able to achieve state of the art performance with only a few epochs of fine tuning.  
The figure below illustrates how BERT can be fine tuned for NER tasks. The input data is a list of tokens representing a sentence. In the training data, each token has an entity label. After fine tuning, the model predicts an entity label for each token of a given sentence in the testing data. 

<img src="https://nlpbp.blob.core.windows.net/images/bert_architecture.png">

## Required packages
* pytorch
* pytorch-pretrained-bert
* pandas
* seqeval

In [1]:
import sys
import os
import yaml
import pprint
import random
from seqeval.metrics import f1_score, classification_report

import torch
from torch.optim import Adam

from pytorch_pretrained_bert.tokenization import BertTokenizer

nlp_path = os.path.abspath('../../')
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp.bert.token_classification import BertTokenClassifier, postprocess_token_labels
from utils_nlp.bert.common_ner import create_data_loader, Language, Tokenizer
from utils_nlp.dataset.wikigold import download, read_data, get_train_test_data, get_unique_labels

## Configurations

In [2]:
# path configurations
data_dir = "./data"
data_file = "./data/wikigold.conll.txt"
cache_dir="."

# set random seeds
random_seed = 10
random.seed(random_seed)
torch.manual_seed(random_seed)

# model configurations
language = Language.ENGLISHLARGE
do_lower_case = True
max_seq_length = 150

# training configurations
device="gpu"
batch_size = 32
num_train_epochs = 5

# optimizer configurations
learning_rate = 3e-5

## Preprocess Data

### Get training and testing dataset. 
The dataset used in this notebook is the [wikigold dataset](https://www.aclweb.org/anthology/W09-3302). The wikigold dataset consists of 145 mannually labelled Wikipedia articles, including 1841 sentences and 40k tokens in total. The dataset can be directly downloaded from [here](https://github.com/juand-r/entity-recognition-datasets/tree/master/data/wikigold). The `download` function downloads the data file to a user-specified directory.  

The helper function `get_train_test_data` splits the dataset into training and testing sets according to `test_percentage`. Because this is a relatively small dataset, we set `test_percentage` to 0.5 in order to have enough data for model evaluation. Running this notebook multiple times with different random seeds produces similar results.   

The helper function `get_unique_labels` returns the unique entity labels in the dataset. There are 5 unique labels in the   original dataset: 'O' (non-entity), 'I-LOC' (location), 'I-MISC' (miscellaneous), 'I-PER' (person), and 'I-ORG' (organization). An 'X' label is added for the trailing word pieces generated by BERT, because BERT uses WordPiece tokenizer.  

The maximum number of words in a sentence is 144, so we set `max_seq_length` to 150 above, because the number of tokens will grow after WordPiece tokenization. Ideally, it would be better to set `max_seq_length` to even larger, but 150 allows model training to be done on a single NVIDIA Tesla K80 GPU.

In [3]:
download(data_dir)
wikigold_text = read_data(data_file)
train_text, train_labels, test_text, test_labels = get_train_test_data(wikigold_text, test_percentage=0.5)
label_list = get_unique_labels()
print('\nUnique entity labels: \n{}\n'.format(label_list))
print('Sample sentence: \n{}\n'.format(train_text[0]))
print('Sample sentence labels: \n{}\n'.format(train_labels[0]))

Maximum sequence length in training data is: 89
Maximum sequence length in testing data is: 144

Unique entity labels: 
['O', 'I-LOC', 'I-MISC', 'I-PER', 'I-ORG', 'X']

Sample sentence: 
It also launched Power98FM 8 months later .

Sample sentence labels: 
['O', 'O', 'O', 'I-ORG', 'O', 'O', 'O', 'O']



### Tokenization and Preprocessing
The `preprocess_ner_tokens` method of the `Tokenizer` class converts raw string data to numerical features, involving the following steps:
1. WordPiece tokenization.
2. Convert tokens and labels to numerical values, i.e. token ids and label ids.
3. Sequence padding or truncation according to the `max_seq_length` configuration.

**Create a dictionary that maps labels to numerical values**

In [4]:
label_map = {label: i for i, label in enumerate(label_list)}

**Create a tokenizer**

In [5]:
tokenizer = Tokenizer(language=language, 
                      to_lower=do_lower_case, 
                      cache_dir=cache_dir)

**Create numerical features**  
Note there is an argument called `trailing_piece_tag`. BERT uses a WordPiece tokenizer which breaks down some words into multiple tokens, e.g. "playing" is tokenized into "play" and "##ing". Since the input data only come with one token label for "playing", within `prerocess_ner_tokens`, the original token label is assigned to the first token "play" and the second token "##ing" is labeled as "X". By default, `trailing_piece_tag` is set to "X". If "X" already exists in your data, you can set `trailing_piece_tag` to another value that doesn't exist in your data. 

In [6]:
train_token_ids, train_input_mask, train_trailing_token_mask, train_label_ids = \
    tokenizer.preprocess_ner_tokens(text=train_text,
                                    label_map=label_map,
                                    max_seq_length=max_seq_length,
                                    labels=train_labels,
                                    trailing_piece_tag="X")
test_token_ids, test_input_mask, test_trailing_token_mask, test_label_ids = \
    tokenizer.preprocess_ner_tokens(text=test_text,
                                    label_map=label_map,
                                    max_seq_length=max_seq_length,
                                    labels=test_labels,
                                    trailing_piece_tag="X")

`Tokenizer.preprocess_ner_tokens` outputs three or four list of numerical features lists, each sublist contains features of an input sentence: 
1. token ids: list of numerical values each corresponds to a token.
2. attention mask: list of 1s and 0s, 1 for input tokens and 0 for padded tokens, so that padded tokens are not attended to. 
3. trailing word piece mask: boolean list, True for the first word piece of each original word, False for the trailing word pieces, e.g. ##ing. This mask is useful for removing predictions on trailing word pieces, so that each original word in the input text has a unique predicted label. 
4. label ids: list of numerical values each corresponds to an entity label, if `labels` is provided.

In [7]:
print("Sample token ids:\n{}\n".format(train_token_ids[0]))
print("Sample attention mask:\n{}\n".format(train_input_mask[0]))
print("Sample trailing token mask:\n{}\n".format(train_trailing_token_mask[0]))
print("Sample label ids:\n{}\n".format(train_label_ids[0]))

Sample token ids:
[2009, 2036, 3390, 2373, 2683, 2620, 16715, 1022, 2706, 2101, 1012, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Sample attention mask:
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

## Create Token Classifier

In [8]:
token_classifier = BertTokenClassifier(language=Language.ENGLISH,
                                       num_labels=len(label_list),
                                       cache_dir=cache_dir)

## Train Model

In [9]:
token_classifier.fit(token_ids=train_token_ids, 
                     input_mask=train_input_mask, 
                     labels=train_label_ids,
                     num_epochs=num_train_epochs, 
                     batch_size=batch_size, 
                     learning_rate=learning_rate)

t_total value of -1 results in schedule not being applied
Epoch:   0%|          | 0/5 [00:00<?, ?it/s]
Iteration:   0%|          | 0/29 [00:00<?, ?it/s][A




Iteration:  59%|█████▊    | 17/29 [00:31<00:22,  1.87s/it][A
Iteration:  59%|█████▊    | 17/29 [00:49<00:22,  1.87s/it][A
Epoch:  20%|██        | 1/5 [00:54<03:36, 54.24s/it]7s/it][A
Iteration:   0%|          | 0/29 [00:00<?, ?it/s][A

Train loss: 0.7232020918665261



Iteration:  55%|█████▌    | 16/29 [00:30<00:24,  1.89s/it][A
Iteration:  55%|█████▌    | 16/29 [00:45<00:24,  1.89s/it][A
Epoch:  40%|████      | 2/5 [01:48<02:43, 54.34s/it]8s/it][A
Iteration:   0%|          | 0/29 [00:00<?, ?it/s][A

Train loss: 0.19114394537333784



Iteration:  55%|█████▌    | 16/29 [00:30<00:24,  1.90s/it][A
Iteration:  55%|█████▌    | 16/29 [00:41<00:24,  1.90s/it][A
Epoch:  60%|██████    | 3/5 [02:43<01:49, 54.50s/it]9s/it][A
Iteration:   0%|          | 0/29 [00:00<?, ?it/s][A

Train loss: 0.08570333865695987



Iteration:  55%|█████▌    | 16/29 [00:30<00:24,  1.90s/it][A
Iteration:  55%|█████▌    | 16/29 [00:46<00:24,  1.90s/it][A
Epoch:  80%|████████  | 4/5 [03:38<00:54, 54.62s/it]9s/it][A
Iteration:   0%|          | 0/29 [00:00<?, ?it/s][A

Train loss: 0.04018309917943231



Iteration:  55%|█████▌    | 16/29 [00:30<00:24,  1.91s/it][A
Iteration:  55%|█████▌    | 16/29 [00:41<00:24,  1.91s/it][A
Epoch: 100%|██████████| 5/5 [04:33<00:00, 54.74s/it]0s/it][A

Train loss: 0.024413088625618095





## Predict on Test Data

In [10]:
pred_label_ids = token_classifier.predict(token_ids=test_token_ids, 
                                          input_mask=test_input_mask, 
                                          labels=test_label_ids, 
                                          batch_size=batch_size)

Iteration: 100%|██████████| 29/29 [00:19<00:00,  1.51it/s]

Evaluation loss: 0.14181747210436854





## Evaluate Model
The `predict` method of the token classifier outputs label ids for all tokens, including the padded tokens. `postprocess_token_labels` is a helper function that removes the predictions on padded tokens. If a `label_map` is provided, it maps the numerical label ids back to original token labels which are usually string type. 

In [11]:
pred_tags_no_padding = postprocess_token_labels(pred_label_ids, 
                                                test_input_mask, 
                                                label_map)
true_tags_no_padding =  postprocess_token_labels(test_label_ids, 
                                                 test_input_mask, 
                                                 label_map)
print(classification_report(true_tags_no_padding, pred_tags_no_padding, digits=2))

           precision    recall  f1-score   support

     MISC       0.62      0.74      0.68       406
      PER       0.95      0.91      0.93       583
      ORG       0.75      0.75      0.75       549
        X       0.95      0.98      0.97      1728
      LOC       0.85      0.83      0.84       550

micro avg       0.87      0.89      0.88      3816
macro avg       0.87      0.89      0.88      3816



`postprocess_token_labels` also provides an option to remove the predictions on trailing word pieces, e.g. ##ing, so that the final predicted labels correspond to the original words in the input text. The `trailing_token_mask` is obtained from `tokenizer.preprocess_ner_tokens`

In [12]:
pred_tags_no_padding_no_trailing = postprocess_token_labels(pred_label_ids, 
                                                            test_input_mask, 
                                                            label_map, 
                                                            remove_trailing_word_pieces=True, 
                                                            trailing_token_mask=test_trailing_token_mask)
true_tags_no_padding_no_trailing = postprocess_token_labels(test_label_ids, 
                                                            test_input_mask, 
                                                            label_map, 
                                                            remove_trailing_word_pieces=True, 
                                                            trailing_token_mask=test_trailing_token_mask)
print(classification_report(true_tags_no_padding_no_trailing, pred_tags_no_padding_no_trailing, digits=2))

           precision    recall  f1-score   support

     MISC       0.59      0.71      0.65       358
      PER       0.94      0.90      0.92       501
      ORG       0.71      0.72      0.71       468
      LOC       0.83      0.82      0.82       505

micro avg       0.75      0.80      0.77      1832
macro avg       0.78      0.80      0.79      1832



We can see that the metrics are worse after exluding trailing word pieces, because they are easy to predict. 