# Named Entity Recognition Using BERT
## Summary
This notebook demonstrates how to fine tune [pretrained BERT model](https://github.com/huggingface/pytorch-pretrained-BERT) for token level named entity recognition (NER) task. A few utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, and model evaluation. 

[BERT (Bidirectional Transformers forLanguage Understanding)](https://arxiv.org/pdf/1810.04805.pdf) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition. It's able to achieve state of the art performance with only a few epochs of fine tuning.  
The figure below illustrates how BERT can be fine tuned for NER tasks. The input data is a list of tokens representing a sentence. In the training data, each token has an entity label. After fine tuning, the model predicts an entity label for each token of a given sentence in the testing data. 

![](bert_architecture.png)

### Required packages
* pytorch
* pytorch-pretrained-bert
* pandas
* seqeval

In [1]:
import sys
import os
import yaml
import pprint
import random
from seqeval.metrics import f1_score

import torch
from torch.optim import Adam

from pytorch_pretrained_bert.tokenization import BertTokenizer

bert_utils_path = os.path.abspath('../../utils_nlp/bert')
if bert_utils_path not in sys.path:
    sys.path.insert(0, bert_utils_path)

from bert_data_utils import KaggleNERProcessor
from token_classification import BertTokenClassifier, postprocess_token_labels

from common_ner import create_data_loader, Language, Tokenizer

In [2]:
ner_data_dir = "./data/NER/ner_dataset.csv"
cache_dir="."
random_seed = 42
random.seed(random_seed)
torch.manual_seed(random_seed)

<torch._C.Generator at 0x7f9d9c1489f0>

## Configurations

In [3]:
# model configurations
language = Language.ENGLISH
do_lower_case = True
max_seq_length = 75

# training configurations
device="gpu"
batch_size = 32
num_train_epochs = 2

# optimizer configurations
learning_rate = 3e-5
clip_gradient = True
max_gradient_norm = 1.0

## Preprocess Data

### Create training and validation examples
`KaggleNERProcessor` is a dataset specific class that splits the whole dataset into training and validation datasets according to `dev_percentage`. The `get_train_examples` and `get_dev_examples` return the training and validation datasets respectively. The `get_labels` method returns a list of all unique labels 

In [4]:
kaggle_ner_processor = KaggleNERProcessor(data_dir=ner_data_dir, dev_percentage = 0.1)

In [5]:
train_text, train_labels = kaggle_ner_processor.get_train_examples()
dev_text, dev_labels = kaggle_ner_processor.get_dev_examples()
label_list = kaggle_ner_processor.get_labels()
print(label_list)

['B-nat', 'I-per', 'B-art', 'B-gpe', 'I-art', 'I-eve', 'B-eve', 'I-gpe', 'I-tim', 'B-per', 'I-nat', 'B-geo', 'I-org', 'B-org', 'O', 'I-geo', 'B-tim', 'X']


`KaggleNERProcessor` generates training and evaluation examples in `BertInputData` type. `BertInputData` is a `namedtuple` with the following three fields:
* text_a: text string of the first sentence.
* text_b: text string of the second setence. This is only required for two-sentence tasks.
* label: required for training and validation data.

In [6]:
print('Sample sentence: \n{}\n'.format(train_text[0]))
print('Sample sentence labels: \n{}\n'.format(train_labels[0]))

Sample sentence: 
Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

Sample sentence labels: 
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']



### Convert raw input to numerical features
The `preprocess_ner_tokens` of the tokenizer preprocess converts raw string data to numerical features, involving the following steps:
1. Tokenization.
2. Convert tokens and labels to numerical values, i.e. token ids and label ids.
3. Sequence padding or truncation according to the `max_seq_length` configuration.

**Create a dictionary that maps labels to numerical values**

In [7]:
label_map = {label: i for i, label in enumerate(label_list)}

**Create a tokenizer**

In [8]:
tokenizer = Tokenizer(language=language, 
                      to_lower=do_lower_case, 
                      cache_dir=cache_dir)

**Create numerical features**  
Note there is an argument called `trailing_piece_tag`. BERT uses a WordPiece tokenizer which breaks down some words into multiple tokens, e.g. "playing" is tokenized into "play" and "##ing". Since the input data only come with one token label for "playing", within `create_token_feature_dataset`, the original token label is assigned to the first token "play" and the second token "##ing" is labeled as "X". By default, `trailing_piece_tag` is set to "X". If your "X" already exists in your data, you can set `trailing_piece_tag` to another value that doesn't exist in your data. 

In [9]:
train_token_ids, train_input_mask, train_trailing_token_mask, train_label_ids = \
    tokenizer.preprocess_ner_tokens(text=train_text,
                                    label_map=label_map,
                                    max_seq_length=max_seq_length,
                                    labels=train_labels,
                                    trailing_piece_tag="X")
dev_token_ids, dev_input_mask, dev_trailing_token_mask, dev_label_ids = \
    tokenizer.preprocess_ner_tokens(text=dev_text,
                                    label_map=label_map,
                                    max_seq_length=max_seq_length,
                                    labels=dev_labels,
                                    trailing_piece_tag="X")

`Tokenizer.preprocess_ner_tokens` outputs three lists of numerical features: 
1. token ids: list of numerical values each corresponds to a token.
2. attention mask: list of 1s and 0s, 1 for input tokens and 0 for padded tokens, so that padded tokens are not attended to. 
3. trailing word piece mask: boolean list, True for the first word piece of each original word, False for the trailing word pieces, e.g. ##ing. This mask is useful for removing the predictions on trailing word pieces, so that each original word in the input text has a unique predicted label. 
4. label ids: list of numerical values each corresponds to an entity label. 

In [10]:
print("Sample token ids:\n{}\n".format(train_token_ids[0]))
print("Sample attention mask:\n{}\n".format(train_input_mask[0]))
print("Sample trailing token mask:\n{}\n".format(train_trailing_token_mask[0]))
print("Sample label ids:\n{}\n".format(train_label_ids[0]))

Sample token ids:
[5190, 1997, 28337, 2031, 9847, 2083, 2414, 2000, 6186, 1996, 2162, 1999, 5712, 1998, 5157, 1996, 10534, 1997, 2329, 3629, 2013, 2008, 2406, 1012, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Sample attention mask:
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Sample trailing token mask:
[True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, 

## Create Token Classifier

In [11]:
token_classifier = BertTokenClassifier(language=Language.ENGLISH,
                                       num_labels=len(label_list),
                                       cache_dir=cache_dir)

## Train Model

In [12]:
token_classifier.fit(token_ids=train_token_ids, 
                     input_mask=train_input_mask, 
                     labels=train_label_ids,
                     use_gpu=True,
                     num_epochs=num_train_epochs, 
                     batch_size=batch_size, 
                     learning_rate=learning_rate,
                     clip_gradient=True)

t_total value of -1 results in schedule not being applied
Epoch:   0%|          | 0/2 [00:00<?, ?it/s]
Iteration:   0%|          | 0/1349 [00:00<?, ?it/s][A
Iteration:   2%|▏         | 28/1349 [00:30<23:52,  1.08s/it][A
Iteration:   2%|▏         | 28/1349 [00:49<23:52,  1.08s/it][A
Iteration:   4%|▍         | 56/1349 [01:00<23:25,  1.09s/it][A
Iteration:   4%|▍         | 56/1349 [01:20<23:25,  1.09s/it][A
Iteration:   6%|▌         | 84/1349 [01:31<22:58,  1.09s/it][A
Iteration:   6%|▌         | 84/1349 [01:50<22:58,  1.09s/it][A
Iteration:   8%|▊         | 112/1349 [02:02<22:30,  1.09s/it][A
Iteration:   8%|▊         | 112/1349 [02:20<22:30,  1.09s/it][A
Iteration:  10%|█         | 140/1349 [02:33<22:04,  1.10s/it][A
Iteration:  10%|█         | 140/1349 [02:50<22:04,  1.10s/it][A
Iteration:  12%|█▏        | 168/1349 [03:04<21:37,  1.10s/it][A
Iteration:  12%|█▏        | 168/1349 [03:20<21:37,  1.10s/it][A
Iteration:  14%|█▍        | 195/1349 [03:34<21:13,  1.10s/it][A
Ite

Train loss: 0.1369325082205109



Iteration:   2%|▏         | 28/1349 [00:30<23:59,  1.09s/it][A
Iteration:   2%|▏         | 28/1349 [00:48<23:59,  1.09s/it][A
Iteration:   4%|▍         | 56/1349 [01:01<23:31,  1.09s/it][A
Iteration:   4%|▍         | 56/1349 [01:18<23:31,  1.09s/it][A
Iteration:   6%|▌         | 84/1349 [01:31<23:00,  1.09s/it][A
Iteration:   6%|▌         | 84/1349 [01:48<23:00,  1.09s/it][A
Iteration:   8%|▊         | 112/1349 [02:02<22:36,  1.10s/it][A
Iteration:   8%|▊         | 112/1349 [02:18<22:36,  1.10s/it][A
Iteration:  10%|█         | 140/1349 [02:33<22:05,  1.10s/it][A
Iteration:  10%|█         | 140/1349 [02:48<22:05,  1.10s/it][A
Iteration:  12%|█▏        | 168/1349 [03:04<21:34,  1.10s/it][A
Iteration:  12%|█▏        | 168/1349 [03:18<21:34,  1.10s/it][A
Iteration:  15%|█▍        | 196/1349 [03:34<21:03,  1.10s/it][A
Iteration:  15%|█▍        | 196/1349 [03:48<21:03,  1.10s/it][A
Iteration:  17%|█▋        | 224/1349 [04:05<20:32,  1.10s/it][A
Iteration:  17%|█▋        | 22

Train loss: 0.07744894394467414





## Predict on Test Data

In [25]:
pred_label_ids = token_classifier.predict(token_ids=dev_token_ids, 
                                          input_mask=dev_input_mask, 
                                          labels=dev_label_ids, 
                                          batch_size=batch_size)

Iteration: 100%|██████████| 150/150 [00:50<00:00,  3.00it/s]

Evaluation loss: 0.0871064007282257





## Evaluate Model
The `predict` method of the token classifier outputs label ids for all tokens, including the padded tokens. `postprocess_token_labels` is a helper function that removes the predictions on padded tokens. If a `label_map` is provided, it maps the numerical label ids back to original token labels which are usually string type. 

In [26]:
pred_tags_no_padding = postprocess_token_labels(pred_label_ids, 
                                                dev_input_mask, 
                                                label_map)
true_tags_no_padding =  postprocess_token_labels(dev_label_ids, 
                                                 dev_input_mask, 
                                                 label_map)

In [27]:
print("F1 Score: {}".format(f1_score(true_tags_no_padding, pred_tags_no_padding)))

F1 Score: 0.89765070581303


`postprocess_token_labels` also provides an option to remove the predictions on trailing word pieces, e.g. ##ing, so that the final predicted labels correspond to the original words in the input text. The `trailing_token_mask` is obtained from `tokenizer.preprocess_ner_tokens`

In [28]:
pred_tags_no_padding_no_trailing = postprocess_token_labels(pred_label_ids, 
                                                            dev_input_mask, 
                                                            label_map, 
                                                            remove_trailing_word_pieces=True, 
                                                            trailing_token_mask=dev_trailing_token_mask)
true_tags_no_padding_no_trailing = postprocess_token_labels(dev_label_ids, 
                                                            dev_input_mask, 
                                                            label_map, 
                                                            remove_trailing_word_pieces=True, 
                                                            trailing_token_mask=dev_trailing_token_mask)

In [29]:
print("F1 Score: {}".format(f1_score(true_tags_no_padding_no_trailing, pred_tags_no_padding_no_trailing)))

F1 Score: 0.8250952274254986


We can see that the F1 score is worse after exluding trailing word pieces, because they are easy to predict. 