# Named Entity Recognition Using BERT
## Summary
This notebook demonstrates how to fine tune [pretrained BERT model](https://github.com/huggingface/pytorch-pretrained-BERT) for token level named entity recognition (NER) task. A few utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, and model evaluation. 

[BERT (Bidirectional Transformers forLanguage Understanding)](https://arxiv.org/pdf/1810.04805.pdf) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition. It's able to achieve state of the art performance with only a few epochs of fine tuning.  
The figure below illustrates how BERT can be fine tuned for NER tasks. The input data is a list of tokens representing a sentence. In the training data, each token has an entity label. After fine tuning, the model predicts an entity label for each token of a given sentence in the testing data. 

![](bert_architecture.png)

### Required packages
* pytorch
* pytorch-pretrained-bert
* pandas
* seqeval

In [1]:
import sys
import os
import yaml
import pprint
import random
from seqeval.metrics import f1_score

import torch
from torch.optim import Adam

from pytorch_pretrained_bert.tokenization import BertTokenizer

bert_utils_path = os.path.abspath('../../utils_nlp/bert')
if bert_utils_path not in sys.path:
    sys.path.insert(0, bert_utils_path)

from bert_data_utils import KaggleNERProcessor
from bert_utils import BertTokenClassifier, postprocess_token_labels

from common_ner import Tokenizer, create_data_loader, Language

In [2]:
ner_data_dir = "./data/NER/ner_dataset.csv"
cache_dir="."
random_seed = 42
random.seed(random_seed)
torch.manual_seed(random_seed)

<torch._C.Generator at 0x7f27da7229f0>

## Configurations

In [3]:
# model configurations
language = Language.ENGLISH
do_lower_case = True
max_seq_length = 75

# training configurations
device="gpu"
batch_size = 32
num_train_epochs = 2

# optimizer configurations
learning_rate = 3e-5
clip_gradient = True
max_gradient_norm = 1.0

## Preprocess Data

### Create training and validation examples
`KaggleNERProcessor` is a dataset specific class that splits the whole dataset into training and validation datasets according to `dev_percentage`. The `get_train_examples` and `get_dev_examples` return the training and validation datasets respectively. The `get_labels` method returns a list of all unique labels 

In [4]:
kaggle_ner_processor = KaggleNERProcessor(data_dir=ner_data_dir, dev_percentage = 0.1)

In [5]:
train_text, train_labels = kaggle_ner_processor.get_train_examples()
dev_text, dev_labels = kaggle_ner_processor.get_dev_examples()
label_list = kaggle_ner_processor.get_labels()
print(label_list)

['B-art', 'I-art', 'I-org', 'B-per', 'B-geo', 'B-tim', 'I-per', 'B-gpe', 'I-tim', 'B-nat', 'I-nat', 'I-geo', 'B-eve', 'B-org', 'O', 'I-eve', 'I-gpe', 'X']


`KaggleNERProcessor` generates training and evaluation examples in `BertInputData` type. `BertInputData` is a `namedtuple` with the following three fields:
* text_a: text string of the first sentence.
* text_b: text string of the second setence. This is only required for two-sentence tasks.
* label: required for training and validation data.

In [6]:
print('Sample sentence: \n{}\n'.format(train_text[0]))
print('Sample sentence labels: \n{}\n'.format(train_labels[0]))

Sample sentence: 
Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

Sample sentence labels: 
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']



### Convert raw input to numerical features
The `preprocess_ner_tokens` of the tokenizer preprocess converts raw string data to numerical features, involving the following steps:
1. Tokenization.
2. Convert tokens and labels to numerical values, i.e. token ids and label ids.
3. Sequence padding or truncation according to the `max_seq_length` configuration.

**Create a dictionary that maps labels to numerical values**

In [7]:
label_map = {label: i for i, label in enumerate(label_list)}

**Create a tokenizer**

In [8]:
tokenizer = Tokenizer(language=language, 
                      to_lower=do_lower_case, 
                      cache_dir=cache_dir)

**Create numerical features**  
Note there is an argument called `trailing_piece_tag`. BERT uses a WordPiece tokenizer which breaks down some words into multiple tokens, e.g. "playing" is tokenized into "play" and "##ing". Since the input data only come with one token label for "playing", within `create_token_feature_dataset`, the original token label is assigned to the first token "play" and the second token "##ing" is labeled as "X". By default, `trailing_piece_tag` is set to "X". If your "X" already exists in your data, you can set `trailing_piece_tag` to another value that doesn't exist in your data. 

In [9]:
train_token_ids, train_input_mask, train_label_ids = tokenizer.preprocess_ner_tokens(text=train_text,
                                                                                      label_map=label_map,
                                                                                      max_seq_length=max_seq_length,
                                                                                      labels=train_labels,
                                                                                      trailing_piece_tag="X")
dev_token_ids, dev_input_mask, dev_label_ids = tokenizer.preprocess_ner_tokens(text=dev_text,
                                                                               label_map=label_map,
                                                                               max_seq_length=max_seq_length,
                                                                               labels=dev_labels,
                                                                               trailing_piece_tag="X")

`Tokenizer.preprocess_ner_tokens` outputs three lists of numerical features: 
1. token ids: numerical values each corresponds to a token.
2. attention mask: 1 for input tokens and 0 for padded tokens, so that padded tokens are not attended to. 
4. label ids: numerical values each corresponds to an entity label. 

In [10]:
print("Sample token id:\n{}\n".format(train_token_ids[0]))
print("Sample attention mask:\n{}\n".format(train_input_mask[0]))
print("Sample label ids:\n{}\n".format(train_label_ids[0]))

Sample token id:
[5190, 1997, 28337, 2031, 9847, 2083, 2414, 2000, 6186, 1996, 2162, 1999, 5712, 1998, 5157, 1996, 10534, 1997, 2329, 3629, 2013, 2008, 2406, 1012, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Sample attention mask:
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Sample label ids:
[14, 14, 14, 14, 14, 14, 4, 14, 14, 14, 14, 14, 4, 14, 14, 14, 14, 14, 7, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 

## Create Token Classifier

In [11]:
token_classifier = BertTokenClassifier(language=Language.ENGLISH,
                                       num_labels=len(label_list),
                                       cache_dir=cache_dir)

## Train Model

In [12]:
token_classifier.fit(token_ids=train_token_ids, 
                     input_mask=train_input_mask, 
                     labels=train_label_ids,
                     device=device,
                     num_epochs=num_train_epochs, 
                     batch_size=batch_size, 
                     learning_rate=learning_rate,
                     clip_gradient=True)

t_total value of -1 results in schedule not being applied
Epoch:   0%|          | 0/2 [00:00<?, ?it/s]
Iteration:   0%|          | 0/1349 [00:00<?, ?it/s][A
Iteration:   2%|▏         | 27/1349 [00:30<24:52,  1.13s/it][A
Iteration:   2%|▏         | 27/1349 [00:49<24:52,  1.13s/it][A
Iteration:   4%|▍         | 55/1349 [01:01<24:05,  1.12s/it][A
Iteration:   4%|▍         | 55/1349 [01:20<24:05,  1.12s/it][A
Iteration:   6%|▌         | 83/1349 [01:31<23:29,  1.11s/it][A
Iteration:   6%|▌         | 83/1349 [01:50<23:29,  1.11s/it][A
Iteration:   8%|▊         | 111/1349 [02:02<22:49,  1.11s/it][A
Iteration:   8%|▊         | 111/1349 [02:20<22:49,  1.11s/it][A
Iteration:  10%|█         | 139/1349 [02:33<22:14,  1.10s/it][A
Iteration:  10%|█         | 139/1349 [02:50<22:14,  1.10s/it][A
Iteration:  12%|█▏        | 167/1349 [03:03<21:42,  1.10s/it][A
Iteration:  12%|█▏        | 167/1349 [03:20<21:42,  1.10s/it][A
Iteration:  14%|█▍        | 195/1349 [03:34<21:07,  1.10s/it][A
Ite

Train loss: 0.13670625791969876



Iteration:   2%|▏         | 28/1349 [00:30<24:13,  1.10s/it][A
Iteration:   2%|▏         | 28/1349 [00:43<24:13,  1.10s/it][A
Iteration:   4%|▍         | 56/1349 [01:01<23:38,  1.10s/it][A
Iteration:   4%|▍         | 56/1349 [01:13<23:38,  1.10s/it][A
Iteration:   6%|▌         | 84/1349 [01:31<23:03,  1.09s/it][A
Iteration:   6%|▌         | 84/1349 [01:43<23:03,  1.09s/it][A
Iteration:   8%|▊         | 112/1349 [02:01<22:27,  1.09s/it][A
Iteration:   8%|▊         | 112/1349 [02:13<22:27,  1.09s/it][A
Iteration:  10%|█         | 140/1349 [02:32<21:56,  1.09s/it][A
Iteration:  10%|█         | 140/1349 [02:43<21:56,  1.09s/it][A
Iteration:  12%|█▏        | 168/1349 [03:02<21:22,  1.09s/it][A
Iteration:  12%|█▏        | 168/1349 [03:13<21:22,  1.09s/it][A
Iteration:  15%|█▍        | 196/1349 [03:33<20:53,  1.09s/it][A
Iteration:  15%|█▍        | 196/1349 [03:43<20:53,  1.09s/it][A
Iteration:  17%|█▋        | 224/1349 [04:03<20:25,  1.09s/it][A
Iteration:  17%|█▋        | 22

Train loss: 0.07746745187449534





## Predict on Test Data

In [102]:
pred_label_ids = token_classifier.predict(token_ids=dev_token_ids, 
                                          input_mask=dev_input_mask, 
                                          label_map=label_map,
                                          labels=dev_label_ids, 
                                          batch_size=batch_size,
                                          device=device)

## Evaluate Model
The `predict` method of the token classifier outputs label ids for all tokens, including the padded tokens. `postprocess_token_labels` is a helper function that removes the predictions on padded tokens. If a `label_map` is provided, it maps the numerical label ids back to original token labels which are usually string type. 

In [148]:
pred_tags_no_padding = postprocess_token_labels(pred_label_ids, dev_input_mask, label_map)
true_tags_no_padding =  postprocess_token_labels(dev_label_ids, dev_input_mask, label_map)

In [149]:
print("F1 Score: {}".format(f1_score(true_tags_no_padding, pred_tags_no_padding)))

F1 Score: 0.8966012568936771
