# Named Entity Recognition Using BERT
## Summary
This notebook demonstrates how to fine tune [pretrained BERT model](https://github.com/huggingface/pytorch-pretrained-BERT) for token level named entity recognition (NER) task. A few utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, and model evaluation. 

[BERT (Bidirectional Transformers forLanguage Understanding)](https://arxiv.org/pdf/1810.04805.pdf) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition. It's able to achieve state of the art performance with only a few epochs of fine tuning.  
The figure below illustrates how BERT can be fine tuned for NER tasks. The input data is a list of tokens representing a sentence. In the training data, each token has an entity label. After fine tuning, the model predicts an entity label for each token of a given sentence in the testing data. 

![](bert_architecture.png)

### Required packages
* pytorch
* pytorch-pretrained-bert
* pandas
* seqeval

In [1]:
import sys
import os
import yaml
import pprint
import random
from seqeval.metrics import f1_score

import torch
from torch.optim import Adam

from pytorch_pretrained_bert.tokenization import BertTokenizer

bert_utils_path = os.path.abspath('../../utils_nlp/bert')
if bert_utils_path not in sys.path:
    sys.path.insert(0, bert_utils_path)

from configs import BERTFineTuneConfig
from bert_data_utils import KaggleNERProcessor
from bert_utils import (BertTokenClassifier, 
                        create_token_feature_dataset, 
                        get_device)


In [2]:
config_file = "config.yaml"
ner_data_dir = "./data/NER/ner_dataset.csv"
random_seed = 42
random.seed(random_seed)
torch.manual_seed(random_seed)

<torch._C.Generator at 0x7fbc340879f0>

## Configurations

In [3]:
with open(config_file, 'r') as ymlfile:
    config_dict = yaml.safe_load(ymlfile)

pprint.pprint(config_dict)

{'ModelConfig': {'bert_model': 'bert-base-uncased',
                 'do_lower_case': True,
                 'max_seq_length': 75},
 'OptimizerConfig': {'clip_gradient': True,
                     'learning_rate': 3e-05,
                     'max_gradient_norm': 1.0,
                     'no_decay_params': ['bias', 'gamma', 'beta'],
                     'optimizer_name': 'Adam',
                     'params_weight_decay': 0.01},
 'TrainConfig': {'batch_size': 32, 'num_train_epochs': 2}}


In [4]:
config = BERTFineTuneConfig(config_dict)

## Preprocess Data

### Create training and validation examples
`KaggleNERProcessor` is a dataset specific class that splits the whole dataset into training and validation datasets according to `dev_percentage`. The `get_train_examples` and `get_dev_examples` return the training and validation datasets respectively. The `get_labels` method returns a list of all unique labels 

In [5]:
kaggle_ner_processor = KaggleNERProcessor(data_dir=ner_data_dir, dev_percentage = 0.1)

In [6]:
train_examples = kaggle_ner_processor.get_train_examples()
dev_examples = kaggle_ner_processor.get_dev_examples()
label_list = kaggle_ner_processor.get_labels()
print(label_list)

['I-gpe', 'B-art', 'B-eve', 'I-geo', 'B-tim', 'I-art', 'I-eve', 'B-nat', 'B-org', 'I-org', 'O', 'B-geo', 'B-gpe', 'I-tim', 'I-per', 'I-nat', 'B-per', 'X']


`KaggleNERProcessor` generates training and evaluation examples in `BertInputData` type. `BertInputData` is a `namedtuple` with the following three fields:
* text_a: text string of the first sentence.
* text_b: text string of the second setence. This is only required for two-sentence tasks.
* label: required for training and validation data.

In [7]:
print('Sample sentence: \n{}\n'.format(train_examples[0].text_a))
print('Sample sentence labels: \n{}\n'.format(train_examples[0].label))

Sample sentence: 
Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

Sample sentence labels: 
['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']



### Convert raw input to feature dataset.
The function `create_token_feature_dataset` converts raw string data to PyTorch `TensorDataset` containing numerical features, involving the following steps:
1. Tokenization.
2. Convert tokens and labels to numerical values, i.e. token ids and label ids.
3. Sequence padding or truncation according to the `max_seq_length` configuration.
4. Convert numpy arrays to Pytorch `TensorDataset`.

**Create a dictionary that maps labels to numerical values**

In [8]:
label_map = {label: i for i, label in enumerate(label_list)}

**Create a tokenizer**

In [9]:
tokenizer = BertTokenizer.from_pretrained(config.bert_model,
                                          do_lower_case=config.do_lower_case)

**Create feature TensorDataset**  
Note there is an argument called `trailing_piece_tag`. BERT uses a WordPiece tokenizer which breaks down some words into multiple tokens, e.g. "playing" is tokenized into "play" and "##ing". Since the input data only come with one token label for "playing", within `create_token_feature_dataset`, the original token label is assigned to the first token "play" and the second token "##ing" is labeled as "X". By default, `trailing_piece_tag` is set to "X". If your "X" already exists in your data, you can set `trailing_piece_tag` to another value that doesn't exist in your data. 

In [10]:
train_dataset = create_token_feature_dataset(data=train_examples,
                                             tokenizer=tokenizer,
                                             label_map=label_map,
                                             true_label_available=True,
                                             max_seq_length=config.max_seq_length, 
                                             trailing_piece_tag="X")
dev_dataset = create_token_feature_dataset(data=dev_examples,
                                           tokenizer=tokenizer,
                                           label_map=label_map, 
                                           true_label_available=True,
                                           max_seq_length=config.max_seq_length, 
                                           trailing_piece_tag="X")

`create_token_feature_dataset` outputs a `TensorDataset` with four tensors:  
1. token ids: numerical values each corresponds to a token.
2. attention mask: 1 for input tokens and 0 for padded tokens, so that padded tokens are not attended to. 
3. segment ids: 0 for the first sentence and 1 for the second sentence, only used in two sentence tasks, not used in NER.
4. label ids: numerical values each corresponds to an entity label. 

In [11]:
print("Sample token id:\n{}\n".format(train_dataset[0][0]))
print("Sample attention mask:\n{}\n".format(train_dataset[0][1]))
print("Sample label ids:\n{}\n".format(train_dataset[0][3]))

Sample token id:
tensor([ 5190,  1997, 28337,  2031,  9847,  2083,  2414,  2000,  6186,  1996,
         2162,  1999,  5712,  1998,  5157,  1996, 10534,  1997,  2329,  3629,
         2013,  2008,  2406,  1012,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0])

Sample attention mask:
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

Sample label ids:
tensor([10, 10, 10, 10, 10, 10, 11, 10, 10, 10, 10, 10, 11, 10, 10, 10, 10, 10,
        12, 

## Create Token Classifier
The `get_device` is helper function which detects if GPU is avalaible and the number of GPUs available. 

In [12]:
device, n_gpu = get_device()
token_classifier = BertTokenClassifier(config=config, 
                                       label_map=label_map, 
                                       device=device, 
                                       n_gpu=n_gpu)

BERT fine tune configurations:
batch_size=32
num_train_epochs=2
bert_model=bert-base-uncased
max_seq_length=75
do_lower_case=True
optimizer_name=Adam
learning_rate=3e-05
no_decay_params=['bias', 'gamma', 'beta']
params_weight_decay=0.01
clip_gradient=True
max_gradient_norm=1.0


## Train Model

In [13]:
token_classifier.fit(train_dataset)

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]
Iteration:   0%|          | 0/1349 [00:00<?, ?it/s][A
Iteration:   2%|▏         | 33/1349 [00:30<20:09,  1.09it/s][A
Iteration:   2%|▏         | 33/1349 [00:49<20:09,  1.09it/s][A
Iteration:   5%|▍         | 66/1349 [01:00<19:40,  1.09it/s][A
Iteration:   5%|▍         | 66/1349 [01:20<19:40,  1.09it/s][A
Iteration:   7%|▋         | 99/1349 [01:31<19:11,  1.09it/s][A
Iteration:   7%|▋         | 99/1349 [01:50<19:11,  1.09it/s][A
Iteration:  10%|▉         | 132/1349 [02:01<18:43,  1.08it/s][A
Iteration:  10%|▉         | 132/1349 [02:20<18:43,  1.08it/s][A
Iteration:  12%|█▏        | 165/1349 [02:32<18:13,  1.08it/s][A
Iteration:  12%|█▏        | 165/1349 [02:50<18:13,  1.08it/s][A
Iteration:  15%|█▍        | 198/1349 [03:03<17:44,  1.08it/s][A
Iteration:  15%|█▍        | 198/1349 [03:20<17:44,  1.08it/s][A
Iteration:  17%|█▋        | 231/1349 [03:33<17:15,  1.08it/s][A
Iteration:  17%|█▋        | 231/1349 [03:50<17:15,  1.08it/s]

Train loss: 0.2427288124778255



Iteration:   2%|▏         | 32/1349 [00:30<20:35,  1.07it/s][A
Iteration:   2%|▏         | 32/1349 [00:40<20:35,  1.07it/s][A
Iteration:   5%|▍         | 64/1349 [01:00<20:06,  1.07it/s][A
Iteration:   5%|▍         | 64/1349 [01:10<20:06,  1.07it/s][A
Iteration:   7%|▋         | 96/1349 [01:30<19:37,  1.06it/s][A
Iteration:   7%|▋         | 96/1349 [01:50<19:37,  1.06it/s][A
Iteration:   9%|▉         | 128/1349 [02:00<19:07,  1.06it/s][A
Iteration:   9%|▉         | 128/1349 [02:20<19:07,  1.06it/s][A
Iteration:  12%|█▏        | 160/1349 [02:30<18:37,  1.06it/s][A
Iteration:  12%|█▏        | 160/1349 [02:50<18:37,  1.06it/s][A
Iteration:  14%|█▍        | 192/1349 [03:00<18:06,  1.06it/s][A
Iteration:  14%|█▍        | 192/1349 [03:20<18:06,  1.06it/s][A
Iteration:  17%|█▋        | 224/1349 [03:30<17:36,  1.06it/s][A
Iteration:  17%|█▋        | 224/1349 [03:50<17:36,  1.06it/s][A
Iteration:  19%|█▉        | 256/1349 [04:00<17:07,  1.06it/s][A
Iteration:  19%|█▉        | 25

Train loss: 0.23164681046427932





## Evaluate Model

In [14]:
pred_tags, true_tags = token_classifier.predict(dev_dataset)
print("F1 Score: {}".format(f1_score(pred_tags, true_tags)))

Iteration: 100%|██████████| 150/150 [01:31<00:00,  1.63it/s]


Validation loss: 0.21716053078571956
F1 Score: 0.794088923156001


Check model performance on the first token of each word by exluding "X" labels.

In [15]:
pred_tags_no_X = []
true_tags_no_X = []
for p, t in zip(pred_tags, true_tags):
    p = p[t != "X"]
    t = t[t != "X"]
    pred_tags_no_X.append(p)
    true_tags_no_X.append(t)
print("F1 Score without 'X' label: {}".format(f1_score(pred_tags_no_X, true_tags_no_X)))

F1 Score without 'X' label: 0.8329393223010244
