*Copyright (c) Microsoft Corporation. All rights reserved.*  
*Licensed under the MIT License.*
# Named Entity Recognition Using BERT on Chinese
## Summary
This notebook demonstrates how to fine tune [pretrained BERT model](https://github.com/huggingface/pytorch-pretrained-BERT) for named entity recognition (NER) task on Chinese text. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring and model evaluation.

[BERT (Bidirectional Transformers for Language Understanding)](https://arxiv.org/pdf/1810.04805.pdf) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition, etc. It's able to achieve state of the art performance with only a few epochs of fine tuning on task specific datasets.  
The figure below illustrates how BERT can be fine tuned for NER tasks. The input data is a list of tokens representing a sentence. In the training data, each token has an entity label. After fine tuning, the model predicts an entity label for each token in a given testing sentence. 

<img src="https://nlpbp.blob.core.windows.net/images/bert_architecture.png">

Named Entity Recognition on non-English text is not very differnt from that on English text. The only difference is the model used, which is configured by the `LANGUAGE` variable below. For non-English languages including Chinese, the *bert-base-multilingual-cased* model can be used by setting `LANGUAGE = Language.MULTILINGUAL`. For Chinese, the *bert-base-chinese* model can also be used by setting `LANGUAGE = Language.CHINESE`. On Chinese text, the performance of *bert-base-chinese* is usually better than *bert-base-multilingual-cased* because the *bert-base-chinese* model is pretrained on Chinese data only. On this particular dataset, the performances of the Chinese-only model and multilingual model are very similar

## Required packages
* pytorch
* pytorch-pretrained-bert
* pandas
* seqeval

In [1]:
import sys
import os
import random
from seqeval.metrics import classification_report

import torch

nlp_path = os.path.abspath('../../')
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp.bert.token_classification import BERTTokenClassifier, postprocess_token_labels, create_label_map
from utils_nlp.bert.common import Language, Tokenizer
from utils_nlp.dataset.msra_ner import load_pandas_df, get_unique_labels

## Configurations

In [2]:
# path configurations
CACHE_DIR = "./temp"

# set random seeds
RANDOM_SEED = 100
torch.manual_seed(RANDOM_SEED)

# model configurations
LANGUAGE = Language.CHINESE
DO_LOWER_CASE = True
MAX_SEQ_LENGTH = 200

# training configurations
BATCH_SIZE = 16
NUM_TRAIN_EPOCHS = 1

# optimizer configuration
LEARNING_RATE = 3e-5

TEXT_COL = "sentence"
LABEL_COL = "labels"

## Preprocess Data

### Get training and testing data
The dataset used in this notebook is the MSRA NER dataset. The dataset consists of 45000 training sentences and 3442 testing sentences. 

The helper function `load_pandas_df` downloads the data files if they don't exist in `local_cache_path`. It returns the training or testing data frame based on `file_split`

The helper function `get_unique_labels` returns the unique entity labels in the dataset. There are 7 unique labels in the   dataset: 
* 'O': non-entity 
* 'B-LOC': beginning of location entity
* 'I-LOC': within location entity
* 'B-PER': beginning of person entity
* 'I-PER': within person entity
* 'B-ORG': beginning of organization entity
* 'I-ORG': within organization entity

The maximum number of words in a sentence is 2427. We set MAX_SEQ_LENGTH to 200 above to reduce the GPU memory needed to run this notebook. Less than 1% of testing data are longer than 200, so this should have negligible impact on the model performance evaluation.

In [3]:
train_df = load_pandas_df(local_cache_path=CACHE_DIR, file_split="train")
test_df = load_pandas_df(local_cache_path=CACHE_DIR, file_split="test")
label_list = get_unique_labels()
print("Number of sentences in training data: {}".format(train_df.shape[0]))
print("Number of sentences in testing data: {}".format(test_df.shape[0]))
print("Unique labels: {}".format(label_list))

Maximum sequence length in the train data is: 746
Maximum sequence length in the test data is: 2427
Number of sentences in training data: 45000
Number of sentences in testing data: 3442
Unique labels: ['O', 'B-LOC', 'B-ORG', 'B-PER', 'I-LOC', 'I-ORG', 'I-PER']


In [4]:
train_df.head()

Unnamed: 0,sentence,labels
0,"[尽, 管, 欧, 佩, 克, 组, 织, 成, 员, 国, 于, ３, 月, ２, ２, ...","[O, O, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, O, O..."
1,"[美, 国, 宇, 航, 局, 说, ，, 宇, 航, 员, 的, 安, 全, 不, 成, ...","[B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, O, O, O, O..."
2,"[这, 一, 问, 题, 将, 由, ６, 月, ２, ４, 日, 召, 开, 的, 欧, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-O..."
3,"[他, 们, 响, 亮, 地, 提, 出, 了, “, 观, 光, 农, 业, ”, 的, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[拉, 萨, 布, 达, 拉, 宫, 广, 场, 频, 繁, 举, 行, 着, 各, 种, ...","[B-LOC, I-LOC, B-LOC, I-LOC, I-LOC, I-LOC, I-L..."


**Note that the input text are lists of Chinese characters instead of raw sentences. This format ensures matching between input words and token labels when the words are further tokenized by `Tokenizer.tokenize_ner`. **

### Tokenization and Preprocessing
The `tokenize_ner` method of the `Tokenizer` class converts raw string data to numerical features, involving the following steps:
1. WordPiece tokenization.
2. Convert tokens and labels to numerical values, i.e. token ids and label ids.
3. Sequence padding or truncation according to the `max_seq_length` configuration.

**Create a dictionary that maps labels to numerical values**

In [5]:
label_map = create_label_map(label_list)

**Tokenize input text**

In [6]:
tokenizer = Tokenizer(language=LANGUAGE, 
                      to_lower=DO_LOWER_CASE, 
                      cache_dir=CACHE_DIR)

**Create numerical features**  

In [7]:
train_token_ids, train_input_mask, train_trailing_token_mask, train_label_ids = \
    tokenizer.tokenize_ner(text=train_df[TEXT_COL],
                           label_map=label_map,
                           max_len=MAX_SEQ_LENGTH,
                           labels=train_df[LABEL_COL])
test_token_ids, test_input_mask, test_trailing_token_mask, test_label_ids = \
    tokenizer.tokenize_ner(text=test_df[TEXT_COL],
                           label_map=label_map,
                           max_len=MAX_SEQ_LENGTH,
                           labels=test_df[LABEL_COL])

`Tokenizer.tokenize_ner` outputs three or four lists of numerical features lists, each sublist contains features of an input sentence: 
1. token ids: list of numerical values each corresponds to a token.
2. attention mask: list of 1s and 0s, 1 for input tokens and 0 for padded tokens, so that padded tokens are not attended to. 
3. trailing word piece mask: boolean list, `True` for the first word piece of each original word, `False` for the trailing word pieces, e.g. ##ize. This mask is useful for removing predictions on trailing word pieces, so that each original word in the input text has a unique predicted label. 
4. label ids: list of numerical values each corresponds to an entity label, if `labels` is provided.

In [8]:
print(("Sample token ids:\n{}\n".format(train_token_ids[0])))
print("Sample attention mask:\n{}\n".format(train_input_mask[0]))
print("Sample trailing token mask:\n{}\n".format(train_trailing_token_mask[0]))
print("Sample label ids:\n{}\n".format(train_label_ids[0]))

Sample token ids:
[2226, 5052, 3616, 877, 1046, 5299, 5302, 2768, 1447, 1744, 754, 8031, 3299, 8030, 8030, 3189, 6809, 2768, 1121, 2208, 1333, 3779, 3189, 772, 7030, 1291, 6379, 8024, 852, 686, 4518, 3779, 817, 1126, 725, 3766, 3300, 2501, 2768, 6772, 2487, 4638, 1353, 2486, 1232, 1928, 8024, 793, 1905, 754, 856, 6837, 4307, 2578, 511, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.

## Create Token Classifier
The value of the `language` argument determines which BERT model is used:
* Language.ENGLISH: "bert-base-uncased"
* Language.ENGLISHCASED: "bert-base-cased"
* Language.ENGLISHLARGE: "bert-large-uncased"
* Language.ENGLISHLARGECASED: "bert-large-cased"
* Language.CHINESE: "bert-base-chinese"
* Language.MULTILINGUAL: "bert-base-multilingual-cased"

Here we use the base model pre-trained only on Chinese data.

In [9]:
token_classifier = BERTTokenClassifier(language=LANGUAGE,
                                       num_labels=len(label_map),
                                       cache_dir=CACHE_DIR)

## Train Model

In [10]:
token_classifier.fit(token_ids=train_token_ids, 
                     input_mask=train_input_mask, 
                     labels=train_label_ids,
                     num_epochs=NUM_TRAIN_EPOCHS, 
                     batch_size=BATCH_SIZE, 
                     learning_rate=LEARNING_RATE)

t_total value of -1 results in schedule not being applied
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/2813 [00:00<?, ?it/s][A




Iteration:   1%|          | 25/2813 [00:30<55:48,  1.20s/it][A
Iteration:   1%|          | 25/2813 [00:49<55:48,  1.20s/it][A
Iteration:   2%|▏         | 50/2813 [01:00<55:25,  1.20s/it][A
Iteration:   2%|▏         | 50/2813 [01:20<55:25,  1.20s/it][A
Iteration:   3%|▎         | 75/2813 [01:30<55:08,  1.21s/it][A
Iteration:   3%|▎         | 75/2813 [01:50<55:08,  1.21s/it][A
Iteration:   4%|▎         | 100/2813 [02:01<54:52,  1.21s/it][A
Iteration:   4%|▎         | 100/2813 [02:20<54:52,  1.21s/it][A
Iteration:   4%|▍         | 125/2813 [02:32<54:35,  1.22s/it][A
Iteration:   4%|▍         | 125/2813 [02:50<54:35,  1.22s/it][A
Iteration:   5%|▌         | 150/2813 [03:03<54:17,  1.22s/it][A
Iteration:   5%|▌         | 150/2813 [03:20<54:17,  1.22s/it][A
Iteration:   6%|▌         | 175/2813 [03:33<53:56,  1.23s/it][A
Iteration:   6%|▌         | 175/2813 [03:50<53:56,  1.23s/it][A
Iteration:   7%|▋         | 200/2813 [04:04<53:31,  1.23s/it][A
Iteration:   7%|▋         | 20

Train loss: 0.059443180853511884





## Predict on Test Data
The `predict` method of the token classifier optionally returns the softmax probability of the predicted class, which is a NxM array, where N is size of the testing data and M is the number of tokens in the testing sequence

In [11]:
predictions = token_classifier.predict(token_ids=test_token_ids, 
                                       input_mask=test_input_mask, 
                                       labels=test_label_ids, 
                                       batch_size=BATCH_SIZE,
                                       probabilities=True)
probabilities = predictions.probabilities
pred_label_ids = predictions.classes
print("Sample predicted probabilities:\n{}\n".format(probabilities[5]))
print("Sample predicted labels:\n{}\n".format(pred_label_ids[5]))

Iteration:   0%|          | 0/216 [00:00<?, ?it/s]



Iteration: 100%|██████████| 216/216 [01:28<00:00,  2.45it/s]

Evaluation loss: 0.029963031752467267
Sample predicted probabilities:
[0.9984303  0.9996406  0.9998387  0.99995875 0.9999659  0.99996686
 0.9999653  0.99997485 0.99997294 0.99997234 0.99995565 0.9999659
 0.9999764  0.99997306 0.9999707  0.9999659  0.9999639  0.99997115
 0.99997234 0.9999709  0.9999713  0.9999739  0.9999732  0.99997246
 0.99997246 0.99997544 0.9999728  0.9999758  0.9999713  0.9999726
 0.9999745  0.9999734  0.9999758  0.9999728  0.9999739  0.99996376
 0.99995685 0.9999708  0.9999721  0.9999734  0.9999745  0.9999738
 0.99997056 0.9999764  0.9997565  0.9997651  0.9994979  0.99969363
 0.99973804 0.9998067  0.9997167  0.9998072  0.99989605 0.9999374
 0.99995315 0.99995065 0.99994457 0.9996952  0.9994863  0.9981755
 0.997241   0.99963045 0.99957377 0.9997564  0.9994242  0.99979025
 0.9997782  0.99987435 0.9997774  0.99983525 0.9998965  0.99966764
 0.99848396 0.99930775 0.9984328  0.9998795  0.9998504  0.99991655
 0.9996692  0.9998728  0.99973184 0.9998615  0.9998672  0.999916




## Evaluate Model
The `predict` method of the token classifier outputs label ids for all tokens, including the padded tokens. `postprocess_token_labels` is a helper function that removes the predictions on padded tokens. If a `label_map` is provided, it maps the numerical label ids back to original token labels which are usually string type. 

In [12]:
pred_tags_no_padding = postprocess_token_labels(pred_label_ids, 
                                                test_input_mask, 
                                                label_map)
true_tags_no_padding = postprocess_token_labels(test_label_ids, 
                                                test_input_mask, 
                                                label_map)
print(classification_report(true_tags_no_padding, pred_tags_no_padding, digits=2))

           precision    recall  f1-score   support

      ORG       0.80      0.92      0.85      1296
      PER       0.94      0.95      0.95      1382
      LOC       0.94      0.93      0.94      2803

micro avg       0.90      0.93      0.92      5481
macro avg       0.91      0.93      0.92      5481

