<h1>Named Entity Recognition using BERT Fine tuning</h1>

For the downstream NLP tasks such as question answering, named entity recognition, and language inference, pre-trained word representations tend to perform better. BERT which fine tunes deep bi-directional representation on series of tasks achieves state-of-the-art results. Unlike traditional Tranformer, BERT is trained on “masked language modeling,” which means that it is allowed to see the whole sentence and does not limit the context it can take into account.

For this example, we are leveraging transformers library to load BERT model and other config files

In [2]:
import tempfile
import os
import numpy as np
from typing import Callable, Iterable, List, Union

import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.models import Model
from transformers import BertTokenizer, TFBertModel

import fastestimator as fe
from fastestimator.dataset.data import german_ner
from fastestimator.op.numpyop.numpyop import NumpyOp
from fastestimator.op.numpyop.univariate import PadSequence, Tokenize, WordtoId
from fastestimator.op.tensorop import TensorOp, Reshape
from fastestimator.op.tensorop.loss import CrossEntropy
from fastestimator.op.tensorop.model import ModelOp, UpdateOp
from fastestimator.trace.metric import Accuracy
from fastestimator.trace.io import BestModelSaver
from fastestimator.backend import feed_forward

In [3]:
max_len = 20
batch_size = 64
epochs = 10

Custom NumpyOp that constructs attention masks for input sequences

In [4]:
class AttentionMask(NumpyOp):
    def forward(self, data, state):
        masks = [float(i > 0) for i in data]
        return np.array(masks)

char2idx function creates look-up table for the corresponding ids for the labels

In [5]:
def char2idx(data):
    tag2idx = {t: i for i, t in enumerate(data)}
    return tag2idx

<h2>Building components</h2>

<h3>Step 1: Prepare training & evaluation data and define pipeline</h3>

NER dataset from GermEval contains sequences and entity tags from german wikipedia and news corpora. We are loading train and eval sequences dataset along with data and label vocabulary. For this example other nouns are omitted for the simplicity.

In [6]:
train_data, eval_data, data_vocab, label_vocab = german_ner.load_data()

Define a pipeline to tokenize and pad the input sequences and construct attention masks. Attention masks are used to avoid performing attention operation on padded tokens. We are using BERT tokenizer for input sequences tokenization and max length 50 for this example.

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tag2idx = char2idx(label_vocab)
pipeline = fe.Pipeline(
    train_data=train_data,
    eval_data=eval_data,
    batch_size=batch_size,
    ops=[
        Tokenize(inputs="x", outputs="x", tokenize_fn=tokenizer.tokenize),
        WordtoId(inputs="x", outputs="x", mapping=tokenizer.convert_tokens_to_ids),
        WordtoId(inputs="y", outputs="y", mapping=tag2idx),
        PadSequence(max_len=max_len, inputs="x", outputs="x"),
        PadSequence(max_len=max_len, value=len(tag2idx), inputs="y", outputs="y"),
        AttentionMask(inputs="x", outputs="x_masks")
    ])

<h3>Step 2: Create model and FastEstimator network</h3>

Network architecture has pretrained weights as initialization for downsteam task. Whole network is then trained during the fine tuning.

In [8]:
def ner_model():
    token_inputs = Input((max_len), dtype=tf.int32, name='input_words')
    mask_inputs = Input((max_len), dtype=tf.int32, name='input_masks')
    bert_model = TFBertModel.from_pretrained("bert-base-uncased")
    seq_output, _ = bert_model(token_inputs, attention_mask=mask_inputs)
    output = Dense(24, activation='softmax')(seq_output)
    model = Model([token_inputs, mask_inputs], output)
    return model

Model definition is then intantiated by calling fe.build which also associates the model with specific optimizers.

In [9]:
model = fe.build(model_fn=ner_model, optimizer_fn=lambda: tf.optimizers.Adam(1e-5))

fe.Network takes series of operators and here we feed our model in the ModelOp with inputs and outputs. Here, ReshapeOp transforms the prediction and ground truth to scalar or two dimensional vector before feeding it to loss calculation.

In [10]:
network = fe.Network(ops=[
        ModelOp(model=model, inputs=["x", "x_masks"], outputs="y_pred"),
        Reshape(inputs="y", outputs="y", shape=(-1, )),
        Reshape(inputs="y_pred", outputs="y_pred", shape=(-1, 24)),
        CrossEntropy(inputs=("y_pred", "y"), outputs="loss"),
        UpdateOp(model=model, loss_name="loss")
    ])

<h3>Step 3: Prepare Estimator and configure the training loop</h3>

Estimator basically has four arguments network, pipeline, epochs and traces. During the training, we want the accuracy metric and save the model with minimum loss, we will define that in Trace class.

In [11]:
save_dir=tempfile.mkdtemp()
traces = [Accuracy(true_key="y", pred_key="y_pred"), BestModelSaver(model=model, save_dir=save_dir)]

In [12]:
estimator = fe.Estimator(network=network,
                             pipeline=pipeline,
                             epochs=epochs,
                             traces=traces)

<h2>Training</h2>

In [13]:
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator-Start: step: 1; model_lr: 1e-05; 
FastEstimator-Train: step: 1; loss: 3.8005962; 
FastEstimator-Train: step: 100; loss: 0.40420213; steps/sec: 2.05; 
FastEstimator-Train: step: 125; epoch: 1; epoch_time: 72.91 sec; 
FastEstimator-ModelSaver: saved model to /tmp/tmpk1i5vjc2/model_best_loss.h5
FastEstimator-Eval: step: 125; epoch: 1; loss: 0.30054897; min_loss: 0.30054897; since_best: 0; accuracy: 0.9269; 
FastEstimator-Train: step: 200; loss: 0.22695072; steps/sec: 2.0; 
FastEstimator-Train: step: 250; epoch: 2; epoch_time: 62.35 sec; 
FastEs

<h2>Inferencing</h2>

Load model weights using <i>fe.build</i>

In [14]:
model_name = 'model_best_loss.h5'
model_path = os.path.join(save_dir, model_name)
trained_model = fe.build(model_fn=ner_model, weights_path=model_path, optimizer_fn=lambda: tf.optimizers.Adam(1e-5))

Loaded model weights from /tmp/tmpk1i5vjc2/model_best_loss.h5


In [15]:
selected_idx = np.random.randint(1000)
print("Ground truth is: ",eval_data[selected_idx]['y'])

Ground truth is:  ['B-PER', 'I-PER', 'I-PER', 'I-PER']


Create data dictionary for the inference. Transform() function in Pipeline and Network applies all the operations on the given data.

In [16]:
infer_data = {"x":eval_data[selected_idx]['x'], "y":eval_data[selected_idx]['y']}
data = pipeline.transform(infer_data, mode="infer")
data = network.transform(data, mode="infer")

Get the predictions using <i>feed_forward</i>

In [17]:
predictions = feed_forward(trained_model, [data["x"],data["x_masks"]], training=False)
predictions = np.array(predictions).reshape(20,24)
predictions = np.argmax(predictions, axis=-1)

In [18]:
def get_key(val): 
    for key, value in tag2idx.items(): 
         if val == value: 
            return key 

In [19]:
print("Predictions: ", [get_key(pred) for pred in predictions])

Predictions:  ['B-PER', 'I-PER', 'I-PER', 'I-PER', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
