# Named Entity Recognition using Neural Networks

## History

### 2020/9/3

I fixed some issues. Sorry for the inconvenience.

- Change the name of pre-trained BERT model(bert-base-japanese-whole-word-masking -> cl-tohoku/bert-base-japanese-whole-word-masking)
- Update `evaluate` function due to the version upgrade of Transformers(v2.3.0 -> v3.1.0)
- Fix the version of transformers(v3.1.0)
- Reduce the `batch_size` from 32 to 16 due to OOM

## Setup

In [None]:
%tensorflow_version 2.x

In [None]:
!pip install seqeval transformers==3.1.0

Collecting seqeval
[?25l  Downloading https://files.pythonhosted.org/packages/9d/2d/233c79d5b4e5ab1dbf111242299153f3caddddbb691219f363ad55ce783d/seqeval-1.2.2.tar.gz (43kB)
[K     |████████████████████████████████| 51kB 2.5MB/s 
[?25hCollecting transformers==3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/ae/05/c8c55b600308dc04e95100dc8ad8a244dd800fe75dfafcf1d6348c6f6209/transformers-3.1.0-py3-none-any.whl (884kB)
[K     |████████████████████████████████| 890kB 10.9MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/e5/2d/6d4ca4bef9a67070fa1cac508606328329152b1df10bdf31fb6e4e727894/sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 31.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |███████████████████████████

In [None]:
!mkdir data
!mkdir models
!wget https://raw.githubusercontent.com/Hironsan/IOB2Corpus/master/ja.wikipedia.conll -P data/

--2020-11-11 02:08:30--  https://raw.githubusercontent.com/Hironsan/IOB2Corpus/master/ja.wikipedia.conll
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1297592 (1.2M) [text/plain]
Saving to: ‘data/ja.wikipedia.conll’


2020-11-11 02:08:30 (21.9 MB/s) - ‘data/ja.wikipedia.conll’ saved [1297592/1297592]



### Hyper-parameters

In [None]:
batch_size = 32
epochs = 100
num_words = 15000

### Imports

In [None]:
import re

import numpy as np
import tensorflow as tf
from seqeval.metrics import classification_report
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.layers import Dense, Input, Embedding, LSTM, Bidirectional
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences

## The dataset

### Load the [ja.wikipedia.conll](https://github.com/Hironsan/IOB2Corpus)

In [None]:
def load_dataset(filename, encoding='utf-8'):
    """Loads data and label from a file.
    Args:
        filename (str): path to the file.
        encoding (str): file encoding format.
        The file format is tab-separated values.
        A blank line is required at the end of a sentence.
        For example:
        ```
        EU	B-ORG
        rejects	O
        German	B-MISC
        call	O
        to	O
        boycott	O
        British	B-MISC
        lamb	O
        .	O
        Peter	B-PER
        Blackburn	I-PER
        ...
        ```
    Returns:
        tuple(numpy array, numpy array): data and labels.
    Example:
        >>> filename = 'conll2003/en/ner/train.txt'
        >>> data, labels = load_data_and_labels(filename)
    """
    sents, labels = [], []
    words, tags = [], []
    with open(filename, encoding=encoding) as f:
        for line in f:
            line = line.rstrip()
            if line:
                word, tag = line.split('\t')
                words.append(word)
                tags.append(tag)
            else:
                sents.append(words)
                labels.append(tags)
                words, tags = [], []
        if words:
            sents.append(words)
            labels.append(tags)

    return sents, labels

In [None]:
x, y = load_dataset('./data/ja.wikipedia.conll')

### Preprocess the dataset

In [None]:
class Vocab:

    def __init__(self, num_words=None, lower=True, oov_token=None):
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=num_words,
            oov_token=oov_token,
            filters='',
            lower=lower,
            split='\t'
        )

    def fit(self, sequences):
        texts = self._texts(sequences)
        self.tokenizer.fit_on_texts(texts)
        return self

    def encode(self, sequences):
        texts = self._texts(sequences)
        return self.tokenizer.texts_to_sequences(texts)

    def decode(self, sequences):
        texts = self.tokenizer.sequences_to_texts(sequences)
        return [text.split(' ') for text in texts]

    def _texts(self, sequences):
        return ['\t'.join(words) for words in sequences]

    def get_index(self, word):
        return self.tokenizer.word_index.get(word)

    @property
    def size(self):
        """Return vocabulary size."""
        return len(self.tokenizer.word_index) + 1

    def save(self, file_path):
        with open(file_path, 'w') as f:
            config = self.tokenizer.to_json()
            f.write(config)

    @classmethod
    def load(cls, file_path):
        with open(file_path) as f:
            tokenizer = tf.keras.preprocessing.text.tokenizer_from_json(f.read())
            vocab = cls()
            vocab.tokenizer = tokenizer
        return vocab


def normalize_number(text, reduce=True):
    if reduce:
        normalized_text = re.sub(r'\d+', '0', text)
    else:
        normalized_text = re.sub(r'\d', '0', text)
    return normalized_text


def preprocess_dataset(sequences):
    sequences = [[normalize_number(w) for w in words] for words in sequences]
    return sequences


def create_dataset(sequences, vocab):
    sequences = vocab.encode(sequences)
    sequences = pad_sequences(sequences, padding='post')
    return sequences

In [None]:
x = preprocess_dataset(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
source_vocab = Vocab(num_words=num_words, oov_token='<UNK>').fit(x_train)
target_vocab = Vocab(lower=False).fit(y_train)
x_train = create_dataset(x_train, source_vocab)
y_train = create_dataset(y_train, target_vocab)

## The models

### Build the models

In [None]:
class UnidirectionalModel:

    def __init__(self, input_dim, output_dim, emb_dim=100, hid_dim=100, embeddings=None):
        self.input = Input(shape=(None,), name='input')
        if embeddings is None:
            self.embedding = Embedding(input_dim=input_dim,
                                       output_dim=emb_dim,
                                       mask_zero=True,
                                       name='embedding')
        else:
            self.embedding = Embedding(input_dim=embeddings.shape[0],
                                       output_dim=embeddings.shape[1],
                                       mask_zero=True,
                                       weights=[embeddings],
                                       name='embedding')
        self.lstm = LSTM(hid_dim,
                         return_sequences=True,
                         name='lstm')
        self.fc = Dense(output_dim, activation='softmax')

    def build(self):
        x = self.input
        embedding = self.embedding(x)
        lstm = self.lstm(embedding)
        y = self.fc(lstm)
        return Model(inputs=x, outputs=y)


class BidirectionalModel:

    def __init__(self, input_dim, output_dim, emb_dim=100, hid_dim=100, embeddings=None):
        self.input = Input(shape=(None,), name='input')
        if embeddings is None:
            self.embedding = Embedding(input_dim=input_dim,
                                       output_dim=emb_dim,
                                       mask_zero=True,
                                       name='embedding')
        else:
            self.embedding = Embedding(input_dim=embeddings.shape[0],
                                       output_dim=embeddings.shape[1],
                                       mask_zero=True,
                                       weights=[embeddings],
                                       name='embedding')
        lstm = LSTM(hid_dim,
                    return_sequences=True,
                    name='lstm')
        self.bilstm = Bidirectional(lstm, name='bilstm')
        self.fc = Dense(output_dim, activation='softmax')

    def build(self):
        x = self.input
        embedding = self.embedding(x)
        lstm = self.bilstm(embedding)
        y = self.fc(lstm)
        return Model(inputs=x, outputs=y)

In [None]:
models = [
    UnidirectionalModel(num_words, target_vocab.size).build(),
    BidirectionalModel(num_words, target_vocab.size).build(),
]

### Train the models

In [None]:
model_path = 'models/model_{}'
for i, model in enumerate(models):
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

    # Preparing callbacks.
    callbacks = [
        EarlyStopping(patience=3),
        ModelCheckpoint(model_path.format(i), save_best_only=True)
    ]

    # Train the model.
    model.fit(x=x_train,
              y=y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=0.1,
              callbacks=callbacks,
              shuffle=True)

Epoch 1/100
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: models/model_0/assets
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100


### Evaluate the models

In [None]:
class InferenceAPI:
    """A model API that generates output sequence.

    Attributes:
        model: Model.
        source_vocab: source language's vocabulary.
        target_vocab: target language's vocabulary.
    """

    def __init__(self, model, source_vocab, target_vocab):
        self.model = model
        self.source_vocab = source_vocab
        self.target_vocab = target_vocab

    def predict_from_sequences(self, sequences):
        lengths = map(len, sequences)
        sequences = self.source_vocab.encode(sequences)
        sequences = pad_sequences(sequences, padding='post')
        y_pred = self.model.predict(sequences)
        y_pred = np.argmax(y_pred, axis=-1)
        y_pred = self.target_vocab.decode(y_pred)
        y_pred = [y[:l] for y, l in zip(y_pred, lengths)]
        return y_pred

In [None]:
model_names = ['Unidirectional Model', 'Bidirectional Model']
for i, model_name in enumerate(model_names):
    model = load_model(model_path.format(i))
    api = InferenceAPI(model, source_vocab, target_vocab)
    y_pred = api.predict_from_sequences(x_test)
    y_pred = api.predict_from_sequences(x_test)
    print(model_name)
    print(classification_report(y_test, y_pred, digits=4))
    print()

Unidirectional Model


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    ARTIFACT     0.1274    0.1299    0.1286       154
        DATE     0.4012    0.8444    0.5440       315
       EVENT     0.0000    0.0000    0.0000        64
    LOCATION     0.5705    0.4924    0.5286       526
       MONEY     0.0000    0.0000    0.0000        12
      NUMBER     0.0776    0.1147    0.0926       218
ORGANIZATION     0.1804    0.1855    0.1829       248
       OTHER     0.0000    0.0000    0.0000        75
     PERCENT     0.0000    0.0000    0.0000        52
      PERSON     0.1353    0.0804    0.1008       224
        TIME     0.0000    0.0000    0.0000         5

   micro avg     0.3151    0.3349    0.3247      1893
   macro avg     0.1357    0.1679    0.1434      1893
weighted avg     0.2842    0.3349    0.2944      1893


Bidirectional Model
              precision    recall  f1-score   support

    ARTIFACT     0.1310    0.0714    0.0924       154
        DATE     0.8390    0.8603    0.8495       315
   

# BERT for Named Entity Recognition

# utils.py

In [None]:
def evaluate(model, target_vocab, features, labels):
    label_ids = model.predict(features)[0]
    label_ids = np.argmax(label_ids, axis=-1)
    y_pred = [[] for _ in range(label_ids.shape[0])]
    y_true = [[] for _ in range(label_ids.shape[0])]
    for i in range(label_ids.shape[0]):
        for j in range(label_ids.shape[1]):
            if labels[i][j] == 0:
                continue
            y_pred[i].append(label_ids[i][j])
            y_true[i].append(labels[i][j])
    y_pred = target_vocab.decode(y_pred)
    y_true = target_vocab.decode(y_true)
    print(classification_report(y_true, y_pred, digits=4))

# preprocessing.py

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences


def convert_examples_to_features(x, y,
                                 vocab,
                                 max_seq_length,
                                 tokenizer):
    pad_token = 0
    features = {
        'input_ids': [],
        'attention_mask': [],
        'token_type_ids': [],
        'label_ids': []
    }
    for words, labels in zip(x, y):
        tokens = [tokenizer.cls_token]
        label_ids = [pad_token]
        for word, label in zip(words, labels):
            word_tokens = tokenizer.tokenize(word)
            tokens.extend(word_tokens)
            label_id = vocab.get_index(label)
            label_ids.extend([label_id] + [pad_token] * (len(word_tokens) - 1))
        tokens += [tokenizer.sep_token]

        input_ids = tokenizer.convert_tokens_to_ids(tokens)
        attention_mask = [1] * len(input_ids)
        token_type_ids = [pad_token] * max_seq_length

        features['input_ids'].append(input_ids)
        features['attention_mask'].append(attention_mask)
        features['token_type_ids'].append(token_type_ids)
        features['label_ids'].append(label_ids)

    for name in features:
        features[name] = pad_sequences(features[name], padding='post', maxlen=max_seq_length)

    x = [features['input_ids'], features['attention_mask'], features['token_type_ids']]
    y = features['label_ids']
    return x, y

# model.py

In [None]:
import tensorflow as tf
from transformers import TFBertForTokenClassification, BertConfig


def build_model(pretrained_model_name_or_path, num_labels):
    config = BertConfig.from_pretrained(
        pretrained_model_name_or_path,
        num_labels=num_labels
    )
    model = TFBertForTokenClassification.from_pretrained(
        pretrained_model_name_or_path,
        config=config
    )
    model.layers[-1].activation = tf.keras.activations.softmax
    return model


def loss_func(num_labels):
    loss_fct = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)

    def loss(y_true, y_pred):
        input_mask = tf.not_equal(y_true, 0)
        logits = tf.reshape(y_pred, (-1, num_labels))
        active_loss = tf.reshape(input_mask, (-1,))
        active_logits = tf.boolean_mask(logits, active_loss)
        train_labels = tf.reshape(y_true, (-1,))
        active_labels = tf.boolean_mask(train_labels, active_loss)
        cross_entropy = loss_fct(active_labels, active_logits)
        return cross_entropy
    return loss

# Training

In [None]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from transformers import BertJapaneseTokenizer


def main():
    # Set hyper-parameters.
    batch_size = 16
    epochs = 100
    model_path = 'models/'
    pretrained_model_name_or_path = 'cl-tohoku/bert-base-japanese-whole-word-masking'
    maxlen = 250

    # Data loading.
    x, y = load_dataset('./data/ja.wikipedia.conll')
    tokenizer = BertJapaneseTokenizer.from_pretrained(pretrained_model_name_or_path, do_word_tokenize=False)

    # Pre-processing.
    x = preprocess_dataset(x)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    target_vocab = Vocab(lower=False).fit(y_train)
    features_train, labels_train = convert_examples_to_features(
        x_train,
        y_train,
        target_vocab,
        max_seq_length=maxlen,
        tokenizer=tokenizer
    )
    features_test, labels_test = convert_examples_to_features(
        x_test,
        y_test,
        target_vocab,
        max_seq_length=maxlen,
        tokenizer=tokenizer
    )

    # Build model.
    model = build_model(pretrained_model_name_or_path, target_vocab.size)
    model.compile(optimizer='sgd', loss=loss_func(target_vocab.size))

    # Preparing callbacks.
    callbacks = [
        EarlyStopping(patience=3),
    ]

    # Train the model.
    model.fit(x=features_train,
              y=labels_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=0.1,
              callbacks=callbacks,
              shuffle=True)
    model.save_pretrained(model_path)

    # Evaluation.
    evaluate(model, target_vocab, features_test, labels_test)


if __name__ == '__main__':
    main()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=257706.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=545149952.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing TFBertForTokenClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForTokenClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
              precision    recall  f1-score   support

       OTHER     0.4000    0.2933    0.3385        75
    LOCATION     0.7688    0.7776    0.7732       526
ORGANIZATION     0.5277    0.6532    0.5838       248
      PERSON     0.7951    0.8661    0.8291       224
        DATE     0.7798    0.8095    0.7944       315
       EVENT     0.3919    0.4531    0.4203        64
    ARTIFACT     0.3716    0.4416    0.4036       154
     PERCENT     0.0000    0.0000    0.0000        52
        TIME     0.0000    0.0000    0.0000         5
      NUMBER     0.4200   