## Fine Tunning the model "cl-tohoku/bert-base-japanese-whole-word-masking"

> Originated from https://github.com/ken11/bert-japanese-ner-finetuning/blob/master/bert-japanese-ner-finetuning-tohoku.ipynb

This notebook is a modified version from above link.
The major differences are:
- do the IOB tagging manually here, instead of calling external lib
- convert tokenizer into TokenizerFast, in order to use newer methods

## Install dependencies

> `Remarks`: it is recommened to use virtual environment to install package instead of installing globally, if executing locally.
You can refer to [the post here](https://code.visualstudio.com/docs/datascience/jupyter-notebooks#_setting-up-your-environment) to see how to setup environment for using jupyter

In [3]:
# TO check which pip3 you are using
!which pip3

/Users/leung.tsz.kit/Desktop/work/code/test/bert_fine_tuning_example_ner/.venv/bin/pip3


Note that the transformers version is important here, because the newer version is not compatible with the TokenizerFast we use later.

In [4]:
!pip3 install --upgrade pip
!pip3 install torch
!pip3 install 'transformers[ja]==4.15.0' numpy sklearn seqeval pandas
!pip3 install wget



## Detect any GPU

Below parts detect if GPU is available, and use CPU if not.

In [5]:
# If you want to avoid using GPU, change below value to False
use_gpu = False

In [6]:
import torch

# If there's a GPU available...
if use_gpu and torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('Using the CPU.')
    device = torch.device("cpu")

Using the CPU.


## Download training data
We are using this public data : [ストックマーク株式会社が公開しているner-wikipedia-dataset](https://github.com/stockmarkteam/ner-wikipedia-dataset)

In [7]:
model_output_dir = "./dest"
model_name = "cl-tohoku/bert-base-japanese-whole-word-masking"

In [8]:
import wget
import os
import pprint as pp


# The URL for the dataset zip file.
url = "https://github.com/stockmarkteam/ner-wikipedia-dataset/raw/main/ner.json"


# Download the file (if we haven't already)
if not os.path.exists('./ner.json'):
    wget.download(url, './ner.json')


Let's take a look the data.

In [9]:
!head -15 ner.json

[
    {
        "curid": "3572156",
        "text": "SPRiNGSと最も仲の良いライバルグループ。",
        "entities": [
            {
                "name": "SPRiNGS",
                "span": [
                    0,
                    7
                ],
                "type": "その他の組織名"
            }
        ]
    },


Let's store the data in a pandas dataframe

In [10]:
import pandas as pd

json_df = pd.read_json("./ner.json")
# We only need these two column
json_df = json_df[["text","entities"]]
# lets take a look some sample rows
json_df.sample(10)

Unnamed: 0,text,entities
445,流域一体は水田が多く、取水堰として樋遣川堰が設置され農業用水として利用されている一方で、同時...,"[{'name': '樋遣川堰', 'span': [17, 21], 'type': '施..."
2251,但し手続きの関係から佐藤が第5代自由民主党総裁に就任したのは12月1日であった。,"[{'name': '佐藤', 'span': [10, 12], 'type': '人名'..."
1425,このアルバムの中には、ジェイク・ギレンホールについて書いた曲があるという。,"[{'name': 'ジェイク・ギレンホール', 'span': [11, 22], 'ty..."
1448,アメリカ合衆国において指揮活動に入り、まずはジョージ・セルに弟子入りし、その後レオポルド・ス...,"[{'name': 'アメリカ合衆国', 'span': [0, 7], 'type': '..."
2891,黒澤も久世も口に出すと後に引かない性格で、ついに柴山撮影所長が乗り出して、久世の殺陣が実現す...,"[{'name': '黒澤', 'span': [0, 2], 'type': '人名'},..."
4326,およびそれを利用した物品。,[]
4157,それにより、同年のJ1優勝チームである鹿島アントラーズは出場できなかった。,"[{'name': 'J1', 'span': [9, 11], 'type': 'その他の..."
437,またペスタロッチは主に初等教育分野に貢献したのに対し、彼の影響を受けたフリードリヒ・フレーベ...,"[{'name': 'ペスタロッチ', 'span': [2, 8], 'type': '人..."
683,Lions Styleはプロ野球・パシフィック・リーグの埼玉西武ライオンズの情報誌である。,"[{'name': 'Lions Style', 'span': [0, 11], 'typ..."
2246,その後500万以下の条件戦に5回出走するが勝利できなかった。,[]



## Fine-tuning the model



### Load the tokenizer from pre-trained model.

> (From HuggingFace) There will be a warning about some of the pretrained weights not being used and some weights being randomly initialized. That’s because we are throwing away the pretraining head of the BERT model to replace it with a classification head which is randomly initialized. We will fine-tune this model on our task, transferring the knowledge of the pretrained model to it (which is why doing this is called transfer learning).

In [11]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
# We will fine-tune the model for NER task, so we use AutoModelForTokenClassification here
model = AutoModelForTokenClassification.from_pretrained(model_name)


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the m

## Build a new Tokenizer

Normally we don't need a new tokenizer, but we want to try some methods only available under TokenizerFast later on.
So we need to "convert" the tokenizer into a TokenizerFast.
Read more about tokenizer base class [here](https://huggingface.co/docs/transformers/main_classes/tokenizer)

Because the BertJapaneseTokenizer that we loaded cannot be converted into TokenizerFast directly.
So we will export it into json and create a new one from it.


In [12]:
# Export the vocab text.
# Note: the model is not new enough to export a json file...
tokenizer.save_vocabulary("./")

('./vocab.txt',)

In [13]:
from tokenizers import BertWordPieceTokenizer

# Build a tokenizer from vocab txt
# Ref: https://huggingface.co/docs/tokenizers/python/latest/quicktour.html#importing-a-pretrained-tokenizer-from-legacy-vocabulary-files
tokenizer = BertWordPieceTokenizer("./vocab.txt")

Convert BertWordPieceTokenizer into TokenizerFast

In [14]:
from transformers import PreTrainedTokenizerFast
# https://huggingface.co/docs/transformers/v4.15.0/en/fast_tokenizers#loading-directly-from-the-tokenizer-object
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
# The pad token needs to be set explicitly
fast_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

0

## Test the tokenization & methods

This code block below is not necessary, just to show you around some values and test whether we can use methods from pretrainedTokenizerFast.

In [15]:
from pprint import pprint as pp
# try
def test(i):
    sen = json_df["text"][i]
    print("Length of the sentence: ", len(sen))
    # Tokenize a sentence
    test_text = fast_tokenizer(sen, return_offsets_mapping=True, return_length=True)

    # Let's take a look some keys returned
    pp(test_text, depth=2, compact=True)
    print(" ------- ------- ------- ")

    # The offset_mapping is what we will use later, to align the NER tag to tokenized text
    print("offset_mapping: ")
    pp(test_text["offset_mapping"][:10], compact=True)
    print(" ------- ------- ------- ")

    print("Original sentence : ", sen)
    print("Original sentence size: ", len(sen))
    print(" ------- ------- ------- ")

    # How the sentence is tokenized
    print("token list : ", test_text.tokens(batch_index=0))
    print("token size : ", len(test_text.tokens(batch_index=0)))
    # The numeric representation of tokens above
    print("input ids : ", test_text.input_ids)
    print(" ------- ------- ------- ")

    # The corresponding word level index. Would be useful if your input sentence was list[Word]
    print("Word index : ", test_text.word_ids(batch_index=0))

    # The token type list, i.e. which sentence it belongs to, but we will not input pair of sentences here. 
    print("sentence id/sequence id/token type id : ", test_text.sequence_ids(batch_index=0))
    print(" ------- ------- ------- ")

    # Below shows how to do some converting
    char_i = 22
    print("The character at index ", char_i, " of the sentence is ", sen[char_i])
    print("Its token index is at ", test_text.char_to_token(0, char_i), " ,showing ", test_text.tokens(batch_index=0)[test_text.char_to_token(0, char_i)])


# Test with a sentence from json_df
test(1)

Length of the sentence:  39
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [2, 23144, 660, 632, 910, 616, 136, 259, 9, 6, 122, 751, 409, 827,
               201, 1158, 280, 7, 108, 259, 11, 865, 609, 828, 760, 24733, 698,
               809, 16, 16071, 8, 3],
 'length': [32],
 'offset_mapping': [(...), (...), (...), (...), (...), (...), (...), (...),
                    (...), (...), (...), (...), (...), (...), (...), (...),
                    (...), (...), (...), (...), (...), (...), (...), (...),
                    (...), (...), (...), (...), (...), (...), (...), (...)],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
 ------- ------- ------- 
offset_mapping: 
[(0, 0), (0, 3), (3, 5), (5, 8), (8, 9), (9, 10), (10, 11), (11, 12), (12, 13),
 (13, 14)]
 ------- ------- ------- 
Original sentence

## Training data pre-processing

The raw data & annotation is like below:

- Original sentence:  
    ```
    レッドフォックス株式会社は、東京都千代田区に本社を置くITサービス企業である。
    ```

- Original sentence size:  
    ```
    39
    ```

- Original label (json_df["entities"][0]): 
    ```python
    [
        {'name': 'レッドフォックス株式会社', 'span': [0, 12], 'type': '法人名'},
        {'name': '東京都千代田区', 'span': [14, 21], 'type': '地名'}
    ]
    ```

We would like to convert it into below form (IOB format) that is easier to process later.

- Desired label structure (for 1 sentence):  
    (list with the same size as the original sentence )
    ```python
    ['B-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'O', 'B-地名', 'I-地名', 'I-地名', 'I-地名', 'I-地名', 'I-地名', 'I-地名', 'I-地名', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
    ```




In [16]:
# Show the sample of record
json_df["entities"][1]

[{'name': 'レッドフォックス株式会社', 'span': [0, 12], 'type': '法人名'},
 {'name': '東京都千代田区', 'span': [14, 21], 'type': '地名'}]

Below is an important step, that we modify the y label, and convert into IOB tagging.

In [17]:
from pprint import pprint as pp
## 1. Convert the raw annotation into desired label structure for model training
##    i.e. Restructure json_df["entities"] into the same shape of sentence
## 2. Build the NER tag mapping


# We will use IOB tagging scheme
# It will temporarily store the unique value of all labels
ner_tag_set = set()
# Add O <-- others
ner_tag_set.add("O")
# We only need list structure from now on. Pandas is not good at looping row.
json_list = json_df.to_dict('records')

for index, row in enumerate(json_list):
    # For each sentence at [index]
    # e.g. row: { text: xxxx, entities: [...]}

    # Form the label list as same size as the length of sentence, filled with O
    label_list = ["O"] * len(row["text"])
    for entity in row["entities"]:
        # For each annotation data for this sentence
        # e.g {'name': 'SPRiNGS', 'span': [0, 7], 'type': 'その他の組織名'}

        ## Add 2 variants to dictionary set for future use
        ner_tag_set.add("B-" + entity["type"])
        ner_tag_set.add("I-" + entity["type"])

        # the label location index at sentence
        start = entity["span"][0]
        end = entity["span"][1]

        # set the label in label_list
        label_list[start] = "B-" + entity["type"]
        label_list[start+1:end+1] = ["I-" + entity["type"]] * (end - start) # the remained 1 is assigned in next line

    # put it in the json_list
    row["label"] = label_list
    # we don't need this anymore
    del row["entities"]


print("NER tag set : ", ner_tag_set)
print("a row of structured list : ")
pp(json_list[1], compact=True)
print("The length of this label : ", len(json_list[1]["label"]))
print("The length of this sentence : ", len(json_list[1]["text"]))

NER tag set :  {'B-その他の組織名', 'B-製品名', 'I-地名', 'B-政治的組織名', 'I-政治的組織名', 'B-地名', 'I-人名', 'I-その他の組織名', 'I-施設名', 'B-人名', 'B-イベント名', 'O', 'I-製品名', 'I-イベント名', 'B-法人名', 'I-法人名', 'B-施設名'}
a row of structured list : 
{'label': ['B-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名',
           'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'O', 'B-地名',
           'I-地名', 'I-地名', 'I-地名', 'I-地名', 'I-地名', 'I-地名', 'I-地名', 'O', 'O',
           'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
           'O'],
 'text': 'レッドフォックス株式会社は、東京都千代田区に本社を置くITサービス企業である。'}
The length of this label :  39
The length of this sentence :  39


In [18]:

## prepare the id to list, list to id mapper
## They will be used in configuration of trainer
## For the official doc, ref to https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig
ner_tag_map = list(ner_tag_set)
ner_tag_map.sort()
print("ner_tag_map with size: ", len(ner_tag_map))

label2id = { label: i for i, label in enumerate(ner_tag_map)}
print(label2id)

id2label = {v: k for k, v in label2id.items()}
print(id2label)


ner_tag_map with size:  17
{'B-その他の組織名': 0, 'B-イベント名': 1, 'B-人名': 2, 'B-地名': 3, 'B-政治的組織名': 4, 'B-施設名': 5, 'B-法人名': 6, 'B-製品名': 7, 'I-その他の組織名': 8, 'I-イベント名': 9, 'I-人名': 10, 'I-地名': 11, 'I-政治的組織名': 12, 'I-施設名': 13, 'I-法人名': 14, 'I-製品名': 15, 'O': 16}
{0: 'B-その他の組織名', 1: 'B-イベント名', 2: 'B-人名', 3: 'B-地名', 4: 'B-政治的組織名', 5: 'B-施設名', 6: 'B-法人名', 7: 'B-製品名', 8: 'I-その他の組織名', 9: 'I-イベント名', 10: 'I-人名', 11: 'I-地名', 12: 'I-政治的組織名', 13: 'I-施設名', 14: 'I-法人名', 15: 'I-製品名', 16: 'O'}


Split data into 3 sets: `train`, `valid`, `test`

In [19]:
from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(json_list, test_size=0.2, random_state=123)
train_data, test_data = train_test_split(train_data, test_size=0.1, random_state=123)

In [20]:
## Lets check the train data
pp(train_data[:1], compact=True)

[{'label': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
            'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
            'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
            'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-人名', 'I-人名', 'I-人名',
            'I-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'O', 'O', 'O'],
  'text': '公式サイトは、2004年11月1日にリニューアルされており、デザインを担当したのはデザイナーのタケウエトモコである。'}]


## Load the pre-trained model
Load the pre-trained model and store the `label2id`, `id2label` we created from above.

In [21]:
from transformers import BertForTokenClassification, BertConfig

config = BertConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label)
model = BertForTokenClassification.from_pretrained(model_name, config=config)

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the m

In [22]:
# We could take a look some details of the model
print(model)

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

## Prepare for the training
Setup the TrainingArguments and Trainer with data_collator.

data_collator is for the trainer to load the data. [More here](https://huggingface.co/docs/transformers/main_classes/data_collator)

In [23]:
import torch

# This data_collator will be used by the trainer
# and as a preprocess method when we do evaulation later.
def data_collator(features: list) -> dict:
    """
    The purpose of this method, is to tokenize the train data,
    so we get all the required input(e.g. numeric tokens, attendence mask, ...etc)
    Also we re-align the labels of sentences into labels of tokens,
    according to the tokenized list. 
    """
    # Lets just use 64 as max length for simplicity
    max_len = 64
    tokenized_list = []
    ## tokenized all text, while aligning the label list to match the result text
    for index, row in enumerate(features):
        # For each row at [index], e.g. { "label": [xxx], "text": "xxxx" }

        # The return_offsets_mapping is so important here
        # so that we can use it to re-align the labels
        # The return_length let us know the length of the tokenized sentence (including CLS, SEP)
        fast_result = fast_tokenizer(row["text"], return_tensors=None, padding='max_length', 
                                    truncation=True, max_length=max_len, return_offsets_mapping=True,
                                    return_length=True)

        ## Build the aligned label list with prefilled values
        ## The length here is same as tokens length in one sentence
        ## NOTE: it includes special token [CLS][SEP][PAD]
        y_each_sentence = [ -100 ] * len(fast_result.offset_mapping)

        ## 'offset_mapping': [(0, 0), (0, 3), (3, 5), (5, 8), (8, 9), ...]
        for index, offset_tuple in enumerate(fast_result.offset_mapping):
            ## It is [PAD] already, break loop to save time
            if index >= fast_result.length[0] :
                break
            ## If it is special token [CLS], [SEP] or [PAD], leave it as -100
            if offset_tuple == (0, 0) : 
                continue
            # Using our structured label list, get the annotation(its id value)
            ## reminder: ['B-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', ... ]
            target_label = row["label"][offset_tuple[0]] # e.g. 'B-法人名'
            target_label_id = label2id[target_label] # e.g. 2
            y_each_sentence[index] = target_label_id
        
        fast_result["labels"] = y_each_sentence
        # Remove unnecessary field, otherwise model training will throw error
        del fast_result["offset_mapping"]
        del fast_result["length"]
        tokenized_list.append(fast_result)

    # transpose
    df_all = pd.DataFrame(tokenized_list)
    dict_filter = df_all.to_dict(orient="list")
    # Convert all input to tensor, and put to specific CPU/GPU
    batch = {k: torch.tensor(v, dtype=torch.int64).to(device) for k, v in dict_filter.items()}
    return batch

    # print(features)

In [24]:
## lets test it before using in training
a = data_collator(json_list[:1])
pp(a, compact=True, depth=1)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'input_ids': tensor([[    2, 11619, 24501, 28589, 28452,   113,    28,  1883,     5,  1273,
            21,   415, 12246, 25457, 28503,     8,     3,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]]),
 'labels': tensor([[-100,    0,    8,    8,    8,   16,   16,   16,   16,   16,   16,   16,
           16,   16,   16,   16, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -

## Training !

We will define some hyperparameters with TrainingArguments class, which will be used in training

In [25]:
ckpt_dir = "./ckpt"
batch_size = 16
epochs = 3
learning_rate = 3e-5
save_freq = 100

In [26]:
from transformers import TrainingArguments

args = TrainingArguments(output_dir=ckpt_dir,
                         do_train=True,
                         do_eval=True,
                         do_predict=True,
                         per_device_train_batch_size=batch_size,
                         per_device_eval_batch_size=batch_size,
                         learning_rate=learning_rate,
                         num_train_epochs=epochs,
                         evaluation_strategy="steps",
                         eval_steps=save_freq,
                         save_strategy="steps",
                         save_steps=save_freq,
                         load_best_model_at_end=True,
                         dataloader_pin_memory=False, #Important if you use gpu
                        )

Below code is to move the model to GPU (or cpu defined above)

In [27]:
model.to(device)

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

Then we finally can define the Trainer, with:
- training data
- evaluation data
- data collator method
- Training arguments
- Loaded pre-trained model
- Early stop strategy

In [28]:
from transformers import Trainer, EarlyStoppingCallback

trainer = Trainer(model=model,
                  args=args,
                  data_collator=data_collator,
                  train_dataset=train_data,
                  eval_dataset=val_data,
                  callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
                 )

Let's train the model !

In [None]:
trainer.train()

We got a trained model now. Let's try to use it.


In [138]:
_, _, metrics = trainer.predict(test_data, metric_key_prefix="test")
print(metrics)

***** Running Prediction *****
  Num examples = 428
  Batch size = 16


{'test_loss': 0.25463199615478516, 'test_runtime': 1.5567, 'test_samples_per_second': 274.945, 'test_steps_per_second': 17.345}


Remember to save the model into files

In [163]:
trainer.save_model(model_output_dir)

Saving model checkpoint to ./dest
Configuration saved in ./dest/config.json
Model weights saved in ./dest/pytorch_model.bin


## Evaluate the model
We will use [seqeval](https://github.com/chakki-works/seqeval) here to test the performance of the model.

### 推論用の関数を定義
学習したモデルを使って推論をするための関数を定義します

In [140]:
# Recall how does the test data look like
print(test_data[0].keys())
print(test_data[0])

dict_keys(['text', 'entities', 'label'])
{'text': '若き科学者ハンク・マッコイはブランドコーポレーションで働き、ミューテーションの誘発物質を分離させる事ができた。', 'entities': [{'name': 'ハンク・マッコイ', 'span': [5, 13], 'type': '人名'}, {'name': 'ブランドコーポレーション', 'span': [14, 26], 'type': '法人名'}], 'label': ['O', 'O', 'O', 'O', 'O', 'B-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'B-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}


We reuse the data collator from above, to process the test data for evalution.

In [141]:
## the function convert the train dataset into tokenized input for model, also align the label for comparing later
def preprocess(each_dict: dict) -> dict:
    # lets reuse the data_collator from above to do tokenization & label structuring
    batch_dict = data_collator([each_dict])
    return  batch_dict

In [161]:
## Lets try first
tmp_test_after_preprocess = preprocess(test_data[0])
pp.pprint(tmp_test_after_preprocess)
print("tokenized sentence size (including CLS,SEP) : ", np.sum(tmp_test_after_preprocess["attention_mask"][0].cpu().tolist()))

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'),
 'input_ids': tensor([[    2,  1099,   322,   536,   112,   104,  2777, 28488,    35, 27492,
         20418, 28450,  4536, 28476,  1551, 28774,  2260, 28456,  2131,   322,
             6,  8826,  6977, 28444,  2325,    85,   410,  2015,    11,   155,
          1022,  9293,   146,    29, 28456, 28512, 28447,     8,     3,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]], device='cuda:0'),
 'labels': tensor([[-100,   16,   16,   16,   16,   16,    2,   10,   10,   10,   10,   10,
            6,   14,   14,   14,   14,   14,   16,   16,   16,   16,   16,   16,
           16,   16,   1

Then we load the stored model, and put to gpu (or cpu)
Also we define the inference method.

In [176]:
import numpy as np
from transformers import BertForTokenClassification

inference_model = BertForTokenClassification.from_pretrained(model_output_dir)
inference_model.to(device)
def inference(each_line_dict: dict) -> list:

    input_dict = preprocess(each_line_dict)
    # dict_keys(['attention_mask', 'input_ids', 'token_type_ids', 'labels'])
    del input_dict['labels']
    # Get tokenized sentence size
    tokenized_length = np.sum(input_dict["attention_mask"][0].cpu().tolist())

    # Inference !
    pred = inference_model(**input_dict).logits[0]
    # print(np.shape(pred)) # torch.Size([64, 17])
    pred = np.argmax(pred.cpu().detach().numpy(), axis=-1)
    labels = []
    for i, label in enumerate(pred):
        # The last meaningful element is also a special token [SEP], so we ignore it
        if i >= (tokenized_length-1): break
        labels.append(inference_model.config.id2label[label])
        # labels[i] = label
    # Remove 1st element, which is [CLS]
    labels.pop(0)
    return pred[1:(tokenized_length-1)], labels
    # return pred

loading configuration file ./dest/config.json
Model config BertConfig {
  "_name_or_path": "cl-tohoku/bert-base-japanese-whole-word-masking",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-\u305d\u306e\u4ed6\u306e\u7d44\u7e54\u540d",
    "1": "B-\u30a4\u30d9\u30f3\u30c8\u540d",
    "2": "B-\u4eba\u540d",
    "3": "B-\u5730\u540d",
    "4": "B-\u653f\u6cbb\u7684\u7d44\u7e54\u540d",
    "5": "B-\u65bd\u8a2d\u540d",
    "6": "B-\u6cd5\u4eba\u540d",
    "7": "B-\u88fd\u54c1\u540d",
    "8": "I-\u305d\u306e\u4ed6\u306e\u7d44\u7e54\u540d",
    "9": "I-\u30a4\u30d9\u30f3\u30c8\u540d",
    "10": "I-\u4eba\u540d",
    "11": "I-\u5730\u540d",
    "12": "I-\u653f\u6cbb\u7684\u7d44\u7e54\u540d",
    "13": "I-\u65bd\u8a2d\u540d",
    "14": "I-\u6cd5\u4eba\u540d",
    "15": "I-\u88fd\u54c1\u540d",
    "16": "O"
  },
 

In [177]:
# Lets test it
temp_test_pred, temp_test_labels = inference(test_data[2])
print("Prediction value: ")
pp.pprint(temp_test_pred, compact=True)
print(len(temp_test_pred))
print("--------")
print("Converted to label: ")
pp.pprint(temp_test_labels, compact=True)
print(len(temp_test_labels))

Prediction value: 
array([16, 16,  6, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
       16, 16, 16,  6, 14, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
       16, 16])
36
--------
Converted to label: 
['O', 'O', 'B-法人名', 'I-法人名', 'I-法人名', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-法人名', 'I-法人名', 'I-法人名', 'O', 'O', 'O',
 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
36


For evaluation, we need to preprocess the label as well.

In [178]:
import pprint as pp

y_true = []
for unit in test_data :
    batch_dict = preprocess(unit)
    tokenized_length = np.sum(batch_dict["attention_mask"][0].cpu().tolist())
    ylist = batch_dict["labels"].tolist() 

    ylist = [ inference_model.config.id2label[eachVal if eachVal >= 0 else 16] for eachVal in ylist[0] ]
    y_true.append(ylist[1:(tokenized_length-1)])

pp.pprint(y_true[2], compact=True)
print(len(y_true[2]))
print(np.shape(y_true))

['O', 'O', 'B-法人名', 'I-法人名', 'I-法人名', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-法人名', 'I-法人名', 'I-法人名', 'O', 'O', 'O',
 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
36
(428,)


Now we do inference on the whole test dataset

In [179]:
y_pred = []
for unit in test_data:
    temp_test_pred, temp_test_label = inference(unit)
    y_pred.append(temp_test_label)
pp.pprint(y_pred[:2], compact=True, width=120)

[['O', 'O', 'O', 'O', 'O', 'B-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'B-法人名', 'I-法人名', 'I-法人名', 'I-法人名', 'I-法人名',
  'I-法人名', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
 ['B-人名', 'I-人名', 'I-人名', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-地名', 'I-地名', 'I-地名', 'O', 'O', 'O',
  'O', 'B-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'I-人名', 'O', 'O', 'O', 'O', 'O', 'O', 'B-地名', 'I-地名', 'I-地名', 'O', 'O',
  'O', 'O']]


In [181]:
print(np.shape(y_pred))

(428,)


### seqeval - classification_report
We can evaluate by classification_report in seqeval lib.

In [184]:
from seqeval.metrics import classification_report
from seqeval.scheme import BILOU

print(classification_report(y_true, y_pred))
# You can also specify the tagging scheme if you are using something else
#print(classification_report(y_true, y_pred, mode='strict', scheme=BILOU))

              precision    recall  f1-score   support

     その他の組織名       0.52      0.74      0.61        68
       イベント名       0.73      0.79      0.76        73
          人名       0.87      0.89      0.88       248
          地名       0.74      0.78      0.76       172
      政治的組織名       0.66      0.76      0.71        85
         施設名       0.60      0.74      0.66        72
         法人名       0.65      0.79      0.71       214
         製品名       0.59      0.65      0.62       102

   micro avg       0.70      0.79      0.74      1034
   macro avg       0.67      0.77      0.71      1034
weighted avg       0.71      0.79      0.74      1034



## Inference
If only for inference without the need to evaluate, we can use the code below.
But keep in mind that it is not optimized for production use case yet.

For example,
- consider to store the loaded model/tokenizer into memory/singleton class for quicker response in the future usage
- deal with the case if input text is too long, e.g. split into chunks
- use onnx instead of the plain model to speed up

In [191]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
def inference(text: str):
    model = AutoModelForTokenClassification.from_pretrained(model_output_dir)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    
    inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=64)
    tokenized_length = np.sum(inputs["attention_mask"][0].cpu().tolist())
    pred = model(**inputs).logits[0]
    pred = np.argmax(pred.detach().numpy(), axis=-1)
    labels = []
    for i, label in enumerate(pred):
        if i > (tokenized_length-1): break
        labels.append(inference_model.config.id2label[label])
    labels.pop(0)
    print(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0].cpu().tolist()))
    print(labels)

In [192]:
print(inference("田中さんの会社の社長は鈴木さんです"))

loading configuration file ./dest/config.json
Model config BertConfig {
  "_name_or_path": "./dest",
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-\u305d\u306e\u4ed6\u306e\u7d44\u7e54\u540d",
    "1": "B-\u30a4\u30d9\u30f3\u30c8\u540d",
    "2": "B-\u4eba\u540d",
    "3": "B-\u5730\u540d",
    "4": "B-\u653f\u6cbb\u7684\u7d44\u7e54\u540d",
    "5": "B-\u65bd\u8a2d\u540d",
    "6": "B-\u6cd5\u4eba\u540d",
    "7": "B-\u88fd\u54c1\u540d",
    "8": "I-\u305d\u306e\u4ed6\u306e\u7d44\u7e54\u540d",
    "9": "I-\u30a4\u30d9\u30f3\u30c8\u540d",
    "10": "I-\u4eba\u540d",
    "11": "I-\u5730\u540d",
    "12": "I-\u653f\u6cbb\u7684\u7d44\u7e54\u540d",
    "13": "I-\u65bd\u8a2d\u540d",
    "14": "I-\u6cd5\u4eba\u540d",
    "15": "I-\u88fd\u54c1\u540d",
    "16": "O"
  },
  "initializer_range": 0.02,
  "intermedia

['[CLS]', '田中', 'さん', 'の', '会社', 'の', '社長', 'は', '鈴木', 'さん', 'です', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
['B-人名', 'I-人名', 'O', 'O', 'O', 'O', 'O', 'B-人名', 'I-人名', 'O', 'O']
None
