<a href="https://colab.research.google.com/github/pinedance/gym-AI-NLP/blob/main/fine_tune_BERT_%ED%95%9C%EB%AC%B8_%EC%9E%90%EB%8F%99_%ED%91%9C%EC%A0%90_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 고문헌 자동 표점 테스트

source
* [raynardj/classical-chinese-punctuation-guwen-biaodian](https://huggingface.co/raynardj/classical-chinese-punctuation-guwen-biaodian)

REF
* [BertForTokenClassification](https://huggingface.co/docs/transformers/main/en/model_doc/bert#transformers.BertForTokenClassification)
* [Custom_Named_Entity_Recognition_with_BERT.ipynb](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT.ipynb)

## Packages

In [1]:
%%bash
# pip install transformers
pip install seqeval[gpu]



In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoConfig, AutoModelForTokenClassification

In [3]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


## Defining the model

In [4]:
model_name = "raynardj/classical-chinese-punctuation-guwen-biaodian"
tokenizer = AutoTokenizer.from_pretrained( model_name )
model = AutoModelForTokenClassification.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  return self.fget.__get__(instance, owner)()


## Preprocessing the data

In [5]:
label_dict = ["O", "，", "。"]
label2id = config.label2id
id2label = config.id2label
label2id

{'"': 14,
 "'": 10,
 'O': 0,
 '。': 1,
 '【': 19,
 '】': 20,
 '！': 6,
 '（': 15,
 '）': 16,
 '，': 2,
 '：': 3,
 '；': 4,
 '？': 5}

In [6]:
def clean_text( text ):
    _text = text + ""
    _text = _text.replace( ",", "，" )
    _text = _text.replace( ".", "。" )
    _text = _text.replace(" ", "")
    return _text.strip()


In [7]:
def split_text_to_src_tgt( text ):
    _text_list = list()
    for t in text:
        if t in ["，", "。"]:
            _text_list.append(t)
        else:
            _text_list.append( t )
            _text_list.append( "O")

    raw_text_new = "".join(_text_list)
    raw_text_new = raw_text_new.replace("O，", "，")
    raw_text_new = raw_text_new.replace("O。", "。")

    src, tgt = list(), list()
    for i, c in enumerate( raw_text_new):
        if i % 2 == 0:
            src.append( c )
        else:
            tgt.append( c )
    return "".join( src ), "".join( tgt )


In [8]:
new_inputs = [
    "風寒暑, 暴傷人便覺, 濕氣熏襲, 人多不覺. 其自外而入者, 長夏鬱熱, 山澤蒸氣, 冒雨行濕, 汗透沾衣. 多腰脚腫痛. 其自內得者, 生冷酒麪滯, 脾生濕, 鬱熱, 多肚腹腫脹. 西北人, 多內濕, 東南人, 多外濕.",
    "人居戴履, 受濕最多. 行住坐臥, 實熏染於冥冥之中, 滯而爲喘嗽, 漬而爲嘔吐, 滲而爲泄瀉, 溢而爲浮腫. 濕瘀熱則發黃, 濕遍體則重着, 濕入關節則一身盡痛, 濕聚痰涎則昏不知人.",
    "濕家治法, 大槪宜發微汗, 及利小便, 使上下分消其濕, 是其治也.",
    "濕上甚而熱, 治以苦溫, 佐以甘辛, 以汗爲故而止, 平胃散方見內傷主之. 濕在上, 宜微汗而解, 不欲汗多, 故不用麻黃ㆍ乾葛輩. 宜微汗, 用防己黃芪湯.",
    "濕在中下, 宜利小便, 此淡滲治濕也. 五苓散主之.",
    "凡濕病, 忌不得以火攻幷轉利之. 若濕家下之, 則額上汗出, 微喘, 小便不利者, 死, 下利不止者亦, 死.",
    "治濕不得猛發汗, 及灼艾灸之.",
    "濕病誤下, 則爲喘噦, 誤汗, 則發痓而死.",
    "濕家不可汗, 汗之則發痓, 發痓者斃. 又不可下, 下之則額汗胸滿, 微喘而噦, 小便淋閉, 難以有瘳也.",
    "太陽病, 發汗太多, 因致痓. 濕家大發汗, 亦作痓. 盖汗太多則亡陽, 不能養筋, 故筋脉緊急而成痓. 其證身熱足冷, 頸項强急, 惡寒, 時頭熱, 面赤目赤, 獨頭面搖, 卒口噤, 背反張者, 是也. 亦名破傷風."
]


In [9]:
text_cleaned = [ clean_text(text) for text in new_inputs ]
text_cleaned

['風寒暑，暴傷人便覺，濕氣熏襲，人多不覺。其自外而入者，長夏鬱熱，山澤蒸氣，冒雨行濕，汗透沾衣。多腰脚腫痛。其自內得者，生冷酒麪滯，脾生濕，鬱熱，多肚腹腫脹。西北人，多內濕，東南人，多外濕。',
 '人居戴履，受濕最多。行住坐臥，實熏染於冥冥之中，滯而爲喘嗽，漬而爲嘔吐，滲而爲泄瀉，溢而爲浮腫。濕瘀熱則發黃，濕遍體則重着，濕入關節則一身盡痛，濕聚痰涎則昏不知人。',
 '濕家治法，大槪宜發微汗，及利小便，使上下分消其濕，是其治也。',
 '濕上甚而熱，治以苦溫，佐以甘辛，以汗爲故而止，平胃散方見內傷主之。濕在上，宜微汗而解，不欲汗多，故不用麻黃ㆍ乾葛輩。宜微汗，用防己黃芪湯。',
 '濕在中下，宜利小便，此淡滲治濕也。五苓散主之。',
 '凡濕病，忌不得以火攻幷轉利之。若濕家下之，則額上汗出，微喘，小便不利者，死，下利不止者亦，死。',
 '治濕不得猛發汗，及灼艾灸之。',
 '濕病誤下，則爲喘噦，誤汗，則發痓而死。',
 '濕家不可汗，汗之則發痓，發痓者斃。又不可下，下之則額汗胸滿，微喘而噦，小便淋閉，難以有瘳也。',
 '太陽病，發汗太多，因致痓。濕家大發汗，亦作痓。盖汗太多則亡陽，不能養筋，故筋脉緊急而成痓。其證身熱足冷，頸項强急，惡寒，時頭熱，面赤目赤，獨頭面搖，卒口噤，背反張者，是也。亦名破傷風。']

In [10]:
src_text, tgt_text = list( zip( *[ split_text_to_src_tgt( text) for text in text_cleaned ] ) )

In [11]:
src_text

('風寒暑暴傷人便覺濕氣熏襲人多不覺其自外而入者長夏鬱熱山澤蒸氣冒雨行濕汗透沾衣多腰脚腫痛其自內得者生冷酒麪滯脾生濕鬱熱多肚腹腫脹西北人多內濕東南人多外濕',
 '人居戴履受濕最多行住坐臥實熏染於冥冥之中滯而爲喘嗽漬而爲嘔吐滲而爲泄瀉溢而爲浮腫濕瘀熱則發黃濕遍體則重着濕入關節則一身盡痛濕聚痰涎則昏不知人',
 '濕家治法大槪宜發微汗及利小便使上下分消其濕是其治也',
 '濕上甚而熱治以苦溫佐以甘辛以汗爲故而止平胃散方見內傷主之濕在上宜微汗而解不欲汗多故不用麻黃ㆍ乾葛輩宜微汗用防己黃芪湯',
 '濕在中下宜利小便此淡滲治濕也五苓散主之',
 '凡濕病忌不得以火攻幷轉利之若濕家下之則額上汗出微喘小便不利者死下利不止者亦死',
 '治濕不得猛發汗及灼艾灸之',
 '濕病誤下則爲喘噦誤汗則發痓而死',
 '濕家不可汗汗之則發痓發痓者斃又不可下下之則額汗胸滿微喘而噦小便淋閉難以有瘳也',
 '太陽病發汗太多因致痓濕家大發汗亦作痓盖汗太多則亡陽不能養筋故筋脉緊急而成痓其證身熱足冷頸項强急惡寒時頭熱面赤目赤獨頭面搖卒口噤背反張者是也亦名破傷風')

In [12]:
tgt_text

('OO，OOOO，OOO，OOO。OOOOO，OOO，OOO，OOO，OOO。OOOO。OOOO，OOOO，OO，O，OOOO。OO，OO，OO，OO。',
 'OOO，OOO。OOO，OOOOOOO，OOOO，OOOO，OOOO，OOOO。OOOOO，OOOOO，OOOOOOOO，OOOOOOOO。',
 'OOO，OOOOO，OOO，OOOOOO，OOO。',
 'OOOO，OOO，OOO，OOOOO，OOOOOOOO。OO，OOOO，OOO，OOOOOOOO。OO，OOOOO。',
 'OOO，OOO，OOOOO。OOOO。',
 'OO，OOOOOOOOO。OOOO，OOOO，O，OOOO，，OOOOO，。',
 'OOOOOO，OOOO。',
 'OOO，OOO，O，OOOO。',
 'OOOO，OOOO，OOO。OOO，OOOOOO，OOO，OOO，OOOO。',
 'OO，OOO，OO。OOOO，OO。OOOOOO，OOO，OOOOOOO。OOOOO，OOO，O，OO，OOO，OOO，OO，OOO，O。OOOO。')

In [13]:
print( [ len(e) for e in src_text], [len(e) for e in tgt_text ])

[75, 70, 25, 58, 19, 38, 12, 15, 38, 74] [75, 70, 25, 58, 19, 38, 12, 15, 38, 74]


In [14]:
data = pd.DataFrame( {"sentence": src_text, "word_labels": tgt_text} ).drop_duplicates().reset_index(drop=True)
data.head()

Unnamed: 0,sentence,word_labels
0,風寒暑暴傷人便覺濕氣熏襲人多不覺其自外而入者長夏鬱熱山澤蒸氣冒雨行濕汗透沾衣多腰脚腫痛其自內...,OO，OOOO，OOO，OOO。OOOOO，OOO，OOO，OOO，OOO。OOOO。OOO...
1,人居戴履受濕最多行住坐臥實熏染於冥冥之中滯而爲喘嗽漬而爲嘔吐滲而爲泄瀉溢而爲浮腫濕瘀熱則發黃...,OOO，OOO。OOO，OOOOOOO，OOOO，OOOO，OOOO，OOOO。OOOOO，...
2,濕家治法大槪宜發微汗及利小便使上下分消其濕是其治也,OOO，OOOOO，OOO，OOOOOO，OOO。
3,濕上甚而熱治以苦溫佐以甘辛以汗爲故而止平胃散方見內傷主之濕在上宜微汗而解不欲汗多故不用麻黃ㆍ...,OOOO，OOO，OOO，OOOOO，OOOOOOOO。OO，OOOO，OOO，OOOOOO...
4,濕在中下宜利小便此淡滲治濕也五苓散主之,OOO，OOO，OOOOO。OOOO。


## Preparing the dataset and dataloader

In [15]:
MAX_LEN = 512
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 10
LEARNING_RATE = 1e-05
MAX_GRAD_NORM = 10

In [16]:
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):

    # tokenized_sentence = []
    # labels = []

    # sentence = sentence.strip()

    # for word, label in zip(sentence.split(), text_labels.split(",")):

    #     # Tokenize the word and count # of subwords the word is broken into
    #     tokenized_word = tokenizer.tokenize(word)
    #     n_subwords = len(tokenized_word)

    #     # Add the tokenized word to the final tokenized word list
    #     tokenized_sentence.extend(tokenized_word)

    #     # Add the same label to the new list of labels `n_subwords` times
    #     labels.extend([label] * n_subwords)

    tokenized_sentence = tokenizer.tokenize( sentence.strip() )
    labels = list( text_labels.strip() )
    if len( tokenized_sentence ) != len(labels ):
        print("!!! length of sentence and labels miss match !!!", len( tokenized_sentence ), len(labels ) )
    return tokenized_sentence, labels


In [17]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index):
        # step 1: tokenize (and adapt corresponding labels)
        sentence = self.data.sentence[index]
        word_labels = self.data.word_labels[index]
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)

        # step 2: add special tokens (and corresponding labels)
        tokenized_sentence = ["[CLS]"] + tokenized_sentence + ["[SEP]"] # add special tokens
        labels.insert(0, "O") # add outside label for [CLS] token
        labels.insert(-1, "O") # add outside label for [SEP] token

        # step 3: truncating/padding
        maxlen = self.max_len

        if (len(tokenized_sentence) > maxlen):
          # truncate
          tokenized_sentence = tokenized_sentence[:maxlen]
          labels = labels[:maxlen]
        else:
          # pad
          tokenized_sentence = tokenized_sentence + ['[PAD]'for _ in range(maxlen - len(tokenized_sentence))]
          labels = labels + ["O" for _ in range(maxlen - len(labels))]

        # step 4: obtain the attention mask
        attn_mask = [1 if tok != '[PAD]' else 0 for tok in tokenized_sentence]

        # step 5: convert tokens to input ids
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)

        label_ids = [label2id[label] for label in labels]
        # the following line is deprecated
        #label_ids = [label if label != 0 else -100 for label in label_ids]

        return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(attn_mask, dtype=torch.long),
              #'token_type_ids': torch.tensor(token_ids, dtype=torch.long),
              'targets': torch.tensor(label_ids, dtype=torch.long)
        }

    def __len__(self):
        return self.len

In [18]:
train_size = 0.8
train_dataset = data.sample(frac=train_size,random_state=200)
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = dataset(train_dataset, tokenizer, MAX_LEN)
testing_set = dataset(test_dataset, tokenizer, MAX_LEN)
training_set = dataset(data, tokenizer, MAX_LEN )

FULL Dataset: (10, 2)
TRAIN Dataset: (8, 2)
TEST Dataset: (2, 2)


In [19]:
training_set[0]

{'ids': tensor([ 101, 7591, 2170, 3264, 3274, 1003,  782,  912, 6221, 4086, 3706, 4221,
         6204,  782, 1914,  679, 6221, 1071, 5632, 1912, 5445, 1057, 5442, 7269,
         1909, 7786, 4229, 2255, 4075, 5892, 3706, 1088, 7433, 6121, 4086, 3731,
         6851, 3783, 6132, 1914, 5587, 5558, 5584, 4578, 1071, 5632, 1058, 2533,
         5442, 4495, 1107, 6983,  100, 4015, 5569, 4495, 4086, 7786, 4229, 1914,
         5496, 5592, 5584, 5568, 6205, 1266,  782, 1914, 1058, 4086, 3346, 1298,
          782, 1914, 1912, 4086,  102,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0, 

In [20]:
training_set[0]["ids"]

tensor([ 101, 7591, 2170, 3264, 3274, 1003,  782,  912, 6221, 4086, 3706, 4221,
        6204,  782, 1914,  679, 6221, 1071, 5632, 1912, 5445, 1057, 5442, 7269,
        1909, 7786, 4229, 2255, 4075, 5892, 3706, 1088, 7433, 6121, 4086, 3731,
        6851, 3783, 6132, 1914, 5587, 5558, 5584, 4578, 1071, 5632, 1058, 2533,
        5442, 4495, 1107, 6983,  100, 4015, 5569, 4495, 4086, 7786, 4229, 1914,
        5496, 5592, 5584, 5568, 6205, 1266,  782, 1914, 1058, 4086, 3346, 1298,
         782, 1914, 1912, 4086,  102,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,   

In [21]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

## Training the model

In [22]:
model.to(device)

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, el

In [23]:
ids = training_set[0]["ids"].unsqueeze(0)
mask = training_set[0]["mask"].unsqueeze(0)
targets = training_set[0]["targets"].unsqueeze(0)


In [24]:
ids = ids.to(device)
mask = mask.to(device)
targets = targets.to(device)


In [25]:
outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
initial_loss = outputs[0]
initial_loss

tensor(0.6591, device='cuda:0', grad_fn=<NllLossBackward0>)

In [26]:
tr_logits = outputs[1]
tr_logits.shape

torch.Size([1, 512, 21])

In [27]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

In [28]:
# Defining the training function on the 80% of the dataset for tuning the bert model
def train(epoch):

    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()

    for idx, batch in enumerate(training_loader):

        ids = batch['ids'].to(device, dtype = torch.long)
        mask = batch['mask'].to(device, dtype = torch.long)
        targets = batch['targets'].to(device, dtype = torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
        loss, tr_logits = outputs.loss, outputs.logits
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += targets.size(0)

        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")

        # compute training accuracy
        flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
        active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
        targets = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)

        tr_preds.extend(predictions)
        tr_labels.extend(targets)

        tmp_tr_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy

        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )

        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

In [29]:
for epoch in range(EPOCHS):
    print(f"Training epoch: {epoch + 1}")
    train(epoch)

Training epoch: 1
Training loss per 100 training steps: 0.6868448853492737
Training loss epoch: 0.41577161848545074
Training accuracy epoch: 0.8426655777017631
Training epoch: 2
Training loss per 100 training steps: 0.11325179785490036
Training loss epoch: 0.10659048209587733
Training accuracy epoch: 0.8210687983191152
Training epoch: 3
Training loss per 100 training steps: 0.03695332631468773
Training loss epoch: 0.06341010704636574
Training accuracy epoch: 0.8227616250515148
Training epoch: 4
Training loss per 100 training steps: 0.03784318268299103
Training loss epoch: 0.050183908392985664
Training accuracy epoch: 0.8291351108344571
Training epoch: 5
Training loss per 100 training steps: 0.05637424811720848
Training loss epoch: 0.04743218546112379
Training accuracy epoch: 0.8281718144484933
Training epoch: 6
Training loss per 100 training steps: 0.05674177408218384
Training loss epoch: 0.05269674708445867
Training accuracy epoch: 0.8245806325530275
Training epoch: 7
Training loss pe

## Evaluating the model

In [30]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()

    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []

    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):

            ids = batch['ids'].to(device, dtype = torch.long)
            mask = batch['mask'].to(device, dtype = torch.long)
            targets = batch['targets'].to(device, dtype = torch.long)

            outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
            loss, eval_logits = outputs.loss, outputs.logits

            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += targets.size(0)

            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")

            # compute evaluation accuracy
            flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
            active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
            targets = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)

            eval_labels.extend(targets)
            eval_preds.extend(predictions)

            tmp_eval_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    #print(eval_labels)
    #print(eval_preds)

    labels = [id2label[id.item()] for id in eval_labels]
    predictions = [id2label[id.item()] for id in eval_preds]

    #print(labels)
    #print(predictions)

    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [31]:
labels, predictions = valid(model, testing_loader)

Validation loss per 100 evaluation steps: 0.04806925356388092
Validation Loss: 0.04806925356388092
Validation Accuracy: 0.8758169934640523


In [None]:
# from seqeval.metrics import classification_report

# print(classification_report([labels], [predictions]))

## Inference

In [33]:
sentence = "濕上甚而熱治以苦溫佐以甘辛以汗爲故而止平胃散方見內傷主之濕在上宜微汗而解不欲汗多故不用麻黃乾葛輩宜微汗用防己黃芪湯"

inputs = tokenizer(sentence, padding='max_length', truncation=True, max_length=MAX_LEN, return_tensors="pt")

# move to gpu
ids = inputs["input_ids"].to(device)
mask = inputs["attention_mask"].to(device)
# forward pass
outputs = model(ids, mask)
logits = outputs[0]

active_logits = logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size*seq_len,) - predictions at the token level

tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
token_predictions = [id2label[i] for i in flattened_predictions.cpu().numpy()]
wp_preds = list(zip(tokens, token_predictions)) # list of tuples. Each tuple = (wordpiece, prediction)

word_level_predictions = []
for pair in wp_preds:
  if (pair[0].startswith(" ##")) or (pair[0] in ['[CLS]', '[SEP]', '[PAD]']):
    # skip prediction
    continue
  else:
    word_level_predictions.append(pair[1])

# we join tokens, if they are not special ones
str_rep = " ".join([t[0] for t in wp_preds if t[0] not in ['[CLS]', '[SEP]', '[PAD]']]).replace(" ##", "")
print(str_rep)
print(word_level_predictions)

濕 上 甚 而 熱 治 以 苦 溫 佐 以 甘 辛 以 汗 爲 故 而 止 平 胃 散 方 見 內 傷 主 之 濕 在 上 宜 微 汗 而 解 不 欲 汗 多 故 不 用 麻 黃 乾 葛 輩 宜 微 汗 用 防 己 黃 芪 湯
['O', 'O', 'O', 'O', '，', 'O', 'O', 'O', '，', 'O', 'O', 'O', '，', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '。', 'O', 'O', '，', 'O', 'O', 'O', 'O', '，', 'O', 'O', 'O', '，', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', '，', 'O', 'O', 'O', 'O', 'O', 'O']


In [34]:
def build_output( input_text, predicted_class ):
    if len(input_text) != len(predicted_class):
        print( "입력 텍스트와 결과 텍트트의 길이가 서로 다릅니다. ")
        return None
    output_list = list()
    for a, b in zip( input_text, predicted_class ):
        output_list.append( a )
        if b != "O":
            output_list.append( b )
    rst = "".join( output_list )
    rst = rst.replace("", "")
    return rst

In [35]:
build_output( sentence, word_level_predictions )

'濕上甚而熱，治以苦溫，佐以甘辛，以汗爲故而止平胃散方見內傷主之。濕在上，宜微汗而解，不欲汗多，故不用麻黃乾葛輩宜微汗，用防己黃芪湯'

In [36]:
from transformers import pipeline

pipe = pipeline(task="token-classification", model=model.to("cpu"), tokenizer=tokenizer, aggregation_strategy="simple")
sentence = "濕上甚而熱治以苦溫佐以甘辛以汗爲故而止平胃散方見內傷主之濕在上宜微汗而解不欲汗多故不用麻黃乾葛輩宜微汗用防己黃芪湯"
pipe(sentence)

[{'entity_group': '，', 'score': 0.8703088, 'word': '熱', 'start': 4, 'end': 5},
 {'entity_group': '，', 'score': 0.7995402, 'word': '溫', 'start': 8, 'end': 9},
 {'entity_group': '，',
  'score': 0.7567696,
  'word': '辛',
  'start': 12,
  'end': 13},
 {'entity_group': '。',
  'score': 0.8064684,
  'word': '之',
  'start': 27,
  'end': 28},
 {'entity_group': '，',
  'score': 0.8801451,
  'word': '上',
  'start': 30,
  'end': 31},
 {'entity_group': '，',
  'score': 0.84420615,
  'word': '解',
  'start': 35,
  'end': 36},
 {'entity_group': '，',
  'score': 0.92889386,
  'word': '多',
  'start': 39,
  'end': 40},
 {'entity_group': '，',
  'score': 0.8986384,
  'word': '汗',
  'start': 50,
  'end': 51}]

In [None]:
from huggingface_hub import notebook_login

notebook_login()

'郡邑，置夫子庙于学，以嵗时释奠。盖自唐。贞观以来，未之或改。我宋有天下因其制而损益之。姑苏当浙右要区，规模尤大，更建炎戎马，荡然无遗。虽修学宫于荆榛瓦砾之余，独殿宇未遑议也。每春秋展礼于斋庐，已则置不问，殆为阙典。今寳文阁直学士括苍梁公来牧之。明年，实绍兴十有一禩也。二月，上丁修祀既毕，乃愓然自咎，揖诸生而告之曰"天子不以汝嘉为不肖，俾再守兹土，顾治民事神，皆守之职。惟是夫子之祀，教化所基，尤宜严且谨。而拜跪荐祭之地，卑陋乃尔！其何以掲防妥灵？汝嘉不敢避其责。曩常去此弥年，若有所负，尚安得以罢輭自恕，复累后人乎！他日或克就绪，愿与诸君落之。于是谋之，僚吏搜故府，得遗材千枚，取赢资以给其费。鸠工庀役，各举其任。嵗月讫，工民不与知像设礼器，百用具修。至于堂室。廊序。门牖。垣墙，皆一新之。'

In [None]:
model_name = "bert-finetuned-ner"

# upload files to the hub
tokenizer.push_to_hub(
    repo_path_or_name=model_name,
    organization="nielsr",
    commit_message="Add tokenizer",
    use_temp_dir=True,
)
model.push_to_hub(
    repo_path_or_name=model_name,
    organization="nielsr",
    commit_message="Add model",
    use_temp_dir=True,
)

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = "nielsr/bert-finetuned-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)