##**Absa 50.03: Token Classification**
> Task (Pretrained model choice): BertForTokenClassification ('bert-base-chinese')

> Dataset: (4/12 updated) 2243 labelled texts  

> Splitting: Training: 1795 for training, 448 for validation. (w/o suffle)

> Trainer source code: https://github.com/huggingface/transformers/tree/master/examples/token-classification

> Model Performance:

[詳盡的tag-separated performance](https://docs.google.com/document/d/16jQvs6bCiJw2848jnXnWamhZfcY_unCOJa3FuPm8E4Y/edit)

    eval_overall_accuracy     =     0.7969
    eval_overall_f1           =     0.7969



> Notes:
  *  csv file不能讀解法(使用jsonline files)：https://github.com/huggingface/transformers/issues/8698 

##Import Packages

In [18]:
import pickle
import torch
import random
import numpy as np
import pandas as pd 
import csv
import json
import datetime
from pathlib import Path
from torch.utils.data import Dataset
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

## Import seq-pair files 

In [19]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [20]:
filename = '/content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/seq_pairs_20210412.pkl'
with open(filename, 'rb') as f:
    data = pickle.load(f)

In [21]:
# format: a list of tuples:(text, tags)
print(f'Dataset size: {len(data)}')
for pair in data[710:715]:
  text = pair[0]
  tag = pair[1]
  for i in range(len(text)):
    print(f'{text[i]}|{tag[i]}', end = ' , ')
  print('\n')
print(data[0][0])
print(data[0])

Dataset size: 2242
中|B-E , 華|I-E , 5|B-A , 8|I-A , 8|I-A , 其|B-O , 實|B-O , 也|B-V , 還|I-V , 好|I-V , 不|I-V , 算|I-V , 多|I-V , 差|I-V , 啦|I-V , 

不|B-O , 划|B-O , 算|B-O , ，|B-O , 當|B-O , 初|B-O , 幫|B-O , 我|B-O , 弟|B-O , 算|B-O , 了|B-O , 一|B-O , 下|B-O , 現|B-O , 在|B-O , p|B-O , r|B-O , o|B-O , 耳|B-O , 機|B-O , 一|B-O , 直|B-O , 降|B-O , 價|B-O , ，|B-O , 還|B-O , 不|B-O , 如|B-O , 直|B-O , 辦|B-O , 4|B-A , 8|I-A , 8|I-A , 比|B-V , 較|I-V , 好|I-V , 除|B-O , 非|B-O , 你|B-O , 真|B-O , 的|B-O , 想|B-O , 用|B-O , 耳|B-O , 機|B-O , ！|B-O , 

市|B-A , 話|I-A , 送|I-A , 1|I-A , 6|I-A , 0|I-A , 分|I-A , 真|B-V , 的|I-V , 很|I-V , 划|I-V , 算|I-V , 

沒|B-O , 辦|B-O , 法|B-O , 生|B-O , 意|B-O , 做|B-O , 很|B-O , 大|B-O , 一|B-O , 定|B-O , 要|B-O , 打|B-O , 市|B-A , 話|I-A , 跟|I-A , 網|I-A , 外|I-A , X|B-O , D|B-O , 之|B-O , 前|B-O , 還|B-O , 有|B-O , 送|B-O , 簡|B-O , 訊|B-O , 還|B-O , 能|B-O , 簡|B-O , 訊|B-O , 檢|B-O , 舉|B-O , 違|B-O , 規|B-O , 停|B-O , 車|B-O , 超|B-V , 好|I-V , 用|I-V , 的|I-V , 

補|B-O , 充|B-O , ：|B-O , 台|B-E , 哥|I-E , 大|I-E , 在|B-O , 蘭|B-A , 潭|I-A

## Helper functions
transform seq-pairs to the format for transformer trainer 
(.csv, .json)

In [22]:
def WriteCSV(data_fraction, output_path):
  results = [['text', 'tags']]
  texts = [pair[0] for pair in data_fraction]
  tags = [pair[1] for pair in data_fraction]
  for te, ta in zip(texts, tags):
    results.append([te, ta])
  with open(output_path, 'w', newline = '') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(results)
  print(f'Finished writing into {output_path}.')

In [23]:
PATH  = '/content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner'
# Find the latest dir, get the seq_pairs, and build the bert_ner dir 
def FindDir(dir_path = '/content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup'): 
  dir_path = Path(dir_path)   
  chdirs = sorted([x for x in dir_path.iterdir() if x.is_dir])
  target_name = chdirs[-1].name
  LAST_DIR_PATH =  dir_path+'/'+target_name+'/bert_ner'    
  ! mkdir -p "${LAST_DIR_PATH}"
  return LAST_DIR_PATH
# PATH  = LAST_DIR_PATH

In [24]:
### CSV FILES ###
ratio = round(len(data)*0.8)
train_csv_path = PATH + '/cht_ner_train.csv'
val_csv_path = PATH + '/cht_ner_val.csv'
# WriteCSV(data[:ratio], train_csv_path)
# WriteCSV(data[ratio:], val_csv_path)

In [25]:
!pip install jsonlines



In [26]:
import jsonlines
def WriteJSONL(data_fraction, output_path):
  '''
  要求格式類型（下為一個json dict）
  {"text": ["B7 台哥大其他地方都很快，只有在地下室，超級爛！！！訊號會直接不見，我同學用中華和遠傳都還收得到"],
   "tags": ["B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "I-V", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O", "B-O"]}
  一份json file裡包含多個如上格式的json dict
   '''
  texts = [pair[0] for pair in data_fraction]
  tags = [pair[1] for pair in data_fraction]
  objs = [{'text':[x], 'tags':y} for x,y in zip(texts, tags)]
  with jsonlines.open(output_path, mode='w') as writer:
    writer.write_all(objs)
  # reading: json_object = json.loads(json_dump)
  print(f'Finished writing into {output_path}.')

In [27]:
### JSON FILES ###
ratio = round(len(data)*0.8)
train_json_path = PATH + '/cht_ner_train.json'
val_json_path = PATH + '/cht_ner_val.json'
WriteJSONL(data[:ratio], train_json_path)
WriteJSONL(data[ratio:], val_json_path)

Finished writing into /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/cht_ner_train.json.
Finished writing into /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/cht_ner_val.json.


## Install packages for trainer 

In [28]:
!git clone https://github.com/huggingface/transformers

fatal: destination path 'transformers' already exists and is not an empty directory.


In [29]:
!pip install datasets



In [30]:
# A source install
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-p36mo3tk
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-p36mo3tk
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.6.0.dev0-cp37-none-any.whl size=2107872 sha256=f7bcb817ec8bb6a163daf511ccac6a38ec50ca51d08310a2424454703560607f
  Stored in directory: /tmp/pip-ephem-wheel-cache-ciwqoi9b/wheels/70/d3/52/b3fa4f8b8ef04167ac62e5bb2accb62ae764db2a378247490e
Successfully built transformers


In [31]:
# NO TRAINER VERSION

# !python /content/transformers/examples/token-classification/run_ner_no_trainer.py \
#   --model_name_or_path bert-base-chinese \
#   --task_name ner \
#   --train_file /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/cht_ner_train.json\
#   --validation_file /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/cht_ner_val.json\
#   --max_length 128 \
#   --per_device_train_batch_size 32 \
#   --learning_rate 2e-5 \
#   --num_train_epochs 10 \
#   --output_dir /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/\
#   --seed 42069\

In [32]:
# NO TRAINER VERSION's instructions
# !pip install accelerate
# !accelerate config
# !accelerate test 

In [33]:
# TRAINER VERSION's eval metric
!pip install seqeval 



##Training 

In [45]:
# TRAINER VERSION
!python /content/transformers/examples/token-classification/run_ner.py \
  --model_name bert-base-chinese \
  --task_name ner \
  --train_file /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/cht_ner_train.json\
  --validation_file /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/cht_ner_val.json\
  --per_device_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --output_dir /content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/out\
  --seed 42069 \
  --pad_to_max_length True\
  --return_entity_level_metrics True\
  --do_train \
  --do_eval 
  # --max_length 128 \
  # 42069

2021-04-15 07:59:22.782694: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
04/15/2021 07:59:24 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=/content/gdrive/MyDrive/指向情緒案/data/annot_data/annotated_data_bkup/20210412/bert_ner/out, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=10.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Apr15_07-59-24_d682295f317c, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, sav