<a href="https://colab.research.google.com/github/murakami-tatsumi/colab/blob/master/BERT%E3%81%AE%E3%83%95%E3%82%A1%E3%82%A4%E3%83%B3%E3%83%81%E3%83%A5%E3%83%BC%E3%83%8B%E3%83%B3%E3%82%B0%E4%BA%8B%E4%BE%8B%EF%BC%88%E6%97%A5%E6%9C%AC%E8%AA%9Esentense_piece%EF%BC%89livedoor%E8%A8%98%E4%BA%8B%E3%81%AE%E5%88%86%E9%A1%9E%E5%95%8F%E9%A1%8C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Finetuning of the pretrained Japanese BERT model

yoheikikutaさんの下記githubタスクを再現<br>
https://github.com/yoheikikuta/bert-japanese/blob/master/notebook/finetune-to-livedoor-corpus.ipynb

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese BERT model

Dataset is livedoor ニュースコーパス in https://www.rondhuit.com/download.html.  
We make test:dev:train = 2:2:6 datasets.

Results:

- Full training data
  - BERT with SentencePiece
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.98      0.94      0.96       178
      it-life-hack       0.96      0.97      0.96       172
     kaden-channel       0.99      0.98      0.99       176
    livedoor-homme       0.98      0.88      0.93        95
       movie-enter       0.96      0.99      0.98       158
            peachy       0.94      0.98      0.96       174
              smax       0.98      0.99      0.99       167
      sports-watch       0.98      1.00      0.99       190
        topic-news       0.99      0.98      0.98       163

         micro avg       0.97      0.97      0.97      1473
         macro avg       0.97      0.97      0.97      1473
      weighted avg       0.97      0.97      0.97      1473
    ```
  - sklearn GradientBoostingClassifier with MeCab
    ```
                      precision    recall  f1-score   support

    dokujo-tsushin       0.89      0.86      0.88       178
      it-life-hack       0.91      0.90      0.91       172
     kaden-channel       0.90      0.94      0.92       176
    livedoor-homme       0.79      0.74      0.76        95
       movie-enter       0.93      0.96      0.95       158
            peachy       0.87      0.92      0.89       174
              smax       0.99      1.00      1.00       167
      sports-watch       0.93      0.98      0.96       190
        topic-news       0.96      0.86      0.91       163

         micro avg       0.92      0.92      0.92      1473
         macro avg       0.91      0.91      0.91      1473
      weighted avg       0.92      0.92      0.91      1473
    ```

- Small training data (1/5 of full training data)
  - BERT with SentencePiece
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.97      0.87      0.92       178
      it-life-hack       0.86      0.86      0.86       172
     kaden-channel       0.95      0.94      0.95       176
    livedoor-homme       0.82      0.82      0.82        95
       movie-enter       0.97      0.99      0.98       158
            peachy       0.89      0.95      0.92       174
              smax       0.94      0.96      0.95       167
      sports-watch       0.97      0.97      0.97       190
        topic-news       0.94      0.94      0.94       163

         micro avg       0.93      0.93      0.93      1473
         macro avg       0.92      0.92      0.92      1473
      weighted avg       0.93      0.93      0.93      1473
    ```
  - sklearn GradientBoostingClassifier with MeCab
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.82      0.71      0.76       178
      it-life-hack       0.86      0.88      0.87       172
     kaden-channel       0.91      0.87      0.89       176
    livedoor-homme       0.67      0.63      0.65        95
       movie-enter       0.87      0.95      0.91       158
            peachy       0.70      0.78      0.73       174
              smax       1.00      1.00      1.00       167
      sports-watch       0.87      0.95      0.91       190
        topic-news       0.92      0.82      0.87       163

         micro avg       0.85      0.85      0.85      1473
         macro avg       0.85      0.84      0.84      1473
      weighted avg       0.86      0.85      0.85      1473
    ```

## オリジナルに対する追加設定

下の手順で環境を自分のcolabに構築します。

下の手順以外に次の設定を前提としています。

- ランタイムはTPUを使用　→TPUでは時間がかかり過ぎたのでGPUに変更
- マイドライブにyoheikikutaさんのmodelを追加しておく

→yoheikikutaさんのgoogle drive
> https://drive.google.com/drive/folders/1Zsm9DD40lrUVu6iAnIuTH2ODIkh-WM-O<br>

をgoogle driveにて「マイドライブ」に追加しておく<br>
（マイドライブ経由で直接アクセスもできるが、コピーも早いので安全策をとる）


In [0]:
# google driveへの接続は毎回必要
from google.colab import drive
drive.mount('/content/drive')

In [0]:
!pip install sentencepiece

In [0]:
# bert-japaneseをgoogle driveのコード用フォルダにロード
!git clone https://github.com/yoheikikuta/bert-japanese.git

In [0]:
%cd /content/bert-japanese
!ls -l
!ls -l src

bert-japaneseの下のフォルダ構成は
> bert-japanese
> - bert　　　→ フォルダのみ
> - model　　 → フォルダのみ
> - notebook　→ 処理には不要
> - src

のようになっている。
- bert　　→本家のbertをgitで展開
- model   →yoheikikutaさんのgoogle driveよりコピー


In [0]:
# bertをbertフォルダにロード
!git clone https://github.com/google-research/bert.git
!ls -l bert
import sys
sys.path.append('/content/bert-japanese/bert')
sys.path

In [0]:
# yoheikikutaさんのgoogle driveよりmodelファイルをコピー
# それなり（10分程度）の時間はかかる
%%time
!cp /content/drive/My\ Drive/bert-wiki-ja/*.* ./model/
!ls -l model

オリジナルのconfig.iniはcolab環境のフォルダ構成を
> /work/

直下にbert-japaneseが展開された想定になっているので、
> /work/　→　/content/bert-japanese/

に置き換えたconfig.iniに置換する必要がある

In [0]:
# 修正したconfig.iniをアップロード
!rm config.ini
from google.colab import files
uploaded = files.upload()

In [0]:
# カレントフォルダをsrcに移動
%cd /content/bert-japanese/src

## ここからいよいよ実行

In [0]:
import configparser
import glob
import os
import pandas as pd
import subprocess
import sys
import tarfile 
from urllib.request import urlretrieve

CURDIR = os.getcwd()
CONFIGPATH = os.path.join(CURDIR, os.pardir, 'config.ini')
config = configparser.ConfigParser()
config.read(CONFIGPATH)

## Data preparing （→題材であるlivedoor記事をFine Tuningの入力用に加工する）

You need execute the following cells just once.

In [0]:
FILEURL = config['FINETUNING-DATA']['FILEURL']
FILEPATH = config['FINETUNING-DATA']['FILEPATH']
EXTRACTDIR = config['FINETUNING-DATA']['TEXTDIR']

Download and unzip data.

In [0]:
%%time

urlretrieve(FILEURL, FILEPATH)

mode = "r:gz"
tar = tarfile.open(FILEPATH, mode) 
tar.extractall(EXTRACTDIR) 
tar.close()

Data preprocessing.

In [0]:
def extract_txt(filename):
    with open(filename) as text_file:
        # 0: URL, 1: timestamp
        text = text_file.readlines()[2:]
        text = [sentence.strip() for sentence in text]
        text = list(filter(lambda line: line != '', text))
        return ''.join(text)

In [0]:
categories = [ 
    name for name 
    in os.listdir( os.path.join(EXTRACTDIR, "text") ) 
    if os.path.isdir( os.path.join(EXTRACTDIR, "text", name) ) ]

categories = sorted(categories)

In [0]:
categories

In [0]:
table = str.maketrans({
    '\n': '',
    '\t': '　',
    '\r': '',
})

In [0]:
%%time

all_text = []
all_label = []

for cat in categories:
    files = glob.glob(os.path.join(EXTRACTDIR, "text", cat, "{}*.txt".format(cat)))
    files = sorted(files)
    body = [ extract_txt(elem).translate(table) for elem in files ]
    label = [cat] * len(body)
    
    all_text.extend(body)
    all_label.extend(label)

In [0]:
df = pd.DataFrame({'text' : all_text, 'label' : all_label})

In [0]:
df.head()

In [0]:
df = df.sample(frac=1, random_state=23).reset_index(drop=True)

In [0]:
df.head()

Save data as tsv files.  
test:dev:train = 2:2:6. To check the usability of finetuning, we also prepare sampled training data (1/5 of full training data).

In [0]:
df[:len(df) // 5].to_csv( os.path.join(EXTRACTDIR, "test.tsv"), sep='\t', index=False)
df[len(df) // 5:len(df)*2 // 5].to_csv( os.path.join(EXTRACTDIR, "dev.tsv"), sep='\t', index=False)
df[len(df)*2 // 5:].to_csv( os.path.join(EXTRACTDIR, "train.tsv"), sep='\t', index=False)

### 1/5 of full training data.
# df[:len(df) // 5].to_csv( os.path.join(EXTRACTDIR, "test.tsv"), sep='\t', index=False)
# df[len(df) // 5:len(df)*2 // 5].to_csv( os.path.join(EXTRACTDIR, "dev.tsv"), sep='\t', index=False)
# df[len(df)*2 // 5:].sample(frac=0.2, random_state=23).to_csv( os.path.join(EXTRACTDIR, "train.tsv"), sep='\t', index=False)

In [0]:
PRETRAINED_MODEL_PATH = '../model/model.ckpt-1400000'
FINETUNE_OUTPUT_DIR = '../model/livedoor_output'

## Finetune pre-trained model

It will take a lot of hours to execute the following cells on CPU environment.  
You can also use colab to recieve the power of TPU. You need to uplode the created data onto your GCS bucket.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zZH2GWe0U-7GjJ2w2duodFfEUptvHjcx)

In [0]:
%%time
# It will take many hours on CPU environment.

!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --do_train=true \
  --do_eval=true \
  --data_dir=../data/livedoor \
  --model_file="../model/wiki-ja.model" \
  --vocab_file="../model/wiki-ja.vocab" \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=10 \
  --output_dir={FINETUNE_OUTPUT_DIR}

In [0]:
!ls -l ../model/livedoor_output
!ls -l ../model/livedoor_output/eval

In [0]:
# Fine Tuning結果をMy Driveに保存
!cp -r ../model/livedoor_output /content/drive/My\ Drive/Colab_Data/

## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [0]:
# Fine Tuning結果をMy Driveから復元
!cp -r /content/drive/My\ Drive/Colab_Data/livedoor_output ../model/

In [0]:
!ls -l ../model/livedoor_output

In [0]:
import sys
sys.path.append("../src")

import tokenization_sentencepiece as tokenization
from run_classifier import LivedoorProcessor
from run_classifier import model_fn_builder
from run_classifier import file_based_input_fn_builder
from run_classifier import file_based_convert_examples_to_features
from utils import str_to_value

In [0]:
sys.path.append("../bert")

import modeling
import optimization
import tensorflow as tf

In [0]:
import configparser
import json
import glob
import os
import pandas as pd
import tempfile

bert_config_file = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8', suffix='.json')
bert_config_file.write(json.dumps({k:str_to_value(v) for k,v in config['BERT-CONFIG'].items()}))
bert_config_file.seek(0)
bert_config = modeling.BertConfig.from_json_file(bert_config_file.name)

In [0]:
# 以下の式は問題あり。文字列をSORTしているので数字の桁上がりの考慮がないためlatestになっていない
# フォルダ内を確認して、場合によっては FINETUNED_MODEL_PATH を手で修正する
output_ckpts = glob.glob("{}/model.ckpt*data*".format(FINETUNE_OUTPUT_DIR))
latest_ckpt = sorted(output_ckpts)[-1]
FINETUNED_MODEL_PATH = latest_ckpt.split('.data-00000-of-00001')[0]

In [0]:
class FLAGS(object):
    '''Parameters.'''
    def __init__(self):
        self.model_file = "../model/wiki-ja.model"
        self.vocab_file = "../model/wiki-ja.vocab"
        self.do_lower_case = True
        self.use_tpu = False
        self.output_dir = "/dummy"
        self.data_dir = "../data/livedoor"
        self.max_seq_length = 512
        self.init_checkpoint = FINETUNED_MODEL_PATH
        self.predict_batch_size = 4
        
        # The following parameters are not used in predictions.
        # Just use to create RunConfig.
        self.master = None
        self.save_checkpoints_steps = 1
        self.iterations_per_loop = 1
        self.num_tpu_cores = 1
        self.learning_rate = 0
        self.num_warmup_steps = 0
        self.num_train_steps = 0
        self.train_batch_size = 0
        self.eval_batch_size = 0

In [0]:
FLAGS = FLAGS()

In [0]:
processor = LivedoorProcessor()
label_list = processor.get_labels()

In [0]:
tokenizer = tokenization.FullTokenizer(
    model_file=FLAGS.model_file, vocab_file=FLAGS.vocab_file,
    do_lower_case=FLAGS.do_lower_case)

tpu_cluster_resolver = None

is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    master=FLAGS.master,
    model_dir=FLAGS.output_dir,
    save_checkpoints_steps=FLAGS.save_checkpoints_steps,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=FLAGS.iterations_per_loop,
        num_shards=FLAGS.num_tpu_cores,
        per_host_input_for_training=is_per_host))

In [0]:
model_fn = model_fn_builder(
    bert_config=bert_config,
    num_labels=len(label_list),
    init_checkpoint=FLAGS.init_checkpoint,
    learning_rate=FLAGS.learning_rate,
    num_train_steps=FLAGS.num_train_steps,
    num_warmup_steps=FLAGS.num_warmup_steps,
    use_tpu=FLAGS.use_tpu,
    use_one_hot_embeddings=FLAGS.use_tpu)


estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=FLAGS.use_tpu,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=FLAGS.train_batch_size,
    eval_batch_size=FLAGS.eval_batch_size,
    predict_batch_size=FLAGS.predict_batch_size)

In [0]:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
predict_file = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8', suffix='.tf_record')

file_based_convert_examples_to_features(predict_examples, label_list,
                                        FLAGS.max_seq_length, tokenizer,
                                        predict_file.name)

predict_drop_remainder = True if FLAGS.use_tpu else False

predict_input_fn = file_based_input_fn_builder(
    input_file=predict_file.name,
    seq_length=FLAGS.max_seq_length,
    is_training=False,
    drop_remainder=predict_drop_remainder)

In [0]:
result = estimator.predict(input_fn=predict_input_fn)

In [0]:
%%time
# It will take a few hours on CPU environment.

result = list(result)

In [0]:
result[:2]

Read test data set and add prediction results.

In [0]:
import pandas as pd

In [0]:
test_df = pd.read_csv("../data/livedoor/test.tsv", sep='\t')

In [0]:
test_df['predict'] = [ label_list[elem['probabilities'].argmax()] for elem in result ]

In [0]:
test_df.head()

In [0]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [0]:
!pip install scikit-learn

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [0]:
print(classification_report(test_df['label'], test_df['predict']))

In [0]:
print(confusion_matrix(test_df['label'], test_df['predict']))

### Simple baseline model.

In [0]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [0]:
train_df = pd.read_csv("../data/livedoor/train.tsv", sep='\t')
dev_df = pd.read_csv("../data/livedoor/dev.tsv", sep='\t')
test_df = pd.read_csv("../data/livedoor/test.tsv", sep='\t')

In [0]:
!apt-get install -q -y mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

In [0]:
!pip install mecab-python3==0.7

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import MeCab

In [0]:
m = MeCab.Tagger("-Owakati")

In [0]:
train_dev_df = pd.concat([train_df, dev_df])

In [0]:
train_dev_xs = train_dev_df['text'].apply(lambda x: m.parse(x))
train_dev_ys = train_dev_df['label']

test_xs = test_df['text'].apply(lambda x: m.parse(x))
test_ys = test_df['label']

In [0]:
vectorizer = TfidfVectorizer(max_features=750)
train_dev_xs_ = vectorizer.fit_transform(train_dev_xs)
test_xs_ = vectorizer.transform(test_xs)

The following set up is not exactly identical to that of BERT because inside Classifier it uses `train_test_split` with shuffle.  
In addition, parameters are not well tuned, however, we think it's enough to check the power of BERT.

In [0]:
%%time

model = GradientBoostingClassifier(n_estimators=200,
                                   validation_fraction=len(dev_df)/len(train_df),
                                   n_iter_no_change=5,
                                   tol=0.01,
                                   random_state=23)

### 1/5 of full training data.
# model = GradientBoostingClassifier(n_estimators=200,
#                                    validation_fraction=len(dev_df)/len(train_df),
#                                    n_iter_no_change=5,
#                                    tol=0.01,
#                                    random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

In [0]:
print(classification_report(test_ys, model.predict(test_xs_)))

In [0]:
print(confusion_matrix(test_ys, model.predict(test_xs_)))