# Finetuning of the pretrained Japanese BERT model

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese BERT model

We make test:dev:train = 2:2:6 datasets.

In [None]:
! pip install -r ../requirements.txt

## Data preparing


事前にDATA_DIRに

- test.tsv
- train.tsv
- dev.tsv

を用意してください。カンマ区切りでquotecharは無しです。  
(tsvではなくcsvとすべきですが面倒でそのままです）

## Finetune pre-trained model

It will take a lot of hours to execute the following cells on CPU environment.  

In [None]:
PRETRAINED_MODEL_PATH = '../model/model.ckpt-1400000'
FINETUNE_OUTPUT_DIR = '../model/finetune_output'

DATA_DIR='../data'

ここに環境変数でラベルを渡す。カンマ区切り

In [None]:
%env BERT_JAPANESE_LABELS=label1,label2,label3

In [None]:
%%time
# It will take many hours on CPU environment.

!python3 ../src/run_classifier.py \
  --do_train=true \
  --do_eval=true \
  --data_dir={DATA_DIR} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=10 \
  --output_dir={FINETUNE_OUTPUT_DIR}

## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [None]:
%%time
# It will take many hours on CPU environment.

!python3 ../src/run_classifier.py \
  --do_predict=true \
  --data_dir={DATA_DIR} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --max_seq_length=512 \
  --output_dir={FINETUNE_OUTPUT_DIR}

In [None]:
import sys
sys.path.append("../src")

from run_classifier import LivedoorProcessor

In [None]:
processor = LivedoorProcessor()
label_list = processor.get_labels()

In [None]:
result = pd.read_csv(FINETUNE_OUTPUT_DIR+"/test_results.tsv", sep='\t', header=None)

In [None]:
result.head()

Read test data set and add prediction results.

In [None]:
import pandas as pd

In [None]:
test_df = pd.read_csv("../data/livedoor/test.tsv", sep='\t')

In [None]:
test_df['predict'] = [ label_list[idx] for idx in result.idxmax(axis=1) ]

In [None]:
test_df.head()

In [None]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [None]:
!pip install scikit-learn

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
print(classification_report(test_df['label'], test_df['predict']))

In [None]:
print(confusion_matrix(test_df['label'], test_df['predict']))

### Simple baseline model.

In [None]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
train_df = pd.read_csv("../data/livedoor/train.tsv", sep='\t')
dev_df = pd.read_csv("../data/livedoor/dev.tsv", sep='\t')
test_df = pd.read_csv("../data/livedoor/test.tsv", sep='\t')

In [None]:
!apt-get install -q -y mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

In [None]:
!pip install mecab-python3==0.7

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import MeCab

In [None]:
m = MeCab.Tagger("-Owakati")

In [None]:
train_dev_df = pd.concat([train_df, dev_df])

In [None]:
train_dev_xs = train_dev_df['text'].apply(lambda x: m.parse(x))
train_dev_ys = train_dev_df['label']

test_xs = test_df['text'].apply(lambda x: m.parse(x))
test_ys = test_df['label']

In [None]:
vectorizer = TfidfVectorizer(max_features=750)
train_dev_xs_ = vectorizer.fit_transform(train_dev_xs)
test_xs_ = vectorizer.transform(test_xs)

The following set up is not exactly identical to that of BERT because inside Classifier it uses `train_test_split` with shuffle.  
In addition, parameters are not well tuned, however, we think it's enough to check the power of BERT.

In [None]:
%%time

model = GradientBoostingClassifier(n_estimators=200,
                                   validation_fraction=len(dev_df)/len(train_df),
                                   n_iter_no_change=5,
                                   tol=0.01,
                                   random_state=23)

### 1/5 of full training data.
# model = GradientBoostingClassifier(n_estimators=200,
#                                    validation_fraction=len(dev_df)/len(train_df),
#                                    n_iter_no_change=5,
#                                    tol=0.01,
#                                    random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

In [None]:
print(classification_report(test_ys, model.predict(test_xs_)))

In [None]:
print(confusion_matrix(test_ys, model.predict(test_xs_)))