# Finetuning of the pretrained Japanese BERT model

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese BERT model

We make test:dev:train = 2:2:6 datasets.

In [None]:
! pip install -r ../requirements.txt

## Data preparing


事前にDATA_DIRに

- test.tsv
- train.tsv
- dev.tsv

を用意してください。カンマ区切りでquotecharは無しです。  
(tsvではなくcsvとすべきですが面倒でそのままです）

## Finetune pre-trained model

It will take a lot of hours to execute the following cells on CPU environment.  

In [None]:
PRETRAINED_MODEL_PATH = '../model/model.ckpt-1400000'
FINETUNE_OUTPUT_DIR = '../model/finetune_output'

DATA_DIR='../data'

ここに環境変数でラベルを渡す。カンマ区切り

In [None]:
%env BERT_JAPANESE_LABELS=label1,label2,label3

In [None]:
%%time
# It will take many hours on CPU environment.

!python3 ../src/run_classifier.py \
  --do_train=true \
  --do_eval=true \
  --data_dir={DATA_DIR} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=10 \
  --output_dir={FINETUNE_OUTPUT_DIR}

## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [None]:
%%time
# It will take many hours on CPU environment.

!python3 ../src/run_classifier.py \
  --do_predict=true \
  --data_dir={DATA_DIR} \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --max_seq_length=512 \
  --output_dir={FINETUNE_OUTPUT_DIR}

In [None]:
import sys
sys.path.append("../src")

from run_classifier import GeneralProcessor

In [None]:
processor = GeneralProcessor()
label_list = processor.get_labels()

In [None]:
import pandas as pd

In [None]:
result = pd.read_csv(FINETUNE_OUTPUT_DIR+"/test_results.tsv", sep='\t', header=None)

In [None]:
result.head()

Read test data set and add prediction results.

In [None]:
test_df = pd.read_csv("../data/test.tsv")

In [None]:
test_df['predict'] = [ label_list[idx] for idx in result.idxmax(axis=1) ]

In [None]:
test_df.head()

In [None]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [None]:
!pip install scikit-learn

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
print(classification_report(test_df['label'], test_df['predict']))

In [None]:
print(confusion_matrix(test_df['label'], test_df['predict']))