# Finetuning of the pretrained Japanese BERT model

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese BERT model

Dataset is livedoor ニュースコーパス in https://www.rondhuit.com/download.html.  
We make test:dev:train = 2:2:6 datasets.

Results:

- Full training data
  - BERT with SentencePiece
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.98      0.94      0.96       178
      it-life-hack       0.96      0.97      0.96       172
     kaden-channel       0.99      0.98      0.99       176
    livedoor-homme       0.98      0.88      0.93        95
       movie-enter       0.96      0.99      0.98       158
            peachy       0.94      0.98      0.96       174
              smax       0.98      0.99      0.99       167
      sports-watch       0.98      1.00      0.99       190
        topic-news       0.99      0.98      0.98       163

         micro avg       0.97      0.97      0.97      1473
         macro avg       0.97      0.97      0.97      1473
      weighted avg       0.97      0.97      0.97      1473
    ```
  - sklearn GradientBoostingClassifier with MeCab
    ```
                      precision    recall  f1-score   support

    dokujo-tsushin       0.89      0.86      0.88       178
      it-life-hack       0.91      0.90      0.91       172
     kaden-channel       0.90      0.94      0.92       176
    livedoor-homme       0.79      0.74      0.76        95
       movie-enter       0.93      0.96      0.95       158
            peachy       0.87      0.92      0.89       174
              smax       0.99      1.00      1.00       167
      sports-watch       0.93      0.98      0.96       190
        topic-news       0.96      0.86      0.91       163

         micro avg       0.92      0.92      0.92      1473
         macro avg       0.91      0.91      0.91      1473
      weighted avg       0.92      0.92      0.91      1473
    ```

- Small training data (1/5 of full training data)
  - BERT with SentencePiece
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.97      0.87      0.92       178
      it-life-hack       0.86      0.86      0.86       172
     kaden-channel       0.95      0.94      0.95       176
    livedoor-homme       0.82      0.82      0.82        95
       movie-enter       0.97      0.99      0.98       158
            peachy       0.89      0.95      0.92       174
              smax       0.94      0.96      0.95       167
      sports-watch       0.97      0.97      0.97       190
        topic-news       0.94      0.94      0.94       163

         micro avg       0.93      0.93      0.93      1473
         macro avg       0.92      0.92      0.92      1473
      weighted avg       0.93      0.93      0.93      1473
    ```
  - sklearn GradientBoostingClassifier with MeCab
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.82      0.71      0.76       178
      it-life-hack       0.86      0.88      0.87       172
     kaden-channel       0.91      0.87      0.89       176
    livedoor-homme       0.67      0.63      0.65        95
       movie-enter       0.87      0.95      0.91       158
            peachy       0.70      0.78      0.73       174
              smax       1.00      1.00      1.00       167
      sports-watch       0.87      0.95      0.91       190
        topic-news       0.92      0.82      0.87       163

         micro avg       0.85      0.85      0.85      1473
         macro avg       0.85      0.84      0.84      1473
      weighted avg       0.86      0.85      0.85      1473
    ```

In [2]:
import configparser
import glob
import os
import pandas as pd
import subprocess
import sys
import tarfile 
from urllib.request import urlretrieve

CURDIR = os.getcwd()
CONFIGPATH = os.path.join(CURDIR, os.pardir, 'config.ini')
config = configparser.ConfigParser()
config.read(CONFIGPATH)

['/home/ubuntu/work/bert-japanese/notebook/../config.ini']

## Data preparing

You need execute the following cells just once.

In [3]:
FILEURL = config['FINETUNING-DATA']['FILEURL']
FILEPATH = config['FINETUNING-DATA']['FILEPATH']
EXTRACTDIR = config['FINETUNING-DATA']['TEXTDIR']

Download and unzip data(livedoor corpus).

In [4]:
%%time

urlretrieve(FILEURL, FILEPATH)

mode = "r:gz"
tar = tarfile.open(FILEPATH, mode) 
tar.extractall(EXTRACTDIR) 
tar.close()

CPU times: user 1.16 s, sys: 388 ms, total: 1.55 s
Wall time: 5.09 s


Data preprocessing.

In [5]:
def extract_txt(filename):
    with open(filename) as text_file:
        # 0: URL, 1: timestamp
        text = text_file.readlines()[2:]
        text = [sentence.strip() for sentence in text]
        text = list(filter(lambda line: line != '', text))
        return ''.join(text)

In [6]:
categories = [ 
    name for name 
    in os.listdir( os.path.join(EXTRACTDIR, "text") ) 
    if os.path.isdir( os.path.join(EXTRACTDIR, "text", name) ) ]

categories = sorted(categories)

In [7]:
categories

['dokujo-tsushin',
 'it-life-hack',
 'kaden-channel',
 'livedoor-homme',
 'movie-enter',
 'peachy',
 'smax',
 'sports-watch',
 'topic-news']

In [8]:
table = str.maketrans({
    '\n': '',
    '\t': '　',
    '\r': '',
})

In [9]:
%%time

all_text = []
all_label = []

for cat in categories:
    files = glob.glob(os.path.join(EXTRACTDIR, "text", cat, "{}*.txt".format(cat)))
    files = sorted(files)
    body = [ extract_txt(elem).translate(table) for elem in files ]
    label = [cat] * len(body)
    
    all_text.extend(body)
    all_label.extend(label)

CPU times: user 1.16 s, sys: 100 ms, total: 1.26 s
Wall time: 1.26 s


In [10]:
df = pd.DataFrame({'text' : all_text, 'label' : all_label})

In [11]:
df.head()

Unnamed: 0,label,text
0,dokujo-tsushin,友人代表のスピーチ、独女はどうこなしている？もうすぐジューン・ブライドと呼ばれる６月。独女の...
1,dokujo-tsushin,ネットで断ち切れない元カレとの縁携帯電話が普及する以前、恋人への連絡ツールは一般電話が普通だ...
2,dokujo-tsushin,相次ぐ芸能人の“すっぴん”披露　その時、独女の心境は？「男性はやっぱり、女性の“すっぴん”が...
3,dokujo-tsushin,ムダな抵抗！？ 加齢の現実ヒップの加齢による変化は「たわむ→下がる→内に流れる」、バストは「...
4,dokujo-tsushin,税金を払うのは私たちなんですけど！6月から支給される子ども手当だが、当初は子ども一人当たり月...


In [12]:
df = df.sample(frac=1, random_state=23).reset_index(drop=True)

In [13]:
df.head()

Unnamed: 0,label,text
0,sports-watch,新記録でロンドンに乗り込む“バタフライの女王”加藤ゆか3日に行われた競泳の日本選手権で、女子...
1,kaden-channel,家電チャンネルの記事も配信！向かうところ敵なしのスマホアプリ「ITニュース by lived...
2,peachy,彼にあげたい韓国メンズコスメ、韓流俳優のような美肌男へ！年末の大イベント、クリスマスまであと...
3,livedoor-homme,快適なスマホライフのための必須アプリ「マトリックス レボリューションズ」(c)Warner ...
4,dokujo-tsushin,独女と上司の気になる関係人事異動の多い春は、職場の人間関係の悩みも増える時期。『an・an』...


Save data as tsv files.  
test:dev:train = 2:2:6. To check the usability of finetuning, we also prepare sampled training data (1/5 of full training data).

In [14]:
df[:len(df) // 5].to_csv( os.path.join(EXTRACTDIR, "test.tsv"), sep='\t', index=False)
df[len(df) // 5:len(df)*2 // 5].to_csv( os.path.join(EXTRACTDIR, "dev.tsv"), sep='\t', index=False)
df[len(df)*2 // 5:].to_csv( os.path.join(EXTRACTDIR, "train.tsv"), sep='\t', index=False)

### 1/5 of full training data.
# df[:len(df) // 5].to_csv( os.path.join(EXTRACTDIR, "test.tsv"), sep='\t', index=False)
# df[len(df) // 5:len(df)*2 // 5].to_csv( os.path.join(EXTRACTDIR, "dev.tsv"), sep='\t', index=False)
# df[len(df)*2 // 5:].sample(frac=0.2, random_state=23).to_csv( os.path.join(EXTRACTDIR, "train.tsv"), sep='\t', index=False)

## Finetune pre-trained model

It will take a lot of hours to execute the following cells on CPU environment.  
You can also use colab to recieve the power of TPU. You need to uplode the created data onto your GCS bucket.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zZH2GWe0U-7GjJ2w2duodFfEUptvHjcx)

In [15]:
PRETRAINED_MODEL_PATH = '../model/model.ckpt-1400000'
FINETUNE_OUTPUT_DIR = '../model/livedoor_output'

In [20]:
%%time
# It will take many hours on CPU environment.

!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --do_train=true \
  --do_eval=true \
  --data_dir=../../data/livedoor \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=10 \
  --output_dir={FINETUNE_OUTPUT_DIR}

Loaded a trained SentencePiece model.
INFO:tensorflow:Using config: {'_model_dir': '../model/livedoor_output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f54c6d0c208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job

INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 4421
INFO:tensorflow:  Batch size = 4
INFO:tensorflow:  Num steps = 11052
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (4, 512)
INFO:tensorflow:  name = input_mask, shape = (4, 512)
INFO:tensorflow:  name = is_real_example, shape = (4,)
INFO:tensorflow:  name = label_ids, shape = (4,)
INFO:tensorflow:  name = segment_ids, shape = (4, 512)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (32000, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ../model/livedoor_output/model.ckpt.
INFO:tensorflow:global_step/sec: 4.37389
INFO:tensorflow:examples/sec: 17.4956
INFO:tensorflow:global_step/sec: 5.28612
INFO:tensorflow:examples/sec: 21.1445
INFO:tensorflow:global_step/sec: 5.28354
INFO:tensorflow:examples/sec: 21.1341
INFO:tensorflow:global_step/sec: 5.28177
INFO:tensorflow:examples/sec: 21.1271
INFO:tensorflow:global_step/sec: 5.28457
INFO:tensorflow:examples/sec: 21.1383
INFO:tensorflow:global_step/sec: 5.28013
INFO:tensorflow:examples/sec: 21.1205
INFO:tensorflow:global_step/sec: 5.27622
INFO:tensorflow:examples/sec: 21.1049
INFO:tensorflow:global_step/sec: 5.27639
INFO:tensorflow:examples/sec: 21.1056
INFO:tensorflow:global_step/sec: 5.27085
INFO:tensorflow:examples/sec: 21.0834
INF

INFO:tensorflow:global_step/sec: 5.26763
INFO:tensorflow:examples/sec: 21.0705
INFO:tensorflow:global_step/sec: 5.27066
INFO:tensorflow:examples/sec: 21.0826
INFO:tensorflow:global_step/sec: 5.27102
INFO:tensorflow:examples/sec: 21.0841
INFO:tensorflow:global_step/sec: 5.26973
INFO:tensorflow:examples/sec: 21.0789
INFO:tensorflow:global_step/sec: 5.26876
INFO:tensorflow:examples/sec: 21.075
INFO:tensorflow:global_step/sec: 5.2707
INFO:tensorflow:examples/sec: 21.0828
INFO:tensorflow:global_step/sec: 5.27032
INFO:tensorflow:examples/sec: 21.0813
INFO:tensorflow:global_step/sec: 5.27113
INFO:tensorflow:examples/sec: 21.0845
INFO:tensorflow:Saving checkpoints for 10000 into ../model/livedoor_output/model.ckpt.
INFO:tensorflow:global_step/sec: 4.6704
INFO:tensorflow:examples/sec: 18.6816
INFO:tensorflow:global_step/sec: 5.26988
INFO:tensorflow:examples/sec: 21.0795
INFO:tensorflow:global_step/sec: 5.26776
INFO:tensorflow:examples/sec: 21.071
INFO:tensorflow:global_step/sec: 5.27115
INFO:te

INFO:tensorflow:guid: dev-4
INFO:tensorflow:tokens: [CLS] ▁ スマート で 美しい “ サイ フ 美人 ” は ロンドン っ 娘 ! ▁その 理由は カード にあり ! あい かわ らず 厳しい 残 暑 が 続いて います が 、 もう まもなく 8 月 も 終わり 。 さ て 、 みな さん 夏休み は何 を され ました か ? 海外 旅行 、 夏の セール と 時間 も お金 も い ろん な 使い 道 があったこと だと は 思い ます 。 とくに お金 と なれば 、 多くの 人が 夏の ボーナス を 目 論 んで 夏休み を 楽 し んだ か と思います 。 思 わず 使い すぎ てしまった と なれば 、 一応 精算 して おか なければ 。 女 た るもの 、 お 財 布 の管理 くらい しっかり で きて ないと ...... え 、 あれ ? ▁ なん か 想像 以上に 現金 がない 。 あれ ー 、 何 に 使った っ け かな ぁ ...... 。 いく ら 気 をつけて いて も 、 現金 での 支出 管理 は なかなか 難しい もの 。 財 布 に入っている と つい つい 使 ってしまう なら 財 布 に お金 を 持 た なければ いい ! ▁ と思い たい が 、 フィナンシャル プラン ナー の 山口 京 子 氏は こう 話す 。 「 実は “ 最小限 の 現金 だけ 入れ て お けば 、 お金 が 貯 まる ” というのは 大きな 誤解 です 。 慣れ て しま えば 、 計画 性が 薄れ てくる 可能性 が あります 。 atm で 下ろす 金額 と 回数 を決め 、 手 持ち の 現金 の 適正 額 を設定し 、『 つか う お金 』、『 貯 める お金 』、『 一 定額 支払う お金 』 を き っち り 分け て 、 用途 に あう 決済 方法 を選ぶ ことが 家 計 管理 を スマート 化する 近 道 となり ます 」 とのこと 。 では 、 そんな 支出 管理 の プロ である 山口 氏 お ス ス メ の 方法は 、 デ ビット カード 。 利用 した 時点で 即時 に 自分の 銀行 口座 から 引き 落とされ るため 、 現金 と同じ 感覚 で 支払い

INFO:tensorflow:***** Running evaluation *****
INFO:tensorflow:  Num examples = 1473 (1473 actual, 0 padding)
INFO:tensorflow:  Batch size = 8
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running eval on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 512)
INFO:tensorflow:  name = input_mask, shape = (?, 512)
INFO:tensorflow:  name = is_real_example, shape = (?,)
INFO:tensorflow:  name = label_ids, shape = (?,)
INFO:tensorflow:  name = segment_ids, shape = (?, 512)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (32000, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNor

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-03-05-06:10:28
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ../model/livedoor_output/model.ckpt-11052
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-03-05-06:10:48
INFO:tensorflow:Saving dict for global step 11052: eval_accuracy = 0.9789545, eval_loss = 0.1859285, global_step = 11052, loss = 0.18504924
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 11052: ../model/livedoor_output/model.ckpt-11052
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.9789545
INFO:tensorflow:  eval_loss = 0.1859285
INFO:tensorflow:  global_step = 11052
INFO:tensorflow:  loss = 0.18504924
CPU times: user 25.3 s, sys: 6.66 s, total: 31.9 s
Wall time: 37min 4s


## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [16]:
import sys
sys.path.append("../src")

import tokenization_sentencepiece as tokenization
from run_classifier import LivedoorProcessor
from run_classifier import model_fn_builder
from run_classifier import file_based_input_fn_builder
from run_classifier import file_based_convert_examples_to_features
from utils import str_to_value

In [17]:
sys.path.append("../bert")

import modeling
import optimization
import tensorflow as tf

In [18]:
import configparser
import json
import glob
import os
import pandas as pd
import tempfile

bert_config_file = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8', suffix='.json')
bert_config_file.write(json.dumps({k:str_to_value(v) for k,v in config['BERT-CONFIG'].items()}))
bert_config_file.seek(0)
bert_config = modeling.BertConfig.from_json_file(bert_config_file.name)

In [19]:
output_ckpts = glob.glob("{}/model.ckpt*data*".format(FINETUNE_OUTPUT_DIR))
latest_ckpt = sorted(output_ckpts)[-1]
FINETUNED_MODEL_PATH = latest_ckpt.split('.data-00000-of-00001')[0]

In [20]:
class FLAGS(object):
    '''Parameters.'''
    def __init__(self):
        self.model_file = "../model/wiki-ja.model"
        self.vocab_file = "../model/wiki-ja.vocab"
        self.do_lower_case = True
        self.use_tpu = False
        self.output_dir = "/dummy"
        self.data_dir = "../../data/livedoor"
        self.max_seq_length = 512
        self.init_checkpoint = FINETUNED_MODEL_PATH
        self.predict_batch_size = 4
        
        # The following parameters are not used in predictions.
        # Just use to create RunConfig.
        self.master = None
        self.save_checkpoints_steps = 1
        self.iterations_per_loop = 1
        self.num_tpu_cores = 1
        self.learning_rate = 0
        self.num_warmup_steps = 0
        self.num_train_steps = 0
        self.train_batch_size = 0
        self.eval_batch_size = 0

In [21]:
FLAGS = FLAGS()

In [22]:
processor = LivedoorProcessor()
label_list = processor.get_labels()

In [23]:
tokenizer = tokenization.FullTokenizer(
    model_file=FLAGS.model_file, vocab_file=FLAGS.vocab_file,
    do_lower_case=FLAGS.do_lower_case)

tpu_cluster_resolver = None

is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    master=FLAGS.master,
    model_dir=FLAGS.output_dir,
    save_checkpoints_steps=FLAGS.save_checkpoints_steps,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=FLAGS.iterations_per_loop,
        num_shards=FLAGS.num_tpu_cores,
        per_host_input_for_training=is_per_host))

Loaded a trained SentencePiece model.


In [24]:
model_fn = model_fn_builder(
    bert_config=bert_config,
    num_labels=len(label_list),
    init_checkpoint=FLAGS.init_checkpoint,
    learning_rate=FLAGS.learning_rate,
    num_train_steps=FLAGS.num_train_steps,
    num_warmup_steps=FLAGS.num_warmup_steps,
    use_tpu=FLAGS.use_tpu,
    use_one_hot_embeddings=FLAGS.use_tpu)


estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=FLAGS.use_tpu,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=FLAGS.train_batch_size,
    eval_batch_size=FLAGS.eval_batch_size,
    predict_batch_size=FLAGS.predict_batch_size)

INFO:tensorflow:Using config: {'_model_dir': '/dummy', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb74e6d9cf8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1, num_shards=1, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_di

In [25]:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
predict_file = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8', suffix='.tf_record')

file_based_convert_examples_to_features(predict_examples, label_list,
                                        FLAGS.max_seq_length, tokenizer,
                                        predict_file.name)

predict_drop_remainder = True if FLAGS.use_tpu else False

predict_input_fn = file_based_input_fn_builder(
    input_file=predict_file.name,
    seq_length=FLAGS.max_seq_length,
    is_training=False,
    drop_remainder=predict_drop_remainder)

INFO:tensorflow:Writing example 0 of 1473
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test-1
INFO:tensorflow:tokens: [CLS] ▁ 新記録 で ロンドン に乗り 込む “ バタ フライ の 女王 ” 加藤 ゆ か 3 日に行われた 競泳 の 日本選手権 で 、 女子 100 メートル バタ フライ の 加藤 ゆ か ( 25 歳 ) が 2 大会連続 の 五輪 出場 を決めた 。 57 秒 77 と 、 自身が 持つ 日本 新 記録を更新 して の 五輪 切符 ゲット だ 。 前回 大会の 北京 五輪 選考 では ガ チ ガ チ に 緊張 していたという 加藤 は 、 幾 多 の経験 を得て 、 強い 精神 力を 培 った 。 記録 更新 での 五輪 出場権 獲得 に 、 爽 やかな 笑顔 で 喜び を 爆発 させた 。 百 花 繚 乱 の 日本女子 競泳 界 の中でも 、 加藤 は 美女 スイ マー の 筆頭 に数えられる 。 目 鼻 立ち の 整 った 顔 に 、 白く 透 き 通 るような 肌 、 そして 鍛え あげられ た アスリート の 肉体 に 、 ネット 上で の人気 も 上 々 だ 。 前回 の 北京 五輪 では 予選 敗退 で 悔 し 涙 を 呑 んだ 加藤 。 4 年間の 濃 密 な 時間 を経て 、 速く 、 より 美しく なった 「 バタ フライ の 女王 」 が 、 ロンドンの プール を 沸 かす 。 ・ 加藤 ゆ か ▁ フォト [SEP]
INFO:tensorflow:input_ids: 4 9 19861 19 1418 4641 2161 1314 16012 5699 10 4685 809 3995 2234 95 31 4533 30140 10 22440 19 7 712 431 1346 16012 5699 10 3995 2234 95 15 228 559 14 12 25 20703 10 7060 840 6210 8 2487 1053 3168 20 7 3310 7506 99 123 20736 55 10 7060 24989 26563 314 8 9481

INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [26]:
result = estimator.predict(input_fn=predict_input_fn)

In [27]:
%%time
# It will take a few hours on CPU environment.

result = list(result)

INFO:tensorflow:Could not find trained model in model_dir: /dummy, running initialization to predict.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running infer on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 512)
INFO:tensorflow:  name = input_mask, shape = (?, 512)
INFO:tensorflow:  name = is_real_example, shape = (?,)
INFO:tensorflow:  name = label_ids, shape = (?,)
INFO:tensorflow:  name = segment_ids, shape = (?, 512)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (32000, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow

INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/query/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/query/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/key/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/key/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/value/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/self/value/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/dense/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_4/attention/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT

INFO:tensorflow:  name = bert/encoder/layer_8/intermediate/dense/bias:0, shape = (3072,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_8/output/dense/kernel:0, shape = (3072, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_8/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_8/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_8/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/query/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/query/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/key/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_9/attention/self/key/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder

In [28]:
result[:2]

[{'probabilities': array([3.2171338e-06, 3.1370519e-06, 3.0396418e-06, 3.5496435e-06,
         3.7992804e-06, 3.6834181e-06, 2.7165158e-06, 9.9996829e-01,
         8.4550366e-06], dtype=float32)},
 {'probabilities': array([3.9066827e-06, 1.2234125e-05, 9.9995792e-01, 3.8348676e-06,
         6.2248005e-06, 2.7139786e-06, 4.8813145e-06, 4.0486084e-06,
         4.3308410e-06], dtype=float32)}]

Read test data set and add prediction results.

In [29]:
import pandas as pd

In [30]:
test_df = pd.read_csv("../../data/livedoor/test.tsv", sep='\t')

In [31]:
test_df['predict'] = [ label_list[elem['probabilities'].argmax()] for elem in result ]

In [32]:
test_df.head()

Unnamed: 0,label,text,predict
0,sports-watch,新記録でロンドンに乗り込む“バタフライの女王”加藤ゆか3日に行われた競泳の日本選手権で、女子...,sports-watch
1,kaden-channel,家電チャンネルの記事も配信！向かうところ敵なしのスマホアプリ「ITニュース by lived...,kaden-channel
2,peachy,彼にあげたい韓国メンズコスメ、韓流俳優のような美肌男へ！年末の大イベント、クリスマスまであと...,peachy
3,livedoor-homme,快適なスマホライフのための必須アプリ「マトリックス レボリューションズ」(c)Warner ...,kaden-channel
4,dokujo-tsushin,独女と上司の気になる関係人事異動の多い春は、職場の人間関係の悩みも増える時期。『an・an』...,dokujo-tsushin


In [33]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

0.9646978954514596

A littel more detailed check using `sklearn.metrics`.

In [34]:
!pip install -U scikit-learn

Requirement already up-to-date: scikit-learn in /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages (0.20.3)
Requirement not upgraded as not directly required: scipy>=0.13.3 in /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages (from scikit-learn) (1.1.0)
Requirement not upgraded as not directly required: numpy>=1.8.2 in /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages (from scikit-learn) (1.14.5)
[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [35]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [36]:
print(classification_report(test_df['label'], test_df['predict']))

                precision    recall  f1-score   support

dokujo-tsushin       0.99      0.91      0.95       178
  it-life-hack       0.94      0.97      0.95       172
 kaden-channel       0.96      0.98      0.97       176
livedoor-homme       0.97      0.88      0.92        95
   movie-enter       0.97      0.97      0.97       158
        peachy       0.93      0.98      0.96       174
          smax       0.97      0.99      0.98       167
  sports-watch       0.98      1.00      0.99       190
    topic-news       0.98      0.96      0.97       163

     micro avg       0.96      0.96      0.96      1473
     macro avg       0.97      0.96      0.96      1473
  weighted avg       0.97      0.96      0.96      1473



In [37]:
print(confusion_matrix(test_df['label'], test_df['predict']))

[[162   1   2   3   2   8   0   0   0]
 [  0 167   1   0   0   1   2   0   1]
 [  0   4 172   0   0   0   0   0   0]
 [  0   3   3  84   3   2   0   0   0]
 [  0   1   0   0 154   1   0   0   2]
 [  2   0   0   0   0 170   2   0   0]
 [  0   2   0   0   0   0 165   0   0]
 [  0   0   0   0   0   0   0 190   0]
 [  0   0   1   0   0   0   1   4 157]]


### Simple baseline model.

In [38]:
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [39]:
train_df = pd.read_csv("../../data/livedoor/train.tsv", sep='\t')
dev_df = pd.read_csv("../../data/livedoor/dev.tsv", sep='\t')
test_df = pd.read_csv("../../data/livedoor/test.tsv", sep='\t')

In [40]:
!sudo apt-get install -q -y mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8

Reading package lists...
Building dependency tree...
Reading state information...
libmecab-dev is already the newest version (0.996-1.2ubuntu1).
mecab is already the newest version (0.996-1.2ubuntu1).
mecab-ipadic is already the newest version (2.7.0-20070801+main-1).
mecab-ipadic-utf8 is already the newest version (2.7.0-20070801+main-1).
The following packages were automatically installed and are no longer required:
  libaio1 librados2 librbd1 librdmacm1
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 14 not upgraded.


In [41]:
!pip install mecab-python3==0.7

[33mYou are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import MeCab

In [43]:
m = MeCab.Tagger("-Owakati")

In [44]:
train_dev_df = pd.concat([train_df, dev_df])

In [45]:
train_dev_xs = train_dev_df['text'].apply(lambda x: m.parse(x))
train_dev_ys = train_dev_df['label']

test_xs = test_df['text'].apply(lambda x: m.parse(x))
test_ys = test_df['label']

In [46]:
vectorizer = TfidfVectorizer(max_features=750)
train_dev_xs_ = vectorizer.fit_transform(train_dev_xs)
test_xs_ = vectorizer.transform(test_xs)

The following set up is not exactly identical to that of BERT because inside Classifier it uses `train_test_split` with shuffle.  
In addition, parameters are not well tuned, however, we think it's enough to check the power of BERT.

In [48]:
%%time

# model = GradientBoostingClassifier(n_estimators=200,
#                                   validation_fraction=len(train_df)/len(dev_df),
#                                   n_iter_no_change=5,
#                                   tol=0.01,
#                                   random_state=23)

### 1/5 of full training data.
model = GradientBoostingClassifier(n_estimators=200,
                                    validation_fraction=len(dev_df)/len(train_df),
                                    n_iter_no_change=5,
                                    tol=0.01,
                                    random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

CPU times: user 2min 56s, sys: 0 ns, total: 2min 56s
Wall time: 2min 56s


In [49]:
print(classification_report(test_ys, model.predict(test_xs_)))

                precision    recall  f1-score   support

dokujo-tsushin       0.89      0.86      0.88       178
  it-life-hack       0.91      0.90      0.91       172
 kaden-channel       0.90      0.94      0.92       176
livedoor-homme       0.79      0.74      0.76        95
   movie-enter       0.93      0.96      0.95       158
        peachy       0.87      0.92      0.89       174
          smax       0.99      1.00      1.00       167
  sports-watch       0.93      0.98      0.96       190
    topic-news       0.96      0.86      0.91       163

     micro avg       0.92      0.92      0.92      1473
     macro avg       0.91      0.91      0.91      1473
  weighted avg       0.92      0.92      0.91      1473



In [50]:
print(confusion_matrix(test_ys, model.predict(test_xs_)))

[[153   4   1   4   2  13   0   1   0]
 [  3 155   6   3   0   1   0   2   2]
 [  0   5 165   0   2   0   1   1   2]
 [  3   4   6  70   4   6   0   1   1]
 [  1   0   1   3 152   1   0   0   0]
 [  7   1   1   2   3 160   0   0   0]
 [  0   0   0   0   0   0 167   0   0]
 [  1   0   0   2   0   0   0 186   1]
 [  3   1   3   5   0   3   0   8 140]]
