# Multi-class classification of Livedoor News

## データ作成の流れ
1. 元ファイルをダウンロード
1. 前処理

## 1. 元ファイルをダウンロード
* 下記URLからldcc-20140209.tar.gzをダウンロード  
https://www.rondhuit.com/download.html
* ./original_data/配下に解凍
* 学習データ提供：ライブドアニュース

GoogleDriveをマウント


In [1]:
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


livedoor_news ディレクトリを作成

In [None]:
"""
!mkdir -p /content/drive/'My Drive'/BERT-LDC/livedoor_news
"""

"\n!mkdir -p /content/drive/'My Drive'/BERT-LDC/livedoor_news\n"

In [3]:
import os
os.chdir('/content/drive/My Drive/BERT-LDC')

livedoor_news ディレクトリに移動

In [2]:
cd /content/drive/'My Drive'/BERT-LDC/livedoor_news

/content/drive/My Drive/BERT-LDC/livedoor_news


livedoor newsコーパスのダウンロード

In [None]:
"""
import urllib.request

livedoor_news_url = "https://www.rondhuit.com/download/ldcc-20140209.tar.gz"
urllib.request.urlretrieve(livedoor_news_url, "ldcc-20140209.tar.gz")
"""

<br><br>

## 2. 前処理
* データの特性から、ライブドアニュースコーパスは1記事1レコードとしてデータ整形
* その他のデータを利用する場合もデータの特性に応じて、適切な分割をする

データセットをBERT向けのフォーマットに変換

In [None]:
import tarfile
import csv
import re


target_genre = [
                "dokujo-tsushin",
                "it-life-hack",
                "kaden-channel",
                "livedoor-homme",
                "movie-enter",
                "peachy",
                "smax",
                "sports-watch",
                "topic-news"
                ]

fname_list = [[] for i in range(len(target_genre))]

tsv_fname = "all.tsv"

brackets_tail = re.compile('【[^】]*】$')
brackets_head = re.compile('^【[^】]*】')

def remove_brackets(inp):
    output = re.sub(brackets_head, '',re.sub(brackets_tail, '', inp))

    return output

def read_title(f):
    next(f)
    next(f)
    title = next(f)
    title = remove_brackets(title.decode('utf-8'))

    return title[:-1]

with tarfile.open("ldcc-20140209.tar.gz") as tf:
    for ti in tf:
        if "LICENSE.txt" in ti.name:
            continue
        elif "CHANGES.txt" in ti.name:
            continue
        elif "README.txt" in ti.name:
            continue
        else:
            for i, t in enumerate(target_genre):
                if target_genre[i] in ti.name and ti.name.endswith(".txt"):
                    fname_list[i].append(ti.name)
                    continue

    with open(tsv_fname, "w") as wf:
        writer = csv.writer(wf, delimiter='\t')
        for i, fcategory in enumerate(fname_list):
            for name in fcategory:
                f = tf.extractfile(name)
                title = read_title(f)
                row = [target_genre[i], i, '', title]
                writer.writerow(row)

bert/livedoor_news/にall.tsvの作成が確認できたら、次のコードを実行して、学習用/テスト用のtsvファイルに分割します。

In [None]:
import random

random.seed(100)
with open("all.tsv", 'r') as f, open("rand-all.tsv", "w") as wf:
    lines = f.readlines()
    random.shuffle(lines)
    for line in lines:
        wf.write(line)

random.seed(101)

train_fname, dev_fname, test_fname = ["train.tsv", "dev.tsv", "test.tsv"]

with open("rand-all.tsv") as f, open(train_fname, "w") as tf, open(dev_fname, "w") as df, open(test_fname, "w") as ef:
    ef.write("class\tsentence\n")
    for line in f:
        v = random.randint(0, 9)
        if v == 8:
            df.write(line)
        elif v == 9:
            ef.write(line)
        else:
            tf.write(line)

<br><br>

# ファインチューニングの流れ
1. 必要なモジュールのインストール
1. BERTのリポジトリをClone
1. プログラムの改変
1. 学習済みモデルのfine-tuning
1. テストデータの予測
1. 予測結果の出力

JUMANのインストール

In [3]:
!wget https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc2/jumanpp-2.0.0-rc2.tar.xz && \
tar xJvf jumanpp-2.0.0-rc2.tar.xz && \
rm jumanpp-2.0.0-rc2.tar.xz && \
cd jumanpp-2.0.0-rc2/ && \
mkdir bld && \
cd bld && \
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=/usr/local && \
make && \
sudo make install

--2020-10-06 05:02:40--  https://github.com/ku-nlp/jumanpp/releases/download/v2.0.0-rc2/jumanpp-2.0.0-rc2.tar.xz
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/70542756/4eeea9d6-279f-11e8-8428-a24e7d7d8b99?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201006%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201006T050240Z&X-Amz-Expires=300&X-Amz-Signature=a7a7134d029a5d3861b2b06af754016c2024c3a9855dd06e98e30158a9b5606f&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=70542756&response-content-disposition=attachment%3B%20filename%3Djumanpp-2.0.0-rc2.tar.xz&response-content-type=application%2Foctet-stream [following]
--2020-10-06 05:02:40--  https://github-production-release-asset-2e65be.s3.amazonaws.com/70542756/4eeea9d6-279f-11e8-8428-a24e7d7d8b99?X-Amz-Algorithm=A

In [4]:
!jumanpp -v

Juman++ Version: 2.0.0-rc2 / Dictionary: 20180202-2cca748 / LM: K:20180217-6c28641 L:20180221-fd8a4b63 F:20171214-9d125cb


In [5]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [6]:
!pip install pyknp

Collecting pyknp
[?25l  Downloading https://files.pythonhosted.org/packages/1d/0e/93221dc85bd214b87b37bdd56af384b252e882fdb91e39c842a2614a8822/pyknp-0.4.5.zip (43kB)
[K     |███████▌                        | 10kB 20.2MB/s eta 0:00:01[K     |███████████████                 | 20kB 6.2MB/s eta 0:00:01[K     |██████████████████████▋         | 30kB 7.4MB/s eta 0:00:01[K     |██████████████████████████████  | 40kB 7.5MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 3.5MB/s 
Building wheels for collected packages: pyknp
  Building wheel for pyknp (setup.py) ... [?25l[?25hdone
  Created wheel for pyknp: filename=pyknp-0.4.5-cp36-none-any.whl size=40420 sha256=06ca703e37e12ffd9b98543b36dd7a031d734e0bf6637e9d07e668822d1c1e27
  Stored in directory: /root/.cache/pip/wheels/7d/0c/46/495789d5ca85293c2478f5bd81e1204f77f949645cb35bf382
Successfully built pyknp
Installing collected packages: pyknp
Successfully installed pyknp-0.4.5


BERT日本語学習済みモデルのダウンロード

In [None]:
"""
kyoto_u_bert_url = "http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/JapaneseBertPretrainedModel/Japanese_L-12_H-768_A-12_E-30_BPE.zip"
urllib.request.urlretrieve(kyoto_u_bert_url, "Japanese_L-12_H-768_A-12_E-30_BPE.zip")
"""

In [None]:
!unzip Japanese_L-12_H-768_A-12_E-30_BPE.zip

Archive:  Japanese_L-12_H-768_A-12_E-30_BPE.zip
  inflating: Japanese_L-12_H-768_A-12_E-30_BPE/README.txt  
  inflating: Japanese_L-12_H-768_A-12_E-30_BPE/bert_config.json  
  inflating: Japanese_L-12_H-768_A-12_E-30_BPE/bert_model.ckpt.data-00000-of-00001  
  inflating: Japanese_L-12_H-768_A-12_E-30_BPE/bert_model.ckpt.index  
  inflating: Japanese_L-12_H-768_A-12_E-30_BPE/bert_model.ckpt.meta  
  inflating: Japanese_L-12_H-768_A-12_E-30_BPE/pytorch_model.bin  
  inflating: Japanese_L-12_H-768_A-12_E-30_BPE/vocab.txt  


## 2. BERTのリポジトリをClone

In [None]:
!git clone https://github.com/google-research/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 317.85 KiB | 5.89 MiB/s, done.
Resolving deltas: 100% (185/185), done.


## 3. プログラムの改変
run_classifier.pyをローカルでコピーし、run_classifier_livedoor.pyと名前を変更して中身を改変する。
tokenization.pyに関しては、名前が変わるとプログラムを呼び出す部分も変更する必要があり(少々面倒なので)このままの名前で編集。

## run_classifier_livedoor.pyの変更

##### １）LivedoorProcessorクラスの追加
既に作成されているColaProcessorクラスの下に、新しいクラスLivedoorProcessorを作成する。
<pre>
class LivedoorProcessor(DataProcessor):
  """Processor for the MRPC data set (GLUE version)."""

  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1", "2", "3", "4", "5", "6", "7", "8"]

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      if i == 0:
        continue
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[0])
      label = tokenization.convert_to_unicode(line[1])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples
</pre>

## tokenization.pyの変更
形態素解析器にJUMANを使用している為、tokenization.pyにJUMANを使用して形態素解析を行うのプログラムを記述する。

<pre>
class FullTokenizer(object):
  """Runs end-to-end tokenziation."""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = load_vocab(vocab_file)
    self.inv_vocab = {v: k for k, v in self.vocab.items()}
    # Jumanを使用する様に変更
    # self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
    self.jumanpp_tokenizer = JumanPPTokenizer()
    self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

  def tokenize(self, text):
    split_tokens = []
    # Jumanを使用する様に変更
    # for token in self.basic_tokenizer.tokenize(text):
    for token in self.jumanpp_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)

    return split_tokens
</pre>

JumanPPTokenizer クラスの追加
<pre>
class JumanPPTokenizer(BasicTokenizer):
  def __init__(self):
    """
        日本語専用トークナイザの構築。
        JUMAN++ を使用する。
    """
    from pyknp import Juman

    self.do_lower_case = False
    self._jumanpp = Juman()

  def tokenize(self, text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text.replace(' ', ''))
    text = self._clean_text(text)

    juman_result = self._jumanpp.analysis(text)
    split_tokens = []
    for mrph in juman_result.mrph_list():
      split_tokens.extend(self._run_split_on_punc(mrph.midasi))

    output_tokens = whitespace_tokenize(" ".join(split_tokens))
    print(split_tokens)
    return output_tokens
</pre>

## 4. trainデータを使用した、学習済みモデルのfine-tuning

In [7]:
!python bert/run_classifier_livedoor.py \
--task_name=livedoor \
--do_train=true \
--do_eval=true \
--data_dir=./ \
--vocab_file=./Japanese_L-12_H-768_A-12_E-30_BPE/vocab.txt \
--bert_config_file=./Japanese_L-12_H-768_A-12_E-30_BPE/bert_config.json \
--init_checkpoint=./Japanese_L-12_H-768_A-12_E-30_BPE/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=./tmp/livedoor_news_output_fine \
--do_lower_case False

[1;30;43mストリーミング出力は最後の 5000 行に切り捨てられました。[0m
['ノムさん', 'が', '、', '松井', '秀喜', 'の', '日本', '球界', 'を', '願う', '理由']
['「', '女性', '蔑視', '発言', '」', '謝罪', 'した', '市議', 'を', '擁護', 'する', '声']
['非', '接触', '充電', 'に', 'も', '対応', 'した', '“', '全部', '入り', '”', 'スマートフォン', '！', '「', 'ARROWSXF', '-', '10', 'D', '」', 'を', '写真', 'と', '動画', 'で', 'チェック']
['電話', '？', '\\', ' ', 'メール', '？', '\\', ' ', 'SNS', '？', '\\', ' ', 'ストレス', 'なく', '連絡', 'を', '取り', '合う', 'イマドキ', 'の', '恋人', 'たち']
['中田', '英寿', 'も', '認めた', '天才', 'MF', 'が', '現役', '引退']
['キム', '・', 'ヨナ', 'の', '失速', '、', '韓国', '記者', '「', '引退', 'の', '可能', '性', 'は', '50', '％', '」']
['NTT', 'ドコモ', '、', 'Android', '向け', 'アプリ', '「', 'しゃべって', 'コンシェル', '」', 'を', 'ver', '2', '.', '0', '.', '0', 'に', 'アップデート', '！', '“', '答え', 'そのもの', '”', 'を', '返す', 'ように', '成長']
['「', 'TOKYOSWEETSCOLLECTION', '」', 'で', '注目', 'さ', 'れた', 'スイーツ', 'アプリ']
['「', 'おばさん', '」', 'と', '呼ば', 'れたら', '？', 'その', 'とき', 'あなた', 'は', '？']
['ゆっくり', 'な', 'の', 'が', 'たまに', 'キズ', '？', '\\', ' ', '「', 'USB', 'あった

## 5. テストデータの予測

In [8]:
!python bert/run_classifier_livedoor.py \
  --task_name=livedoor \
  --do_predict=true \
  --data_dir=./ \
  --vocab_file=./Japanese_L-12_H-768_A-12_E-30_BPE/vocab.txt \
  --bert_config_file=./Japanese_L-12_H-768_A-12_E-30_BPE/bert_config.json \
  --init_checkpoint=./tmp/livedoor_news_output_fine \
  --max_seq_length=128 \
  --output_dir=tmp/livedoor_news_output_predic/




W1006 05:19:07.967417 140709923747712 module_wrapper.py:139] From bert/run_classifier_livedoor.py:829: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.


W1006 05:19:07.967735 140709923747712 module_wrapper.py:139] From bert/run_classifier_livedoor.py:829: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.


W1006 05:19:07.968204 140709923747712 module_wrapper.py:139] From /content/drive/My Drive/BERT-LDC/livedoor_news/bert/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.


W1006 05:19:07.971976 140709923747712 module_wrapper.py:139] From bert/run_classifier_livedoor.py:854: The name tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs instead.

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * h

## 6. 予測結果の出力

In [9]:
cd /content/drive/'My Drive'/BERT-LDC/livedoor_news

/content/drive/My Drive/BERT-LDC/livedoor_news


In [10]:
import csv
import numpy as np


with open("./test.tsv") as f, open("tmp/livedoor_news_output_predic/test_results.tsv") as rf:
  test = csv.reader(f, delimiter = '\t')
  test_result = csv.reader(rf, delimiter = '\t')

  # 正解データの抽出
  next(test)
  test_list = [int(row[1]) for row in test ]

  # 予測結果を抽出
  result_list = []
  for result in test_result:
    max_index = np.argmax(result)
    result_list.append(max_index)

  # 分類した予測結果(カテゴリNo)を出力
  with open('tmp/livedoor_news_output_predic/test_results.csv', 'w') as of:
    writer = csv.writer(of)
    for row in result_list:
      writer.writerow([row])

  test_count = len(test_list)
  result_correct_answer_list = [result for test, result in zip(test_list, result_list) if test == result]
  result_correct_answer_count = len(result_correct_answer_list)
  print("正解率: ", result_correct_answer_count / test_count)

正解率:  0.8220338983050848
