<a href="https://colab.research.google.com/github/inuikous/rep/blob/main/BERT_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <u>ニュース記事のタイトルを９カテゴリに分類</u>
データセット：[livedoorニュースコーパス](http://www.rondhuit.com/download.html#ldcc)

In [1]:
!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
# 以下で報告があるようにmecab-python3のバージョンを0.996.5にしないとtokenizerで落ちる
# https://stackoverflow.com/questions/62860717/huggingface-for-japanese-tokenizer
!pip install mecab-python3==0.996.5
!pip install unidic-lite # これないとMeCab実行時にエラーで落ちる
!pip install transformers[torch]
!pip install datasets
!pip install fugashi
!pip install ipadic

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  aptitude-common libcwidget4 libsigc++-2.0-0v5 libxapian30 swig4.0
Suggested packages:
  apt-xapian-index aptitude-doc-en | aptitude-doc debtags tasksel libcwidget-dev xapian-tools
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  aptitude aptitude-common libcwidget4 libsigc++-2.0-0v5 libxapian30 swig swig4.0
0 upgraded, 7 newly installed, 0 to remove and 8 not upgraded.
Need to get 4,954 kB of archives.
After this operation, 22.9 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 aptitude-common all 0.8.13-3ubuntu1 [1,719 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libsigc++-2.0-0v5 amd64 2.10.4-2ubuntu3 [12.1 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libcwidget4 amd64 0.5.18-5build1 [306 kB]
Get:4

## データセットをダウンロード・解凍

In [2]:
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
!tar xf ldcc-20140209.tar.gz

--2023-11-17 10:59:25--  https://www.rondhuit.com/download/ldcc-20140209.tar.gz
Resolving www.rondhuit.com (www.rondhuit.com)... 59.106.19.174
Connecting to www.rondhuit.com (www.rondhuit.com)|59.106.19.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8855190 (8.4M) [application/x-gzip]
Saving to: ‘ldcc-20140209.tar.gz’


2023-11-17 10:59:34 (1.19 MB/s) - ‘ldcc-20140209.tar.gz’ saved [8855190/8855190]



## データ読込・整形

In [3]:
import os
from glob import glob
import pandas as pd
import linecache

# カテゴリを配列で取得
categories = [name for name in os.listdir("text") if os.path.isdir("text/" + name)]
print(categories)
# ['movie-enter', 'it-life-hack', 'kaden-channel', 'topic-news', 'livedoor-homme', 'peachy', 'sports-watch', 'dokujo-tsushin', 'smax']

datasets = pd.DataFrame(columns=["title", "category"])
for cat in categories:
    path = "text/" + cat + "/*.txt"
    files = glob(path)
    for text_name in files:
        title = linecache.getline(text_name, 3).rstrip('\n')
        s = pd.DataFrame([[title, cat]], columns=datasets.columns)
        datasets = pd.concat([datasets, s], ignore_index=True)

datasets.head()

['movie-enter', 'livedoor-homme', 'topic-news', 'kaden-channel', 'smax', 'sports-watch', 'dokujo-tsushin', 'it-life-hack', 'peachy']


Unnamed: 0,title,category
0,【プレゼント】1番悪いヤツは誰なのか、北野武監督最新作『アウトレイジ ビヨンド』試写会にご招待,movie-enter
1,『ミッション：8ミニッツ』繰り返される“8分間”の悪夢に隠された真実とは,movie-enter
2,『バイオハザード』を超えた、『三銃士』がドイツで初登場1位を獲得,movie-enter
3,最愛の友との別れは破滅への序曲となる！ 『ベルセルク』パート2の予告映像が解禁,movie-enter
4,【DVDエンター！】三浦春馬を取り巻く3人の美女、暖かい公園のように身近で大切な存在,movie-enter


In [4]:
"""
category列(文字列型)をcategory_id(数値型)に変換
"""

# カテゴリーのリストをデータセットから取得
categories = list(set(datasets['category']))
print(categories)
#['topic-news', 'movie-enter', 'livedoor-homme', 'it-life-hack', 'dokujo-tsushin', 'sports-watch', 'kaden-channel', 'peachy', 'smax']

# カテゴリーのID辞書を作成
id2cat = dict(zip(list(range(len(categories))), categories))
cat2id = dict(zip(categories, list(range(len(categories)))))
print(id2cat)
print(cat2id)
#{0: 'topic-news', 1: 'movie-enter', 2: 'livedoor-homme', 3: 'it-life-hack', 4: 'dokujo-tsushin', 5: 'sports-watch', 6: 'kaden-channel', 7: 'peachy', 8: 'smax'}
#{'topic-news': 0, 'movie-enter': 1, 'livedoor-homme': 2, 'it-life-hack': 3, 'dokujo-tsushin': 4, 'sports-watch': 5, 'kaden-channel': 6, 'peachy': 7, 'smax': 8}

# DataFrameにカテゴリーID列を追加
datasets['category_id'] = datasets['category'].map(cat2id)

# 念の為シャッフル
datasets = datasets.sample(frac=1).reset_index(drop=True)

# データセットをタイトルとカテゴリーID列だけにする
datasets = datasets[['title', 'category_id']]

datasets.columns = ['text', 'label']
datasets.head()

['it-life-hack', 'kaden-channel', 'smax', 'livedoor-homme', 'peachy', 'movie-enter', 'dokujo-tsushin', 'topic-news', 'sports-watch']
{0: 'it-life-hack', 1: 'kaden-channel', 2: 'smax', 3: 'livedoor-homme', 4: 'peachy', 5: 'movie-enter', 6: 'dokujo-tsushin', 7: 'topic-news', 8: 'sports-watch'}
{'it-life-hack': 0, 'kaden-channel': 1, 'smax': 2, 'livedoor-homme': 3, 'peachy': 4, 'movie-enter': 5, 'dokujo-tsushin': 6, 'topic-news': 7, 'sports-watch': 8}


Unnamed: 0,text,label
0,ペナルティ・ワッキー、W杯で遠藤が決めたフリーキックの裏話明かす,8
1,大人カワイく省エネ実践！パナソニック「エネループ」がディズニーとコラボ【売れ筋チェック】,1
2,JOYが4本腕の怪物姿でエスコート、『ジョン・カーター』ジャパンプレミア開催,5
3,ソフトバンク、HONEY BEE 101Kに緊急速報メール対応のためのソフトウェア更新を提供開始,2
4,「突撃！隣の晩ごはん」のヨネスケが、華やか企業の「OL昼ごはん」に再び突撃！,6


In [5]:
"""
HuggingFaceのtransformersライブラリで扱えるデータ構造に変換し、さらに学習用データとテストデータに分ける
"""
from datasets import Dataset

dataset_packed = Dataset.from_pandas(datasets)
dataset_split = dataset_packed.train_test_split(test_size=0.2, seed=0)
print(dataset_split)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5900
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1476
    })
})


## トークン化 (Tokenize)

トークン化の処理は以下の流れ
1. 形態素解析（今回のモデルではMecab&WordPieceを利用）
2. 文の前に`[CLS]`、文の後ろに`[SEP]`を付ける
3. 各単語を単語固有のIDに置き換えることで、文字列データを数値データに変換する

※ 事前学習済みモデルを使用する際は、必ず事前学習時と同じTokenizerを使用する

In [6]:
from transformers import AutoTokenizer

# 事前学習時と同じTokenizerを読み込む
tokenizer = AutoTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-v2')

# 試しに分かち書きしてみる。
text = list(datasets['text'])[0]
wakati_ids = tokenizer.encode(text, return_tensors='pt')
print("元のデータ　：", text)
print("トークン化　：", tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
print("単語IDに変換：", wakati_ids)

(…)se-v2/resolve/main/tokenizer_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]

(…)ase-japanese-v2/resolve/main/config.json:   0%|          | 0.00/517 [00:00<?, ?B/s]

(…)-base-japanese-v2/resolve/main/vocab.txt:   0%|          | 0.00/236k [00:00<?, ?B/s]

元のデータ　： ペナルティ・ワッキー、W杯で遠藤が決めたフリーキックの裏話明かす
トークン化　： ['[CLS]', 'ペナルティ', '・', 'ワ', '##ッキー', '、', 'W', '杯', 'で', '遠藤', 'が', '決め', 'た', 'フリー', 'キック', 'の', '裏', '##話', '明', '##かす', '[SEP]']
単語IDに変換： tensor([[    2, 26382,  1025,  1017, 13151,   828,    69,  2854,   889, 18359,
           862, 12540,   881, 11981, 15137,   896,  4808,  6334,  2736, 15960,
             3]])


In [7]:
"""
データセット全体をトークン化
"""

def preprocess_function(examples):
    MAX_LENGTH = 512
    return tokenizer(examples["text"], max_length=MAX_LENGTH, truncation=True)

tokenized_dataset = dataset_split.map(preprocess_function, batched=True)

Map:   0%|          | 0/5900 [00:00<?, ? examples/s]

Map:   0%|          | 0/1476 [00:00<?, ? examples/s]

## 学習

In [8]:
"""
DataCollatorと事前学習済みBERTモデルを準備
"""

from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained("cl-tohoku/bert-base-japanese-v2", num_labels=9)

pytorch_model.bin:   0%|          | 0.00/447M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v2 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
"""
評価用関数を準備
"""
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {'accuracy':acc, 'f1':f1}

In [10]:
"""
学習（ファインチューニング）
"""

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    save_strategy='epoch',
    save_total_limit=1,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    no_cuda=False, # GPUを使用する場合はFalse, 使用しない場合はTrue
)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.8025,0.407094,0.878049,0.878507
2,0.2896,0.376672,0.888889,0.88871
3,0.1387,0.382568,0.900407,0.90025
4,0.0719,0.434114,0.888889,0.889321
5,0.0419,0.433558,0.892954,0.892961


TrainOutput(global_step=1845, training_loss=0.26891642728149084, metrics={'train_runtime': 354.8993, 'train_samples_per_second': 83.122, 'train_steps_per_second': 5.199, 'total_flos': 617634462909024.0, 'train_loss': 0.26891642728149084, 'epoch': 5.0})

In [11]:
"""
モデル保存
"""
trainer.save_state()
trainer.save_model()

## 評価

In [12]:
"""
評価データでの予測結果を出力
"""
pred_result = trainer.predict(tokenized_dataset['test'], ignore_keys=['loss', 'last_hidden_state', 'hidden_states', 'attentions'])
pred_label= pred_result.predictions.argmax(axis=1).tolist()
print(pred_label)

[8, 7, 0, 6, 8, 2, 8, 4, 0, 7, 7, 0, 4, 0, 6, 6, 2, 8, 8, 6, 1, 6, 8, 1, 4, 4, 0, 1, 5, 1, 2, 7, 2, 6, 5, 6, 8, 4, 3, 6, 0, 0, 7, 4, 6, 3, 1, 3, 0, 8, 0, 6, 0, 4, 7, 1, 1, 5, 0, 4, 8, 7, 4, 4, 7, 0, 0, 8, 8, 2, 8, 8, 7, 2, 5, 6, 6, 5, 4, 4, 4, 2, 4, 8, 0, 8, 5, 6, 6, 4, 7, 7, 6, 5, 2, 8, 7, 8, 5, 0, 0, 1, 4, 2, 8, 5, 3, 0, 6, 4, 1, 4, 2, 8, 2, 0, 8, 8, 6, 5, 7, 1, 1, 8, 4, 5, 6, 0, 1, 8, 8, 5, 1, 0, 3, 5, 1, 6, 0, 1, 0, 1, 6, 8, 7, 5, 1, 1, 5, 2, 8, 5, 1, 5, 4, 5, 5, 4, 4, 2, 2, 3, 8, 1, 4, 7, 5, 5, 1, 5, 1, 1, 4, 1, 7, 1, 5, 7, 3, 0, 8, 1, 2, 6, 0, 7, 0, 8, 2, 7, 6, 6, 8, 0, 5, 7, 5, 4, 5, 0, 7, 1, 8, 2, 8, 8, 2, 8, 0, 2, 5, 0, 6, 2, 7, 5, 5, 4, 5, 4, 2, 8, 1, 1, 6, 4, 3, 4, 0, 5, 4, 1, 5, 0, 5, 2, 8, 8, 8, 2, 7, 7, 2, 2, 5, 6, 0, 5, 4, 4, 1, 4, 5, 4, 1, 0, 6, 4, 6, 8, 5, 8, 4, 7, 6, 7, 1, 5, 1, 5, 2, 8, 8, 7, 4, 0, 1, 7, 5, 3, 8, 2, 8, 1, 6, 7, 4, 5, 6, 8, 2, 4, 8, 8, 5, 3, 6, 7, 0, 4, 2, 2, 6, 5, 5, 7, 2, 3, 8, 0, 8, 3, 3, 0, 5, 0, 7, 2, 0, 5, 1, 0, 4, 7, 0, 5, 1, 4, 4, 6, 5, 0, 2, 

In [13]:
"""
適合率(precision), 再現率(recall), F値(f1-score), 正解率(accuracy)を求める。
"""
from sklearn.metrics import classification_report
print(classification_report(tokenized_dataset['test']['label'], pred_label, target_names=categories))

                precision    recall  f1-score   support

  it-life-hack       0.93      0.93      0.93       182
 kaden-channel       0.96      0.92      0.94       176
          smax       0.96      0.96      0.96       160
livedoor-homme       0.77      0.71      0.74        86
        peachy       0.77      0.80      0.78       166
   movie-enter       0.89      0.91      0.90       184
dokujo-tsushin       0.88      0.84      0.86       167
    topic-news       0.85      0.94      0.89       163
  sports-watch       0.95      0.92      0.93       192

      accuracy                           0.89      1476
     macro avg       0.89      0.88      0.88      1476
  weighted avg       0.89      0.89      0.89      1476

