<a href="https://colab.research.google.com/github/mlengineer19989/text_classification/blob/main/colab_notebooks/japanese_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# notebookの使い方
- カレントディレクトリ直下に学習に使いたいデータを、train.csv, valid.csvという名前で配置する。



In [2]:
!pip install transformers[ja,torch] datasets matplotlib japanize-matplotlib

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting japanize-matplotlib
  Downloading japanize-matplotlib-1.1.3.tar.gz (4.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.1/4.1 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting accelerate>=0.21.0 (from transformers[ja,torch])
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fugashi>=1.0 (from transformers[ja,torch])
  Downloading fugashi-1.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (600 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m600.9/600.9 kB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

In [3]:
import numpy as np
import pandas as pd
from datasets import ClassLabel, Dataset, Features, Value
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BatchEncoding,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)

In [4]:
my_features = Features(
    {"sentence": Value("string"), "label": ClassLabel(names=["positive", "negative"])}
)

# データを読み込む
df_train = pd.read_csv("train.csv")
df_valid = pd.read_csv("valid.csv")
train_dataset = Dataset.from_pandas(
    df_train[["sentence", "label"]], features=my_features
)
valid_dataset = Dataset.from_pandas(
    df_valid[["sentence", "label"]], features=my_features
)

In [5]:
##### トークナイズ #####
model_name = "cl-tohoku/bert-base-japanese-v3"
tokenizer = AutoTokenizer.from_pretrained(model_name)


def preprocess_text_classification(example: dict) -> BatchEncoding:
    """文書分類の事例のテキストをトークナイズし、IDに変換"""
    encoded_example = tokenizer(example["sentence"], max_length=512)
    # モデルの入力引数である"labels"をキーとして格納する
    encoded_example["labels"] = example["label"]
    return encoded_example


encoded_train_dataset = train_dataset.map(
    preprocess_text_classification,
    remove_columns=train_dataset.column_names,
)
encoded_valid_dataset = valid_dataset.map(
    preprocess_text_classification,
    remove_columns=valid_dataset.column_names,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/251 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/231k [00:00<?, ?B/s]



Map:   0%|          | 0/20149 [00:00<?, ? examples/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Map:   0%|          | 0/1608 [00:00<?, ? examples/s]

In [6]:
##### ミニバッチ構築 #####
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [7]:
##### モデル準備 #####
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
)

config.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/447M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cl-tohoku/bert-base-japanese-v3 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
##### 訓練実行 #####
training_args = TrainingArguments(
    output_dir="outputs",  # 結果の保存フォルダ
    per_device_train_batch_size=32,  # 訓練時のバッチサイズ
    per_device_eval_batch_size=32,  # 評価時のバッチサイズ
    learning_rate=2e-5,  # 学習率
    lr_scheduler_type="linear",  # 学習率スケジューラの種類
    warmup_ratio=0.1,  # 学習率のウォームアップの長さを指定
    num_train_epochs=3,  # エポック数
    save_strategy="epoch",  # チェックポイントの保存タイミング
    logging_strategy="epoch",  # ロギングのタイミング
    evaluation_strategy="epoch",  # 検証セットによる評価のタイミング
    load_best_model_at_end=True,  # 訓練後に開発セットで最良のモデルをロード
    metric_for_best_model="accuracy",  # 最良のモデルを決定する評価指標
    fp16=True,  # 自動混合精度演算の有効化
)


def compute_accuracy(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:
    """予測ラベルと正解ラベルから正解率を計算"""
    predictions, labels = eval_pred
    # predictionsは各ラベルについてのスコア
    # 最もスコアの高いインデックスを予測ラベルとする
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


trainer = Trainer(
    model=model,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_valid_dataset,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_accuracy,
)
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.3093,0.178192,0.933458
2,0.1357,0.186292,0.935945
3,0.0678,0.247015,0.935323


TrainOutput(global_step=1890, training_loss=0.17092418771572215, metrics={'train_runtime': 316.993, 'train_samples_per_second': 190.689, 'train_steps_per_second': 5.962, 'total_flos': 2736880683965940.0, 'train_loss': 0.17092418771572215, 'epoch': 3.0})

In [11]:
# Googleドライブをマウントする
from google.colab import drive

drive.mount("drive")

Mounted at drive


In [12]:
# prompt: 今日の年月日時間の文字列のフォルダを"drive/MyDrive/models"直下に作製する

import datetime
now = datetime.datetime.now()
today = now.strftime("%Y%m%d%H")


2024052602


In [13]:
# prompt: 今日の年月日時間の文字列のフォルダを"drive/MyDrive/models"直下に作製し、そこにoutputsフォルダをコピーする。

import os

model_dir = os.path.join("drive/MyDrive/models", today)
os.makedirs(model_dir)


In [14]:
# prompt: 今日の年月日時間の文字列のフォルダを"drive/MyDrive/models"直下に作製し、そこにoutputsフォルダをコピーする。

!cp -r outputs $model_dir
