<a href="https://colab.research.google.com/github/mydreamisto/notebooks/blob/main/course/en/chapter3/section3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a model with the Trainer API or Keras

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets

In [None]:
!pip install wandb

In [None]:
!pip install evaluate

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

# Training

In [None]:
# TrainingArguments class contains all the hyperparameters the Trainer will use for training and evaluation
from transformers import TrainingArguments
# The only argument you have to provide is a directory where the   trained model will be saved, as well as the checkpoints along the way
training_args = TrainingArguments("test-trainer")

In [None]:
from transformers import AutoModelForSequenceClassification
# num_labels=2 表示该序列分类任务是一个二分类任务
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Trainer class to help you fine-tune any of the pretrained models it provides on your dataset.
from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"],
    data_collator = data_collator
)

In [None]:
import wandb
"""
导入了 wandb 库，这是一个用于实验跟踪和可视化的工具。
wandb.init() 是 wandb 库的初始化函数，它会创建一个新的实验运行，设置实验的一些初始信息，如实验名称、超参数等。
如果不调用这个函数，当后续代码中调用 wandb 的其他函数（如 wandb.log()）时，程序将无法正常工作，因为它不知道将数据记录到哪个实验中。
然后，调用 trainer.train() 进行训练。在训练过程中，程序可能会使用 wandb 来记录训练的各种信息，例如损失、指标等，通过调用 wandb.log() 来实现。
由于我们已经调用了 wandb.init()，所以 wandb.log() 可以正常工作，将数据发送到 wandb 的服务器进行存储和可视化。
"""
wandb.init()

# it would reports the training loss every 500 steps
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,0.2067
1000,0.1236


TrainOutput(global_step=1377, training_loss=0.1310880476786074, metrics={'train_runtime': 208.0578, 'train_samples_per_second': 52.889, 'train_steps_per_second': 6.618, 'total_flos': 405114969714960.0, 'train_loss': 0.1310880476786074, 'epoch': 3.0})

In [None]:
# the output of predict() is a named tuple with 3 field:
# "predictions", "label_ids", "metrics"
# The "metrics" field will just contain the loss on the dataset passed,
# as well as some time metrics (how long it took to predict, in total and on average).

# 存储预测的结果
# 使用 trainer 的 predict 方法对验证集（tokenized_datasets["validation"]）进行预测
# 预测结果存储在 predictions 变量中，predictions 对象包含预测结果和真实标签等信息，
# 通常包含 predictions.predictions（预测的原始输出）和 predictions.label_ids（真实标签）
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

# As you can see, predictions is a two-dimensional array with shape 408 x 2
# (408 being the number of elements in the dataset we used).
# 2 是由之前提到的 num_labels 决定的（可能是一个二分类任务）
# Those are the logits for each element of the dataset we passed to predict()



(408, 2) (408,)


predictions 是一个二维数组。

对于每个样本，模型会输出一组 logits，在二分类任务中，这组 logits 包含两个元素，在多分类任务中会有更多元素，每个元素代表模型对该样本属于相应类别的 “得分”，但这些得分并不是概率。

logits 是模型的原始输出，在分类任务中，它们是未经 softmax 或 sigmoid 激活函数处理的值。对于二分类任务，logits 可以被解释为模型对每个类别的未归一化的 “得分”。例如，对于一个样本，logits 为 [2.3, -1.5] 表示模型更倾向于将该样本预测为第一类，因为第一类的得分更高。

predictions.predictions 的形状为 408 x 2（在这个例子中），其中 408 是传递给 predict() 方法的数据集的元素数量，2 是由之前提到的 num_labels 决定的（可能是一个二分类任务）。对于每个样本，predictions.predictions 存储了预测的结果，但这里存储的是 logits 而不是直接的类别标签。

形状为 (408,) 表示 predictions.label_ids 是一个一维数组，其中包含了 408 个元素，每个元素是一个整数，表示该样本的真实类别标签。对于第 i 个样本，其真实类别标签可以通过 predictions.label_ids[i] 获取。

In [None]:
# To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:
import numpy as np
"""
将 logits 转换为最终的类别预测，得到一个一维数组 preds，其长度等于样本数量，每个元素是预测的类别索引。
这使得你可以直接将 preds 与 predictions.label_ids（真实类别标签）进行比较，
从而计算各种分类性能指标，如准确率、精确率、召回率、F1 分数等，以评估模型的性能。
"""
preds = np.argmax(predictions.predictions, axis = -1)
# 使用 axis=-1 会在每个样本的 logits 中（也就是每一行中，因为 logits 是按行存储的）找到最大值的索引。
# 要注意，numpy的逻辑是沿着"axis = -1"即列轴进行操作，所以是选出行最大值

"""
使用 axis=-1 是一种方便的表示方式，对于多维数组，axis=-1 表示最后一个轴。
对于二维数组，axis=-1 等价于 axis=1，它会在每一行中找到最大值的索引，因为对于二维数组，最后一个轴是列轴。
对于更高维度的数组，axis=-1 会在最后一个维度上进行操作，这在处理不同形状的数组时非常方便，避免了手动计算轴的编号。
"""

In [None]:
import evaluate

# metric = evaluate.load("glue", "mrpc") 是加载适用于 mrpc 任务的评估指标，mrpc 是 GLUE 基准测试中的一个任务，这个函数会根据任务名称加载相应的评价指标
metric = evaluate.load("glue", "mrpc")
# 使用加载的评估指标 metric 计算评估结果，将预测结果 preds 和真实标签 predictions.label_ids 作为输入，得到模型在验证集上的性能评估
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8602941176470589, 'f1': 0.9015544041450777}

In [None]:
# 这个函数在训练过程中会被 Trainer 调用
# def compute_metrics(eval_preds) 函数的作用是将 Trainer 在评估时传入的 eval_preds 转换为最终的预测类别，并使用 evaluate 加载的评估指标计算评估结果。
# 这个函数会在 Trainer 进行评估时被调用，确保在训练过程中能自动计算评估指标。
def compute_metrics(eval_preds):
  metric = evaluate.load("glue", "mrpc")
  # logits, labels = eval_preds：将输入的 eval_preds 解包为 logits（模型的原始输出）和 labels（真实标签）
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    # Trainer 内部会将存储了预测结果和真实标签的 eval_preds 作为参数传递给 compute_metrics 函数。
    # 你不需要手动调用 compute_metrics(eval_preds) 并提供 eval_preds，因为 Trainer 会处理这一切。
    # 当你将 compute_metrics 函数作为参数传递给 Trainer 的 compute_metrics 属性时，你只是告诉 Trainer 应该使用哪个函数来计算评估指标。
    # Trainer 会在合适的时候调用这个函数并为它提供所需的 eval_preds。

    #在 trainer 中，当进行评估时，会将 eval_dataset 中的数据输入到模型中，得到模型的输出，然后将这些输出作为 eval_preds 传递给 compute_metrics 函数，以计算评估指标。
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.434373,0.803922,0.859649
2,0.572700,0.412914,0.848039,0.891608
3,0.375600,0.67606,0.835784,0.886633


TrainOutput(global_step=1377, training_loss=0.41641840384157136, metrics={'train_runtime': 226.2499, 'train_samples_per_second': 48.636, 'train_steps_per_second': 6.086, 'total_flos': 405114969714960.0, 'train_loss': 0.41641840384157136, 'epoch': 3.0})

# **整理：**

安装所需要的库

In [None]:
!pip install datasets evaluate wandb evaluate

In [7]:
import wandb
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# 数据处理
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# 评价函数
def compute_metrics(eval_preds):
  metric = evaluate.load("glue", "mrpc")
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

# 训练配置
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

wandb.init()
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

  trainer = Trainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.426644,0.823529,0.884244
2,0.499500,0.63079,0.848039,0.898361
3,0.255700,0.636733,0.867647,0.908475


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

TrainOutput(global_step=1377, training_loss=0.3024315262671909, metrics={'train_runtime': 215.2931, 'train_samples_per_second': 51.112, 'train_steps_per_second': 6.396, 'total_flos': 405114969714960.0, 'train_loss': 0.3024315262671909, 'epoch': 3.0})