我们将使用huggingface中的Trainer API简化训练循环。
与之前不同的是使用automodel_for_sequence_classification，而不是automodel。区别在于automodel_for_sequence_classification自带一个分类头，只需要指定模型需要的预测的标注数量（这里是6）来决定分类头的数量。

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 6
model = (AutoModelForSequenceClassification
        .from_pretrained(model.ckpt, num_labels=num_labels)
        .to(device))


你会看到一个警告，说明模型的某些部分是随机初始化的，这很正常，因为分类头还没被训练。
## 定义性能指标
为了在训练期间监控指标，需要为Trainer定义一个compute_metrics函数。该函数接收一个EvalPrediction对象，该对象包含预测和标签属性（predictions和label_ids）的命名元组，并返回一个将每个指标名称映射到其值的字典。我们将计算F1分数和准确率。

In [None]:
from sklearn.mertric import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

# 创建一个Trainer对象，并传入模型、数据集、优化器、评估函数和日志记录器。

在完成数据集和性能指标的准备工作后，我们还需要处理两个关键事项才能定义`Trainer`类：

1. 登录Hugging Face Hub账户。这将允许我们将微调后的模型推送至社区平台，并与全球开发者共享成果
2. 定义本次训练运行所需的所有超参数（hyperparameters）

这两个步骤的具体实现将在下一章节详细展开

#### 模型训练
如果你正在使用Jupyter笔记本运行代码，可以通过以下辅助函数登录Hugging Face Hub平台：

In [None]:
from huggingface_hub import notebook_login

notebook_login()

这将显示一个交互式控件，您可以在其中输入用户名和密码，或具有写入权限的访问令牌（Access Token）。您可以通过[Hub文档](https://huggingface.co/docs/hub/security#user-access-tokens)查看创建访问令牌的详细说明。如果您在终端环境中操作，可通过执行以下命令进行登录：

```bash
$ huggingface-cli login

为了定义训练参数，我们使用`TrainingArguments`类。该类存储了大量信息，并为您提供对训练和评估过程的细粒度控制。需要特别指定的最重要参数是`output_dir`，这是保存训练过程中所有工件（如模型权重、日志文件等）的目录。以下是`TrainingArguments`的完整示例：

In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 64
logging_steps = len(emotions_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-emotion"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  push_to_hub=True, 
                                  log_level="error")

我们在此设置了批次大小（batch size）、学习率（learning rate）和训练轮数（epochs），并指定在训练结束时加载性能最佳的模型。完成这些参数配置后，我们可以通过`Trainer`类对模型进行实例化和微调：

In [None]:
from transformers import Trainer

trainer = Trainer(model=model, args=training_args, 
                  compute_metrics=compute_metrics,
                  train_dataset=emotions_encoded["train"],
                  eval_dataset=emotions_encoded["validation"],
                  tokenizer=tokenizer)
trainer.train();

Looking at the logs, we can see that our model has an $F_1$-score on the validation set of around 92% - this is a significant improvement over the feature-based approach!

We can take a more detailed look at the training metrics by calculating the confusion matrix. To visualize the confusion matrix, we first need to get the predictions on the validation set. The `predict()` method of the `Trainer` class returns several useful objects we can use for evaluation:

In [None]:
# hide_output
preds_output = trainer.predict(emotions_encoded["validation"])

The output of the `predict()` method is a `PredictionOutput` object that contains arrays of `predictions` and `label_ids`, along with the metrics we passed to the trainer. For example, the metrics on the validation set can be accessed as follows:

In [None]:
preds_output.metrics

It also contains the raw predictions for each class. We can decode the predictions greedily using `np.argmax()`. This yields the predicted labels and has the same format as the labels returned by the Scikit-Learn models in the feature-based approach:

In [None]:
y_preds = np.argmax(preds_output.predictions, axis=1)

With the predictions, we can plot the confusion matrix again:

In [None]:
plot_confusion_matrix(y_preds, y_valid, labels)

This is much closer to the ideal diagonal confusion matrix.  The `love` category is still often confused with `joy`, which seems natural. `surprise` is also frequently mistaken for `joy`, or confused with `fear`. Overall the performance of the model seems quite good, but before we call it a day, let's dive a little deeper into the types of errors our model is likely to make.

### Sidebar: Fine-Tuning with Keras
If you are using TensorFlow, it's also possible to fine-tune your models using the Keras API. The main difference from the PyTorch API is that there is no `Trainer` class, since Keras models already provide a built-in `fit()` method. To see how this works, let's first load  DistilBERT as a TensorFlow model:

In [None]:
#hide_output
from transformers import TFAutoModelForSequenceClassification

tf_model = (TFAutoModelForSequenceClassification
            .from_pretrained(model_ckpt, num_labels=num_labels))

Next, we'll convert our datasets into the `tf.data.Dataset` format. Since we have already padded our tokenized inputs, we can do this easily by applying the `to_tf_dataset()` method to `emotions_encoded`:

In [None]:
# The column names to convert to TensorFlow tensors
tokenizer_columns = tokenizer.model_input_names

tf_train_dataset = emotions_encoded["train"].to_tf_dataset(
    columns=tokenizer_columns, label_cols=["label"], shuffle=True,
    batch_size=batch_size)
tf_eval_dataset = emotions_encoded["validation"].to_tf_dataset(
    columns=tokenizer_columns, label_cols=["label"], shuffle=False,
    batch_size=batch_size)

Here we've also shuffled the training set, and defined the batch size for it and the validation set. The last thing to do is compile and train the model:

In [None]:
#hide_output
import tensorflow as tf

tf_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy())

tf_model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=2)

### End sidebar
#### Error analysis
Before moving on, we should investigate our model's predictions a little bit further. A simple yet powerful technique is to sort the validation samples by the model loss. When we pass the label during the forward pass, the loss is automatically calculated and returned. Here's a function that returns the loss along with the predicted label:


In [None]:
from torch.nn.functional import cross_entropy

def forward_pass_with_label(batch):
    # Place all input tensors on the same device as the model
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}

    with torch.no_grad():
        output = model(**inputs)
        pred_label = torch.argmax(output.logits, axis=-1)
        loss = cross_entropy(output.logits, batch["label"].to(device), 
                             reduction="none")

    # Place outputs on CPU for compatibility with other dataset columns   
    return {"loss": loss.cpu().numpy(), 
            "predicted_label": pred_label.cpu().numpy()}

Using the `map()` method once more, we can apply this function to get the losses for all the samples:

In [None]:
#hide_output
# Convert our dataset back to PyTorch tensors
emotions_encoded.set_format("torch", 
                            columns=["input_ids", "attention_mask", "label"])
# Compute loss values
emotions_encoded["validation"] = emotions_encoded["validation"].map(
    forward_pass_with_label, batched=True, batch_size=16)

In [None]:
Finally, we create a `DataFrame` with the texts, losses, and predicted/true labels:

In [None]:
emotions_encoded.set_format("pandas")
cols = ["text", "label", "predicted_label", "loss"]
df_test = emotions_encoded["validation"][:][cols]
df_test["label"] = df_test["label"].apply(label_int2str)
df_test["predicted_label"] = (df_test["predicted_label"]
                              .apply(label_int2str))

我们现在可以轻松地将`emotions_encoded`数据集按照损失值进行升序或降序排列。此操作的目标在于发现以下两类关键问题：

- **错误标注（Wrong labels）**：任何添加标签的过程都可能存在缺陷。人工标注者可能犯错或产生分歧，而通过其他特征推断的标签也可能出错。如果数据能被轻易自动化标注，我们就不需要构建模型来完成这项任务了。因此，存在部分错误标注的样本是正常的。通过这种方法，我们可以快速定位并修正这些错误。

- **数据集特性异常（Quirks of the dataset）**：现实世界中的数据集往往存在不规则性。在文本处理场景中，输入数据中的特殊字符或字符串可能显著影响模型预测结果。通过分析模型预测最弱的样本，我们可以识别这些特征，通过清洗数据或注入相似样例来增强模型的鲁棒性。

让我们首先观察具有**最高损失值**的数据样本：

In [None]:
#hide_output
df_test.sort_values("loss", ascending=False).head(10)

我们可以清楚地看到模型预测了一些标签存在错误。另一方面，看起来还有一些样本没有明确的类别划分，这些样本可能被错误标注，或者需要完全新增一个类别来处理。特别是`joy`（喜悦）类别似乎多次被错误标注。通过这些信息，我们可以对数据集进行精细化调整，这种调整往往能带来与增加数据量或扩大模型规模相当（甚至更显著）的性能提升！

当我们观察损失值最低的样本时，会发现模型在预测`sadness`（悲伤）类别时表现得最为自信。深度学习模型非常擅长寻找并利用预测的捷径。因此，花时间分析模型最自信的那些样本也很重要，这样我们可以确认模型没有不当利用文本中的某些特征。为此，我们也将关注损失值最小的预测结果：

In [None]:
#hide_output
df_test.sort_values("loss", ascending=True).head(10)

In [None]:
#hide_output
trainer.push_to_hub(commit_message="Training completed!")

In [None]:
#hide_output
from transformers import pipeline

# Change `transformersbook` to your Hub username
model_id = "transformersbook/distilbert-base-uncased-finetuned-emotion"
classifier = pipeline("text-classification", model=model_id)

In [None]:
custom_tweet = "I saw a movie today and it was really good."
preds = classifier(custom_tweet, return_all_scores=True)

In [None]:
preds_df = pd.DataFrame(preds[0])
plt.bar(labels, 100 * preds_df["score"], color='C0')
plt.title(f'"{custom_tweet}"')
plt.ylabel("Class probability (%)")
plt.show()