# **第 1 部分：识别企业气候承诺 vs 空谈（Cheap Talk）**

在本次研讨课的第一部分中，我们将使用一个预训练的 BERT 分类器进行推断。该分类器来自研究论文 [How cheap talk in climate disclosures relates to climate initiatives, corporate emissions, and reputation risk](https://www.sciencedirect.com/science/article/pii/S0378426624001080?via%3Dihub#appSC). 这篇文章旨在测量并分析企业在气候行动方面的“空谈”行为。

研究中提出了一个“气候空谈指数”（cheap talk index），该指数指的是在企业披露中，具体气候承诺占所有承诺的比例（即与模糊承诺相比的比例）。要进行这一测量，研究需要使用分类器来识别气候承诺（随后还需另一个分类器识别是否为“具体”承诺）。我们将专注于识别气候承诺这一部分的模型。


## 环境设置

首先，我们需要设置环境，安装相关的 Python 套件。这将花费几分钟时间。

由于 Google Colab 使用的是临时运行环境，因此除了少数常见库（如 pandas、numpy 和 torch）外，其他所有套件在每次会话中都必须重新安装。

In [None]:
!pip install transformers
!pip install datasets
!pip install tqdm
!pip install scikit-learn



In [None]:
# 这只是为了确保文本在 Google Colab 的输出显示中自动换行。

from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

现在，我们可以导入必要的库。

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
from transformers.pipelines.pt_utils import KeyDataset
import datasets
from tqdm.auto import tqdm
import torch

接下来，我们将直接从 HuggingFace 获取模型和数据集。你可以通过以下链接查看模型卡和数据集卡： [(1)](https://huggingface.co/climatebert/distilroberta-base-climate-commitment) [(2)](https://huggingface.co/datasets/climatebert/climate_commitments_actions) 我们将把它们的目录分别赋值给 `dataset_name` 和 `model_name`，以便后续使用。

In [None]:
dataset_name = "climatebert/climate_commitments_actions"
model_name = "climatebert/distilroberta-base-climate-commitment"

你可以使用 `datasets` 库从 HuggingFace 加载任何数据集。只需运行下面的代码即可。

**⚠️注意:** 在 Python 中，调用已加载库中的函数时，你需要先输入库名，然后加一个 `.`，再写函数名（就像下面的代码中那样）。你可以在导入库时给它们指定简写名，比如写成 `import datasets as ds`。如果我们这样做了，那么下面的代码就可以改为 `ds.load_dataset()`

In [None]:
dataset = datasets.load_dataset(dataset_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/4.52k [00:00<?, ?B/s]

(…)-00000-of-00001-2044cce9e261c6b3.parquet:   0%|          | 0.00/273k [00:00<?, ?B/s]

(…)-00000-of-00001-77f76c0960abb9c6.parquet:   0%|          | 0.00/101k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/320 [00:00<?, ? examples/s]

让我们通过print数据集来查看其结构……

In [None]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 320
    })
})


这是一个数据集字典（dataset dictionary），它包含两个数据集 —— 一个用于训练，一个用于测试。`train` 数据集中有 1,000 条观测值，而 `test` 数据集中有 320 条。由于本次研讨课中我们并不打算使用这些数据来训练模型，因此我们可以直接提取测试数据。下面的代码将打印出前 3 行。


**⚠️注意:** 在 Python 中，索引的工作方式与 R 不同。Python 中数组的第一个元素通过 `0` 来调用，而不是 `1`。因此 `dataset['train'][0]` 将返回第一个元素。同时，如果给出一个范围，例如 `0:3`，这个范围不包括上限，即选择的是小于 3 的元素。

In [None]:
dataset['train'][0:3]

接下来，让我们加载 **模型（model）** 和 **分词器（tokenizer）**。HuggingFace 的一个非常便利的特性是：每个模型对应的分词器可以从同一个位置访问。这意味着我们无需担心如何使文本输入与模型对齐。

要获取模型，我们使用：`AutoModelForSequenceClassification.from_pretrained()` 这是因为该模型用于文本序列分类。其他模型可能用于诸如 词元分类（token classification）、问答（question answering）、文本摘要（text summarisation） 等任务。

请注意，我们在 `AutoTokenizer.from_pretrained()` 中指定了 `max_len=512`，这是 Roberta 模型所能接受的最大词元数。

当你运行下方代码时，它将从 HuggingFace 下载模型和分词器。在后续调用中，它们将直接从本地加载，无需再次下载。

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, max_len=512)

config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.48k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

## 探索模型

1. **使用下面的代码来查看模型的结构。它有多少个 Transformer 层？每个词元的隐藏表示长度是多少？**

In [None]:
print("Hidden Size:", model.config.hidden_size)
print("Number of Attention Heads:", model.config.num_attention_heads)
print("Number of Layers:", model.config.num_hidden_layers)

Hidden Size: 768
Number of Attention Heads: 12
Number of Layers: 6


In [None]:
from torchsummary import summary

model.eval()


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50500, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
           

2. **通过在引号中添加你自己的句子来测试分词器（tokeniser）。它是如何处理冗长或罕见的单词的？**


In [None]:
tokenizer.tokenize("Supercalifragilisticexpialidocious")

['Super', 'cal', 'if', 'rag', 'il', 'ist', 'ice', 'xp', 'ial', 'id', 'ocious']

## 推断

3. **执行推断（Perform inference）!**

现在让我们使用模型进行推断。我们将从单个句子开始操作。请在引号中写入你自拟的示例句子。

试试看能否让模型预测为 'yes'（即这是一个承诺采取气候行动的陈述），然后再换一个句子，让模型预测 'no'（即这不是一个承诺气候行动的陈述）。

**⚠️注意:** 打印输出中的第一个数字是 'no' 的概率分数，第二个数字是 'yes' 的概率分数。
我们可以通过运行 `model.config.id2label` 来检查 ID 与标签的对应关系。

In [None]:
synthetic_example = "I will keep using fossil fuels until the planet has burned"

input = tokenizer(synthetic_example, truncation=True, return_tensors="pt")
logits = model(input["input_ids"]).logits[0]
prediction = torch.softmax(logits, -1).tolist()

print(model.config.id2label)
print(prediction)


{0: 'no', 1: 'yes'}
[0.12263745069503784, 0.8773625493049622]


为了在更大规模上执行推断，使用 `pipeline` 是非常有帮助的。该函数会创建一个生成器，当应用于文本时，会对每条文本进行分词并执行推断。

device 参数用于指定执行推断时所使用的硬件设备。将 `device=0` 设置为使用第一个 GPU 设备。如果你是在本地设备（而非 Google Colab）上运行，并且没有配置 GPU，则可能需要将 `device='cpu'` 设置为使用 CPU。

In [None]:
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)

# Create a list to store the results
predicted_labels = [] # Predicted category
predicted_score_label = [] # Probability of predicted category

for out in tqdm(pipe(KeyDataset(dataset['test'], 'text'), padding=True, truncation=True)):
    predicted_labels.append(out['label'])
    predicted_score_label.append(out['score'])

Device set to use cpu


  0%|          | 0/320 [00:00<?, ?it/s]

4. **表面效度检验（Face validity check）**

现在让我们进行一次表面效度检验。将样本按照被分类为正例（即气候承诺陈述）的可能性从高到低排序。

检查得分最高的样本是否确实看起来是在做出气候承诺，而得分最低的样本是否不是。

In [None]:
import pandas as pd
import numpy as np

# 将测试数据集转换为 pandas dataframe
test_data = pd.DataFrame(dataset['test'])

# 添加推理结果
test_data['predicted_score_label'] = predicted_score_label
test_data['predicted_labels'] = predicted_labels

# 创建一个新变量，表示句子是 'yes' 的概率
test_data['probability_yes'] = np.where(np.array(predicted_labels) == "yes", np.array(predicted_score_label), 1 - np.array(predicted_score_label))

#  按分数对 dataframe 排序（以升序和降序表示最低分数和最高分数）
df_sorted = test_data.sort_values(by='probability_yes', ascending=False)

# Print 最高和最低的句子
print("### Top sentence ###")
print(df_sorted['text'][0])

print("\n\n ### Lowest sentence ###")
print(df_sorted['text'][319])

### Top sentence ###
Sustainable strategy ‘red lines’ For our sustainable strategy range, we incorporate a series of proprietary ‘red lines’ in order to ensure the poorest- performing companies from an ESG perspective are not eligible for investment.


 ### Lowest sentence ###
Climate change is producing changes in weather and other environmental conditions, including temperature and precipitation levels, and thus may affect consumer demand for electricity. In addition, the potential physical effects of climate change, such as increased frequency and severity of storms, floods and other climatic events, could disrupt NRG's operations and supply chain, and cause them to incur significant costs in preparing for or responding to these effects. These or other meteorological changes could lead to increased operating costs, capital expenses or power purchase costs. NRG's commercial and residential customers may also experience the potential physical impacts of climate change and may incur si

5. **使用性能指标进行评估**

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# 提取相关的列
y_true = test_data["label"]  # True labels
y_pred = np.where(test_data["predicted_labels"]=='yes', 1, 0)  # Predicted labels
y_prob = test_data["probability_yes"]  # Probability of label being 1

# 计算评估指标
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
conf_matrix = confusion_matrix(y_true, y_pred)

# Print results
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

print("\nConfusion Matrix:")
print(conf_matrix)

Accuracy: 0.8125
Precision: 0.6397
Recall: 0.8878
F1 Score: 0.7436

Confusion Matrix:
[[173  49]
 [ 11  87]]


# **Part 2: 用于零样本分类的自然语言推理（Natural Language Inference for Zero-shot Classification）**

第 1 部分中的模型是由该研究的作者预先训练好的。如果我们要直接从 RoBERTa 开始训练他们的分类器，将比本次研讨课的时间安排要长得多。

在本节中，我们将介绍 BERT-NLI 模型，它提供了一种使用 Transformer 进行分类的捷径。诀窍在于：该模型已被训练用于执行一个通用任务——判断一个前提句（premise）是否蕴含（entail）某个假设句（hypothesis）。模型在这一通用任务中获得的知识可以被迁移应用到新的任务中，即使训练数据非常有限。

在本节中，我们将尝试在**没有进行任何额外微调（fine tuning）**的前提下使用该模型 —— 也就是说，我们将执行所谓的 零样本分类（zero-shot classification）。

## Setup

使用与第 1 部分相同的方法加载模型和分词器。

In [None]:
model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]



1. **尝试合成示例（synthetic examples）**

在加载任何数据之前，先通过一些自拟句子来试验模型表现，以了解在未经过任何特定任务微调的情况下，模型的初始效果。下面的代码将输出每个假设句（hypothesis）被前提句（premise）**蕴含（entailment）**的概率。

**⚠️注意:** *自然语言推理（NLI）可以被表述为一个三分类问题（蕴含、无关、中立）或二分类问题（蕴含 vs 非蕴含）。该模型采用的是二分类（entailment / not-entailment）方式。*

In [None]:
# 该模型可用于零样本自然语言推理（zero-shot natural language inference）

premise = "Eggs make me feel ill"

hypothesis1 = "I don't like ommelettes"
hypothesis2 = "I love ommelettes"

input1 = tokenizer(premise, hypothesis1, truncation=True, return_tensors="pt")
output1 = model(input1["input_ids"].to("cpu"))
prob_entail1 = torch.softmax(output1["logits"][0], -1).tolist()

input2 = tokenizer(premise, hypothesis2, truncation=True, return_tensors="pt")
output2 = model(input2["input_ids"].to("cpu"))
prob_entail2 = torch.softmax(output2["logits"][0], -1).tolist()

print(model.config.id2label)
print("Hypothesis 1:", prob_entail1)
print("Hypothesis 2:", prob_entail2)

{0: 'entailment', 1: 'not_entailment'}
Hypothesis 1: [0.5714797377586365, 0.4285202920436859]
Hypothesis 2: [0.0018297064816579223, 0.9981702566146851]


2. **加载关于堕胎的推文数据集**

现在我们将把模型应用于一些真实数据。运行下面的代码块，从 HuggingFace 下载一个关于堕胎主题的推文数据集。

使用接下来的代码块来了解该数据集的结构。

In [None]:
dataset_name = "SetFit/tweet_eval_stance_abortion"

dataset = datasets.load_dataset(dataset_name)

dataset_infos.json:   0%|          | 0.00/936 [00:00<?, ?B/s]

(…)-00000-of-00001-8c586495de0f343d.parquet:   0%|          | 0.00/43.7k [00:00<?, ?B/s]

(…)-00000-of-00001-5cbcbcf3e2cef71e.parquet:   0%|          | 0.00/26.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/587 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/346 [00:00<?, ? examples/s]

In [None]:
print(dataset)

dataset['train'][2]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 587
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 346
    })
})


{'text': 'Life is #precious & so are babies, mothers, & fathers. Please support the sanctity of Human Life. Think #SemST',
 'label': 1,
 'label_text': 'against'}

为了执行零样本分类，我们将仅使用测试数据集中一个平衡的推文样本。下面的代码会从每个类别（支持、反对、中立）中各选取前 30 条推文，并将它们合并为一个单独的 pandas 数据框（`df_inference`）。

In [None]:
import pandas as pd

# 提取测试集
df_test = pd.DataFrame(dataset['test'])

# 设置随机种子以确保结果可复现
random_seed = 1234

# 从测试集中分别随机抽取30条支持、30条中立和30条反对的样本
df_test_pro = df_test[df_test['label_text'] == 'favor'].sample(n=30, random_state=random_seed)
df_test_neutral = df_test[df_test['label_text'] == 'none'].sample(n=30, random_state=random_seed)
df_test_anti = df_test[df_test['label_text'] == 'against'].sample(n=30, random_state=random_seed)

# 合并为一个包含90条样本的数据集
df_inference = pd.concat([df_test_pro, df_test_neutral, df_test_anti], ignore_index=True).sample(n=90, random_state=random_seed)


## 推理

3. **定义我们的假设（hypotheses）**

我们需要为希望分类推文的每一个类别定义一个假设句（hypothesis）。下面是一种实现方式。这里我为每个立场（支持、反对、中立）都设计了一个假设句。

此外，我在每条推文前后分别加上了 'The tweet: "' 和 '" - end of the tweet'，这样假设句在语义上可以更清楚地指代整条推文。

但请注意，这里并没有唯一最优的设定方式。另一种做法是尝试多种不同的假设句表述，然后比较它们在模型中的表现，将其作为一种“超参数（hyperparameter）”进行调优。

In [None]:
# 定义假设
hypothesis_label_dic_inference = {
    "favor": "The tweet is positive towards abortion rights",
    "against": "The tweet is negative towards abortion rights",
    "none": "The tweet is not about abortion rights"
}
hypothesis_lst = list(hypothesis_label_dic_inference.values())

# 在原始推文内容前后添加提示文本
df_inference["text_prepared"] = 'The tweet: "' + df_inference.text.fillna("") + '" - end of the tweet.'

# 将处理后的文本转换为列表形式
text_lst = df_inference["text_prepared"].tolist()

4. **执行推断！**

现在我们可以像第 1 部分一样执行推断。这将花费几分钟时间。

In [None]:
# 官方文档：https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
classifier = pipeline(
    "zero-shot-classification",  # 使用零样本分类管道
    model=model,
    tokenizer=tokenizer,
    framework="pt",  # 使用 PyTorch 框架
    device=0  # 使用第一个 GPU，如果为 -1 则使用 CPU
)

# 进行推理
pipe_output = classifier(
    text_lst,  # 这里可以输入任意文本列表
    candidate_labels=hypothesis_lst,  # 候选假设标签
    hypothesis_template="{}",  # 用于格式化每个假设的模板
    multi_label=False,  # 设置是否允许多个假设同时为真，False 表示只能有一个为真
    batch_size=8  # 设置批处理大小
)

# 从推理结果中提取预测信息
hypothesis_pred_true_probability = []  # 存储每个样本预测的最高概率
hypothesis_pred_true = []  # 存储每个样本预测的标签
for dic in pipe_output:
    hypothesis_pred_true_probability.append(dic["scores"][0])
    hypothesis_pred_true.append(dic["labels"][0])

# 将长假设句映射回原始的短标签名
hypothesis_label_dic_inference_inverted = {value: key for key, value in hypothesis_label_dic_inference.items()}
label_pred = [hypothesis_label_dic_inference_inverted[hypo] for hypo in hypothesis_pred_true]

# 将推理结果添加到原始数据集中
df_inference["label_text_pred"] = label_pred  # 添加预测标签
df_inference["label_text_pred_prob_label"] = hypothesis_pred_true_probability  # 添加预测标签对应的概率

Device set to use cpu


5. **评估这个模型!**

现在请使用与第 1 部分类似的代码对模型性能进行评估。

你如何评价模型的表现？
如果分类是完全随机的，你预期的准确率是多少？

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# 提取真实标签和预测标签列
y_true = df_inference["label_text"]  # 真实标签
y_pred = df_inference["label_text_pred"]  # 模型预测标签

# 指定标签的类别顺序
class_order = ["against", "none", "favor"]

# 计算评估指标
accuracy = accuracy_score(y_true, y_pred) # 准确率
precision = precision_score(y_true, y_pred, average = None, labels = class_order)  # 各类别的精确率
recall = recall_score(y_true, y_pred, average = None, labels = class_order) # 各类别的召回率
f1 = f1_score(y_true, y_pred, average = None, labels = class_order) # 各类别的F1分数
conf_matrix = confusion_matrix(y_true, y_pred, labels = class_order) # 混淆矩阵

# Print results
print("Class order:", class_order)
print(f"Accuracy: {accuracy:.4f}")
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

print("\nConfusion Matrix:")
print(conf_matrix)

Class order: ['against', 'none', 'favor']
Accuracy: 0.4444
Precision: [0.41666667 0.47368421 0.33333333]
Recall: [0.33333333 0.9        0.1       ]
F1 Score: [0.37037037 0.62068966 0.15384615]

Confusion Matrix:
[[10 16  4]
 [ 1 27  2]
 [13 14  3]]


# **课后作业**

你的作业任务是：对任意文本数据（但不得使用本次示范中用过的数据），使用任意一个 Transformer 模型进行推断（inference）。

你可以选择一个预训练分类器，在其对应的数据集上做推断；也可以选择使用 BERT-NLI 在你自己项目的文本数据上执行零样本分类（zero-shot classification）。无论你选择哪种模型或数据集，你只需在一个小样本（大约 10 条文本）上执行推断。

不需要进行任何训练或微调（当然，如果你愿意，也可以这样做）。

✅ 一些实用建议：

* 你可以在 HuggingFace 上搜索模型和数据集。每个模型/数据集都有一个“卡片（card）”，介绍其背景、用途和用法（有些更详细，有些则较简略）。

* 要加载模型或数据集，你只需要它的名称，比如：`MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c` 名称可在模型或数据集的 HuggingFace 页面中找到。之后你基本可以复用本次课堂中的代码来执行推断。

📥 如果你想使用自己项目中的数据：

你需要将该数据导入 Python（通常是 .csv 文件）。如果你在 Google Colab 中操作，可以使用如下代码导入：

In [None]:
from google.colab import files
uploaded = files.upload()

上传完成后，df 就是你的数据框。你可以在上面运行分词、推断或其他分析操作。

然后选择你想要的文件。一旦上传完成，你就可以运行以下代码（将 `file_name` 替换为你的文件名）：

In [None]:
import pandas as pd
import io
df = pd.read_csv(io.StringIO(uploaded['file_name.csv'].decode('utf-8')))

# **可选材料：微调自然语言推理（NLI）模型以用于分类任务（Fine-tuning a Natural Language Inference Model for Classification）**

在第 2 部分中，我们使用了 BERT-NLI 模型进行推断。该模型被训练用于执行通用的自然语言推理任务。我们发现，尽管该模型没有接受我们特定任务（识别关于堕胎立场）的训练，它的表现仍然优于随机基准，但总体效果仍然较差。现在我们来看看，是否可以通过用少量标注样本进行**微调（fine-tuning）**来改进这一表现。

## 🛠️ 环境设置

1. **抽样训练数据**

为此，我们需要从 'train' 数据集中选取一些标注样本。为了任务可控，我们将仅抽取：
* 30 条 正向（positive） 样本
* 30 条 负向（negative） 样本
* 30 条 中立（neutral） 样本

即便是对 BERT-NLI 来说，这也是非常少量的训练数据，但这使得我们可以快速完成训练，这对于本次研讨课的目的非常有帮助。

In [None]:
import pandas as pd

df_train = pd.DataFrame(dataset['train'])

df_train_pro = df_train[df_train['label_text'] == 'favor'].sample(n=30, random_state=random_seed)
df_train_neutral = df_train[df_train['label_text'] == 'none'].sample(n=30, random_state=random_seed)
df_train_anti = df_train[df_train['label_text'] == 'against'].sample(n=30, random_state=random_seed)

df_train_s = pd.concat([df_train_pro, df_train_neutral, df_train_anti], ignore_index=True).sample(n=30, random_state=random_seed)

df_train_s["text_prepared"] = 'The tweet: "' + df_train_s.text.fillna("") + '" - end of the tweet.'

df_train_s

Unnamed: 0,text,label,label_text,text_prepared
36,"@user ""If you don't draw the line where I've a...",0,none,"The tweet: ""@user ""If you don't draw the line ..."
59,Anti-vaxxers are such an idiotic bunch... Seri...,0,none,"The tweet: ""Anti-vaxxers are such an idiotic b..."
65,"'Amen, I say to you, whatever you did for one ...",1,against,"The tweet: ""'Amen, I say to you, whatever you ..."
45,Spent the WHOLE day on my highlight film from ...,0,none,"The tweet: ""Spent the WHOLE day on my highligh..."
60,RT @user Children are the greatest blessing wh...,1,against,"The tweet: ""RT @user Children are the greatest..."
84,It's time to end the #deathpenalty in the Unit...,1,against,"The tweet: ""It's time to end the #deathpenalty..."
72,"@user Precisely! In God's eyes, ALL life is pr...",1,against,"The tweet: ""@user Precisely! In God's eyes, AL..."
68,@user Any pregnancy can turn deadly at any tim...,1,against,"The tweet: ""@user Any pregnancy can turn deadl..."
9,People who have never had an abortion can be p...,2,favor,"The tweet: ""People who have never had an abort..."
74,Since #roevwade 1/3 of my #generation has been...,1,against,"The tweet: ""Since #roevwade 1/3 of my #generat..."


2. **重新格式化训练数据（Reformat training data）**

接下来，我们需要将数据集格式化为适用于 NLI 分类任务 的形式。这包括：
* 创建一个新列，包含正确的类别对应的假设句（hypothesis），并赋予标签 'true'；
* 添加错误示例（false examples）。

请记住，NLI 任务的核心在于：模型需要判断一个“假设”在给定“上下文”下是否为真（true）或假（false）。如果我们只提供“正确匹配”的上下文-假设对，算法将无法有效学习“false”类别。


下方的代码（借鉴自[this demo](https://colab.research.google.com/github/MoritzLaurer/less-annotating-with-bert-nli/blob/master/BERT_NLI_demo.ipynb))执行了上述两项任务。为生成 false 示例，它对每条文本添加一行，将其与一个随机错误类别的假设句匹配，并将其标注为 'false'（在 NLI 中对应数值标签为 1）。这样处理后，训练数据的规模最多会增加至原来的 两倍。

In [None]:

hypothesis_label_dic = {
    "favor": "The tweet is positive towards abortion rights",
    "against": "The tweet is negative towards abortion rights",
    "none": "The tweet is not about abortion rights"
}

## 用于重新格式化训练集的函数
def format_nli_trainset(df_train=None, hypo_label_dic=None, random_seed=42):
  print(f"Length of df_train before formatting step: {len(df_train)}.")
  length_original_data_train = len(df_train)

  df_train_lst = []
  for label_text, hypothesis in hypo_label_dic.items():
    ## 正向蕴含（entailment）
    df_train_step = df_train[df_train.label_text == label_text].copy(deep=True)
    df_train_step["hypothesis"] = [hypothesis] * len(df_train_step)
    df_train_step["label"] = [0] * len(df_train_step)

    ## 非蕴含（not_entailment）
    df_train_step_not_entail = df_train[df_train.label_text != label_text].copy(deep=True)
    df_train_step_not_entail = df_train_step_not_entail.sample(n=min(len(df_train_step), len(df_train_step_not_entail)), random_state=random_seed)
    df_train_step_not_entail["hypothesis"] = [hypothesis] * len(df_train_step_not_entail)
    df_train_step_not_entail["label"] = [1] * len(df_train_step_not_entail)

    # 合并该类别的 entail 和 not-entail 样本
    df_train_lst.append(pd.concat([df_train_step, df_train_step_not_entail]))

    # 合并所有类别的数据
  df_train = pd.concat(df_train_lst)

  # 打乱数据顺序
  df_train = df_train.sample(frac=1, random_state=random_seed)
  df_train["label"] = df_train.label.apply(int)

  # 添加更直观的标签列用于观察
  df_train["label_nli_explicit"] = ["True" if label == 0 else "Not-True" for label in df_train["label"]]  # adding this just to simplify readibility

  print(f"After adding not_entailment training examples, the training data was augmented to {len(df_train)} texts.")
  print(f"Max augmentation could be: len(df_train) * 2 = {length_original_data_train*2}. It can also be lower, if there are more entail examples than not-entail for a majority class.")

  return df_train.copy(deep=True)

# 调用函数格式化训练集
df_train_formatted = format_nli_trainset(df_train=df_train_s, hypo_label_dic=hypothesis_label_dic, random_seed=1234)

Length of df_train before formatting step: 30.
After adding not_entailment training examples, the training data was augmented to 60 texts.
Max augmentation could be: len(df_train) * 2 = 60. It can also be lower, if there are more entail examples than not-entail for a majority class.


In [None]:
df_train_formatted

Unnamed: 0,text,label,label_text,text_prepared,hypothesis,label_nli_explicit
27,They know what's best for their health They kn...,1,favor,"The tweet: ""They know what's best for their he...",The tweet is negative towards abortion rights,Not-True
39,So ready for my abortion debate #SemST,0,none,"The tweet: ""So ready for my abortion debate #...",The tweet is not about abortion rights,True
77,"Also, abortion is wrong biblically and morally...",1,against,"The tweet: ""Also, abortion is wrong biblically...",The tweet is positive towards abortion rights,Not-True
64,If your agonist abortion get a vasectomy #SemST,0,against,"The tweet: ""If your agonist abortion get a vas...",The tweet is negative towards abortion rights,True
18,@user @user Thank goodness politicians saw sen...,0,favor,"The tweet: ""@user @user Thank goodness politic...",The tweet is positive towards abortion rights,True
9,People who have never had an abortion can be p...,1,favor,"The tweet: ""People who have never had an abort...",The tweet is not about abortion rights,Not-True
59,Anti-vaxxers are such an idiotic bunch... Seri...,0,none,"The tweet: ""Anti-vaxxers are such an idiotic b...",The tweet is not about abortion rights,True
84,It's time to end the #deathpenalty in the Unit...,1,against,"The tweet: ""It's time to end the #deathpenalty...",The tweet is positive towards abortion rights,Not-True
68,@user Any pregnancy can turn deadly at any tim...,0,against,"The tweet: ""@user Any pregnancy can turn deadl...",The tweet is negative towards abortion rights,True
54,Bigger problems 4 #Christians than same sex #m...,1,none,"The tweet: ""Bigger problems 4 #Christians than...",The tweet is negative towards abortion rights,Not-True


3. **预处理数据**



In [None]:
# 将 Pandas DataFrame 转换为 Hugging Face 的 Dataset 对象，以便后续预处理
dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train_formatted)
})

# 分词函数：处理 NLI 格式的数据
def tokenize_nli_format(examples):
  return tokenizer(
      examples["text_prepared"],  # 前提：经过格式化的原始文本
      examples["hypothesis"],     # 假设句子
      truncation=True,            # 截断过长文本
      max_length=512              # 最大长度为 512 个 token
      )

# 对训练集进行批量分词处理
dataset["train"] = dataset["train"].map(tokenize_nli_format, batched=True)

Map:   0%|          | 0/60 [00:00<?, ? examples/s]

## 训练

4. **设置训练参数**

这一步设置了一些关键的训练参数：

* 学习率（learning rate）：如在讲座中所述，学习率决定了每次反向传播后模型参数更新的幅度。

* 预热比例（warm up ratio）：决定训练初期学习率的增长方式。若预热比例为 0.25，表示在训练前 25% 的步骤中，学习率将从 0 线性增长。

通常在“从零训练”一个 BERT 分类器时，我们会将该值设得较低（如 0.06）。但在本例中，我们希望尽可能重用 BERT-NLI 模型的已有知识，并避免“灾难性遗忘（catastrophic forgetting）”，因此设置较高的预热比例。

* 轮数（epochs）：即对整个训练数据的完整遍历次数。一般介于 3 到 20 之间，并需在训练过程中通过验证集（validation set）评估性能以防止过拟合。

但在本次演示中，为了加快运行速度，我们将其设为 1。这不可避免地意味着模型不会完全收敛，但我们仍然应能观察到一定程度的性能提升。

In [None]:
from transformers import TrainingArguments, Trainer

# Training arguments
train_args = TrainingArguments(
    output_dir="/tmp/trainer",
    logging_dir="/tmp/logs",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=80,
    num_train_epochs=1,
    warmup_ratio=0.25,
    weight_decay=0.1,
    load_best_model_at_end=True,
    seed=1234,
    eval_strategy="no",
    save_strategy = "no",
    report_to="none"
)

5. **训练模型**

In [None]:
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset['train']
)

# 训练
trainer.train()

  trainer = Trainer(


Step,Training Loss


TrainOutput(global_step=4, training_loss=1.3632081747055054, metrics={'train_runtime': 61.1363, 'train_samples_per_second': 0.981, 'train_steps_per_second': 0.065, 'total_flos': 1924034097024.0, 'train_loss': 1.3632081747055054, 'epoch': 1.0})

## 推断

6. **在相同的测试样本上执行推断**

现在让我们看看这个微调后的模型相较于零样本分类器（zero-shot classifier）的表现如何。

首先，我们需要定义一个新的推断管道（pipeline）。我们可以重新运行之前的代码，只不过这次在 model 参数中传入我们微调后的模型对象。（尽管代码看起来一样，但由于我们在训练参数中设置了 `load_best_model_at_end=True`，模型对象已经在环境中被更新。）

为了避免覆盖零样本模型的预测结果，我们将这一次的推断输出保存为 `pipe_output2`。随后的一段代码会将这些新预测添加到 `df_inference` 数据框中的新列中。

In [None]:
# 官方文档：https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
classifier_finetuned = pipeline(
    "zero-shot-classification",  # 使用零样本分类管道
    model=model,  # 如果在训练参数中设置了 load_best_model_at_end=True，这里会自动加载最优微调后的模型
    tokenizer=tokenizer,
    framework="pt",  # 使用 PyTorch 框架
    device=0,  # 使用第一个 GPU（如果为 -1 则表示使用 CPU）
)

# 使用你选择的模型进行推理（预测）
pipe_output2 = classifier_finetuned(
    text_lst,  # 输入任何一个文本列表
    candidate_labels=hypothesis_lst,  # 候选标签（作为假设）
    hypothesis_template="{}",  # 假设模板
    multi_label=False,  # 设置是否允许多个标签同时为真；False 表示每条文本只能匹配一个标签
    batch_size=8  # 批处理大小
)

Device set to use cpu


In [None]:
# 从推理结果 pipe_output2 中提取预测结果
hypothesis_pred_true_probability2 = []  # 存储每条样本预测标签的概率（分数）
hypothesis_pred_true2 = []  # 存储每条样本预测的标签（长句形式）
for dic in pipe_output2:
    hypothesis_pred_true_probability2.append(dic["scores"][0])
    hypothesis_pred_true2.append(dic["labels"][0])

# 将长句形式的假设标签映射为对应的简短标签名
hypothesis_label_dic_inference_inverted = {
    value: key for key, value in hypothesis_label_dic_inference.items()
}
label_pred2 = [hypothesis_label_dic_inference_inverted[hypo] for hypo in hypothesis_pred_true2]

# 将推理结果添加到原始数据集中
df_inference["label_text_pred2"] = label_pred2  # 微调模型预测的标签
df_inference["label_text_pred_prob_label2"] = hypothesis_pred_true_probability2  # 对应的预测概率

# 显示结果数据框
df_inference

Unnamed: 0,text,label,label_text,text_prepared,label_text_pred,label_text_pred_prob_label,label_text_pred2,label_text_pred_prob_label2
36,"@user @user @user @user @user Yes, an occasion...",0,none,"The tweet: ""@user @user @user @user @user Yes,...",none,0.979184,against,0.503527
59,@user I know. God won't be mocked tho. So I bl...,0,none,"The tweet: ""@user I know. God won't be mocked ...",none,0.988884,none,0.922036
65,If being a mother's womb isn't safe I guess ne...,1,against,"The tweet: ""If being a mother's womb isn't saf...",against,0.933024,against,0.936690
45,"How many location-efficient, #affordablehousin...",0,none,"The tweet: ""How many location-efficient, #affo...",none,0.990550,none,0.884252
60,There is nothing sadder than pro-life young wo...,1,against,"The tweet: ""There is nothing sadder than pro-l...",against,0.963542,against,0.979698
...,...,...,...,...,...,...,...,...
76,"@user I don't know about you, but I didn't vot...",1,against,"The tweet: ""@user I don't know about you, but ...",none,0.988117,none,0.544721
53,@user sad this is what has to be done for wome...,0,none,"The tweet: ""@user sad this is what has to be d...",none,0.701689,none,0.445458
38,They say God isn't listening? Just look at tha...,0,none,"The tweet: ""They say God isn't listening? Just...",none,0.914295,none,0.463050
83,@user 1999 Meet the Press admitted to being Ve...,1,against,"The tweet: ""@user 1999 Meet the Press admitted...",favor,0.579279,favor,0.915231


7. **评估!**

最后，让我们对微调后的模型所做的预测结果进行评估。请思考：这些预测结果相比零样本模型（zero-shot model），表现有何改进？是否在准确率、召回率或其他指标上有明显提升？

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# 提取真实标签和微调模型预测标签
y_true = df_inference["label_text"]  # 真实标签
y_pred = df_inference["label_text_pred2"]  # 微调模型预测标签

# 指定标签类别的顺序
class_order = ["against", "none", "favor"]

# 计算各项评估指标
accuracy = accuracy_score(y_true, y_pred)  # 准确率
precision = precision_score(y_true, y_pred, average=None, labels=class_order)  # 各类别的精确率
recall = recall_score(y_true, y_pred, average=None, labels=class_order)  # 各类别的召回率
f1 = f1_score(y_true, y_pred, average=None, labels=class_order)  # 各类别的 F1 分数
conf_matrix = confusion_matrix(y_true, y_pred, labels=class_order)  # 混淆矩阵

# 打印评估结果
print("Class order:", class_order)
print(f"Accuracy: {accuracy:.4f}")
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

print("\nConfusion Matrix:")
print(conf_matrix)

Class order: ['against', 'none', 'favor']
Accuracy: 0.4667
Precision: [0.39534884 0.60606061 0.35714286]
Recall: [0.56666667 0.66666667 0.16666667]
F1 Score: [0.46575342 0.63492063 0.22727273]

Confusion Matrix:
[[17  7  6]
 [ 7 20  3]
 [19  6  5]]
