### 完成一个简单的文本分类任务

由于hugging face官网总是连接不上，因此在下载模型时使用国内的镜像网站，或是将模型下载到本地再进行加载

In [1]:
import os

# # 设置镜像地址（代码级覆盖）
# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

# 设置文件存放路径
data_path = ".\data\GLUE\MRPC"
model_path = ".\models\\bert-base-uncased"
save_path = ".\output"
checkpoint = "bert-base-classification"

#### tokenizer预处理

使用AutoTokenizer类下载分词器并实例化，我是把tokenizer和model的config下载下来放到本地然后载入的

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
#return_tensors="pt"表示返回Pytorch张量。文本转换为数字之后必须再转换成张量tensors才能输入模型。
#padding=True表示填充输入序列到最大长度，truncation=True表示过长序列被截断

print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


#### 选择模型

使用AutoModel类下载模型并实例化

In [4]:
from transformers import AutoModel

model = AutoModel.from_pretrained(model_path)

In [5]:
# 打印并查看模型结构
print(model)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

要完成文本分类任务必须载入对应的model head，因此不使用AutoModel类，而是使用AutoModelForSequenceClassification类

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_path)
outputs = model(**inputs)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at .\models\bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# 由于只有两个句子和两个标签，因此得到了2*2形状的结果
print(outputs.logits.shape)

print(outputs.logits)

torch.Size([2, 2])
tensor([[ 0.0773, -0.2972],
        [ 0.0211, -0.3956]], grad_fn=<AddmmBackward0>)


在pytorch中，调用nn.crossentropyloss()之前一定不要再调用nn.softmax()函数了，因为nn.crossentropyloss()里已经包含了softmax的计算，它的计算融合了logsoftmax和nullloss两部分:\
\
nn.softmax() = nn.logsoftmax() + nn.nllloss()

In [8]:
import torch
import torch.nn as nn
import torch.nn.functional as F

predictions = F.softmax(outputs.logits, dim=-1)
print(predictions)  # 输出为概率分数

tensor([[0.5925, 0.4075],
        [0.6027, 0.3973]], grad_fn=<SoftmaxBackward0>)


查看分类标签，得出模型预测的结论如下：\
第一句：NEGATIVE - 0.040195，POSITIVE - 0.95980\
第一句：NEGATIVE - 0.99946，POSITIVE - 0.00054418

In [9]:
# 查看共有几个分类标签(这里只有0,1两类)
model.config.id2label

{0: 'LABEL_0', 1: 'LABEL_1'}

### 下载并构建自己的数据集

GLUE benchmark(GLUE基准)用于衡量nlp模型在9个不同任务(包含分类任务、相似度任务、推理任务)中的性能\
MRPC数据集是其中之一\
\
由于GLUE中的数据集均带有标签(是标注数据)，因此微调bert模型可以算作是监督微调sft(supervised fine-tuning)

In [10]:
from datasets import load_dataset

# 从本地读取数据集
raw_datasets = load_dataset("parquet", 
                            data_files={
                                "train": os.path.join(data_path, "train-00000-of-00001.parquet"), 
                                "validation": os.path.join(data_path, "validation-00000-of-00001.parquet"),
                                "test": os.path.join(data_path, "test-00000-of-00001.parquet")
                            })

Generating train split: 3668 examples [00:00, 319575.98 examples/s]
Generating validation split: 408 examples [00:00, 102025.64 examples/s]
Generating test split: 1725 examples [00:00, 431101.38 examples/s]


In [11]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [12]:
# 输出训练集中的第一条数据，展示数据格式
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [13]:
# 通过features属性可以知道每一列的类型
# 由于MRPC数据集是一个相似度任务数据集，因此label为0表示句子对意思不一致，为1表示句子对意思一致
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

还可以转成df形式对数据集进行可视化及分析

In [16]:
import pandas as pd

validation = pd.DataFrame(raw_datasets['validation'])

In [17]:
validation

Unnamed: 0,sentence1,sentence2,label,idx
0,He said the foodservice pie business doesn 't ...,""" The foodservice pie business does not fit ou...",1,9
1,Magnarelli said Racicot hated the Iraqi regime...,"His wife said he was "" 100 percent behind Geor...",0,18
2,"The dollar was at 116.92 yen against the yen ,...","The dollar was at 116.78 yen JPY = , virtually...",0,25
3,The AFL-CIO is waiting until October to decide...,The AFL-CIO announced Wednesday that it will d...,1,32
4,No dates have been set for the civil or the cr...,No dates have been set for the criminal or civ...,0,33
...,...,...,...,...
403,Their contract will expire at 12 : 01 a.m. Wed...,""" It has outraged the membership , "" said Rian...",0,4023
404,But plaque volume increased by 2.7 percent in ...,The volume of plaque in Pravachol patients ' a...,1,4028
405,"Today in the US , the book - kept under wraps ...","Tomorrow the book , kept under wraps by G. P. ...",1,4040
406,The S & P / TSX composite rose 87.74 points on...,"On the week , the Dow Jones industrial average...",0,4049


### 数据集预处理

In [14]:
import os
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)

In [15]:
# 举个例子来看tokenizer的输出
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [16]:
# 对整个数据集进行分词处理
# # 以下这种做法是可以的，但缺点是处理后的tokenized_dataset将不再是dataset格式，而是返回字典
# # 一旦dataset过大，就无法存放在内存中，会导致out of memory异常
# tokenized_dataset = tokenizer(
#     raw_datasets["train"]["sentence1"],
#     raw_datasets["train"]["sentence2"],
#     padding=True,
#     truncation=True,
# )

# 为了使数据集保持dataset格式，使用更灵活的dataset.map方法
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map: 100%|██████████| 3668/3668 [00:00<00:00, 20522.00 examples/s]
Map: 100%|██████████| 408/408 [00:00<00:00, 11168.31 examples/s]
Map: 100%|██████████| 1725/1725 [00:00<00:00, 18445.50 examples/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [17]:
from transformers import DataCollatorWithPadding
# 动态填充，即将每个批次的输入序列填充到一样的长度
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### 使用Trainer API进行训练

当未设置评估函数时，trainer输出的结果中仅有training loss，并不包含评估结果acc、f1

In [None]:
from transformers import TrainingArguments

# TrainingArguments类唯一一个必须提供的参数是保存model的路径
training_args = TrainingArguments(os.path.join(save_path, checkpoint))

In [55]:
from transformers import AutoModelForSequenceClassification

# 实例化该预训练模型后汇报一个warning，因为BERT没有在句子分类方面进行过预训练
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)  # 二分类

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at D:\ProgrammingProjects\huggingface\models\bert\bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [56]:
from transformers import Trainer

# 定义一个训练器
# 将构建的所有对象传入进行模型精调
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [57]:
# 在CPU上微调模型特别慢，可以挂载到colab上用GPU加速
trainer.train()

 36%|███▋      | 500/1377 [21:42<39:56,  2.73s/it]  

{'loss': 0.5368, 'grad_norm': 7.87819242477417, 'learning_rate': 3.184458968772695e-05, 'epoch': 1.09}


 73%|███████▎  | 1000/1377 [1:36:34<15:44,  2.50s/it]    

{'loss': 0.2799, 'grad_norm': 5.622527599334717, 'learning_rate': 1.3689179375453886e-05, 'epoch': 2.18}


100%|██████████| 1377/1377 [1:52:57<00:00,  4.92s/it]

{'train_runtime': 6777.5719, 'train_samples_per_second': 1.624, 'train_steps_per_second': 0.203, 'train_loss': 0.3399341530616334, 'epoch': 3.0}





TrainOutput(global_step=1377, training_loss=0.3399341530616334, metrics={'train_runtime': 6777.5719, 'train_samples_per_second': 1.624, 'train_steps_per_second': 0.203, 'total_flos': 405114969714960.0, 'train_loss': 0.3399341530616334, 'epoch': 3.0})

In [59]:
# 查看验证集预处理后的结构
tokenized_datasets["validation"]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 408
})

In [60]:
# 获得模型预测结果
# predict方法输出一个三元组(predictions:[batch_size, num_labels], label_ids, metrics:默认training loss)
predictions = trainer.predict(tokenized_datasets["validation"])

100%|██████████| 51/51 [00:32<00:00,  1.55it/s]


In [62]:
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


In [63]:
# predictions:[batch_size, num_labels]，比较标签和预测结果
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)

### 设置评估函数

In [18]:
from evaluate import load
import numpy as np

def compute_metrics(eval_preds):
    metric = load("glue", "mrpc")  # 或者 load("glue/mrpc")
    
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    
    return metric.compute(predictions=predictions, references=labels)

在设定完需要输出的评估指标后重新训练模型

In [19]:
from transformers import TrainingArguments
# training_args = TrainingArguments(model_path, 
#                                 evaluation_strategy="epoch")
# 注：TrainingArguments唯一一个必须传入的参数是保存model的路径
training_args = TrainingArguments(
    output_dir=save_path,
    run_name=checkpoint,
    num_train_epochs=5,  # 训练5个epoch(当数据量小的时候要降低轮数防止过拟合，有early stop吗)
    per_device_train_batch_size=16,  # 每个GPU训练16个batch(确保显存足够)
    per_device_eval_batch_size=32,  # 每个GPU评估32个batch(评估时无需反向传播，可更大)
    gradient_accumulation_steps=4,  # 梯度累积步数
    
    learning_rate=2e-5,
    warmup_ratio=0.1,     # 增加warmup(前10%steps线性增加lr)
    weight_decay=0.01,    # 添加正则化(防止过拟合)
    lr_scheduler_type="cosine",  # 余弦退火学习率衰减(何时需要？)

    evaluation_strategy="steps",
    eval_steps=50,  # 每50步评估一次

    save_strategy="steps",  # 按步数保存，与评估策略一致
    save_steps=100,  # 每100步保存一次模型
    save_total_limit=5,  # 设置最大保存检查点数，避免磁盘爆炸

    load_best_model_at_end=True,  # 训练结束加载最佳模型
    metric_for_best_model="f1",   # 根据F1选择最佳(mrpc任务核心指标)
    greater_is_better=True,  # 评估指标是F1，F1越大越好

    #bf16=torch.cuda.is_bf16_supported(),  # 自动检测是否支持bf16
    #fp16=not torch.cuda.is_bf16_supported(),  # 自动检测是否支持fp16
    dataloader_num_workers=4 if torch.cuda.is_available() else 2,  # 多线程加载数据(按照CPU核心数制定)
    dataloader_pin_memory=True,  # 锁定内存(加速数据加载)，建议GPU训练时开启

    logging_dir=f"{save_path}/{checkpoint}/logs",  # 日志单独存放
    logging_steps=50,  # 每50步打印一次日志
    
    report_to="tensorboard",  # 多平台监控
    save_safetensors=True,  # 启用安全格式

    # do_train=True,  # 训练开关
    # max_steps=15000,  # 总训练步数
    )
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at .\models\bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()