# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## YelpReviewFull 数据集

**Hugging Face 数据集：[ YelpReviewFull ](https://huggingface.co/datasets/yelp_review_full)**

### 数据集摘要

Yelp评论数据集包括来自Yelp的评论。它是从Yelp Dataset Challenge 2015数据中提取的。

### 支持的任务和排行榜
文本分类、情感分类：该数据集主要用于文本分类：给定文本，预测情感。

### 语言
这些评论主要以英语编写。

### 数据集结构

#### 数据实例
一个典型的数据点包括文本和相应的标签。

来自YelpReviewFull测试集的示例如下：

```json
{
    'label': 0,
    'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'
}
```

#### 数据字段

- 'text': 评论文本使用双引号（"）转义，任何内部双引号都通过2个双引号（""）转义。换行符使用反斜杠后跟一个 "n" 字符转义，即 "\n"。
- 'label': 对应于评论的分数（介于1和5之间）。

#### 数据拆分

Yelp评论完整星级数据集是通过随机选取每个1到5星评论的130,000个训练样本和10,000个测试样本构建的。总共有650,000个训练样本和50,000个测试样本。

## 下载数据集

In [43]:
from datasets import load_dataset

# 在线加载数据集
# dataset = load_dataset("yelp_review_full")

# 离线加载数据集（假设已下载到本地 parquet 文件）
from datasets import Dataset, DatasetDict
dataset = DatasetDict({
    "train": Dataset.from_parquet("data/yelp_review_full/train-00000-of-00001.parquet"),
    "test": Dataset.from_parquet("data/yelp_review_full/test-00000-of-00001.parquet")
})

In [44]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [45]:
dataset["train"][2]

{'label': 3,
 'text': "Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life."}

In [46]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [47]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [48]:
show_random_elements(dataset["train"])

Unnamed: 0,label,text
0,5 stars,"Officially my favorite steakhouse hands down! As if Vegas doesn't have enough to write about, this little gem is in one of the remaining standouts of what some call \""Old Vegas\"". The meal got off to a great start at the bar as we waited for our table. The service was fast, the drinks were well made and naturally they were top shelf. I was particularly pleased with the professional look and feel of the place. Its very old school, mafioso, dark, white linen, with the sounds of clinking wine glasses and lively conversation in the background. \n\nThe food: Excellent! Your steak is served as intended, well aged beef cooked to order on a hot plate...nothing else! Fantastic. The sides are of sharable portions so agree on one for every 2 or 3 people unless you plan on toting a box of leftovers around Vegas. The mac and cheese is perfect as is the smooth and creamy lobster bisque. The other members in my party had several cuts of meat and one had a chicken dish...my sincerest apologies for not having info on those. Fact is, I wanted every square inch of space in my belly for the rib eye on my plate. Like sooo many other reviews about this place, it was melt in your mouth good. We also ordered a chocolate gnash desert that could only be made better by eating it off of your favorite super models body, otherwise it was absolute perfection. \n\nService was superb. Our waiter was attentive, informative and practiced. He was professional and friendly. Over all a fantastic experience. I will definitely be returning. Highly recommend this location to anyone. You will not be disappointed."
1,1 star,Horrible college! If you can go somewhere else they are so unorganized and try to find ways to make you attend their school.
2,4 stars,"We love to get take out at Los Taquitos on a busy week night when we are too tired to cook. They are super fast and dinner for 4 is usually under $20! \n\nI love the street tacos and my husband the ceviche. Son loves the carne asada burrito, daughter the tamales. Seriously something for everyone. There IS some variability in the spicy factor, but we are total regulars. Gotta try it!"
3,1 star,Stay far away from here!! I had braces put on by them and a few years later my teeth are still messed up! They screwed up my mouth so bad that every dentist I've gone to refuses to touch me until I have jaw surgery due to Western Dental's negligence. Then they sent me to collections for no reason which it has since been taken care of but that just shows how negligent they are!
4,5 stars,I love this place. It's the best hipster dive in Las Vegas! You can get deep fried Oreos.. For real. The girls are always super friendly and the drinks are fast and strong. What else could you want?!?
5,1 star,"CLOSED.....These guys lasted about two months...they're gone now, replaced by Zza's Pizza...will review it once I've tried it. But I gotta say, it's a funky location, and location's everything."
6,1 star,"We just left this restaurant. I can truthfully say the food was amazing. We had two servers who were fabulous. The manager however needs to reconsider her job choice. We made a reservation for a group of 15. We were hoping to go at 7 but they asked us to come at 6:30. Fine, we did. When we for there they were not prepared for our group. We pulled the tables and chairs ourselves. Then after all 13 of us had ordered, the manager came over and said they only had 2 woks going in the kitchen and could we combine orders. We then took our order down to 10 orders. We of course ended up adding more dishes. The manager spent the rest of our stay glaring in our direction. Then the bill came...wow. They put extra food on our bill. When we told them about it, she was condescending and made it sound as if we were lying and then as if she would punish the kitchen for sending us food we didn't order or eat. We paid, making sure we counted out every cent of the money to her so there would be no confusion. We walked out of there and straight into a bar at the container park to wash the taste of this horrible customer service off our palates. Some of our party had been here before and spoke of how great it was...sadly none of us plan on returning."
7,4 stars,"After reading reviews, I went to Taco Haus with medium expectations; however, I was quite happy with the results. The service was terrific and the food was very good. \n\nI enjoyed the raw jicama salad as well as the rice (I requested it with butter), guacamole, veggie taco (I requested it with butter instead of veggie oil), and the pork belly taco was pretty good. My special diet requests which were taken care of with ease and that made me extra pleased. \n\nTaco Haus sources local ingredients with what seems to be mostly organic. That alone will bring me back again. Their meat is hormone free, but if it were also grass fed, they'd get that much more of my business."
8,4 stars,"Okay, so I am a sucker for Panda Express. You may think it is not Chinese food, but I gotta tell you that their Orange Chicken is to die for. The store at Thunderbird and 40th Street is fast (usually) and offers hot and fresh food. Expect to stand in line (out the door) if you arrive between 5-6 p.m. on weekdays. Overall, a quick place to get chinese food served fresh and hot. My favs: Orange Chicken, Chicken with Mushroom, and Mandarin Chicken. Tip: if you are on a diet, buy the kids meal. It comes with one side and one entree, plus a cookie and a drink. Give the cookie to your kid, order Ice Tea or Diet Coke, and enjoy!"
9,1 star,"\""Eek! Methinks not\"" is RIGHT! (one star because we couldn't go lower)\nI'm writing this as my family and I are cracking up in horror, disbelief, and embarrassment. We saw the show an hour ago and can't stop making fun. We were so NOT entertained, the only way to make it worth our money is to laugh our asses off at how incredibly stupid and awkward this show was. We had a party of 12 people and here's what each said:\n9 year old boy #1: I thought it would be more magical. Is the fiance from Hooters?\n9 year old boy #2: My teenage cousin is better at magic in our living room. Laaaa-- aaaaaame!\n10 year old girl: The show was more about HIM than him doing magic. I wanted to put a paper bag over my head. Not that the person himself stinks, he just stinks at magic...\n11 year old girl: Are we at the right show? \n12 year old girl: I'm ashamed of/for the mother (aka DJ M-O-M also dressed like Hooters girl).\n12 year old boy: The mom looks like a hooker and the fiance IS a hooker (and we didn't even know he knew the word hooker)\n14 year old boy: It was good huhuhuhuhuhuh\n16 year old boy: Mom, if you EVER wore that...\nmom #1: Why in God's good name would we want to see him lip sync an Elvis song while waltzing down the aisle shaking hands like an actual performer? Or play the drums with his one-tune dad? He was AVOIDING magic. \nmom#2: The finger snapping is awkward and creeping me out. The stage manager looks mortified and MUST be a cousin working to earn his keep. I was uncomfortable.\ndad#1: Tommy Wind is magic's answer to the Wolf of Wall Street. Goooooooooong\ndad #2: no comment (in fear of bad karma...)\n\nWe simply could not make eye contact at the end. Meet and greet--me thinks NOT. We're too ashamed. \n\nOn a serious note, the positive reviews either have to be from friends and family, or we saw an extremely OFF night...For some constructive criticism--Tommy Wind's passion for show business is nice. Hopefully, he will perfect some tricks (we could see some of the tricks we shouldn't have seen) and focus more on magic than the music portion. We paid less for the Nathan Burton show last year and it was very good, so we expected some good illusions and slight of hand. Also, It's nice to work with family, but maybe he can work on integrating the family stories more smoothly."


## 预处理数据

下载数据集到本地后，使用 Tokenizer 来处理文本，对于长度不等的输入数据，可以使用填充（padding）和截断（truncation）策略来处理。

Datasets 的 `map` 方法，支持一次性在整个数据集上应用预处理函数。

下面使用填充到最大长度的策略，处理整个数据集：

In [49]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [50]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,2 star,"Just an update, just adding another star for the apology they gave us.....","[101, 2066, 1126, 11984, 117, 1198, 5321, 1330, 2851, 1111, 1103, 13382, 1152, 1522, 1366, 119, 119, 119, 119, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


### 数据抽样

使用 1000 个数据样本，在 BERT 上演示小规模训练（基于 Pytorch Trainer）

`shuffle()`函数会随机重新排列列的值。如果您希望对用于洗牌数据集的算法有更多控制，可以在此函数中指定generator参数来使用不同的numpy.random.Generator。

In [None]:
# 这里我们对训练集和测试集分别进行洗牌（shuffle），然后各自选取前1000条数据，作为小规模训练和评估的样本集。
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(10000))

# small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
# small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)

## 微调训练配置

### 加载 BERT 模型

警告通知我们正在丢弃一些权重（`vocab_transform` 和 `vocab_layer_norm` 层），并随机初始化其他一些权重（`pre_classifier` 和 `classifier` 层）。在微调模型情况下是绝对正常的，因为我们正在删除用于预训练模型的掩码语言建模任务的头部，并用一个新的头部替换它，对于这个新头部，我们没有预训练的权重，所以库会警告我们在用它进行推理之前应该对这个模型进行微调，而这正是我们要做的事情。

In [52]:
from transformers import AutoModelForSequenceClassification

# 这里我们从预训练的 "bert-base-cased" 模型加载，并指定分类任务的类别数为5
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 训练超参数（TrainingArguments）

完整配置参数与默认值：https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments

源代码定义：https://github.com/huggingface/transformers/blob/v4.36.1/src/transformers/training_args.py#L161

**最重要配置：模型权重保存路径(output_dir)**

In [53]:
from transformers import TrainingArguments

model_dir = "models/bert-base-cased-finetune-yelp"

# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  per_device_train_batch_size=16,
                                  num_train_epochs=5,
                                  logging_steps=100)

In [54]:
# 完整的超参数配置
print(training_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,


### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [55]:
import numpy as np
import evaluate

metric = evaluate.load("./metrics/accuracy")


接着，调用 `compute` 函数来计算预测的准确率。

在将预测传递给 compute 函数之前，我们需要将 logits 转换为预测值（**所有Transformers 模型都返回 logits**）。

In [56]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

#### 训练过程指标监控

通常，为了监控训练过程中的评估指标变化，我们可以在`TrainingArguments`指定`evaluation_strategy`参数，以便在 epoch 结束时报告评估指标。

In [57]:
from transformers import TrainingArguments, Trainer

# 参数解释：
# output_dir: 模型输出保存目录
# eval_strategy: 评估策略，这里设置为每个epoch结束时进行评估
# per_device_train_batch_size: 每个设备上的训练batch大小
# num_train_epochs: 训练的总轮数
# logging_steps: 日志记录的步数间隔
# run_name: 本次训练的运行名称
# report_to: 日志报告方式，这里设置为不报告到任何平台

training_args = TrainingArguments(
    output_dir=model_dir,                # 模型输出目录
    eval_strategy="epoch",               # 每个epoch评估一次
    per_device_train_batch_size=16,      # 每个设备的batch size
    num_train_epochs=3,                  # 训练轮数
    logging_steps=30,                    # 日志记录步数
    run_name="my_fine_run",              # 运行名称
    report_to="none"                     # 不报告到外部平台
)

## 开始训练

### 实例化训练器（Trainer）

`kernel version` 版本问题：暂不影响本示例代码运行

In [58]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

## 使用 nvidia-smi 查看 GPU 使用

为了实时查看GPU使用情况，可以使用 `watch` 指令实现轮询：`watch -n 1 nvidia-smi`:

```shell
Every 1.0s: nvidia-smi                                                   Wed Dec 20 14:37:41 2023

Wed Dec 20 14:37:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |
| N/A   64C    P0              69W /  70W |   6665MiB / 15360MiB |     98%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     18395      C   /root/miniconda3/bin/python                6660MiB |
+---------------------------------------------------------------------------------------+
```

In [59]:
import torch

# 检查 CUDA 是否可用
is_available = torch.cuda.is_available()
print(f"CUDA 是否可用: {is_available}")

if is_available:
    # 获取 GPU 数量
    device_count = torch.cuda.device_count()
    print(f"可用的 GPU 数量: {device_count}")
    
    # 获取当前 GPU 的名称
    current_device_name = torch.cuda.get_device_name(0)
    print(f"当前 GPU 名称: {current_device_name}")
else:
    print("PyTorch 未能检测到任何可用的 CUDA 设备。")

CUDA 是否可用: True
可用的 GPU 数量: 1
当前 GPU 名称: NVIDIA GeForce RTX 3050 OEM


In [60]:
trainer.train()

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
small_test_dataset = tokenized_datasets["test"].shuffle(seed=64).select(range(100))

In [None]:
trainer.evaluate(small_test_dataset)

### 保存模型和训练状态

- 使用 `trainer.save_model` 方法保存模型，后续可以通过 from_pretrained() 方法重新加载
- 使用 `trainer.save_state` 方法保存训练状态

In [None]:
trainer.save_model(model_dir)

In [None]:
trainer.save_state()

In [None]:
# trainer.model.save_pretrained("./")

## Homework: 使用完整的 YelpReviewFull 数据集训练，看 Acc 最高能到多少