# Hugging Face Transformers 微调训练入门

本示例将介绍基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 训练器基本介绍
- 实战训练
- 模型保存

## 下载数据集

In [1]:
from datasets import load_dataset

dataset=load_dataset("yelp_review_full")

In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

## 检查数据

In [3]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [4]:
show_random_elements(dataset['train'])

Unnamed: 0,label,text
0,1 star,"This place is really pretty low budget. The day care aspect isn't terrible because it is really just a bunch of dogs running around an empty room and that isnt to hard to mess up. The staff is also friendly and but the real disaster of the place is grooming.\n\nI took my dog in to get groomed and the groomer cut my dog all the way down to his muscle!!!! When I picked him up I found out that he was cut 2-3 hours earlier and was just left alone while his arm was bleeding and muscle was exposed.\nI immediately took him to the vet who was absolutley appauled that he was left alone and untreated after such a serious injury. After the vet cleaned up all the dried blood around the wound, she cleaned all of the dirt out of the wound! She then patched his arm up with a few sutures and started him on pain meds and antibiotics for 10 days.\n\nWe will not be returning and warn all others to seriously consider using a differernt groomer."
1,4 stars,"This market was wonderful. It had a wide variety of items for sale (including chocolates :-) ) But one of the main reasons why people come here is for the coffee and fresh baked goods. One of the best croissants you will ever eat will be found here to say nothing about the bread and butter. I know these may seem like simple, common food items but when executed with excellence there is nothing better. I find myself longing to relive those food experiences that I was fortunate enough to share with my dear friend and family; I will remember it fondly."
2,3 stars,"I'm surprised that no one's bothered to review this place. I'm not even from Vegas, nor did I actually eat anything here!\n\nThat disclaimer aside, I'm giving it three stars based on the following:\n\n*The waiter may not have understood what 'vegan' meant (when I was trying to explain to him why I wasn't going to/couldn't order anything) but was ridiculously kind, and even offered to bring me a bowl of grapes.\n\n*The BF discovered that unlike all the IHOPs in CA, this IHOP had no vegetarian omelet on the menu, nor avocado as an ingredient in the create-your-own department. So he had to order a country omelet, and substitute tomatoes for the ham. The waiter looked confused, but did his best to accommodate him.\n\n*Dirty bathroom. Ewww...\n\n*Cherry Diet Coke = Yay!"
3,3 stars,"The main reason for coming to this theater is the location, it's the closest Harkins by our house. I also love the $1 refill souvenir cups! The chairs are pretty comfy but I must say AMC's are much better. The employees are nice/friendly. It's not the cleanest theater, but I wouldn't say it's dirty. The bathrooms are ok, kind of a mess at times. \n\nI have noticed they tend to encounter technical difficulties often. A few months back we went to go see The Other Guys & during previews the manager came in saying they were working on fixing the screen. There were doubles of everything, kinda like we were all drunk. Well 10 minutes later she came back saying they weren't able to fix it. WTF!? They gave each person 2 free passes & said we could go see another movie playing if we wanted to. We were already there & everything so we saw Dinner For Schmucks, awful movie! \n\nThen, this past Friday we went to go see Due Date (awesome movie, hilarious!). It was near the last 10-15 minutes of the movie & all of a sudden the sound went out! Luckily, the picture was fine so we could see what was happening but again WTF!? The manager came in while they stopped the movie & said they aren't able to \""rewind\"" the movie so everyone would get 1 free pass, pretty sweet considering we did see the movie & only missed a minute or 2."
4,3 stars,"Hmmmm. This is a Vegas classic, dark, smokey, cheap drinks and cheap food. I was there with a group of maybe 12. We did find seating, although we had to inhabit one of their 4- plex poker machine table thingys. I'm not much of a gambler but this place obviously caters to them big time. There are quite a few machines of all types stuffed into this place (and this place isn't all that big)\nThere was some kind of big wheel thing that people were spinning off and on throughout the night, not sure what you win, but it was the center of attention when someone stepped up to it.\nLike I said, cheap drinks and cheap food. Friendly bartenders are key, and they nailed that one on the proverbial head. You can tell the regulars there too, nice people they were. Easy to strike up some convo with. Not a place I'd go out of my way for, but if in the area, it's a good watering hole for sure."
5,2 star,"Reasonably priced breakfast buffet however, the buffet out of all hotel buffets on the strip is considered in my opinion as I'd say lower quality compared to for example the Mandalay bay. Food kinda blows and if you want breakfast here, you better come early to snag a table otherwise you'll be waiting forever."
6,4 stars,"One of my fav chinese spots in the valley! This is one of those places where if you choose the right menu items it's super awesome but I think they got a few things they are not great at, like their kung pow. I've been going here for years so here are some of my favorites: Hong Kong Style Chow Mein, Salt Pepper Pork, Sizzling Beef Short Ribs, Garlic Friend Shrimp, Crystal Shrimp, Lemon Chicken, Wong Jo Chicken, Madarin Beef. You'll find lots of local people here. Lots of asians! Good sign that the food is good right?"
7,2 star,This is the Walmart of electronics. They have everything you may need at good prices. Do your research online before coming in the store. They are infamous for having horrible customer service yet having tons of employees walking around. They completely ignore you and walk right by you. If you want good service and answers to questions go to best buy. This is a no-nonsense store. Get it and get out!
8,4 stars,"I don't usually venture past the strip unless it's for a good enough reason (like to catch a flight out of the country) but Todd's has been on my go to list and we had company staying on that side of town.\n\nI'll get the worst part of it over first; the ambiance fails miserably and the decor is on need of a major overhaul! Uncomfortable 80's diner drab booths, outdated doodie-brown color scheme, annoying noise from the TV and patrons in the bar area that intrudes into the dining area. \n\nOnce I moved past the absence of marble columns and chandeliers and tuned out the chatter (with the help of a delectable hibiscus coconut Mojito) I was able to see what made dining at Todd's worth the trek... the food is scrumptious.\n \nWe had:\n\nGoat cheese wontons with raspberry basil sauce.\n\nPotato and carrot samosas. \n\nFilet of beef grilled with wild mushroom Cabernet sauce.\n\nShort Rib with jalapeno mashed, caramelized onion sauce.\n\nOrganic chicken breast, substituted the jalapeno mashed for the toasted couscous.\n\nApple & blueberry cobble with buttermilk cinnamon ice cream.\n\nThe food gets a big Y for YUM and Martin, our server, was helpful and fantastic and in sync with our leisurely pace. \n\nToo bad the atmosphere doesn't meld with the food and service to make this a truly unique dining experience."
9,3 stars,"I really wanted Mexican food the day we flew in to Vegas. It' s just one of those things, you crave it and you have to have it. So I found this place, it had okay reviews so I decided to try it.\n\nThe restaurant had good ambiance. I thought we were going to be seated outside but we were sat inside which wasn't too bad. They have a huge projection screen where they were playing some channel I cannot remember. We were also next to the bar which looked to have a great selection of Tequila. \n\nI really wanted chicken flautas which they didn't have so I settled for crunchy chicken tacos. The server told us their sangrias were really good so I decided to try one of those too.\n\nThe sangria was actually pretty good, served with bits of green apple. That drink got me started for the night. \n\nThe crunchy chicken taco plate came with three tacos, black beans, white lime rice with cilantro, charred tomatillo salsa, guacamole, and sour cream. Put that all together and it made for a yummy meal. I feel the chicken could have had more flavor and also the menu stated they made their own shells but they looked more like the Taco Bell variety which I wasn't expecting. \n\nA downer to this restaurant, no free chips and salsa and the food was a bit pricey for the food. We ordered one sangria, the chicken taco plate and a ceviche appetizer. The tab was $40! Not too much bang for the buck here. Many will say that's Vegas but a different dining experience totaling the same dollar amount the next evenings makes me say I beg to differ."


## 数据预处理，给数据编码，统一长度

In [5]:
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_fuc(example):
    return tokenizer(example['text'],padding="max_length",truncation=True)

tokenized_dataset=dataset.map(tokenize_fuc,batched=True)

In [6]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [7]:
show_random_elements(tokenized_dataset['train'],1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,5 stars,"We went fully prepared for great food and surly service. Upon arrival, the place was packed, so we sat at the bar right in front of the grill. Best seats in the house as far as we're concerned. The service was quick and to the point. I've spent 15 years working in restaurants and these guys are *busy*. There isn't enough time to dote on needy patrons. The food speaks for itself.\n\nWe started with the grilled octopus app. Shared a litre of the house red. She had the grilled sardines and I had the chicken as the main. Best dining experience during our Montreal trip. We'll definitely return.\n\nJust before we left, the guitarist came out and started entertaining the guests. Don't let anyone suggest that Chez Doval isn't welcoming.","[101, 1284, 1355, 3106, 4029, 1111, 1632, 2094, 1105, 8910, 1193, 1555, 119, 4352, 4870, 117, 1103, 1282, 1108, 8733, 117, 1177, 1195, 2068, 1120, 1103, 2927, 1268, 1107, 1524, 1104, 1103, 176, 11071, 119, 1798, 3474, 1107, 1103, 1402, 1112, 1677, 1112, 1195, 112, 1231, 4264, 119, 1109, 1555, 1108, 3613, 1105, 1106, 1103, 1553, 119, 146, 112, 1396, 2097, 1405, 1201, 1684, 1107, 7724, 1105, 1292, 3713, 1132, 115, 5116, 115, 119, 1247, 2762, 112, 189, 1536, 1159, 1106, 15645, 1162, 1113, 27819, 14645, 119, 1109, 2094, 8917, 1111, 2111, 119, 165, 183, 165, 183, 2924, 1162, 1408, ...]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


## 抽取小部分数据

In [8]:
small_train_dataset=tokenized_dataset['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset=tokenized_dataset['test'].shuffle(seed=42).select(range(1000))


## 加载模型

In [9]:
from transformers import AutoModelForSequenceClassification

model=AutoModelForSequenceClassification.from_pretrained('bert-base-cased',num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 配置超参数

In [10]:
from transformers import TrainingArguments

model_dir=r"E:\model\language\fine-tuning\bert-base-cased-by-yelp"

training_arg=TrainingArguments(
    output_dir=model_dir,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    logging_steps=100,
)

In [11]:
print(training_arg)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_la

### 训练过程中的指标评估（Evaluate)

**[Hugging Face Evaluate 库](https://huggingface.co/docs/evaluate/index)** 支持使用一行代码，获得数十种不同领域（自然语言处理、计算机视觉、强化学习等）的评估方法。 当前支持 **完整评估指标：https://huggingface.co/evaluate-metric**

训练器（Trainer）在训练过程中不会自动评估模型性能。因此，我们需要向训练器传递一个函数来计算和报告指标。 

Evaluate库提供了一个简单的准确率函数，您可以使用`evaluate.load`函数加载

In [12]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [13]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [14]:
from transformers import TrainingArguments, Trainer

training_arg = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch", 
                                  per_device_train_batch_size=3,
                                  num_train_epochs=3,
                                  logging_steps=30)

## 训练

In [15]:
trainer=Trainer(
    model=model,
    args=training_arg,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)

In [16]:
trainer.train()

  0%|          | 0/1002 [00:00<?, ?it/s]

{'loss': 1.6564, 'grad_norm': 9.190133094787598, 'learning_rate': 4.8502994011976046e-05, 'epoch': 0.09}
{'loss': 1.6423, 'grad_norm': 13.536694526672363, 'learning_rate': 4.70059880239521e-05, 'epoch': 0.18}
{'loss': 1.6969, 'grad_norm': 16.42144203186035, 'learning_rate': 4.550898203592814e-05, 'epoch': 0.27}
{'loss': 1.7041, 'grad_norm': 9.768793106079102, 'learning_rate': 4.40119760479042e-05, 'epoch': 0.36}
{'loss': 1.6783, 'grad_norm': 16.96868896484375, 'learning_rate': 4.251497005988024e-05, 'epoch': 0.45}
{'loss': 1.6081, 'grad_norm': 9.759632110595703, 'learning_rate': 4.101796407185629e-05, 'epoch': 0.54}
{'loss': 1.4754, 'grad_norm': 17.274873733520508, 'learning_rate': 3.9520958083832336e-05, 'epoch': 0.63}
{'loss': 1.4431, 'grad_norm': 11.90774917602539, 'learning_rate': 3.802395209580839e-05, 'epoch': 0.72}
{'loss': 1.2927, 'grad_norm': 26.799957275390625, 'learning_rate': 3.652694610778443e-05, 'epoch': 0.81}
{'loss': 1.3105, 'grad_norm': 9.593435287475586, 'learning_ra

  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 1.2845370769500732, 'eval_accuracy': 0.397, 'eval_runtime': 41.8419, 'eval_samples_per_second': 23.9, 'eval_steps_per_second': 2.987, 'epoch': 1.0}
{'loss': 1.223, 'grad_norm': 36.22141647338867, 'learning_rate': 3.2035928143712576e-05, 'epoch': 1.08}
{'loss': 1.2044, 'grad_norm': 10.429362297058105, 'learning_rate': 3.053892215568863e-05, 'epoch': 1.17}
{'loss': 1.2009, 'grad_norm': 20.184120178222656, 'learning_rate': 2.9041916167664674e-05, 'epoch': 1.26}
{'loss': 1.0517, 'grad_norm': 17.230878829956055, 'learning_rate': 2.754491017964072e-05, 'epoch': 1.35}
{'loss': 1.0729, 'grad_norm': 33.56970977783203, 'learning_rate': 2.604790419161677e-05, 'epoch': 1.44}


Checkpoint destination directory E:\model\language\fine-tuning\bert-base-cased-by-yelp\checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 1.2233, 'grad_norm': 9.063084602355957, 'learning_rate': 2.4550898203592816e-05, 'epoch': 1.53}
{'loss': 1.1228, 'grad_norm': 8.379409790039062, 'learning_rate': 2.3053892215568866e-05, 'epoch': 1.62}
{'loss': 1.1801, 'grad_norm': 23.561912536621094, 'learning_rate': 2.155688622754491e-05, 'epoch': 1.71}
{'loss': 1.1849, 'grad_norm': 9.996758460998535, 'learning_rate': 2.0059880239520957e-05, 'epoch': 1.8}
{'loss': 1.0159, 'grad_norm': 27.484691619873047, 'learning_rate': 1.8562874251497005e-05, 'epoch': 1.89}
{'loss': 0.9491, 'grad_norm': 9.6849946975708, 'learning_rate': 1.7065868263473055e-05, 'epoch': 1.98}


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 1.2204817533493042, 'eval_accuracy': 0.492, 'eval_runtime': 47.8557, 'eval_samples_per_second': 20.896, 'eval_steps_per_second': 2.612, 'epoch': 2.0}
{'loss': 0.8989, 'grad_norm': 34.590579986572266, 'learning_rate': 1.5568862275449103e-05, 'epoch': 2.07}
{'loss': 0.7418, 'grad_norm': 7.8535943031311035, 'learning_rate': 1.407185628742515e-05, 'epoch': 2.16}
{'loss': 0.6332, 'grad_norm': 14.910706520080566, 'learning_rate': 1.2574850299401197e-05, 'epoch': 2.25}
{'loss': 0.866, 'grad_norm': 23.167125701904297, 'learning_rate': 1.1077844311377246e-05, 'epoch': 2.34}
{'loss': 0.687, 'grad_norm': 15.934168815612793, 'learning_rate': 9.580838323353295e-06, 'epoch': 2.43}
{'loss': 0.9063, 'grad_norm': 27.37175178527832, 'learning_rate': 8.083832335329342e-06, 'epoch': 2.51}
{'loss': 0.744, 'grad_norm': 6.055376052856445, 'learning_rate': 6.58682634730539e-06, 'epoch': 2.6}
{'loss': 0.6884, 'grad_norm': 11.331487655639648, 'learning_rate': 5.0898203592814375e-06, 'epoch': 2.69}

Checkpoint destination directory E:\model\language\fine-tuning\bert-base-cased-by-yelp\checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


  0%|          | 0/125 [00:00<?, ?it/s]

{'eval_loss': 1.1581151485443115, 'eval_accuracy': 0.558, 'eval_runtime': 42.5465, 'eval_samples_per_second': 23.504, 'eval_steps_per_second': 2.938, 'epoch': 3.0}
{'train_runtime': 574.1883, 'train_samples_per_second': 5.225, 'train_steps_per_second': 1.745, 'train_loss': 1.136073023973111, 'epoch': 3.0}


TrainOutput(global_step=1002, training_loss=1.136073023973111, metrics={'train_runtime': 574.1883, 'train_samples_per_second': 5.225, 'train_steps_per_second': 1.745, 'train_loss': 1.136073023973111, 'epoch': 3.0})

## 测试

In [17]:
t_dataset = tokenized_dataset["test"].shuffle(seed=64).select(range(100))

In [18]:
trainer.evaluate(t_dataset)

  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 1.3711062669754028,
 'eval_accuracy': 0.5,
 'eval_runtime': 4.5916,
 'eval_samples_per_second': 21.779,
 'eval_steps_per_second': 2.831,
 'epoch': 3.0}

In [19]:
trainer.save_model(model_dir)

In [20]:
trainer.save_state()