基于 Transformers 实现模型微调训练的主要流程，包括：
- 数据集下载
- 数据预处理
- 训练超参数配置
- 训练评估指标设置
- 实战训练
- 模型保存

1、数据集下载（包括训练集train和测试集test）

In [6]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

Generating train split: 650000 examples [00:01, 579665.69 examples/s]
Generating test split: 50000 examples [00:00, 543726.21 examples/s]


In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [8]:
dataset["train"][1]

{'label': 1,
 'text': "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars."}

2、数据预处理

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|██████████| 650000/650000 [01:49<00:00, 5933.79 examples/s]
Map: 100%|██████████| 50000/50000 [00:08<00:00, 6025.95 examples/s]


In [11]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

In [12]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [13]:
show_random_elements(tokenized_datasets["train"], num_examples=1)

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,2 star,"Not a very good Wal Mart. The store is messy and dirty, as well as dimly lit in spots. I only encountered one employee and she was friendly enough, but it wasn't enough to save this place for me. Would steer clear if I were you.","[101, 1753, 170, 1304, 1363, 160, 1348, 24341, 119, 1109, 2984, 1110, 20549, 1105, 7320, 117, 1112, 1218, 1112, 12563, 1193, 4941, 1107, 7152, 119, 146, 1178, 8181, 1141, 7775, 1105, 1131, 1108, 4931, 1536, 117, 1133, 1122, 1445, 112, 189, 1536, 1106, 3277, 1142, 1282, 1111, 1143, 119, 5718, 25284, 2330, 1191, 146, 1127, 1128, 119, 102]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]"


3、数据抽样（全量）

In [14]:
small_train_dataset = tokenized_datasets["train"]
small_eval_dataset = tokenized_datasets["test"]

4、加载BERT模型

In [15]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./models/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


5、训练超参数配置（TrainingArguments）

In [1]:
from transformers import TrainingArguments

# 模型权重保存路径(output_dir)
model_dir = "./models/bert-base-cased-finetune-yelp"

# 为了监控训练过程中的评估指标变化，我们可以在TrainingArguments指定evaluation_strategy参数，以便在 epoch 结束时报告评估指标
# logging_steps 默认值为500，根据我们的训练数据和步长，将其设置为100
training_args = TrainingArguments(output_dir=model_dir,
                                  evaluation_strategy="epoch",
                                  per_device_train_batch_size=16,
                                  num_train_epochs=3,
                                  logging_steps=100)

# 完整的超参数配置
print(training_args)

  from .autonotebook import tqdm as notebook_tqdm


6、

In [7]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

7、实例化训练器

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

8、进行训练

In [None]:
trainer.train()

9、训练后评估

In [None]:
small_test_dataset = tokenized_datasets["test"]

In [None]:
trainer.evaluate(small_test_dataset)

10、保存模型和训练状态

In [None]:
trainer.save_model(model_dir)

In [None]:
trainer.save_state()