# 使用 PyTorch 微调 BERT（Lightning 版本）

从 Hugging Face Hub 上加载预训练的 BERT 模型，然后使用 PyTorch 纯手工对其进行微调，设定如下：
- 预训练模型：bert-base-uncased
- 下游任务：GLUE/SST-2

使用 PyTorch 微调 BERT 需要以下步骤：
- 数据预处理：加载数据集并定义 `Dataset` 和 `DataLoader`
- 模型定义：给 BERT 基础模型添加一个全连接层作为分类头
- 模型微调：使用 AdamW 优化器对模型进行微调
- 模型验证：计算模型在验证集上的准确率

In [1]:
import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from sklearn.metrics import confusion_matrix, classification_report
from transformers import (
    BertTokenizerFast,
    BertModel,
    DataCollatorWithPadding
)
from datasets import load_dataset
import pytorch_lightning as pl

## 加载数据集

使用 `datasets.load_dataset` 从 Hugging Face Hub 上加载 GLUE/SST-2 任务的数据集

In [2]:
raw_datasets = load_dataset("glue", "sst2")
raw_datasets

Reusing dataset glue (/home/wh/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

对原始数据集分词

In [3]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

def preprocessing(examples):
    """用于分词的预处理程序"""
    return tokenizer(examples["sentence"], padding="max_length", max_length=60, truncation=True)

tokenized_datasets = raw_datasets.map(preprocessing, batched=True)
tokenized_datasets

Loading cached processed dataset at /home/wh/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-b6e0c246248e5535.arrow
Loading cached processed dataset at /home/wh/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-1fd84aa32a3195c3.arrow
Loading cached processed dataset at /home/wh/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e6aa15cb6563b44c.arrow


DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1821
    })
})

从数据集中取出训练集和验证集，并移除训练过程中不需要的 `sentence` 和 `idx` 字段

In [4]:
train_dataset = tokenized_datasets["train"].remove_columns(["sentence", "idx"])
eval_dataset = tokenized_datasets["validation"].remove_columns(["sentence", "idx"])
train_dataset

Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 67349
})

定义 `DataLoader`

In [5]:
data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=data_collator, num_workers=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=64, collate_fn=data_collator, num_workers=8)

In [6]:
batch = next(iter(train_dataloader))
batch.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

## 模型定义

In [7]:
class BertForSST2(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased", output_attentions=True)
        self.dropout = nn.Dropout(0.5)
        self.classifier = nn.Linear(self.bert.config.hidden_size, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        pooled_output = self.dropout(outputs.pooler_output)
        return self.classifier(pooled_output)
    
    def _calculate_loss(self, batch, mode="train"):
        logits = self(batch["input_ids"], batch["attention_mask"], batch["token_type_ids"])
        loss = F.cross_entropy(logits, batch["labels"])
        preds = F.softmax(logits, dim=1).argmax(1)
        acc = (preds == batch["labels"]).float().mean()
        self.log("%s/loss" % mode, loss)
        self.log("%s/acc" % mode, acc)
        return loss
    
    def training_step(self, batch, batch_idx):
        loss = self._calculate_loss(batch, mode="train")
        return loss
    
    def validation_step(self, batch, batch_idx):
        self._calculate_loss(batch, mode="val")
    
    def predict_step(self, batch, batch_idx):
        logits = self(batch["input_ids"], batch["attention_mask"], batch["token_type_ids"])
        pred = F.softmax(logits, dim=1).argmax(1)
        return pre
    
    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=5e-5)
        lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50)
        return [optimizer], [lr_scheduler]

In [8]:
pl_model = BertForSST2()
trainer = pl.Trainer(max_epochs=1, accelerator="gpu", devices=[0])
trainer.fit(pl_model, train_dataloader, eval_dataloader)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing lo

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

## 模型微调

In [9]:
pred = trainer.predict(pl_model, eval_dataloader)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]


Predicting: 1053it [00:00, ?it/s]

In [11]:
trainer.test(pl_model, eval_dataloader)

MisconfigurationException: No `test_step()` method defined to run `Trainer.test`.