# Pytorch入门实战（7）：基于BERT实现简单的中文文本摘要任务（Summarization task）

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iioSnail/chaotic-transformer-tutorials/blob/master/bert_classification_demo.ipynb)

In [1]:
# 如果你没有使用Google Drive，请不要运行这个代码块
# from google.colab import drive
# drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

# 本文涉及知识点

1. [nn.Transformer的使用](https://blog.csdn.net/zhaohongfei_358/article/details/126019181)
2. [Transformer源码解读](https://blog.csdn.net/zhaohongfei_358/article/details/126085246) (了解即可)
3. [Pytorch中DataLoader和Dataset的基本用法](https://blog.csdn.net/zhaohongfei_358/article/details/122742656)
4. [Masked-Attention的机制和原理](https://blog.csdn.net/zhaohongfei_358/article/details/125858248)
5. [Pytorch自定义损失函数](https://blog.csdn.net/zhaohongfei_358/article/details/125759911)
6. [Hugging Face快速入门](https://blog.csdn.net/zhaohongfei_358/article/details/126224199)

# 本文内容


# 环境配置

本文重点依赖Hugging Face的两个重要类库datasets和transformers，所以需要安装：

```
transformers==4.21
datasets==2.4
```

In [None]:
!pip install datasets
!pip install transformers

导入本文要使用的所有依赖包:

In [1]:
import os
import pandas
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
# 用于加载bert-base-chinese模型的分词器
from transformers import AutoTokenizer
# 用于加载bert-base-chinese模型
from transformers import AutoModel
from pathlib import Path
from collections import Counter

# 全局配置

定义一些全局变量，我是不太喜欢一些全局变量在函数中传来传去的，太麻烦了。

In [2]:
batch_size = 16
# 文本的最大长度
text_max_length = 128
epochs = 1000
validation_ratio = 0.1
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 每多少步，打印一次loss
log_per_step = 20
# 每多少步存储一次模型
save_per_step = 5000

dataset_dir = Path("./dataset")

# 模型存储路径
model_dir = Path("./drive/MyDrive/model/transformer_checkpoints")
# 如果工作目录不存在，则创建一个
os.makedirs(model_dir) if not os.path.exists(model_dir) else ''

print("Device:", device)

Device: cpu


# 数据处理

## 加载数据集

In [3]:
pd_data = pandas.read_csv(dataset_dir / 'train.csv')[['text', 'target']]

加载成功后，来看一下内容：

In [4]:
pd_data.sample(16, random_state=16)

Unnamed: 0,text,target
3031,Put the RIGHT person up on the block #Shelli??...,0
1204,I'm mentally preparing myself for a bomb ass s...,0
220,Cop pulls drunk driver to safety SECONDS befor...,1
4629,Enter the world of extreme diving ÛÓ 9 storie...,0
7187,BUT I will be uploading these videos ASAP so y...,0
7202,@Camilla_33 @CrayKain Hate to shatter your del...,0
2847,Could Billboard's Hot 100 chart be displaced b...,1
3262,@suelinflower there is no words to describe th...,0
3675,@Bardissimo Yes life has a 100% fatality rate.,0
965,@TR_jdavis Bruh you wanna fight I'm down meet ...,0


## Dataset And Dataloader

In [5]:
pd_validation_data = pd_data.sample(frac=validation_ratio)
pd_train_data = pd_data[~pd_data.index.isin(pd_validation_data.index)]

加载好数据集后，我们就可以开始构建Dataset了，我们这里Dataset就是返回评论和其摘要：

In [6]:
class MyDataset(Dataset):

    def __init__(self, mode='train'):
        super(MyDataset, self).__init__()
        # 拿到对应的数据
        if mode == 'train':
            self.dataset = pd_train_data
        elif mode == 'validation':
            self.dataset = pd_validation_data
        else:
            raise Exception("Unknown mode {}".format(mode))

    def __getitem__(self, index):
        # 取第index条
        data = self.dataset.iloc[index]
        # 取其评论
        source = data['text'].replace("#", "").replace("@", "")
        # 取对应的摘要
        target = data['target']
        # 返回
        return source, target

    def __len__(self):
        return len(self.dataset)

In [7]:
train_dataset = MyDataset('train')
validation_dataset = MyDataset('validation')

我们来打印看一下；

In [8]:
train_dataset.__getitem__(2)

("All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
 1)

构造好Dataset后，就可以来构造Dataloader了。在构造Dataloader前，我们需要先定义好分词器：

In [9]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

我们来尝试使用一下分词器：

In [10]:
tokenizer("I'm learning deep learning", return_tensors='pt')

{'input_ids': tensor([[ 101, 1045, 1005, 1049, 4083, 2784, 4083,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

可以正常运行。其中101表示“开始”(`[CLS]`)，102表示句子结束(`[SEP]`)

我们接着构造我们的Dataloader。我们需要定义一下collate_fn，在其中完成对句子进行编码、填充、组装batch等动作：

In [11]:
def collate_fn(batch):
    """
    将一个batch的文本句子转成tensor，并组成batch。
    :param batch: 一个batch的句子，例如: [('评论', '摘要'), ('评论', '摘要'), ...]
    :return: 处理后的结果，例如：
             src: {'input_ids': tensor([[ 101, ..., 102, 0, 0, ...], ...]), 'attention_mask': tensor([[1, ..., 1, 0, ...], ...])}
             tgt和tgt_y与src格式一样
             n_tokens为本轮预测时有效token数
    """
    text, target = zip(*batch)
    text, target = list(text), list(target)

    # src是要送给bert的，所以不需要特殊处理，直接用tokenizer的结果即可
    # padding='max_length' 不够长度的进行填充
    # truncation=True 长度过长的进行裁剪
    src = tokenizer(text, padding='max_length', max_length=text_max_length, return_tensors='pt', truncation=True)

    return src, torch.LongTensor(target)

In [12]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

我们来看一眼train_loader的数据：

In [13]:
for inputs, targets in train_loader:
    print(targets)
    break

tensor([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0])


# 构建模型

In [14]:
class MyModel(nn.Module):

    def __init__(self):
        super(MyModel, self).__init__()

        # 加载bert模型
        self.bert = AutoModel.from_pretrained("bert-base-uncased")

        # 最后的预测层
        self.predictor = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, src):
        """
        前向传播，获取decoder的输出。注意是decoder的输出，不是最后线性层的输出
        :param src: 分词后的评论数据
        :param tgt: 前面累计预测出的结果
        :return: decoder的输出
        """
        # 将src直接序列解包传入bert，因为bert和tokenizer是一套的，所以可以这么做。
        # 得到encoder的输出
        outputs = self.bert(**src).last_hidden_state[:, 0, :]
        return self.predictor(outputs)

In [15]:
model = MyModel()
model = model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
model(inputs.to(device))

tensor([[0.4655],
        [0.4628],
        [0.4697],
        [0.4832],
        [0.4707],
        [0.4490],
        [0.4764],
        [0.4553],
        [0.4642],
        [0.4575],
        [0.4581],
        [0.4649],
        [0.4669],
        [0.4916],
        [0.4559],
        [0.4711]], grad_fn=<SigmoidBackward0>)

# 训练模型

接下来开始正式训练模型，首先定义出损失函数和优化器：

In [17]:
criteria = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=3e-5)

In [18]:
# 由于src，tgt都是字典类型的，定义一个辅助函数帮助to(device)
def to_device(dict_tensors):
    result_tensors = {}
    for key, value in dict_tensors.items():
        result_tensors[key] = value.to(device)
    return result_tensors

In [19]:
def validate():
    total_loss = 0.
    total_correct = 0
    for inputs, targets in validation_loader:
        inputs, targets = to_device(inputs), targets.to(device)
        outputs = model(inputs)
        loss = criteria(outputs.view(-1), targets.float())
        total_loss += float(loss)

        correct_num = (((outputs >= 0.5).float() * 1).flatten() == targets).sum()
        total_correct += correct_num

    return total_correct / len(validation_dataset), total_loss / len(validation_dataset)

开始训练：

In [27]:
# 首先将模型调成训练模式
model.train()

# 清空一下cuda缓存
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# 定义几个变量，帮助打印loss
total_loss = 0.
# 记录步数
step = 0

# 开始训练
for epoch in range(epochs):
    for i, (inputs, targets) in enumerate(train_loader):
        # 从batch中拿到训练数据
        inputs, targets = to_device(inputs), targets.to(device)
        # 传入模型进行前向传递
        outputs = model(inputs)
        # 计算损失
        loss = criteria(outputs.view(-1), targets.float())
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += float(loss)
        step += 1

        if step % log_per_step == 0:
            print("Epoch {}/{}, Step: {}/{}, total loss:{:.4f}".format(epoch+1, epochs, i, len(train_loader), total_loss))
            total_loss = 0

        del inputs, targets

    accuracy, validation_loss = validate()
    print("Epoch {}, accuracy: {:.4f}, validation loss: {:.4f}".format(epoch+1, accuracy, validation_loss))
    torch.save(model, model_dir / f"model_{epoch}.pt")

# 模型使用