# 基于Bert微调的自动判卷系统
本项目为基于华为云服务器Ascend: 1*Ascend910与mindspore-ascend 1.10.0的使用Bert进行全模型微调的问答机器人，其可以实现输入一句英文问题以及一个句子，从而获得该句子是否为问题答案的分类。

## 1. 实验环境准备

- 安装相应的库，由于git上的mindnlp库中不包含`tqdm`库，故需要单独安装。

In [None]:
!pip install git+https://openi.pcl.ac.cn/lvyufeng/mindnlp
!pip install regex
!pip uninstall tqdm -y
!pip install tqdm

- 导入相应的包

In [1]:
import os
import tqdm
import mindspore
from mindspore.dataset import text, GeneratorDataset, transforms
from mindspore import nn, context

from mindnlp.transforms import PadTransform
from mindnlp.transforms.tokenizers import BertTokenizer

from mindnlp.engine import Trainer, Evaluator
from mindnlp.engine.callbacks import CheckpointCallback, BestModelCallback
from mindnlp.metrics import Accuracy

## 2. 数据处理
### 2.1. 实验文件导入
在ours文件夹下存放有`WikiQA-train.tsv`、`WikiQA-dev.tsv`、`WikiQA-test.tsv`三个tsv文件，其中包含问题Question、句子Sequence和判断是否为答案的标识Label

在这里书写`Loader`类对实验内容进行加载，并返回一个字典，其中包含问题`question`、句子`answer`和判断是否为答案的标识`label`

In [2]:
import csv

class Loader:
    
    def __init__(self, path):
        self.path = path
        self._data = []  # This will store dictionaries
        self._load()

    def _load(self):
        with open(self.path, 'r', encoding='utf-8') as csvfile:
            spamreader = csv.reader(csvfile, delimiter='\t', quotechar='"')
            next(spamreader, None)  # skip the headers
            for row in spamreader:
                res = {}
                res['question'] = str(row[1])
                res['answer']=str(row[5])
                res['label'] = int(row[6])
                self._data.append(res)

    def __getitem__(self, index):
        return self._data[index]['label'], self._data[index]['question'],self._data[index]['answer']

    def __len__(self):
        return len(self._data)


In [3]:
train_file = Loader('ours/WikiQA-train.tsv')
valid_file = Loader('ours/WikiQA-dev.tsv')
test_file = Loader('ours/WikiQA-test.tsv')
len(train_file)

20347

### 2.2. 对原始数据的处理
由于原始数据与Bert的输入有较大区别，故在进行训练之前，需要对数据进行处理，其大体可以分为以下几个步骤：
1. 将数据导入`mindspore.dataset.GeneratorDataset`中，以便于后续的直接处理
2. 将question与answer进行tokenizer的处理，变成token的形式，将label转化成type_cast_op
3. 将question与answer进行合并，生成`[CLS]question[SEP]answer[SEP]`的形式，记作`input_ids`
4. 对`input_ids`剩下的部分用`[PAD]`进行填充，填充为`max_seq_len`的长度

In [4]:
import numpy as np

def process_dataset(source, tokenizer, pad_value, max_seq_len=64, batch_size=32, shuffle=True):
    column_names = ["label", "question",'answer']
    rename_columns = ["label", "input_ids"]
    
    def concat_columns(data1, data2):
        return np.concatenate((data1, data2[1:]), axis=0)  # Skip the first element of data2

    dataset = GeneratorDataset(source, column_names=column_names, shuffle=shuffle)
    # transforms
    pad_op = PadTransform(max_seq_len, pad_value=pad_value)
    type_cast_op = transforms.TypeCast(mindspore.int32)
    
    # map dataset
    dataset = dataset.map(tokenizer, input_columns="question")
    dataset = dataset.map(tokenizer, input_columns="answer")
    dataset = dataset.map(operations=[type_cast_op], input_columns="label")

    # Concatenate question and answer columns and then pad the result
    dataset = dataset.map(operations=concat_columns, input_columns=["question", "answer"], output_columns=["input_ids"], column_order=["label", "input_ids"])
    dataset = dataset.map(operations=pad_op, input_columns="input_ids")  # Apply padding

    # batch dataset
    dataset = dataset.batch(batch_size)

    return dataset


In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
pad_value = tokenizer.token_to_id('[PAD]')
dataset_train = process_dataset(train_file, tokenizer, pad_value)
dataset_val = process_dataset(valid_file, tokenizer, pad_value)
dataset_test = process_dataset(test_file, tokenizer, pad_value, shuffle=False)

## 3. 模型训练
基于预训练好的`BertForSequenceClassification`中的`bert-base-uncase`进行微调，以适配本项目的任务需求。

In [6]:
from mindnlp.models import BertForSequenceClassification
from mindnlp._legacy.amp import auto_mixed_precision

# set bert config and define parameters for training
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model = auto_mixed_precision(model, 'O1')

loss = nn.CrossEntropyLoss()
optimizer = nn.Adam(model.trainable_params(), learning_rate=2e-5)

metric = Accuracy()

# define callbacks to save checkpoints
ckpoint_cb = CheckpointCallback(save_path='checkpoint', ckpt_name='bert_qabot', epochs=1, keep_checkpoint_max=2)
best_model_cb = BestModelCallback(save_path='checkpoint', ckpt_name='bert_qabot_best', auto_load=True)

trainer = Trainer(network=model, train_dataset=dataset_train,
                  eval_dataset=dataset_val, metrics=metric,
                  epochs=5, loss_fn=loss, optimizer=optimizer, callbacks=[ckpoint_cb, best_model_cb],
                  jit=True)

['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.layer_norm.gamma', 'cls.predictions.transform.layer_norm.beta', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']


In [None]:
# start training
trainer.run('label')

The train will start from the checkpoint saved in 'checkpoint'.


  0%|          | 0/636 [00:00<?, ?it/s]

- 模型导入

In [None]:
import mindspore
graph = mindspore.load("checkpoint\bert_qabot_best.ckpt")

- 模型评估

In [None]:
evaluator = Evaluator(network=model, eval_dataset=dataset_test, metrics=metric)
evaluator.run(tgt_columns="label")

## 4. 模型推理
对于未知的问题及解答（英文）使用训练好的模型进行推理，得到答案是否正确

In [None]:
from mindspore import Tensor

def predict_single(question, answer):
    label_map = {0: "错误", 1: "正确"}
    # Tokenize and convert to numpy arrays
    ques = tokenizer.encode(question).ids
    ans = tokenizer.encode(answer).ids
    
    # Concatenate question and answer
    text_tokenized = np.concatenate((ques, ans[1:]))
    
    # Convert concatenated tokens back to tensor and get prediction
    logits = model(Tensor([text_tokenized]))
    
    predict_label = logits[0].asnumpy().argmax()
    info = f"inputs: '{question} {answer}', predict: '{label_map[predict_label]}'"
    
    return info



In [None]:
question = 'how are glacier caves formed?'
answer = 'A glacier cave is a cave formed within the ice of a glacier .'
print(predict_single(question, answer))

由于数据集的限制，模型在推理未知的问题上精确度有所缺失

## 参考资料

- 基于Bert实现知识库问答：[https://work.datafountain.cn/forum?id=121&type=2&source=1](https://work.datafountain.cn/forum?id=121&type=2&source=1)
- MindNLP开源地址：h[ttps://openi.pcl.ac.cn/lvyufeng/mindnlp](https://openi.pcl.ac.cn/lvyufeng/mindnlp)
- MindNLP文档：[https://mindnlp.cqu.ai/en/latest/](https://mindnlp.cqu.ai/en/latest/)
- 基于GPT2与mindspore的总结项目：[https://github.com/mindspore-lab/mindnlp/blob/master/examples/summarization/gpt2_summarization.ipynb](https://github.com/mindspore-lab/mindnlp/blob/master/examples/summarization/gpt2_summarization.ipynb)
- 基于Bert与mindnlp的情绪分类任务：[https://developer.huaweicloud.com/develop/aigallery/notebook/detail?id=e486c037-76ae-415b-90a7-7766ea189982](https://developer.huaweicloud.com/develop/aigallery/notebook/detail?id=e486c037-76ae-415b-90a7-7766ea189982)