# MindNLP-bigbird_pegasus模型微调
基础模型：google/bigbird-pegasus-large-arxiv
tokenizer：google/bigbird-pegasus-large-arxiv
微调数据集：databricks/databricks-dolly-15k
硬件：Ascend910B1
环境
| Software    | Version                     |
| ----------- | --------------------------- |
| MindSpore   | MindSpore 2.4.0             |
| MindSpore   | MindSpore 0.4.1             |
| CANN        | 8.0                         |
| Python      | Python 3.9                  |
| OS platform | Ubuntu 5.4.0-42-generic     |

## instruction
BigBird-Pegasus 是基于 BigBird 和 Pegasus 的混合模型，结合了两者的优势，专为处理长文本序列设计。BigBird 是一种基于 Transformer 的模型，通过稀疏注意力机制处理长序列，降低计算复杂度。Pegasus 是专为文本摘要设计的模型，通过自监督预训练任务（GSG）提升摘要生成能力。BigBird-Pegasus 结合了 BigBird 的长序列处理能力和 Pegasus 的摘要生成能力，适用于长文本摘要任务，如学术论文和长文档摘要。
Databricks Dolly 15k 是由 Databricks 发布的高质量指令微调数据集，包含约 15,000 条人工生成的指令-响应对，用于训练和评估对话模型。是专门为NLP模型微调设计的数据集。
## train loss

对比微调训练的loss变化

| epoch | mindnlp+mindspore | transformer+torch（4060） |
| ----- | ----------------- | ------------------------- |
| 1     | 2.9176            | 8.7301                    |
| 2     | 2.79              | 8.1557                    |
| 3     | 2.593             | 7.7516                    |
| 4     | 2.4875            | 7.5017                    |
| 5     | 2.3831            | 7.2614                    |
| 6     | 2.2631            | 7.0559                    |
| 7     | 2.2369            | 6.8405                    |
| 8     | 2.1732            | 6.7297                    |
| 9     | 2.1717            | 6.7136                    |
| 10    | 2.1833            | 6.6279                    |

## eval loss

对比评估得分

| epoch | mindnlp+mindspore  | transformer+torch（4060） |
| ----- | ------------------ | ------------------------- |
| 1     | 2.6390955448150635 | 6.3235931396484375        |

**首先运行以下脚本配置环境**

In [None]:
# 在Ascend910B1环境需要额外安装以下
# !pip install mindnlp
# !pip install mindspore==2.4
# !export LD_PRELOAD=$LD_PRELOAD:/home/ma-user/anaconda3/envs/MindSpore/lib/python3.9/site-packages/torch.libs/libgomp-74ff64e9.so.1.0.0
# !yum install libsndfile

## 导入库
注意这里导入了多个Tokenizer进行过测试。与transformer不同，这里需要找到对应的Tokenizer，但是BigBirdPegasus在mindnlp中没有找到完全对应的Tokenizer。
要设置mindspore工作环境为Ascend。

In [2]:
import os
from mindnlp.transformers import (
    BigBirdPegasusForCausalLM, 
    BigBirdTokenizerFast,
    PegasusTokenizer,
    PreTrainedTokenizerBase)
from datasets import load_dataset, DatasetDict
from mindspore.dataset import GeneratorDataset
from mindnlp.engine import Trainer, TrainingArguments
import mindspore as ms
# 设置运行模式和设备
ms.set_context(mode=ms.PYNATIVE_MODE, device_target="Ascend")

  from .autonotebook import tqdm as notebook_tqdm
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.304 seconds.
Prefix dict has been built successfully.


## 处理数据集
这里为了快速多次微调，数据集经过处理后保存到本地。需要注意的是这里使用BigBirdPegasusForCausalLM，使用的是语言模型，需要将数据集进行处理。

In [3]:
# 定义数据集保存路径
dataset_path = "./processed_dataset"
# 检查是否存在处理好的数据集
if os.path.exists(dataset_path):
    dataset = DatasetDict.load_from_disk(dataset_path)
    train_dataset = dataset["train"]
    eval_dataset = dataset["eval"]
else:
    # 加载和处理数据集
    dataset = load_dataset("databricks/databricks-dolly-15k")
    print(dataset)

    def format_prompt(sample):
        instruction = f"### Instruction\n{sample['instruction']}"
        context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
        response = f"### Answer\n{sample['response']}"
        prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
        sample["prompt"] = prompt
        return sample

    dataset = dataset.map(format_prompt)
    dataset = dataset.remove_columns(['instruction', 'context', 'response', 'category'])
    train_dataset = dataset["train"].select(range(0, 40))
    eval_dataset = dataset["train"].select(range(40, 50))
    # print(train_dataset)
    # print(eval_dataset)
    # print(train_dataset[0])
    # 保存处理好的数据集
    dataset = DatasetDict({"train": train_dataset, "eval": eval_dataset})
    dataset.save_to_disk(dataset_path)

## 加载模型
在mindnlp中没有找到类似BigBirdPegasusTokenizer的类，也不能像transformers一样使用tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)进行加载，查阅mindnlp，发现有个例程使用PegasusTokenizer，遂解决。


In [None]:
model_name = "google/bigbird-pegasus-large-arxiv"
# tokenizer_name = "google/bigbird-roberta-base"
tokenizer_name = "google/bigbird-pegasus-large-arxiv"
tokenizer = PegasusTokenizer.from_pretrained(tokenizer_name)
tokenizer.pad_token = tokenizer.eos_token 
model = BigBirdPegasusForCausalLM.from_pretrained(model_name)

BigBirdPegasusForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`.`PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


[MS_ALLOC_CONF]Runtime config:  enable_vmm:True  vmm_align_size:2MB




## 将数据集预处理为训练格式
这里在mindnlp中没有找到类似transformer中DataCollatorForLanguageModeling的工具，所以需要自己编写padding和truncation。
这里输出了处理过的数据集与torch的进行对比，保证获得的数据集是一样的。

In [None]:
class TextDataset:
    def __init__(self, data):
        self.data = data
    # 这里就是个padding和truncation截断的操作
    def __getitem__(self, index):
        index = int(index)
        text = self.data[index]["prompt"]
        inputs = tokenizer(text, padding='max_length', max_length=256, truncation=True)
        return (
            inputs["input_ids"], 
            inputs["attention_mask"],
            inputs["input_ids"]  # 添加labels
        )

    def __len__(self):
        return len(self.data)
train_dataset = GeneratorDataset(
    TextDataset(train_dataset),
    column_names=["input_ids", "attention_mask", "labels"],  # 添加labels
    shuffle=True
)
eval_dataset = GeneratorDataset(
    TextDataset(eval_dataset),
    column_names=["input_ids", "attention_mask", "labels"],  # 添加labels
    shuffle=False
)
print("train_dataset:", train_dataset)
print("eval_dataset:", eval_dataset)
for data in train_dataset.create_dict_iterator():
    print(data)
    break

train_dataset: <mindspore.dataset.engine.datasets_user_defined.GeneratorDataset object at 0xffff3d7411c0>
eval_dataset: <mindspore.dataset.engine.datasets_user_defined.GeneratorDataset object at 0xffff457844f0>
{'input_ids': Tensor(shape=[256], dtype=Int64, value= [  110, 63444, 26323,   722,   171,   125,   388,   850,   152,   110, 63444, 13641,  1819,   334,   119,   179,  1359,   850,  2688,   111, 16554,   107,  3960,   122, 
 18393,  1000,   115,   653,   172,   114,   371,  1028,  1580,   107,   240,   119,   394,  3120,   269,   108,   388,  6861,   135,   114,  1102,   108,   112, 32078, 
  1102,   108,   523, 31978, 10336,   118, 62773, 33886,  4471,   107, 29022,   815,   128,   850,   166,   111,  2028,   130,   128,  2921,   476,  7997,   107,   614, 
   113,   109,   205,   356,   341,   117,  1274,   308,   111,  5154, 10285,   107,  6333,  2427,   112,   128,   513,   108,   111,   248,  1004,   390,   173,   690, 
   112,  1585,  2015,   107,     1,     1,     1,     1

## 配置trainer并train
这里参数要与torch的训练参数一致，记录当前训练的loss变换然后对比

In [6]:
EPOCHS = 10
BATCH_SIZE = 4
# 定义训练参数
training_args = TrainingArguments(
    output_dir='./MindsporeBigBirdFinetune',
    overwrite_output_dir=True,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    
    save_steps=500,                  # Save checkpoint every 500 steps
    save_total_limit=2,              # Keep only the last 2 checkpoints
    logging_dir="./logs",            # Directory for logs
    logging_steps=100,               # Log every 100 steps
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    eval_steps=500,                  # Evaluation frequency
    learning_rate=5e-5,
    weight_decay=0.01,               # Weight decay
)

# 创建trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=None
)
trainer.train()

 10%|█         | 10/100 [00:10<01:00,  1.49it/s]

{'loss': 2.9176, 'learning_rate': 4.5e-05, 'epoch': 1.0}



  0%|          | 0/3 [00:00<?, ?it/s][A

-


                                                
 10%|█         | 10/100 [00:12<01:00,  1.49it/s]
100%|██████████| 3/3 [00:01<00:00,  7.88it/s][A
                                             [A

{'eval_loss': 3.525486707687378, 'eval_runtime': 1.8043, 'eval_samples_per_second': 1.663, 'eval_steps_per_second': 0.554, 'epoch': 1.0}


 20%|██        | 20/100 [00:18<00:46,  1.71it/s]

{'loss': 2.79, 'learning_rate': 4e-05, 'epoch': 2.0}



  0%|          | 0/3 [00:00<?, ?it/s][A
                                                
 20%|██        | 20/100 [00:18<00:46,  1.71it/s]
                                             [A

{'eval_loss': 3.379671096801758, 'eval_runtime': 0.193, 'eval_samples_per_second': 15.543, 'eval_steps_per_second': 5.181, 'epoch': 2.0}


 30%|███       | 30/100 [00:23<00:36,  1.89it/s]

{'loss': 2.593, 'learning_rate': 3.5e-05, 'epoch': 3.0}



                                                
[A                                  

{'eval_loss': 3.1008880138397217, 'eval_runtime': 0.1928, 'eval_samples_per_second': 15.56, 'eval_steps_per_second': 5.187, 'epoch': 3.0}


 30%|███       | 30/100 [00:24<00:36,  1.89it/s]
100%|██████████| 3/3 [00:00<00:00, 25.99it/s][A
 40%|████      | 40/100 [00:29<00:31,  1.88it/s]

{'loss': 2.4875, 'learning_rate': 3e-05, 'epoch': 4.0}



                                                
 40%|████      | 40/100 [00:29<00:31,  1.88it/s]
100%|██████████| 3/3 [00:00<00:00, 24.10it/s][A
                                             [A

{'eval_loss': 2.9427363872528076, 'eval_runtime': 0.1967, 'eval_samples_per_second': 15.255, 'eval_steps_per_second': 5.085, 'epoch': 4.0}


 50%|█████     | 50/100 [00:35<00:27,  1.85it/s]

{'loss': 2.3831, 'learning_rate': 2.5e-05, 'epoch': 5.0}



                                                
 50%|█████     | 50/100 [00:35<00:27,  1.85it/s]
100%|██████████| 3/3 [00:00<00:00, 24.72it/s][A
                                             [A

{'eval_loss': 2.9003379344940186, 'eval_runtime': 0.1942, 'eval_samples_per_second': 15.451, 'eval_steps_per_second': 5.15, 'epoch': 5.0}


 60%|██████    | 60/100 [00:40<00:22,  1.80it/s]

{'loss': 2.2631, 'learning_rate': 2e-05, 'epoch': 6.0}



                                                
 60%|██████    | 60/100 [00:41<00:22,  1.80it/s]
100%|██████████| 3/3 [00:00<00:00, 24.52it/s][A
                                             [A

{'eval_loss': 2.8607707023620605, 'eval_runtime': 0.1931, 'eval_samples_per_second': 15.539, 'eval_steps_per_second': 5.18, 'epoch': 6.0}


 70%|███████   | 70/100 [00:46<00:15,  1.88it/s]

{'loss': 2.2369, 'learning_rate': 1.5e-05, 'epoch': 7.0}



                                                
 70%|███████   | 70/100 [00:46<00:15,  1.88it/s]

{'eval_loss': 2.759572744369507, 'eval_runtime': 0.189, 'eval_samples_per_second': 15.873, 'eval_steps_per_second': 5.291, 'epoch': 7.0}



100%|██████████| 3/3 [00:00<00:00, 25.59it/s][A
 80%|████████  | 80/100 [00:52<00:10,  1.90it/s]

{'loss': 2.1732, 'learning_rate': 1e-05, 'epoch': 8.0}



                                                
 80%|████████  | 80/100 [00:52<00:10,  1.90it/s]

{'eval_loss': 2.7054977416992188, 'eval_runtime': 0.1896, 'eval_samples_per_second': 15.82, 'eval_steps_per_second': 5.273, 'epoch': 8.0}



100%|██████████| 3/3 [00:00<00:00, 25.82it/s][A
 90%|█████████ | 90/100 [00:57<00:05,  1.91it/s]

{'loss': 2.1717, 'learning_rate': 5e-06, 'epoch': 9.0}



                                                
 90%|█████████ | 90/100 [00:57<00:05,  1.91it/s]


{'eval_loss': 2.651596784591675, 'eval_runtime': 0.1884, 'eval_samples_per_second': 15.924, 'eval_steps_per_second': 5.308, 'epoch': 9.0}


100%|██████████| 3/3 [00:00<00:00, 26.19it/s][A
100%|██████████| 100/100 [01:03<00:00,  1.90it/s]

{'loss': 2.1833, 'learning_rate': 0.0, 'epoch': 10.0}



                                                 
100%|██████████| 100/100 [01:03<00:00,  1.90it/s]
100%|██████████| 3/3 [00:00<00:00, 26.00it/s]

{'eval_loss': 2.6390955448150635, 'eval_runtime': 0.1883, 'eval_samples_per_second': 15.932, 'eval_steps_per_second': 5.311, 'epoch': 10.0}


[A
100%|██████████| 100/100 [01:03<00:00,  1.58it/s]

{'train_runtime': 63.3728, 'train_samples_per_second': 6.312, 'train_steps_per_second': 1.578, 'train_loss': 2.419927463531494, 'epoch': 10.0}





TrainOutput(global_step=100, training_loss=2.419927463531494, metrics={'train_runtime': 63.3728, 'train_samples_per_second': 6.312, 'train_steps_per_second': 1.578, 'train_loss': 2.419927463531494, 'epoch': 10.0})

## 查看评估结果

In [7]:
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

100%|██████████| 3/3 [00:00<00:00, 28.29it/s]

Evaluation results: {'eval_loss': 2.6390953063964844, 'eval_runtime': 0.1845, 'eval_samples_per_second': 16.258, 'eval_steps_per_second': 5.419, 'epoch': 10.0}





## 保存微调结果

In [8]:
model.save_pretrained("./mindNLPModelBigbirdPegasusFinetune")
tokenizer.save_pretrained("./mindNLPTokenizerBigbirdPegasusFinetune")

Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file instead.
Non-default generation parameters: {'max_length': 256, 'num_beams': 5, 'length_penalty': 0.8}


('./mindNLPTokenizerBigbirdPegasusFinetune/tokenizer_config.json',
 './mindNLPTokenizerBigbirdPegasusFinetune/special_tokens_map.json',
 './mindNLPTokenizerBigbirdPegasusFinetune/spiece.model',
 './mindNLPTokenizerBigbirdPegasusFinetune/added_tokens.json')

## 使用微调模型进行测试
虽然loss不断下降并且比torch的更好。但是由于两个都是短暂微调训练，可以看到语言模型实际效果并不好，输出结果不解其意。

In [9]:
fine_tuned_model = BigBirdPegasusForCausalLM.from_pretrained("./mindNLPModelBigbirdPegasusFinetune")
fine_tuned_tokenizer = PegasusTokenizer.from_pretrained("./mindNLPTokenizerBigbirdPegasusFinetune")
inputs = "Hello, my dog is cute"
input_tokens = fine_tuned_tokenizer(inputs, return_tensors="ms")
outputs = fine_tuned_model(**input_tokens)
logits = outputs.logits
# 使用 argmax 获取预测的 token ID
from mindspore import ops
predicted_token_ids = ops.argmax(logits, dim=-1)  # 在最后一个维度（vocab_size）上取 argmax
# 解码生成的文本
generated_text = fine_tuned_tokenizer.decode(predicted_token_ids[0].asnumpy().tolist(), skip_special_tokens=True)
print(generated_text)

in,, have back but
