# 👾Qwen2大模型微调入门

作者：林泽毅

教程文章：https://zhuanlan.zhihu.com/p/702491999  

显存要求：10GB左右  

实验过程看：https://swanlab.cn/@ZeyiLin/Qwen2-fintune/runs/cfg5f8dzkp6vouxzaxlx6/chart

ModelScope 是阿里云开发的一个开源机器学习模型库和开发平台，旨在提供各种预训练模型和机器学习解决方案。该平台为研究人员、开发者和企业用户提供了大量的预训练模型和工具，支持 NLP、计算机视觉、语音识别、推荐系统等多种任务，帮助用户快速集成和部署 AI 应用。

ModelScope 的主要特点包括：

丰富的预训练模型：ModelScope 提供了大量开箱即用的预训练模型，涵盖各种常见的 AI 任务，例如文本分类、目标检测、语音合成等，便于用户直接使用或进一步微调。

标准化的开发流程：平台定义了模型加载、训练、推理等操作的标准接口，简化了模型开发流程，使用户可以轻松集成不同的模型和工具。

多种任务支持：ModelScope 支持自然语言处理、计算机视觉、语音处理、推荐系统等任务，用户可以根据任务需求选择合适的模型。

云端支持与集成：ModelScope 可以与阿里云的机器学习和计算资源无缝集成，方便用户在云端训练、部署模型，提高计算效率。

社区支持：ModelScope 提供开源代码和社区支持，鼓励开发者和研究人员贡献自己的模型和代码，推动技术发展。

ModelScope 提供的丰富资源和工具降低了机器学习模型开发的门槛，使得用户可以更快地构建、测试和部署 AI 解决方案

## 1.安装环境

本案例测试于modelscope==1.14.0、transformers==4.41.2、datasets==2.18.0、peft==0.11.1、accelerate==0.30.1、swanlab==0.3.9

In [1]:
%pip install torch swanlab modelscope transformers datasets peft pandas accelerate --quiet

Collecting swanlab
  Downloading swanlab-0.3.23-py3-none-any.whl (230 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m230.3/230.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting modelscope
  Downloading modelscope-1.19.2-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting transformers
  Downloading transformers-4.46.1-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m147.7 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m288.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.13.2-py3-none-any.whl (320 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[

如果是第一次使用SwanLab，则前往[SwanLab](https://swanlab.cn)注册账号后，在[用户设置](https://swanlab.cn/settings/overview)复制API Key，如果执行下面的代码：

In [2]:
# !swanlab login

## 2. 数据集加载

1. 在[zh_cls_fudan-news - modelscope](https://modelscope.cn/datasets/huangjintao/zh_cls_fudan-news/files)下载train.jsonl和test.jsonl到同级目录下。

<img src="../assets/dataset.png" width=600>

2. 将train.jsonl和test.jsonl进行处理，转换成new_train.jsonl和new_test.jsonl

In [3]:
# 2.将train.jsonl和test.jsonl进行处理，转换成new_train.jsonl和new_test.jsonl

import json
import pandas as pd
import os

def dataset_jsonl_transfer(origin_path, new_path):
    """
    将原始数据集转换为大模型微调所需数据格式的新数据集
    """
    messages = []

    # 读取旧的JSONL文件
    with open(origin_path, "r") as file:
        for line in file:
            # 解析每一行的json数据
            data = json.loads(line)
            context = data["text"]
            catagory = data["category"]
            label = data["output"]
            message = {
                "instruction": "你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型",
                "input": f"文本:{context},类型选型:{catagory}",
                "output": label,
            }
            messages.append(message)

    # 保存重构后的JSONL文件
    with open(new_path, "w", encoding="utf-8") as file:
        for message in messages:
            file.write(json.dumps(message, ensure_ascii=False) + "\n")


# 加载、处理数据集和测试集
train_dataset_path = "train.jsonl"
test_dataset_path = "test.jsonl"

train_jsonl_new_path = "new_train.jsonl"
test_jsonl_new_path = "new_test.jsonl"

if not os.path.exists(train_jsonl_new_path):
    dataset_jsonl_transfer(train_dataset_path, train_jsonl_new_path)
if not os.path.exists(test_jsonl_new_path):
    dataset_jsonl_transfer(test_dataset_path, test_jsonl_new_path)

train_df = pd.read_json(train_jsonl_new_path, lines=True)[:1000]  # 取前1000条做训练（可选）
test_df = pd.read_json(test_jsonl_new_path, lines=True)[:10]  # 取前10条做主观评测

## 3. 下载/加载模型和tokenizer

In [4]:
from modelscope import snapshot_download, AutoTokenizer
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import torch

# 在modelscope上下载Qwen模型到本地目录下
model_dir = snapshot_download("qwen/Qwen2-1.5B-Instruct", cache_dir="./", revision="master")

# Transformers加载模型权重
tokenizer = AutoTokenizer.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("./qwen/Qwen2-1___5B-Instruct/", device_map="auto", torch_dtype=torch.bfloat16)
model.enable_input_require_grads()  # 开启梯度检查点时，要执行该方法



Downloading [model.safetensors]:   0%|          | 0.00/2.88G [00:00<?, ?B/s]

Downloading [README.md]:   0%|          | 0.00/3.47k [00:00<?, ?B/s]

Downloading [tokenizer.json]:   0%|          | 0.00/6.70M [00:00<?, ?B/s]

Downloading [tokenizer_config.json]:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading [vocab.json]:   0%|          | 0.00/2.65M [00:00<?, ?B/s]

## 4. 预处理训练数据

In [5]:
def process_func(example):
    """
    将数据集进行预处理
    """
    MAX_LENGTH = 384
    input_ids, attention_mask, labels = [], [], []
    instruction = tokenizer(
        f"<|im_start|>system\n你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型<|im_end|>\n<|im_start|>user\n{example['input']}<|im_end|>\n<|im_start|>assistant\n",
        add_special_tokens=False,
    )
    response = tokenizer(f"{example['output']}", add_special_tokens=False)
    input_ids = (
        instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    )
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]
    labels = (
        [-100] * len(instruction["input_ids"])
        + response["input_ids"]
        + [tokenizer.pad_token_id]
    )
    if len(input_ids) > MAX_LENGTH:  # 做一个截断
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

In [6]:
from datasets import Dataset

train_ds = Dataset.from_pandas(train_df)
train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## 5. 设置LORA

In [7]:
from peft import LoraConfig, TaskType, get_peft_model

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    inference_mode=False,  # 训练模式
    r=8,  # Lora 秩
    lora_alpha=32,  # Lora alaph，具体作用参见 Lora 原理
    lora_dropout=0.1,  # Dropout 比例
)

model = get_peft_model(model, config)

## 6. 训练

In [8]:
args = TrainingArguments(
    output_dir="./output/Qwen2",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    logging_steps=10,
    num_train_epochs=2,
    save_steps=100,
    learning_rate=1e-4,
    save_on_each_node=True,
    gradient_checkpointing=True,
    report_to="none",
)

In [9]:
from swanlab.integration.huggingface import SwanLabCallback
import swanlab

swanlab_callback = SwanLabCallback(
    project="Qwen2-fintune",
    experiment_name="Qwen2-1.5B-Instruct",
    description="使用通义千问Qwen2-1.5B-Instruct模型在zh_cls_fudan-news数据集上微调。",
    config={
        "model": "qwen/Qwen2-1.5B-Instruct",
        "dataset": "huangjintao/zh_cls_fudan-news",
    },
)

  from swanlab.integration.huggingface import SwanLabCallback


In [10]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
    callbacks=[swanlab_callback],
)

trainer.train()


# ====== 训练结束后的预测 ===== #

def predict(messages, model, tokenizer):
    device = "cuda"
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512)
    generated_ids = [
        output_ids[len(input_ids) :]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(response)

    return response
    

test_text_list = []
for index, row in test_df.iterrows():
    instruction = row["instruction"]
    input_value = row["input"]

    messages = [
        {"role": "system", "content": f"{instruction}"},
        {"role": "user", "content": f"{input_value}"},
    ]

    response = predict(messages, model, tokenizer)
    messages.append({"role": "assistant", "content": f"{response}"})
    result_text = f"{messages[0]}\n\n{messages[1]}\n\n{messages[2]}"
    test_text_list.append(swanlab.Text(result_text, caption=response))

swanlab.log({"Prediction": test_text_list})
swanlab.finish()

[1m[34mswanlab[0m[0m: Logging into swanlab cloud.
[1m[34mswanlab[0m[0m: You can find your API key at: [33mhttps://swanlab.cn/settings[0m
[1m[34mswanlab[0m[0m: Paste an API key from your profile and hit enter, or press 'CTRL + C' to quit: 


 ········


[1m[34mswanlab[0m[0m: Tracking run with swanlab version 0.3.23                                  
[1m[34mswanlab[0m[0m: Run data will be saved locally in [35m[1m/opt/app-root/src/LLM-Finetune/notebook/swanlog/run-20241030_033631-a3b1799d[0m[0m
[1m[34mswanlab[0m[0m: 👋 Hi [1m[39malanliuxiang[0m[0m, welcome to swanlab!
[1m[34mswanlab[0m[0m: Syncing run [33mQwen2-1.5B-Instruct[0m to the cloud
[1m[34mswanlab[0m[0m: 🌟 Run `[1mswanlab watch /opt/app-root/src/LLM-Finetune/notebook/swanlog[0m` to view SwanLab Experiment Dashboard locally
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@alanliuxiang/Qwen2-fintune[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@alanliuxiang/Qwen2-fintune/runs/wdnox1lad9ouh07bg1nnh[0m[0m



[notice] A new release of pip available: 22.2.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,2.7264
20,0.455
30,0.2805
40,0.2155
50,0.1184
60,0.0293
70,0.381
80,0.0082
90,0.0025
100,0.0538


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Art
Military
Politics
外国文学研究
History
Space
Transport
Literature
Economy
Art
[1m[34mswanlab[0m[0m: 🌟 Run `[1mswanlab watch /opt/app-root/src/LLM-Finetune/notebook/swanlog[0m` to view SwanLab Experiment Dashboard locally
[1m[34mswanlab[0m[0m: 🏠 View project at [34m[4mhttps://swanlab.cn/@alanliuxiang/Qwen2-fintune[0m[0m
[1m[34mswanlab[0m[0m: 🚀 View run at [34m[4mhttps://swanlab.cn/@alanliuxiang/Qwen2-fintune/runs/wdnox1lad9ouh07bg1nnh[0m[0m
                                                                                                    