# 使用DPO算法微调模型

本教程演示如何使用DPO算法微调大模型（以Llama-3.1-8B模型为例）。通过本教程，你将学习如何配置训练参数，并使用 DPO 算法在具有偏好标签的数据上进行强化学习式的训练，从而提升模型在对齐任务中的性能。

## 1. 什么是 DPO 算法？

DPO（Direct Preference Optimization）是一种用于训练语言模型更好地对齐人类偏好的方法。它不依赖显式的奖励模型或策略梯度方法，而是直接在“人类偏好数据”上优化模型，使其在给定两个回答中更倾向于人类偏好的那个。

## 2. 环境配置

在开始之前，请确保您已安装 ``align-anything`` 包。

```bash
# 克隆仓库
git clone git@github.com:PKU-Alignment/align-anything.git
cd align-anything

# 使用conda创建虚拟环境
conda create -n align-anything python==3.11
conda activate align-anything
```

- **`[Optional]`** We recommend installing [CUDA](https://anaconda.org/nvidia/cuda) in the conda environment and set the environment variable.

```bash
# 我们在 H800 计算集群上测试过，这个版本的 CUDA 效果很好。
# 您可以根据计算集群的实际情况调整此版本。

conda install nvidia/label/cuda-12.2.0::cuda
export CUDA_HOME=$CONDA_PREFIX
```

> 如果您的 CUDA 安装在不同的位置，例如 `/usr/local/cuda/bin/nvcc`，您可以按如下方式设置环境变量：

```bash
export CUDA_HOME="/usr/local/cuda"
```

最后，通过以下命令安装 `align-anything`：

```bash
# 我们为训练和评估准备了快速安装。
# 如果您只需要使用训练或评估模块，
# 您可以安装相应的依赖项。
pip install -e .[train] # 安装训练依赖项
pip install -e .[evaluate] # 安装评估依赖项

# 如果您需要安装所有依赖项，可以使用以下命令：
pip install -e .[all]
```

## 3. Qwen2.5-0.5B-Instruct模型输出示例
下面，让我们首先测试Qwen2.5-0.5B-Instruct模型的zero-shot能力。
### 3.1 导入所需的库

In [16]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import torch

os.environ["TRANSFORMERS_OFFLINE"] = "1"
os.environ["HF_DATASETS_OFFLINE"] = "1"

In [None]:
!pip show transformers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Name: transformers
Version: 4.52.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /data/phybench/miniconda3/envs/vis/lib/python3.13/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 
Note: you may need to restart the kernel to use updated packages.


### 3.2 加载原始的Qwen2.5-0.5B-Instruct 模型

In [18]:
device = "cuda"  
model_path = "/data/phybench/workdir/guosy/align/Qwen2.5-0.5B-Instruct" 
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# 将模型设置为eval模式
model.eval()

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbe

### 3.3 测试原始模型的性能

让我们用一个示例问题测试 Qwen2.5-0.5B-Instruct 模型。

In [19]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "Recently, a wild animal in the local area has become aggressive towards humans and caused several injuries. How should I handle this wild animal?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)


Generated Text: It's important to approach any wild animal with caution and respect. Here are some steps you can take:

1. **Stay Calm**: Try to stay calm and composed so you don't accidentally provoke or attack the animal.

2. **Do Not Feed or Touch**: Do not feed or touch the animal, as it may be frightened or aggressive due to fear of human contact.

3. **Use Non-Weaponized Tools**: If you need to use tools like sticks, rocks, or other non-lethal items, do so carefully and slowly. Avoid using weapons unless absolutely necessary.

4. **Call for Help**: If the situation is dangerous or if you feel threatened by the animal, call your local wildlife control agency or emergency services immediately. They will provide appropriate advice and assistance.

5. **Secure Your Property**: Ensure that your property is secure enough to contain the animal safely. This might involve setting up barriers, placing stakes, or moving furniture away from the area where the animal was last seen.

6. **Mon

## 4. 使用DPO算法对齐模型

**注意**：如果您无法访问huggingface.co，请将huggingface的endpoint设置为hf-mirror.com。您可以进行以下操作：

`export HF_ENDPOINT="https://hf-mirror.com"`

在这里，我们以 align-anything 数据集为例。

可以参考如下的训练脚本：

训练完成后，您可以在`OUTPUT_DIR`下找到训练的模型权重。

In [None]:
import sys
sys.path.append("/data/phybench/workdir/guosy/align/align-anything")

import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader, RandomSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.optim import AdamW
from align_anything.datasets.text_to_text.preference import PreferenceDataset
from align_anything.configs.template import ChatTemplate

# 设置设备
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 模型路径
model_name = "/data/phybench/workdir/guosy/align/Qwen2.5-0.5B-Instruct"

# 加载 tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

# 加载训练模型
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).to(device)
model.resize_token_embeddings(len(tokenizer))

# 加载参考模型（冻结参数）
ref_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).to(device)
ref_model.eval()
for p in ref_model.parameters():
    p.requires_grad = False

# 初始化模板
train_template = ChatTemplate(
    formatter=tokenizer,
    template="HOMEWORK",
)

# 加载 DPO 偏好格式数据集
dataset = PreferenceDataset(
    path="/data/phybench/workdir/guosy/align/align_anything_t2t",
    template=train_template,
    tokenizer=tokenizer,
    processor=tokenizer,
    split="train",
    size=1000,
)

# DataLoader（DPO 每 batch 为 better + worse）
dataloader = DataLoader(
    dataset,
    collate_fn=dataset.get_collator(),
    sampler=RandomSampler(dataset),
    batch_size=2,  # better + worse
)

# 优化器
optimizer = AdamW(model.parameters(), lr=2e-5)

# 训练主循环
model.train()
scale_coeff = 0.1  # 等价于 beta

for step, batch in enumerate(dataloader):
    try:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        response_lens = batch["meta_info"]["response_lens"]  # List[int]

        half = input_ids.size(0) // 2
        chosen_ids = input_ids[:half]
        rejected_ids = input_ids[half:]
        chosen_mask = attention_mask[:half]
        rejected_mask = attention_mask[half:]
        chosen_lens = response_lens[:half]
        rejected_lens = response_lens[half:]

        def get_logps(model, ids, mask, lens):
            with torch.no_grad() if model is ref_model else torch.enable_grad():
                outputs = model(input_ids=ids, attention_mask=mask)
                logits = outputs.logits
                log_probs = F.log_softmax(logits, dim=-1)

                logp_list = []
                for i, L in enumerate(lens):
                    # 取 response 部分 token 段
                    logp = log_probs[i, -L-1:-1, :]
                    label = ids[i, -L:]
                    logp = logp.gather(1, label.unsqueeze(-1)).squeeze(-1)
                    logp_list.append(logp.sum())
                return torch.stack(logp_list)

        model_chosen_logp = get_logps(model, chosen_ids, chosen_mask, chosen_lens)
        model_rejected_logp = get_logps(model, rejected_ids, rejected_mask, rejected_lens)

        ref_chosen_logp = get_logps(ref_model, chosen_ids, chosen_mask, chosen_lens)
        ref_rejected_logp = get_logps(ref_model, rejected_ids, rejected_mask, rejected_lens)

        pi_diff = model_chosen_logp - model_rejected_logp
        ref_diff = ref_chosen_logp - ref_rejected_logp
        loss = -F.logsigmoid(scale_coeff * (pi_diff - ref_diff)).mean()

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if step % 10 == 0:
            print(f"Step {step}: DPO Loss = {loss.item():.4f}")

        if step == 100:
            break

    except Exception as e:
        print(f"[错误] 第 {step} 步处理失败：{e}")
        continue

# 保存模型
output_dir = "/data/phybench/workdir/guosy/bw_workspace/outputs"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"✅ 模型已保存至 {output_dir}")

Filtering valid indices: 100%|██████████| 1000/1000 [00:00<00:00, 3900.72it/s]


Step 0: DPO Loss = 0.6914
Step 10: DPO Loss = 0.6953
Step 20: DPO Loss = 0.5000
Step 30: DPO Loss = 0.4609
Step 40: DPO Loss = 1.0391
Step 50: DPO Loss = 0.6680
Step 60: DPO Loss = 0.5078
Step 70: DPO Loss = 1.1406
Step 80: DPO Loss = 0.9297
Step 90: DPO Loss = 0.4883
Step 100: DPO Loss = 0.9688
✅ 模型已保存至 /data/phybench/workdir/guosy/bw_workspace/outputs


## 5. 测试DPO训练后的模型性能

在训练结束后，我们试图测试训练后的模型对齐情况是否有所改观。

### 5.1 加载新的模型权重


In [21]:
model_path = "/data/phybench/workdir/guosy/bw_workspace/dpo_model"  
model = AutoModelForCausalLM.from_pretrained(model_path, local_files_only=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)

# 将模型设置为eval模式
model.eval()

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896, padding_idx=151643)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_e

### 5.2 测试新模型的性能

In [22]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that answers user queries."},
    {
        "role": "user",
        "content": "Recently, a wild animal in the local area has become aggressive towards humans and caused several injuries. How should I handle this wild animal?",
    },
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([input_text], return_tensors="pt").to(device)

# the model generate new tokens
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=2048)
# convert the generated tokens to text
generated_text = tokenizer.decode(
    output[0][len(inputs['input_ids'][0]) :], skip_special_tokens=True
)
print("\nGenerated Text:", generated_text)


Generated Text: If you encounter a wild animal in your local area that is aggressive towards humans, it is important to handle it with caution and respect. Here are some steps you can take:

1. Do not approach or provoke the animal. This can be dangerous and may cause injury.
2. Do not feed the animal. This can be harmful to both the animal and the human.
3. Do not attempt to handle the animal. This can be dangerous and may cause injury.
4. Do not attempt to remove the animal from the area. This can be dangerous and may cause injury.
5. Do not attempt to capture or trap the animal. This can be dangerous and may cause injury.
6. Do not attempt to harm the animal. This can be dangerous and may cause injury.
7. Do not attempt to remove the animal from the area. This can be dangerous and may cause injury.
8. Do not attempt to harm the animal. This can be dangerous and may cause injury.
9. Do not attempt to capture or trap the animal. This can be dangerous and may cause injury.
10. Do not 

## 6. 致谢

- [Hugging Face Transformers 文档](https://huggingface.co/docs/transformers/index)
- [DPO 论文](https://arxiv.org/abs/2305.18290)