[grpo] support CHORD algorithm (#5680)

hjh0119 · web-flow · commit c5ca64d38018 · 2025-09-07T20:58:21.000+08:00
* chord wip

* wip

* wip

* update doc

* remove chord to utils

* fix

* fix

* update script

* doc

* remove unused import

* fix link

* readme

* readme en

* compute sft only for train

* fix mu=0
diff --git a/README.md b/README.md
@@ -75,6 +75,7 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 🎉 News
+- 🎁 2025.09.07: Added support for CHORD training algorithm. See the [documentation](./docs/source_en/Instruction/GRPO/AdvancedResearch/CHORD.md)
 - 🎁 2025.09.06: Ulysses can now be used with ring-attention, allowing sequences to be sharded into any number of chunks (no longer limited by the number of heads). The argument remains `--sequence_parallel_size N`.
 - 🎁 2025.09.02: Megatron-SWIFT now supports multimodal model training. Documentation can be found [here](./docs/source_en/Megatron-SWIFT/Multimodal-Model.md).
 - 🎁 2025.08.12: Support [Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT) in SFT training, use parameter `--enable_dft_loss true`. Training scripts can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh).
diff --git a/README_CN.md b/README_CN.md
@@ -71,6 +71,7 @@
 - **模型量化**：支持AWQ、GPTQ、FP8和BNB的量化导出，导出的模型支持使用vLLM/SGLang/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🎁 2025.09.07: 支持CHORD训练算法，请查看[文档](docs/source/Instruction/GRPO/AdvancedResearch/CHORD.md)。
 - 🎁 2025.09.06: Ulysses现已支持与ring-attention结合使用，使得输入序列可以被切分成任意数量的块（不再受限于num_heads），命令参数仍然是`--sequence_parallel_size N`。
 - 🎁 2025.09.02: Megatron-SWIFT支持多模态模型训练。文档参考[这里](./docs/source/Megatron-SWIFT/多模态模型.md)。
 - 🎁 2025.08.12: 支持在SFT训练中使用[Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT)，使用参数 `--enable_dft_loss true`。训练脚本参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh)
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/CHORD.md b/docs/source/Instruction/GRPO/AdvancedResearch/CHORD.md
@@ -0,0 +1,65 @@
+# On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
+
+**版本依赖**：ms-swift>=3.9
+
+本文档介绍论文 [On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting](https://arxiv.org/abs/2508.11408) 中提出的 **CHORD** 算法。CHORD 的核心思想是在 **on-policy 强化学习**（如 GRPO/PPO）过程中，**动态融合 off-policy 专家数据（SFT）**，通过 **全局权重 μ + token 级别权重 φ** 的双重控制机制，在模仿与探索之间实现平衡。
+
+## 算法概述
+CHORD 算法通过在 **GRPO loss** 中引入 **SFT loss**，实现动态混合训练。总体目标函数为：
+
+$$
+    \mathcal{L}_{\text{CHORD}} = (1 - \mu) \cdot \mathcal{L}_{\text{GRPO}} + \mu \cdot \mathcal{L}_{\text{SFT}}
+$$
+
+其中：
+- $\mathcal{L}_{\text{GRPO}}$：基于 on-policy 采样的强化学习损失（类似 PPO）。
+- $\mathcal{L}_{\text{SFT}}$：监督微调损失。
+- $\mu \in [0, 1]$：全局平衡系数，控制 SFT 信号在总梯度中的贡献。
+
+### 参数配置（数据与批量大小）
+我们可以基于 GRPO 训练实现 CHORD 训练。
+
+CHORD 需要在训练时指定额外的 SFT 数据集和批量大小：
+- `chord_sft_dataset`: 用于提供专家数据的 SFT 数据集。
+- `chord_sft_per_device_train_batch_size`: 每个设备的 SFT mini-batch 大小。
+
+---
+
+## 两种 CHORD 变体
+
+论文提出了两种算法变体：**CHORD-µ** 和 **CHORD-ϕ**。
+
+### CHORD-µ
+通过在训练过程中逐步 **衰减 μ**，实现从模仿专家到自主探索的过渡。
+
+**参数：**
+- `chord_mu_peak`：μ 的峰值。
+- `chord_mu_valley` μ 的衰减终值。
+- `chord_mu_warmup_steps` μ 值上升至峰值的训练步数。
+- `chord_mu_decay_steps` μ 从峰值衰减到谷值的训练步数。
+
+### CHORD-ϕ（Token 级加权）
+**CHORD-ϕ** 不依赖 μ 的动态衰减，而是固定 μ 为一个较小的常数（推荐 **0.05 ~ 0.2**），再通过 **token-wise 权重函数 φ** 动态控制每个专家 token 的梯度贡献。
+
+**φ 定义：**
+$$
+    \phi(y_t^\star, \pi_\theta) = p_t \cdot (1 - p_t)
+$$
+
+其中：
+- $p_t = \pi_\theta(y_t^\star \mid x, y_{<t}^\star)$：模型当前预测专家 token 的概率。
+- 当 $p_t ≈ 0.5$（模型不确定时），φ 取最大值 → 强化学习不确定的 token。
+- 当 $p_t ≈ 0$ 或 $p_t ≈ 1$，φ → 0 → 避免对过于确定或完全不会的 token 过度学习。
+
+**开启 φ 加权的参数**：
+- `chord_enable_phi_function: bool = False`
+  - 设置为 `True` 即启用 token-wise 权重 φ。
+
+注：如果使用常数 μ 值 ，设置 chord_mu_peak 与 chord_mu_valley 相同
+
+<details>
+<summary>mu值衰减与loss计算代码实现</summary>
+请参考`GRPOTrainer`的`_compute_chord_loss`方法：
+</details>
+
+训练参考该[脚本](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/chord.sh)
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/index.rst b/docs/source/Instruction/GRPO/AdvancedResearch/index.rst
@@ -7,3 +7,4 @@ Advanced Research
    DAPO.md
    deepeyes.md
    GSPO.md
+   CHORD.md
diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/CHORD.md b/docs/source_en/Instruction/GRPO/AdvancedResearch/CHORD.md
@@ -0,0 +1,65 @@
+# On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
+
+**Version Requirement**: ms-swift>=3.9
+
+This document describes the CHORD algorithm proposed in the paper "On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting" (https://arxiv.org/abs/2508.11408). The core idea of CHORD is to dynamically integrate off-policy expert data (SFT) into on-policy reinforcement learning (e.g., GRPO/PPO) by a dual control mechanism: a global weight μ plus a token-level weight φ, thereby balancing imitation and exploration.
+
+## Algorithm Overview
+CHORD mixes training by introducing the SFT loss into the GRPO loss. The overall objective is:
+
+$$
+    \mathcal{L}_{\text{CHORD}} = (1 - \mu) \cdot \mathcal{L}_{\text{GRPO}} + \mu \cdot \mathcal{L}_{\text{SFT}}
+$$
+
+where:
+- $\mathcal{L}_{\text{GRPO}}$: on-policy RL loss based on on-policy samples (similar to PPO).
+- $\mathcal{L}_{\text{SFT}}$: supervised fine-tuning (SFT) loss.
+- $\mu \in [0, 1]$: global balancing coefficient that controls the contribution of the SFT signal to the overall gradient.
+
+### Configuration (data and batch sizes)
+We can implement CHORD training based on GRPO training.
+
+CHORD requires specifying an additional SFT dataset and batch size at training time:
+- `chord_sft_dataset`: the SFT dataset that provides expert data.
+- `chord_sft_per_device_train_batch_size`: SFT mini-batch size per device.
+
+---
+
+## Two CHORD Variants
+
+The paper proposes two variants: CHORD-μ and CHORD-φ.
+
+### CHORD-μ
+CHORD-μ gradually decays μ during training to transition from imitating experts toward autonomous exploration.
+
+Parameters:
+- `chord_mu_peak`: the peak value of μ.
+- `chord_mu_valley`: the final decayed value of μ.
+- `chord_mu_warmup_steps`: number of training steps to ramp μ up to the peak.
+- `chord_mu_decay_steps`: number of training steps to decay μ from peak to valley.
+
+### CHORD-φ (Token-level weighting)
+CHORD-φ does not rely on μ scheduling; instead it keeps μ fixed to a small constant (recommended 0.05–0.2) and uses a token-wise weighting function φ to dynamically control each expert token's gradient contribution.
+
+Definition of φ:
+$$
+    \phi(y_t^\star, \pi_\theta) = p_t \cdot (1 - p_t)
+$$
+
+where:
+- $p_t = \pi_\theta(y_t^\star \mid x, y_{<t}^\star)$ is the model's current predicted probability of the expert token.
+- When $p_t \approx 0.5$ (model uncertainty), φ is maximal → emphasize tokens the model is uncertain about.
+- When $p_t \approx 0$ or $p_t \approx 1$, φ → 0 → avoid overemphasizing tokens that are already certain or impossible.
+
+Parameter to enable φ weighting:
+- `chord_enable_phi_function: bool = False`
+  - Set to `True` to enable token-wise weight φ.
+
+Note: If using a constant μ, set `chord_mu_peak` and `chord_mu_valley` to the same value.
+
+<details>
+<summary>Code implementation of μ scheduling and loss computation</summary>
+See the `GRPOTrainer` method `_compute_chord_loss`.
+</details>
+
+Training reference script: https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/chord.sh
diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst b/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst
@@ -7,3 +7,4 @@ Advanced Research
    DAPO.md
    deepeyes.md
    GSPO.md
+   CHORD.md
diff --git a/examples/train/grpo/internal/chord.sh b/examples/train/grpo/internal/chord.sh
@@ -0,0 +1,53 @@
+# 8*80G GPU
+# CHORD https://arxiv.org/abs/2508.11408
+# GRPO total batch = 32(prompts)*8(num_generations) = 256 =  8(gpus) * 4(per_device_train_batch_size) * 8(gradient_accumulation_steps)
+# SFT total batch = 64 = 8(gpus) * 1(chord_sft_per_device_train_batch_size) * 8(gradient_accumulation_steps)
+
+# NOTE: We use the same dataset for GRPO and SFT, which may cause overlap (i.e., the same examples to be selected).
+# You can pre-download the dataset and manually split it to avoid this.
+
+export CHORD_SYSTEM_PROMPT="You are a helpful assistant that solves MATH problems.
+You should first think about the reasoning process in mind and then provide the user with the answer.
+You should present your reasoning process using the format: <think>\n...your reasoning process here... </think>\n"
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=8 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --dataset AI-MO/NuminaMath-TIR \
+    --torch_dtype bfloat16 \
+    --beta 0.0 \
+    --steps_per_generation 4 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 4 \
+    --gradient_accumulation_steps 8 \
+    --chord_sft_per_device_train_batch_size 1 \
+    --chord_sft_dataset AI-MO/NuminaMath-TIR \
+    --chord_enable_phi_function false \
+    --chord_mu_warmup_steps 0 \
+    --chord_mu_decay_steps 200 \
+    --chord_mu_peak 0.9 \
+    --chord_mu_valley 0.05 \
+    --num_generations 8 \
+    --train_type full \
+    --reward_funcs accuracy \
+    --system "$CHORD_SYSTEM_PROMPT" \
+    --use_vllm true \
+    --vllm_mode colocate \
+    --vllm_gpu_memory_utilization 0.4 \
+    --vllm_max_model_len 8192 \
+    --max_completion_length 4096 \
+    --overlong_filter true \
+    --offload_optimizer true \
+    --offload_model true \
+    --sleep_level 1 \
+    --save_steps 1000 \
+    --learning_rate 1e-6 \
+    --save_total_limit 2 \
+    --logging_steps 1 \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero3 \
+    --log_completions true \
+    --report_to tensorboard swanlab
diff --git a/swift/llm/train/rlhf.py b/swift/llm/train/rlhf.py
@@ -164,7 +164,7 @@ def prepare_model(cls, args, model, *, template=None, train_dataset=None, task_t
     def _prepare_template(self) -> None:
         args = self.args
         super()._prepare_template()
-        model_mapping = {'kto': 'kto', 'gkd': 'gkd', 'ppo': 'pt', 'grpo': 'pt'}
+        model_mapping = {'kto': 'kto', 'gkd': 'gkd', 'ppo': 'pt', 'grpo': 'train'}
         self.template.set_mode(model_mapping.get(args.rlhf_type, 'rlhf'))
 
         if args.rlhf_type == 'ppo':
@@ -177,6 +177,25 @@ def _get_dataset(self):
             train_dataset, val_dataset = prepare_kto_dataset(args, train_dataset, val_dataset)
         return train_dataset, val_dataset
 
+    def _prepare_chord_sft_dataset(self):
+        from ..dataset import load_dataset
+        from swift.llm.dataset.loader import DatasetLoader
+
+        # prepare expert sft dataset for chord
+        args = self.args
+        assert hasattr(args, 'chord_sft_dataset') and args.chord_sft_dataset
+        dataset_kwargs = args.get_dataset_kwargs()
+        chord_sft_datasets = []
+        # TODO: validatition
+        chord_sft_dataset, _ = load_dataset(
+            args.chord_sft_dataset, split_dataset_ratio=0, shuffle=args.dataset_shuffle, **dataset_kwargs)
+        chord_sft_dataset, _ = self._encode_dataset(chord_sft_dataset, None, pre_process=True)
+        chord_sft_datasets.append(chord_sft_dataset)
+        chord_sft_dataset = DatasetLoader._concat_datasets(chord_sft_datasets)
+        datasets = [chord_sft_dataset, None]
+        datasets = self._post_process_datasets(datasets)
+        return datasets
+
     def _get_trainer_kwargs(self):
         trainer_kwargs = {}
         for key in ['ref', 'reward', 'value', 'teacher']:
@@ -189,6 +208,8 @@ def _get_trainer_kwargs(self):
         if self.args.rlhf_type == 'grpo':
             trainer_kwargs['reward_funcs'] = self.args.reward_funcs
             trainer_kwargs['vllm_client'] = self.args.vllm_client
+            if self.args.chord_sft_dataset:
+                trainer_kwargs['chord_sft_dataset'], _ = self._prepare_chord_sft_dataset()
         return trainer_kwargs
 
 
diff --git a/swift/llm/train/sft.py b/swift/llm/train/sft.py
@@ -113,22 +113,29 @@ def _get_cached_dataset(self):
 
     def _prepare_dataset(self):
         args = self.args
+        is_grpo = hasattr(args, 'rlhf_type') and args.rlhf_type == 'grpo'
         if args.cached_dataset:
             train_datasets, val_datasets = self._get_cached_dataset()
         else:
             train_datasets, val_datasets = [], []
         if args.dataset:
             train_dataset, val_dataset = self._get_dataset()
-            train_dataset, val_dataset = self._encode_dataset(train_dataset, val_dataset)
+            train_dataset, val_dataset = self._encode_dataset(train_dataset, val_dataset, pre_process=not is_grpo)
             train_datasets.append(train_dataset)
             val_datasets.append(val_dataset)
         train_dataset = DatasetLoader._concat_datasets(train_datasets)
         val_dataset = DatasetLoader._concat_datasets(val_datasets)
-        is_grpo = hasattr(args, 'rlhf_type') and args.rlhf_type == 'grpo'
-        predict_with_generate = getattr(args, 'predict_with_generate', False)
         datasets = [train_dataset, val_dataset]
         if is_grpo:
             return datasets
+        datasets = self._post_process_datasets(datasets)
+
+        return datasets
+
+    def _post_process_datasets(self, datasets: List) -> List:
+        args = self.args
+        predict_with_generate = getattr(args, 'predict_with_generate', False)
+
         template = self.template
         for i, dataset in enumerate(datasets):
             if dataset is None:
@@ -294,15 +301,14 @@ def _show_dataset(self, train_dataset, val_dataset):
             if val_dataset is not None and not predict_with_generate:
                 self.train_msg['val_dataset'] = self._stat_dataset(val_dataset)
 
-    def _encode_dataset(self, train_dataset, val_dataset):
+    def _encode_dataset(self, train_dataset, val_dataset, pre_process=True):
         template = self.template
         args = self.args
         self._save_val_dataset(val_dataset)
 
-        is_grpo = hasattr(args, 'rlhf_type') and args.rlhf_type == 'grpo'
         predict_with_generate = getattr(args, 'predict_with_generate', False)
         datasets = [train_dataset, val_dataset]
-        if is_grpo:
+        if not pre_process:
             return datasets
 
         origin_template_model = template.model
diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py
@@ -137,6 +137,15 @@ def __post_init__(self):
 class RLHFArgumentsMixin:
     # gkd
     sft_alpha: float = 0
+    # chord
+    chord_sft_dataset: Optional[str] = None
+    chord_sft_per_device_train_batch_size: Optional[int] = None
+
+    chord_enable_phi_function: bool = False
+    chord_mu_warmup_steps: Optional[int] = None
+    chord_mu_decay_steps: Optional[int] = None
+    chord_mu_peak: Optional[float] = None
+    chord_mu_valley: Optional[float] = None
 
 
 @dataclass
diff --git a/swift/trainers/rlhf_trainer/grpo_trainer.py b/swift/trainers/rlhf_trainer/grpo_trainer.py
diff --git a/swift/trainers/rlhf_trainer/utils.py b/swift/trainers/rlhf_trainer/utils.py