You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,6 +75,7 @@ You can contact us and communicate with us by adding our group:
75
75
76
76
77
77
## 🎉 News
78
+
- 🎁 2025.09.07: Added support for CHORD training algorithm. See the [documentation](./docs/source_en/Instruction/GRPO/AdvancedResearch/CHORD.md)
78
79
- 🎁 2025.09.06: Ulysses can now be used with ring-attention, allowing sequences to be sharded into any number of chunks (no longer limited by the number of heads). The argument remains `--sequence_parallel_size N`.
79
80
- 🎁 2025.09.02: Megatron-SWIFT now supports multimodal model training. Documentation can be found [here](./docs/source_en/Megatron-SWIFT/Multimodal-Model.md).
80
81
- 🎁 2025.08.12: Support [Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT) in SFT training, use parameter `--enable_dft_loss true`. Training scripts can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh).
# On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting (CHORD)
2
+
3
+
**Version Requirement**: ms-swift>=3.9
4
+
5
+
This document describes the CHORD algorithm proposed in the paper "On-Policy RL Meets Off-Policy Experts: Harmonizing SFT and RL via Dynamic Weighting" (https://arxiv.org/abs/2508.11408). The core idea of CHORD is to dynamically integrate off-policy expert data (SFT) into on-policy reinforcement learning (e.g., GRPO/PPO) by a dual control mechanism: a global weight μ plus a token-level weight φ, thereby balancing imitation and exploration.
6
+
7
+
## Algorithm Overview
8
+
CHORD mixes training by introducing the SFT loss into the GRPO loss. The overall objective is:
- $\mu \in [0, 1]$: global balancing coefficient that controls the contribution of the SFT signal to the overall gradient.
18
+
19
+
### Configuration (data and batch sizes)
20
+
We can implement CHORD training based on GRPO training.
21
+
22
+
CHORD requires specifying an additional SFT dataset and batch size at training time:
23
+
-`chord_sft_dataset`: the SFT dataset that provides expert data.
24
+
-`chord_sft_per_device_train_batch_size`: SFT mini-batch size per device.
25
+
26
+
---
27
+
28
+
## Two CHORD Variants
29
+
30
+
The paper proposes two variants: CHORD-μ and CHORD-φ.
31
+
32
+
### CHORD-μ
33
+
CHORD-μ gradually decays μ during training to transition from imitating experts toward autonomous exploration.
34
+
35
+
Parameters:
36
+
-`chord_mu_peak`: the peak value of μ.
37
+
-`chord_mu_valley`: the final decayed value of μ.
38
+
-`chord_mu_warmup_steps`: number of training steps to ramp μ up to the peak.
39
+
-`chord_mu_decay_steps`: number of training steps to decay μ from peak to valley.
40
+
41
+
### CHORD-φ (Token-level weighting)
42
+
CHORD-φ does not rely on μ scheduling; instead it keeps μ fixed to a small constant (recommended 0.05–0.2) and uses a token-wise weighting function φ to dynamically control each expert token's gradient contribution.
43
+
44
+
Definition of φ:
45
+
$$
46
+
\phi(y_t^\star, \pi_\theta) = p_t \cdot (1 - p_t)
47
+
$$
48
+
49
+
where:
50
+
- $p_t = \pi_\theta(y_t^\star \mid x, y_{<t}^\star)$ is the model's current predicted probability of the expert token.
51
+
- When $p_t \approx 0.5$ (model uncertainty), φ is maximal → emphasize tokens the model is uncertain about.
52
+
- When $p_t \approx 0$ or $p_t \approx 1$, φ → 0 → avoid overemphasizing tokens that are already certain or impossible.
53
+
54
+
Parameter to enable φ weighting:
55
+
-`chord_enable_phi_function: bool = False`
56
+
- Set to `True` to enable token-wise weight φ.
57
+
58
+
Note: If using a constant μ, set `chord_mu_peak` and `chord_mu_valley` to the same value.
59
+
60
+
<details>
61
+
<summary>Code implementation of μ scheduling and loss computation</summary>
62
+
See the `GRPOTrainer` method `_compute_chord_loss`.
63
+
</details>
64
+
65
+
Training reference script: https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/chord.sh
0 commit comments