diff --git a/docs/source/Instruction/GKD.md b/docs/source/Instruction/GKD.md index 6de08b1410..00bfdb0c47 100644 --- a/docs/source/Instruction/GKD.md +++ b/docs/source/Instruction/GKD.md @@ -33,22 +33,22 @@ $$ #### Forward KL(前向 KL) $$ -\text{KL}(P_{\text{student}} \| P_{\text{teacher}}) = \sum_v P_{\text{student}}(v) \log \frac{P_{\text{student}}(v)}{P_{\text{teacher}}(v)} +\text{KL}(P_{\text{teacher}} \| P_{\text{student}}) = \sum_v P_{\text{teacher}}(v) \log \frac{P_{\text{teacher}}(v)}{P_{\text{student}}(v)} $$ -**特性**:Mode-seeking(寻模) -- 期望在学生分布下计算 -- 学生模型倾向于集中在教师模型的峰值区域(高概率区域) +**特性**:Mode-covering +- 期望在教师分布下计算 +- 学生模型倾向于覆盖教师的整个分布(包括低概率区域) -#### Reverse KL(反向 KL) +#### Reverse KL(反向 KL) $$ -\text{KL}(P_{\text{teacher}} \| P_{\text{student}}) = \sum_v P_{\text{teacher}}(v) \log \frac{P_{\text{teacher}}(v)}{P_{\text{student}}(v)} +\text{KL}(P_{\text{student}} \| P_{\text{teacher}}) = \sum_v P_{\text{student}}(v) \log \frac{P_{\text{student}}(v)}{P_{\text{teacher}}(v)} $$ -**特性**:Mode-covering(覆模) -- 期望在教师分布下计算 -- 学生模型倾向于覆盖教师的整个分布(包括低概率区域) +**特性**:Mode-seeking +- 期望在学生分布下计算 +- 学生模型倾向于集中在教师模型的峰值区域(高概率区域) ### 广义 Jensen-Shannon 散度(Generalized JSD) @@ -78,8 +78,8 @@ $$ 其中 $M = \beta \cdot P_{\text{teacher}} + (1-\beta) \cdot P_{\text{student}}$ > 对极端情况($\beta = 0$ 或 $\beta = 1$),直接计算单个 KL 散度: -> - 当 $\beta = 0$ 时:直接定义 $D = \text{KL}(P_{\text{teacher}} \| P_{\text{student}})$(Reverse KL,Mode-covering) -> - 当 $\beta = 1$ 时:直接定义 $D = \text{KL}(P_{\text{student}} \| P_{\text{teacher}})$(Forward KL,Mode-seeking) +> - 当 $\beta = 0$ 时:直接定义 $D = \text{KL}(P_{\text{teacher}} \| P_{\text{student}})$(Forward KL,Mode-covering) +> - 当 $\beta = 1$ 时:直接定义 $D = \text{KL}(P_{\text{student}} \| P_{\text{teacher}})$(Reverse KL,Mode-seeking) > - 当 $0 < \beta < 1$ 时:使用上述混合分布公式进行插值 通过调节 $\beta$ 参数,可以在不同的散度度量之间进行插值,当 $\beta = 0.5$ 时,散度为标准的对称 JSD。 @@ -142,8 +142,8 @@ loss = D_JSD(P_teacher(·|x,y), P_student(·|x,y)) | 参数 | 类型 | 默认值 | 取值范围 | 说明 | |------|------|--------|---------|------| | `--teacher_model` | str | 必需 | - | 教师模型路径或模型 ID | -| `--beta` | float | 0.5 | [0.0, 1.0] | 散度插值系数
• 0.0: Reverse KL (覆模,更多样)
• 0.5: JSD (平衡,**推荐**)
• 1.0: Forward KL (寻模,更专注) | -| `--lmbda` | float | 0.5 | [0.0, 1.0] | On-Policy 学习触发概率
• 0.0: 纯 Off-Policy
• 0.5: 混合策略 (**推荐**)
• 1.0: 纯 On-Policy | +| `--beta` | float | 0.5 | [0.0, 1.0] | 散度插值系数
• 0.0: Forward KL
• 0.5: JSD (平衡)
• 1.0: Reverse KL | +| `--lmbda` | float | 0.5 | [0.0, 1.0] | On-Policy 学习触发概率
• 0.0: 纯 Off-Policy
• 0.5: 混合策略
• 1.0: 纯 On-Policy | | `--seq_kd` | bool | False | True/False | 是否使用教师生成序列
• False: 非 on-policy 时使用数据集
• True: 非 on-policy 时使用教师生成 | | `--temperature` | float | 0.9 | > 0 | 生成采样温度,控制随机性 | | `--max_completion_length` | int | 512 | > 0 | 生成时的最大 token 数 | @@ -200,3 +200,13 @@ swift rlhf \ ``` 训练脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal/rlhf/gkd/fast.sh) + +## On-Policy Distillation + +我们可以通过设置以下参数实现 Thinking Machine Lab blog 中的[On-Policy Distillation](https://thinkingmachines.ai/blog/on-policy-distillation/)训练。 +```bash +--lmbda 1 # on-policy +--beta 1 # reverse +``` + +相关脚本可以参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/on_policy_distillation.sh) diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/index.rst b/docs/source/Instruction/GRPO/AdvancedResearch/index.rst index 1c7b8f2342..4ef69cc56c 100644 --- a/docs/source/Instruction/GRPO/AdvancedResearch/index.rst +++ b/docs/source/Instruction/GRPO/AdvancedResearch/index.rst @@ -7,4 +7,5 @@ Advanced Research DAPO.md deepeyes.md GSPO.md + RLOO.md CHORD.md diff --git a/docs/source_en/Instruction/GKD.md b/docs/source_en/Instruction/GKD.md index dd65ecf390..06494e1f27 100644 --- a/docs/source_en/Instruction/GKD.md +++ b/docs/source_en/Instruction/GKD.md @@ -33,22 +33,22 @@ In knowledge distillation, there are two choices depending on the order of the t #### Forward KL $$ -\text{KL}(P_{\text{student}} \| P_{\text{teacher}}) = \sum_v P_{\text{student}}(v) \log \frac{P_{\text{student}}(v)}{P_{\text{teacher}}(v)} +\text{KL}(P_{\text{teacher}} \| P_{\text{student}}) = \sum_v P_{\text{teacher}}(v) \log \frac{P_{\text{teacher}}(v)}{P_{\text{student}}(v)} $$ -**Characteristics**: Mode-seeking -- Expectation is computed under the student distribution -- The student model tends to concentrate on the peak regions (high-probability areas) of the teacher model +**Characteristics**: Mode-covering +- Expectation is computed under the teacher distribution +- The student model tends to cover the entire teacher distribution (including low-probability regions) #### Reverse KL $$ -\text{KL}(P_{\text{teacher}} \| P_{\text{student}}) = \sum_v P_{\text{teacher}}(v) \log \frac{P_{\text{teacher}}(v)}{P_{\text{student}}(v)} +\text{KL}(P_{\text{student}} \| P_{\text{teacher}}) = \sum_v P_{\text{student}}(v) \log \frac{P_{\text{student}}(v)}{P_{\text{teacher}}(v)} $$ -**Characteristics**: Mode-covering -- Expectation is computed under the teacher distribution -- The student model tends to cover the entire teacher distribution (including low-probability regions) +**Characteristics**: Mode-seeking +- Expectation is computed under the student distribution +- The student model tends to concentrate on the peak regions (high-probability areas) of the teacher model ### Generalized Jensen-Shannon Divergence (Generalized JSD) @@ -78,8 +78,8 @@ $$ Where $M = \beta \cdot P_{\text{teacher}} + (1-\beta) \cdot P_{\text{student}}$ > For extreme cases ($\beta = 0$ or $\beta = 1$), directly compute a single KL divergence: -> - When $\beta = 0$: directly define $D = \text{KL}(P_{\text{teacher}} \| P_{\text{student}})$ (Reverse KL, Mode-covering) -> - When $\beta = 1$: directly define $D = \text{KL}(P_{\text{student}} \| P_{\text{teacher}})$ (Forward KL, Mode-seeking) +> - When $\beta = 0$: directly define $D = \text{KL}(P_{\text{teacher}} \| P_{\text{student}})$ (Forward KL, Mode-covering) +> - When $\beta = 1$: directly define $D = \text{KL}(P_{\text{student}} \| P_{\text{teacher}})$ (Reverse KL, Mode-seeking) > - When $0 < \beta < 1$: use the above mixture distribution formula for interpolation By adjusting the $\beta$ parameter, interpolation can be performed between different divergence metrics. When $\beta = 0.5$, the divergence is the standard symmetric JSD. @@ -142,7 +142,7 @@ We can perform GKD training by setting the following parameters: | Parameter | Type | Default | Range | Description | |------|------|--------|---------|------| | `--teacher_model` | str | Required | - | Teacher model path or model ID | -| `--beta` | float | 0.5 | [0.0, 1.0] | Divergence interpolation coefficient
• 0.0: Reverse KL (mode-covering, more diverse)
• 0.5: JSD (balanced, **recommended**)
• 1.0: Forward KL (mode-seeking, more focused) | +| `--beta` | float | 0.5 | [0.0, 1.0] | Divergence interpolation coefficient
• 0.0: Forward KL
• 0.5: JSD (balanced)
• 1.0: Reverse KL | | `--lmbda` | float | 0.5 | [0.0, 1.0] | On-Policy learning trigger probability
• 0.0: Pure Off-Policy
• 0.5: Mixed strategy (**recommended**)
• 1.0: Pure On-Policy | | `--seq_kd` | bool | False | True/False | Whether to use teacher-generated sequences
• False: Use dataset when not on-policy
• True: Use teacher generation when not on-policy | | `--temperature` | float | 0.9 | > 0 | Generation sampling temperature, controls randomness | @@ -201,3 +201,14 @@ swift rlhf \ ``` Training script reference [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/multimodal/rlhf/gkd/fast.sh) + + +## On-Policy Distillation +We can achieve the [On-Policy Distillation](https://thinkingmachines.ai/blog/on-policy-distillation/) training described in the Thinking Machines Lab blog by setting the following parameters: + +```bash +--lmbda 1 # on-policy +--beta 1 # reverse +``` + +For a complete implementation, refer to the example script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/on_policy_distillation.sh). diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst b/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst index 1c7b8f2342..4ef69cc56c 100644 --- a/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst +++ b/docs/source_en/Instruction/GRPO/AdvancedResearch/index.rst @@ -7,4 +7,5 @@ Advanced Research DAPO.md deepeyes.md GSPO.md + RLOO.md CHORD.md diff --git a/examples/train/on_policy_distillation.sh b/examples/train/on_policy_distillation.sh new file mode 100644 index 0000000000..75f37187b8 --- /dev/null +++ b/examples/train/on_policy_distillation.sh @@ -0,0 +1,41 @@ +# On-Policy Distillation https://thinkingmachines.ai/blog/on-policy-distillation/ + +# CUDA_VISIBLE_DEVICES=7 \ +# swift rollout \ +# --model Qwen/Qwen3-8B-Base \ +# --vllm_max_model_len 24192 + +NPROC_PER_NODE=7 \ +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 \ +swift rlhf \ + --rlhf_type gkd \ + --model Qwen/Qwen3-8B-Base \ + --teacher_model Qwen/Qwen3-32B \ + --train_type full \ + --dataset open-thoughts/OpenThoughts3-1.2M#10000 \ + --seq_kd false \ + --lmbda 1 \ + --beta 1 \ + --torch_dtype bfloat16 \ + --num_train_epochs 1 \ + --per_device_train_batch_size 1 \ + --learning_rate 1e-5 \ + --gradient_accumulation_steps 1 \ + --save_steps 1000 \ + --save_total_limit 2 \ + --logging_steps 1 \ + --max_length 16000 \ + --max_completion_length 8192 \ + --output_dir output \ + --warmup_ratio 0.05 \ + --save_only_model true \ + --dataloader_num_workers 64 \ + --dataset_num_proc 4 \ + --deepspeed zero2 \ + --teacher_deepspeed zero3 \ + --attn_impl flash_attn \ + --use_vllm true \ + --vllm_mode server \ + --vllm_server_host 127.0.0.1 \ + --vllm_server_port 8000