Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/BestPractices/Qwen3最佳实践.md
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ swift rlhf \

Qwen3-235B-A22B-Instruct-250718 单机8卡H20 LoRA训练的最佳实践参考:[https://github.com/modelscope/ms-swift/pull/5033](https://github.com/modelscope/ms-swift/pull/5033)。

ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO。支持的模型可以在[支持的模型文档](../Instruction/支持的模型和数据集.md)中找到。
ms-swift 引入了 Megatron 并行技术以加速大模型的CPT/SFT/DPO/KTO。支持的模型可以在[支持的模型文档](../Instruction/支持的模型和数据集.md)中找到。

关于环境准备以及 HF 和 MCore 模型权重的转换,可以参考[Megatron-SWIFT训练文档](../Megatron-SWIFT/快速开始.md)。

Expand Down
2 changes: 1 addition & 1 deletion docs/source/GetStarted/快速开始.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
- 量化训练:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
- 🍊 RLHF训练:支持纯文本大模型和多模态大模型的DPO、GRPO、RM、PPO、GKD、KTO、CPO、SimPO、ORPO等人类对齐训练方法。
- 🍓 多模态训练:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。
- 🥥 Megatron并行技术:支持使用Megatron并行技术对CPT/SFT/DPO进行加速,现支持200+大语言模型。
- 🥥 Megatron并行技术:支持使用Megatron并行技术对CPT/SFT/DPO/KTO进行加速,现支持200+大语言模型。
- 界面训练:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。
- 插件化与拓展:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
- 🍉 工具箱能力:除了对大模型和多模态大模型的训练支持外,还支持其推理、评测、量化和部署全流程。
Expand Down
16 changes: 9 additions & 7 deletions docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
- 注意:在"ms-swift<3.7.1",其默认为None,自动从config.json读取。
- moe_z_loss_coeff: z-loss 的缩放系数。默认为None。
- 🔥moe_shared_expert_overlap: 启用共享专家计算与调度器通信之间的重叠。如果不启用此选项,共享专家将在路由专家之后执行。仅在设置了`moe_shared_expert_intermediate_size`时有效。默认为False。
- moe_expert_capacity_factor: 每个专家的容量因子,None表示不会丢弃任何token。默认为None。通过设置 `--moe_expert_capacity_factor`,超出专家容量的 token 会基于其被选中的概率被丢弃。可以令训练负载均匀,提升训练速度。
- 🔥moe_expert_capacity_factor: 每个专家的容量因子,None表示不会丢弃任何token。默认为None。通过设置 `--moe_expert_capacity_factor`,超出专家容量的 token 会基于其被选中的概率被丢弃。可以令训练负载均匀,提升训练速度(例如设置为1)
- moe_pad_expert_input_to_capacity: 对每个专家(expert)的输入进行填充,使其长度与专家容量(expert capacity length)对齐,默认为False。该操作仅在设置了 `--moe_expert_capacity_factor` 参数后才生效。
- moe_token_drop_policy: 可选为'probs', 'position'。默认为'probs'。

Expand Down Expand Up @@ -233,13 +233,15 @@ lora训练:
- reference_free: 是否忽略提供的参考模型,并隐式地使用一个对所有响应赋予相等概率的参考模型。默认为False。
- label_smoothing: 默认为0.。
- f_divergence_type: 默认为`reverse_kl`。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。
- loss_type: 默认为'sigmoid'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。
- loss_type: 默认为'sigmoid'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer#loss-functions)。

**KTO参数**:
- beta: KL正则项系数,默认为`0.1`。
- desirable_weight: KTO算法中对desirable response的loss权重 $\lambda_D$,默认为`1.`。
- undesirable_weight: KTO算法中对undesirable response的loss权重 $\lambda_U$,默认为`1.`。
- calculate_KL: 是否计算KL散度,默认为True。
- ref_load: 含义同DPO。
- ref_adapter_load: 含义同DPO。
- beta: 控制与 ref_model 偏离程度的参数。较高的 beta 表示与 ref_model 偏离更小。默认为`0.1`。
- loss_type: 默认为'kto'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/kto_trainer#trl.KTOConfig.loss_type)。
- desirable_weight: 抵消 desirable 和 undesirable 配对数量不均衡的影响,对 desirable 损失按该系数进行加权,默认为`1.`。
- undesirable_weight: 抵消 desirable 和 undesirable 配对数量不均衡的影响,对 undesirable 损失按该系数进行加权,默认为`1.`。

## 训练参数

Expand All @@ -252,7 +254,7 @@ Megatron训练参数继承自Megatron参数和基本参数(与ms-swift共用da
- mlp_padding_free: 默认为False。用于padding_free设置为false时,对mlp进行padding_free优化。这可以在自定义attention_mask的同时,提升训练速度和减少显存占用。
- vit_gradient_checkpointing: 多模态模型训练时,是否对vit部分开启gradient_checkpointing。默认为True。
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。
- 🔥packing: 是否使用序列packing,默认为False。当前支持CPT/SFT/DPO。
- 🔥packing: 是否使用序列packing,默认为False。当前支持CPT/SFT/DPO/KTO
- packing_length: packing的长度。默认为None,设置为max_length。
- streaming: 流式读取并处理数据集,默认False。
- 注意:因为流式数据集无法获得其长度,因此需要设置`--train_iters`参数。设置`max_epochs`参数确保训练到对应epochs时退出训练,并对权重进行验证和保存。
Expand Down
2 changes: 1 addition & 1 deletion docs/source/Megatron-SWIFT/多模态模型.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 多模态模型

ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen3-VL, Qwen3-Omni, Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v, Kimi-VL等模型的CPT/SFT/DPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。
ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen3-VL, Qwen3-Omni, Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v, Kimi-VL等模型的CPT/SFT/DPO/KTO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。

环境准备请参考Megatron-SWIFT的[快速开始文档](./快速开始.md)。

Expand Down
1 change: 1 addition & 0 deletions docs/source/Megatron-SWIFT/快速开始.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ ms-swift引入了Megatron的并行技术来加速大模型的训练,包括数
| 预训练| ✅ | ✅| ✅ | ✅ |
| 指令监督微调 | ✅ | ✅| ✅ | ✅ |
| DPO | ✅ | ✅| ✅ | ✅ |
| KTO | ✅ | ✅| ✅ | ✅ |
| 分类任务 | ✅ | ✅| ✅ | ✅ |


Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/BestPractices/Qwen3-Best-Practice.md
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ swift rlhf \

Best practice reference for single-node 8xH20 LoRA training with Qwen3-235B-A22B-Instruct-250718: https://github.com/modelscope/ms-swift/pull/5033.

ms-swift introduces Megatron parallelism techniques to accelerate CPT/SFT/DPO for large models. Supported models can be found in the [Supported Models and Datasets Document](../Instruction/Supported-models-and-datasets.md).
ms-swift introduces Megatron parallelism techniques to accelerate CPT/SFT/DPO/KTO for large models. Supported models can be found in the [Supported Models and Datasets Document](../Instruction/Supported-models-and-datasets.md).

For environment setup and conversion between HF and MCore model weights, refer to the [Megatron-SWIFT Training Documentation](../Megatron-SWIFT/Quick-start.md).

Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/GetStarted/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ms-swift is a comprehensive training and deployment framework for large language
- Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
- 🍊 RLHF Training: Supports human alignment training methods like DPO, GRPO, RM, PPO, GKD, KTO, CPO, SimPO, ORPO for both text-based and multimodal large models.
- 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
- 🥥 Megatron Parallelism: Supports accelerating CPT/SFT/DPO using Megatron parallelism techniques, currently compatible with 200+ large language models.
- 🥥 Megatron Parallelism: Supports accelerating CPT/SFT/DPO/KTO using Megatron parallelism techniques, currently compatible with 200+ large language models.
- Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
- Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
- 🍉 Toolbox Capabilities: Offers not only training support for large models and multi-modal large models but also covers the entire process of inference, evaluation, quantization, and deployment.
Expand Down
16 changes: 9 additions & 7 deletions docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
- Note: In ms-swift versions earlier than 3.7.1, the default is None and the value is automatically loaded from config.json.
- moe_z_loss_coeff: Scaling coefficient for z-loss. Default is None.
- 🔥moe_shared_expert_overlap: Enables overlap between shared expert computation and the dispatcher. If not enabled, shared expert computation will be performed after routing experts. Only effective when `moe_shared_expert_intermediate_size` is set. Default is False.
- moe_expert_capacity_factor: Capacity factor for each expert. `None` means no tokens will be dropped. Default is `None`. When `--moe_expert_capacity_factor` is set, tokens exceeding an expert’s capacity will be dropped based on their selection probability. This can balance the training load and improve training speed.
- 🔥moe_expert_capacity_factor: Capacity factor for each expert. `None` means no tokens will be dropped. Default is `None`. When `--moe_expert_capacity_factor` is set, tokens exceeding an expert’s capacity will be dropped based on their selection probability. This can balance the training load and improve training speed (for example, set it to 1.).
- moe_pad_expert_input_to_capacity: Pad the input of each expert so that its length aligns with the expert capacity length. Default is `False`. This option only takes effect if `--moe_expert_capacity_factor` is set.
- moe_token_drop_policy: Options are 'probs' and 'position'. Default is 'probs'.

Expand Down Expand Up @@ -248,13 +248,15 @@ LoRA Training:
- reference_free: Whether to ignore the provided reference model and implicitly use a reference model that assigns equal probability to all responses. Default is `False`.
- label_smoothing: Default is 0.
- f_divergence_type: Default is `reverse_kl`. See the [TRL documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer) for possible values.
- loss_type: Default is `'sigmoid'`. See the [TRL documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer) for possible values.
- loss_type: Default is `'sigmoid'`. See the [TRL documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer#loss-functions) for possible values.

**KTO Parameters**:
- beta: Coefficient for the KL regularization term. Default is `0.1`.
- desirable_weight: Loss weight $\lambda_D$ for desirable response in the KTO algorithm, default is `1.`.
- undesirable_weight: Loss weight $\lambda_U$ for undesirable response in the KTO algorithm, default is `1.`.
- calculate_KL: Whether to calculate KL divergence. Default is `True`.
- ref_load: same meaning as in DPO.
- ref_adapter_load: same meaning as in DPO.
- beta: parameter controlling the deviation from the ref_model. Higher `beta` means less deviation from the ref_model. Default is `0.1`.
- loss_type: default is `'kto'`. See possible values in the TRL docs: https://huggingface.co/docs/trl/main/en/kto_trainer#trl.KTOConfig.loss_type.
- desirable_weight: factor to weight desirable losses to counter imbalance between desirable and undesirable pairs. Default is `1.`.
- undesirable_weight: factor to weight undesirable losses to counter imbalance between desirable and undesirable pairs. Default is `1.`.
Comment on lines 253 to +259
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculate_KL parameter seems to have been removed from the KTO parameters documentation. However, it's still a configurable parameter in the code (swift/megatron/argument/megatron_args.py). It's now optional and can be inferred, but users can still set it. It would be beneficial to document this parameter for clarity.

For example, you could add:

- calculate_KL: Whether to calculate KL divergence. Defaults to `None`, and will be inferred based on `loss_type`. For example, when `loss_type` is `'apo_zero_unpaired'`, `calculate_KL` will be set to `False`, otherwise `True`.


## Training Parameters

Expand All @@ -267,7 +269,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
- mlp_padding_free: The default is False. This is used for applying padding-free optimization to the MLP when padding_free is set to false. It allows for improved training speed and reduced memory usage while customizing the attention_mask.
- vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT part during multimodal model training. Default: True.
- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Default: None.
- 🔥packing: Whether to use sequence packing, defaults to False. Currently supports CPT/SFT/DPO.
- 🔥packing: Whether to use sequence packing, defaults to False. Currently supports CPT/SFT/DPO/KTO.
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
- streaming: Stream data loading and processing, default is False.
- Note: Since the length of a streaming dataset cannot be determined, the `--train_iters` parameter must be set. Also set the `max_epochs` parameter to ensure training exits after the specified number of epochs, and to validate and save the model weights accordingly.
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Megatron-SWIFT/Multimodal-Model.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Multimodal Models

ms-swift introduces Megatron's parallelization techniques to accelerate the training of large multimodal models. Currently, it supports CPT/SFT/DPO for models such as Qwen3-VL, Qwen3-Omni, Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v, Kimi-VL. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md).
ms-swift introduces Megatron's parallelization techniques to accelerate the training of large multimodal models. Currently, it supports CPT/SFT/DPO/KTO for models such as Qwen3-VL, Qwen3-Omni, Qwen2.5-VL, Qwen2.5-Omni, InternVL3.5, GLM4.5v, Kimi-VL. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md).

For environment setup, please refer to the Megatron-SWIFT [Quick Start guide](./Quick-start.md).

Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Megatron-SWIFT/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ ms-swift incorporates Megatron's parallelization techniques to accelerate the tr
| Pretraining | ✅ | ✅ | ✅ | ✅ |
| Instruction-supervised fine-tuning | ✅ | ✅ | ✅ | ✅ |
| DPO | ✅ | ✅ | ✅ | ✅ |
| KTO | ✅ | ✅ | ✅ | ✅ |
| Classification tasks | ✅ | ✅ | ✅ | ✅ |

## Environment Setup
Expand Down
36 changes: 36 additions & 0 deletions examples/megatron/rlhf/kto/dense.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# 4 * 43GiB
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
megatron rlhf \
--rlhf_type kto \
--load Qwen2.5-7B-Instruct-mcore \
--dataset 'AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto#20000' \
--load_from_cache_file true \
--split_dataset_ratio 0.01 \
--tensor_model_parallel_size 4 \
--packing true \
--micro_batch_size 1 \
--global_batch_size 4 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--max_epochs 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-6 \
--save megatron_output/Qwen2.5-7B-Instruct \
--eval_interval 200 \
--save_interval 200 \
--max_length 8192 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--attention_backend flash \
--beta 0.1 \
--desirable_weight 1 \
--undesirable_weight 1
44 changes: 44 additions & 0 deletions examples/megatron/rlhf/kto/moe.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# 2 * 48GiB
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
megatron rlhf \
--rlhf_type kto \
--load Qwen3-30B-A3B-Instruct-2507-mcore \
--dataset 'AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto#20000' \
--load_from_cache_file true \
--packing true \
--train_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--split_dataset_ratio 0.01 \
--expert_model_parallel_size 2 \
--moe_permute_fusion true \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 1e-3 \
--micro_batch_size 1 \
--global_batch_size 4 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--max_epochs 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-4 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-5 \
--save megatron_output/Qwen3-30B-A3B-Instruct-2507 \
--eval_interval 100 \
--save_interval 100 \
--max_length 8192 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--attention_backend flash \
--beta 0.1 \
--desirable_weight 1 \
--undesirable_weight 1
2 changes: 1 addition & 1 deletion swift/llm/train/kto.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ def prepare_kto_dataset(args, train_dataset, val_dataset):
f"""
You have different amounts of desirable/positive and undesirable/negative examples but the
weights on the desirable and undesirable losses don't seem to be in an ideal range. Based
on your data, we recommend EITHER desirable_weight in [{des_weight_lower_bound}, '{des_weight_upper_bound}]
on your data, we recommend EITHER desirable_weight in [{des_weight_lower_bound}, {des_weight_upper_bound}]
or undesirable_weight in [{und_weight_lower_bound}, {und_weight_upper_bound}] (but NOT BOTH).
See the documentation on how to optimally set these weights.""", UserWarning)
return train_dataset, val_dataset
25 changes: 24 additions & 1 deletion swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

@dataclass
class RLHFMegatronArgumentsMixin:
rlhf_type: Literal['dpo', 'kto'] = None
ref_load: Optional[str] = None
ref_adapter_load: Optional[str] = None

Expand All @@ -25,7 +26,28 @@ class RLHFMegatronArgumentsMixin:
reference_free: bool = False
label_smoothing: float = 0.
f_divergence_type: str = 'reverse_kl'
loss_type: str = 'sigmoid'
loss_type: Optional[str] = None

# kto
desirable_weight: float = 1.
undesirable_weight: float = 1.
calculate_KL: Optional[bool] = None

def _init_kto(self):
if self.calculate_KL is None:
# Not all losses require a KL calculation
self.calculate_KL = True
if self.loss_type in ['apo_zero_unpaired']:
self.calculate_KL = False

def __post_init__(self):
if self.rlhf_type is None:
return
default_loss_type = {'kto': 'kto', 'dpo': 'sigmoid'}
if self.loss_type is None:
self.loss_type = default_loss_type[self.rlhf_type]
if self.rlhf_type == 'kto':
self._init_kto()


@dataclass
Expand Down Expand Up @@ -403,6 +425,7 @@ def __post_init__(self):
require_version('peft>=0.15')
else:
require_version('peft>=0.12')
RLHFMegatronArgumentsMixin.__post_init__(self)
MegatronTunerMixin.__post_init__(self)
os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = '1'
self._set_default()
Expand Down
4 changes: 0 additions & 4 deletions swift/megatron/argument/rlhf_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,3 @@ class MegatronRLHFArguments(MegatronTrainArguments):
loss_scale: str = 'last_round'

calculate_per_token_loss: bool = False

desirable_weight: float = 1.
undesirable_weight: float = 1.
calculate_KL: bool = True
Loading
Loading