Release v4.3.0 · modelscope/ms-swift

中文版

新特性

Megatron-SWIFT
a. 新增 model_type 支持：deepseek_v4, gemma4, gemma4_unified, bailing_hybrid, bailing_moe, qwen3_asr。DeepSeek-V4 微调实践：https://swift.readthedocs.io/zh-cn/latest/BestPractices/deepseek-v4.html 。Gemma4 训练脚本：https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4
b. 新增 language_model_only 参数，支持只加载、训练和保存多模态模型的语言模型部分。
c. 更多任务类型支持上下文并行（长文本训练）：embedding、generative_reranker、seq_cls 和 reward_model 任务。
d. Qwen3-Next 支持 Mcore-GDN 方式运行（默认），支持序列 packing、FP8 及 CP能力。
e. Mcore-Bridge 新增极简 Forward 示例：创建模型、执行 forward并计算损失，方便接入其他项目：https://github.com/modelscope/mcore-bridge?tab=readme-ov-file#minimal-forward-example
f. test_convert_precision 功能增强：新增对 Attention Backend（fused / flash）、上下文并行（CP）及 padding_free 的支持。
g. generative_reranker 任务训练 lm_head 部分显存占用优化，只提取 positive / negative token 位置的 logits 而不是完整的 logits。
h. 训练设置 --merge_lora true 时，会将 lora 增量权重和 merged 权重都进行存储。
i. 新增 megatron_extra_kwargs 参数，支持将额外参数直接透传至 Megatron，提升配置灵活性。
j. 支持通过 --attention_backend flash_2 / flash_3 / flash_4 手动指定 Attention 后端。
k. 新增 Megatron FP4 训练参数：fp4_format、fp4_recipe 和 fp4_param_gather。
l. 新增 FP8 与 LoRA 结合使用的训练示例脚本：https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/lora.sh
m. embedding 与 generative_reranker 任务支持非 padding_free 方式训练。
n. batch_p2p_comm 参数支持。如遇流水线并行（PP）训练卡住的情况，可将其设置为 False。
o. mlp_padding_free 参数兼容上下文并行。
RL
a. 新增基于 Megatron + Ray 的 GRPO 和 GKD 训练支持，适用于超大规模分布式强化学习场景，参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/Ray.html
b. Megatron GRPO 支持多轮对话场景下的训练。
c. Megatron GRPO 新增 REAL 算法支持。（感谢@li2zhi 贡献）
d. GRPO 新增 FIPO 算法支持。（感谢 @li2zhi 贡献）
e. GKD 的 teacher_server 重构为基于 swift deploy 的部署方式。
f. 新增基于 FrozenLake 环境的多轮训练完整示例，参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/gym_env.html#frozenlake
g. rollout / deploy 支持 vllm 0.16+ 的 dense model 数据并行。
h. gym 模块重构，参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
训练
a. 新增 preserve_thinking 参数：可在推理与训练时控制是否保留历史思考内容。
b. 新增 chat_template_kwargs：训练/推理支持数据样本粒度配置 max_pixels、enable_thinking 等参数，部署侧支持通过 OpenAI 格式传入 enable_thinking / preserve_thinking。
c. 支持在昇腾 NPU 上使用 Zigzag Ring 序列并行策略。（感谢 @addsubmuldiv 的贡献）
d. 音频数据支持从视频文件读取。（感谢 @Tohrusky 的贡献）
e. 兼容python 3.13。

新模型

纯文本模型
a. deepseek-ai/DeepSeek-V4-Flash 系列
b. OpenBMB/MiniCPM5-1B 系列
c. inclusionAI/Ling-2.6-1T, inclusionAI/Ring-2.6-1T 系列
多模态模型
a. OpenBMB/MiniCPM-V-4.6（感谢 @tsjyma 的贡献）
b. PaddlePaddle/PaddleOCR-VL-1.6
c. google/gemma-4-12B-it 系列
d. moonshotai/Kimi-K2.5, moonshotai/Kimi-K2.6 支持多模态

English Version

New Features

Megatron-SWIFT
a. New model_type support: deepseek_v4, gemma4, gemma4_unified, bailing_hybrid, bailing_moe, qwen3_asr. DeepSeek-V4 fine-tuning best practices: https://swift.readthedocs.io/en/latest/BestPractices/deepseek-v4.html. Gemma4 training scripts: https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4
b. Added language_model_only parameter to support loading, training, and saving only the language model component of multimodal models.
c. Context Parallelism (CP) extended to more task types (long-sequence training): embedding, generative_reranker, seq_cls, and reward_model.
d. Qwen3-Next now defaults to Mcore-GDN, enabling sequence packing, FP8 training, and Context Parallelism.
e. Mcore-Bridge adds a minimal Forward example covering model creation, forward execution, and loss computation, making it straightforward to integrate into other projects: https://github.com/modelscope/mcore-bridge?tab=readme-ov-file#minimal-forward-example
f. test_convert_precision enhancements: added support for Attention Backend (fused / flash), Context Parallelism (CP), and padding_free mode.
g. Memory optimization for lm_head in generative_reranker training: only the logits at positive / negative token positions are extracted instead of materializing the full logits matrix, significantly reducing peak memory usage in large-vocabulary scenarios.
h. When --merge_lora true is set, both the LoRA delta weights and the merged weights are now saved simultaneously.
i. Added megatron_extra_kwargs parameter to allow arbitrary extra arguments to be passed directly to Megatron, improving configuration flexibility.
j. Support for manually specifying the Attention backend via --attention_backend flash_2 / flash_3 / flash_4.
k. Added Megatron FP4 training parameters: fp4_format, fp4_recipe, and fp4_param_gather.
l. Added a training example script for combined FP8 + LoRA usage: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/lora.sh
m. embedding and generative_reranker tasks now support training without padding_free mode.
n. Added batch_p2p_comm parameter. If pipeline parallelism (PP) training hangs, setting this to False can resolve the communication blocking issue.
o. mlp_padding_free is now compatible with Context Parallelism (CP).
RL
a. Added GRPO and GKD training support based on Megatron + Ray, targeting large-scale distributed reinforcement learning scenarios. Refer to: https://swift.readthedocs.io/en/latest/Instruction/Ray.html
b. Megatron GRPO now supports multi-turn dialogue training.
c. Megatron GRPO adds support for the REAL algorithm. (Thanks to @li2zhi for the contribution.)
d. GRPO adds support for the FIPO algorithm. (Thanks to @li2zhi for the contribution.)
e. GKD teacher_server has been refactored to use swift deploy as the serving backend.
f. Added a complete multi-turn training example based on the FrozenLake environment. Refer to: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html#frozenlake
g. rollout / deploy now supports data parallelism for dense models with vLLM 0.16+.
h. gym module refactored for a cleaner interface and improved extensibility. Refer to: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
Training
a. Added preserve_thinking parameter to control whether historical chain-of-thought content is retained during inference and training.
b. Added chat_template_kwargs support: training and inference now support sample-level configuration of parameters such as max_pixels and enable_thinking; on the serving side, enable_thinking / preserve_thinking can be passed via the OpenAI-compatible API format.
c. Added support for the Zigzag Ring sequence parallelism strategy on Ascend NPU. (Thanks to @addsubmuldiv for the contribution.)
d. Audio data can now be read directly from video files. (Thanks to @Tohrusky for the contribution.)
e. Python 3.13 compatibility.

New Models

Language Models
a. deepseek-ai/DeepSeek-V4-Flash series
b. OpenBMB/MiniCPM5-1B series
c. inclusionAI/Ling-2.6-1T, inclusionAI/Ring-2.6-1T series
Multimodal Models
a. OpenBMB/MiniCPM-V-4.6 (Thanks to @tsjyma for the contribution.)
b. PaddlePaddle/PaddleOCR-VL-1.6
c. google/gemma-4-12B-it series
d. moonshotai/Kimi-K2.5, moonshotai/Kimi-K2.6 with multimodal support

What's Changed

[bugfix] fix datasets cache hash by @Jintao-Huang in #9284
[readme] fix readme by @Jintao-Huang in #9285
[template] support chat_template_kwargs by @Jintao-Huang in #9272
compat python3.13 by @Jintao-Huang in #9288
[bugfix]: fix qwen2.5 omni image list mode fps by @Tohrusky in #9290
update lint compat py313 by @Jintao-Huang in #9289
[model] Support deepseek v4 template by @Jintao-Huang in #9208
[docs] update support docs by @Jintao-Huang in #9291
[docs] fix docs install swift by @Jintao-Huang in #9292
[model] Add support for MiniCPM-V-4.6 model by @tsjyma in #9286
[docs] update swift image 4.2.0 by @Jintao-Huang in #9300
Fix missing root_image_dir attribute in deepseekocr template by @hjh0119 in #9305
[bugfix] fix deepspeed lr_scheduler by @Jintao-Huang in #9304
fix vllm dp & reset_encoder_cache & fix vllm init with zero3 by @hjh0119 in #9295
Megatron REAL Support by @li2zhi in #9270
[docs] update docs & fix gemma4 megatron by @Jintao-Huang in #9308
megatron grpo data_shuffle by @hjh0119 in #9310
[template] Support preserve thinking by @Jintao-Huang in #9309
[Bugfix] Fix NPU FSDP2 parameter device placement for full_tensor access by @ys2025-AI in #9314
NPU ring attention adapt by @addsubmuldiv in #9298
refactor_mm_token_type by @Jintao-Huang in #9316
[model] support minicpmv4_6 by @Jintao-Huang in #9326
[bugfix] fix qwen3 vl template by @Jintao-Huang in #9327
[fix] support video as audio input by @Tohrusky in #9311
[bugfix] fix lora peft 019 by @Jintao-Huang in #9333
[docs] update minicpmv4.6 docs by @Jintao-Huang in #9334
[megatron] also save LoRA weights when merge_lora is set to true by @Jintao-Huang in #9332
fix grpo metric_for_best_model by @hjh0119 in #9336
[docs] update docs by @Jintao-Huang in #9338
fix ppo best model metrics by @hjh0119 in #9339
[docs] update qwen3-next GDN by @Jintao-Huang in #9340
[bugfix] fix mm_token_type_ids cp by @Jintao-Huang in #9341
[megatron] update megatron test_convert_precision by @Jintao-Huang in #9342
[bugfix] fix qwen3_5 zero3 by @Jintao-Huang in #9346
[model] support ling2.6 ring2.6 by @Jintao-Huang in #9347
[docs] update megatron docs by @Jintao-Huang in #9350
[megatron] compat megatron 0.17 by @Jintao-Huang in #9354
update dataloader_persistent_workers by @Jintao-Huang in #9248
[model] Update minicpm-v-4.6 by @Jintao-Huang in #9362
[bugfix] fix encoding by @Jintao-Huang in #9363
[fix] MiniCPM-V 4.6: Fix sliced image token splitting, supplement collator and vLLM paths by @tsjyma in #9366
[bugfix] fix agent response decode by @Jintao-Huang in #9369
Support FIPO by @li2zhi in #9328
[megatron] support gemma4 megatron by @Jintao-Huang in #9296
[megatron] support megatron fp4 by @Jintao-Huang in #9330
[Bugfix] Fix loss missing from logs when context parallelism is enabled by @Zhikaiiii in #9380
[Bug fix] Avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate by @ys2025-AI in #9335
modify NPU Qwen3.5 Megatron practice by @hazelduan in #9382
[1/N] Initial support for Megatron-Ray by @hjh0119 in #9381
update ray_utils by @Jintao-Huang in #9383
[bugfix] fix truncation_strategy & mm_token by @Jintao-Huang in #9385
[bugfix] fix mm_token_id by @Jintao-Huang in #9389
[megatron] test_convert_precision support cp/padding_free by @Jintao-Huang in #9388
[model] gemma4 compat transformers 5.9 by @Jintao-Huang in #9393
[bugfix] fix yaml torchrun by @Jintao-Huang in #9396
remove NotImplementedError by @Jintao-Huang in #9397
[megatron] test_convert_precision support attention fused by @Jintao-Huang in #9395
[docs] update_docs by @Jintao-Huang in #9404
[bugfix] fix mllm grpo liger loss by @hjh0119 in #9406
[model] support minicpm-5 by @Jintao-Huang in #9418
[bugfix] fix grpo eval_use_evalscope true by @hjh0119 in #9417
[bugfix] fix vllm StatelessProcessGroup create by @hjh0119 in #9420
[bugfix] fix vllm bugs by @hjh0119 in #9419
[megatron] Support deepseek-v4 megatron by @Jintao-Huang in #9386
[megatron] add fp8 lora example by @Jintao-Huang in #9423
[bugfix] Fix LLaVA OneVision 1.5 vit_gradient_checkpointing issue by @randydl in #9428
[docs] remove obsolete packing_cache reference from FAQ by @xyzhang626 in #9426
[docs] support deepseek_v4 readme by @Jintao-Huang in #9430
[bugfix] Qwen3.5 SP compat transformers 5.9.0 by @Jintao-Huang in #9434
[megatron] support bailing_v25 megatron by @Jintao-Huang in #9442
[examples] add rm infer example by @Jintao-Huang in #9444
[agent_template] Update the default value of agent_template by @Jintao-Huang in #9445
[bugfix] fix kernels TransformersEngine & FP8 by @Jintao-Huang in #9447
Fix: restore requires_grad after _save_converted_model to work around peft inject_adapter side effect by @kiritozc in #9452
[megatron] support megatron_extra_kwargs by @Jintao-Huang in #9459
[docs] update swift image 4.2.3 by @Jintao-Huang in #9458
[CI] Fix NPU unittest discovery and runtime setup by @addsubmuldiv in #9456
support megatron/megatron ray multi turn grpo by @hjh0119 in #9405
Fix NPU Megatron checkpoint resume by @addsubmuldiv in #9460
[model] support deepseek_v4 reasoning_effort by @Jintao-Huang in #9461
[model] support paddle_ocr v1.6 by @Jintao-Huang in #9464
refactor teacher server by @hjh0119 in #9457
rollout ep by @hjh0119 in #9392
fix grpo example by @hjh0119 in #9352
[bugfix] fix gkd hf generate to use template generate by @hjh0119 in #9473
[bugfix] fix vit_gradient_checkpointing_kwargs passing in megatron tr… by @randydl in #9470
[bugfix] fix omni bugs by @Jintao-Huang in #9477
[model] use autoconfig by @Jintao-Huang in #9478
Stabilize NPU CI device assignment and enable math-verify tests by @addsubmuldiv in #9474
[bugfix] fix gkd eval by @hjh0119 in #9476
[bugfix] fix megatron merge_lora args_path by @Jintao-Huang in #9480
fix CI by @Jintao-Huang in #9482
Npu grpo Fix by @addsubmuldiv in #9468
[bugfix] fix peft>=0.19 target_parameters by @Jintao-Huang in #9483
[bugfix] fix device_map ddp (compat transformers>=5.0) by @Jintao-Huang in #9484
fix argument bug "is_binary_loss_scale" by @ZhiyuanLi218 in #9489
[bugfix]Fix Megatron model device placement when use_cpu_initialization is enabled by @ShiroNyaa in #9446
[bugfix] fix args_path megatron by @Jintao-Huang in #9491
[model] support gemma-4-12B-it by @Jintao-Huang in #9487
[examples] add gemma4 shell by @Jintao-Huang in #9492
update NPU doc support environment versions by @addsubmuldiv in #9488
[megatron] add batch_p2p_comm arguments by @Jintao-Huang in #9493
[megatron] support language_model_only by @Jintao-Huang in #9496
[rollout] Don't hard-import vllm in multi_turn (breaks use_vllm=False) by @LukeLIN-web in #9498
[model] kimi_k25 support mm by @Jintao-Huang in #9497
[bugfix] update hf_config del attr by @Jintao-Huang in #9502
[megatron] Support flash_2/flash_3/flash_4 by @Jintao-Huang in #9503
[bugfix] fix callbacks by @Jintao-Huang in #9505
[bugfix] fix perf_log by @Jintao-Huang in #9506
[2/N] support megatron-ray gkd by @hjh0119 in #9471
[misc, megatron grpo] empty cache before wake up by @hjh0119 in #9515
[megatron] fix megatron reranker pp by @Jintao-Huang in #9514
Fix transformers 5 deepspeed accelerator state by @addsubmuldiv in #9511
[bugfix] fix ray build_pkg by @Jintao-Huang in #9520
[megatron] Support megatron CP/non-padding-free more tasks by @Jintao-Huang in #9516
[template] update template decode_generate_ids by @Jintao-Huang in #9523
Update the NPU dependency versions and scripts. by @hazelduan in #9500
[bugfix] Fix megatron lr_mult by @Jintao-Huang in #9524
[bugfix] fix process_weights_after_loading & non_thinking_prefix by @hjh0119 in #9519
[bugfix] fix grpo target_parameters & chord device by @hjh0119 in #9525

New Contributors

@tsjyma made their first contribution in #9286
@randydl made their first contribution in #9428
@xyzhang626 made their first contribution in #9426
@kiritozc made their first contribution in #9452
@ZhiyuanLi218 made their first contribution in #9489
@ShiroNyaa made their first contribution in #9446
@LukeLIN-web made their first contribution in #9498

Full Changelog: v4.2.0...v4.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

中文版

新特性

新模型

English Version

New Features

New Models

What's Changed

New Contributors

Contributors

Uh oh!