中文版
新特性
- Megatron-SWIFT
a. 新增 model_type 支持:deepseek_v4, gemma4, gemma4_unified, bailing_hybrid, bailing_moe, qwen3_asr。DeepSeek-V4 微调实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/deepseek-v4.html 。Gemma4 训练脚本:https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4
b. 新增language_model_only参数,支持只加载、训练和保存多模态模型的语言模型部分。
c. 更多任务类型支持上下文并行(长文本训练):embedding、generative_reranker、seq_cls 和 reward_model 任务。
d. Qwen3-Next 支持 Mcore-GDN 方式运行(默认),支持序列 packing、FP8 及 CP能力。
e. Mcore-Bridge 新增极简 Forward 示例:创建模型、执行 forward并计算损失,方便接入其他项目:https://github.com/modelscope/mcore-bridge?tab=readme-ov-file#minimal-forward-example
f. test_convert_precision 功能增强:新增对 Attention Backend(fused / flash)、上下文并行(CP)及 padding_free 的支持。
g. generative_reranker 任务训练 lm_head 部分显存占用优化,只提取 positive / negative token 位置的 logits 而不是完整的 logits。
h. 训练设置--merge_lora true时,会将 lora 增量权重和 merged 权重都进行存储。
i. 新增megatron_extra_kwargs参数,支持将额外参数直接透传至 Megatron,提升配置灵活性。
j. 支持通过--attention_backend flash_2 / flash_3 / flash_4手动指定 Attention 后端。
k. 新增 Megatron FP4 训练参数:fp4_format、fp4_recipe和fp4_param_gather。
l. 新增 FP8 与 LoRA 结合使用的训练示例脚本:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/lora.sh
m. embedding 与 generative_reranker 任务支持非 padding_free 方式训练。
n.batch_p2p_comm参数支持。如遇流水线并行(PP)训练卡住的情况,可将其设置为 False。
o.mlp_padding_free参数兼容上下文并行。 - RL
a. 新增基于 Megatron + Ray 的 GRPO 和 GKD 训练支持,适用于超大规模分布式强化学习场景,参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/Ray.html
b. Megatron GRPO 支持多轮对话场景下的训练。
c. Megatron GRPO 新增 REAL 算法支持。(感谢@li2zhi 贡献)
d. GRPO 新增 FIPO 算法支持。(感谢 @li2zhi 贡献)
e. GKD 的 teacher_server 重构为基于swift deploy的部署方式。
f. 新增基于 FrozenLake 环境的多轮训练完整示例,参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/gym_env.html#frozenlake
g. rollout / deploy 支持 vllm 0.16+ 的 dense model 数据并行。
h. gym 模块重构,参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/gym_env.html - 训练
a. 新增 preserve_thinking 参数:可在推理与训练时控制是否保留历史思考内容。
b. 新增 chat_template_kwargs:训练/推理支持数据样本粒度配置 max_pixels、enable_thinking 等参数,部署侧支持通过 OpenAI 格式传入 enable_thinking / preserve_thinking。
c. 支持在昇腾 NPU 上使用 Zigzag Ring 序列并行策略。(感谢 @addsubmuldiv 的贡献)
d. 音频数据支持从视频文件读取。(感谢 @Tohrusky 的贡献)
e. 兼容python 3.13。
新模型
- 纯文本模型
a. deepseek-ai/DeepSeek-V4-Flash 系列
b. OpenBMB/MiniCPM5-1B 系列
c. inclusionAI/Ling-2.6-1T, inclusionAI/Ring-2.6-1T 系列 - 多模态模型
a. OpenBMB/MiniCPM-V-4.6(感谢 @tsjyma 的贡献)
b. PaddlePaddle/PaddleOCR-VL-1.6
c. google/gemma-4-12B-it 系列
d. moonshotai/Kimi-K2.5, moonshotai/Kimi-K2.6 支持多模态
English Version
New Features
- Megatron-SWIFT
a. New model_type support: deepseek_v4, gemma4, gemma4_unified, bailing_hybrid, bailing_moe, qwen3_asr. DeepSeek-V4 fine-tuning best practices: https://swift.readthedocs.io/en/latest/BestPractices/deepseek-v4.html. Gemma4 training scripts: https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4
b. Addedlanguage_model_onlyparameter to support loading, training, and saving only the language model component of multimodal models.
c. Context Parallelism (CP) extended to more task types (long-sequence training): embedding, generative_reranker, seq_cls, and reward_model.
d. Qwen3-Next now defaults to Mcore-GDN, enabling sequence packing, FP8 training, and Context Parallelism.
e. Mcore-Bridge adds a minimal Forward example covering model creation, forward execution, and loss computation, making it straightforward to integrate into other projects: https://github.com/modelscope/mcore-bridge?tab=readme-ov-file#minimal-forward-example
f.test_convert_precisionenhancements: added support for Attention Backend (fused/flash), Context Parallelism (CP), andpadding_freemode.
g. Memory optimization forlm_headingenerative_rerankertraining: only the logits at positive / negative token positions are extracted instead of materializing the full logits matrix, significantly reducing peak memory usage in large-vocabulary scenarios.
h. When--merge_lora trueis set, both the LoRA delta weights and the merged weights are now saved simultaneously.
i. Addedmegatron_extra_kwargsparameter to allow arbitrary extra arguments to be passed directly to Megatron, improving configuration flexibility.
j. Support for manually specifying the Attention backend via--attention_backend flash_2 / flash_3 / flash_4.
k. Added Megatron FP4 training parameters:fp4_format,fp4_recipe, andfp4_param_gather.
l. Added a training example script for combined FP8 + LoRA usage: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/lora.sh
m.embeddingandgenerative_rerankertasks now support training withoutpadding_freemode.
n. Addedbatch_p2p_commparameter. If pipeline parallelism (PP) training hangs, setting this toFalsecan resolve the communication blocking issue.
o.mlp_padding_freeis now compatible with Context Parallelism (CP). - RL
a. Added GRPO and GKD training support based on Megatron + Ray, targeting large-scale distributed reinforcement learning scenarios. Refer to: https://swift.readthedocs.io/en/latest/Instruction/Ray.html
b. Megatron GRPO now supports multi-turn dialogue training.
c. Megatron GRPO adds support for the REAL algorithm. (Thanks to @li2zhi for the contribution.)
d. GRPO adds support for the FIPO algorithm. (Thanks to @li2zhi for the contribution.)
e. GKDteacher_serverhas been refactored to useswift deployas the serving backend.
f. Added a complete multi-turn training example based on the FrozenLake environment. Refer to: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html#frozenlake
g.rollout/deploynow supports data parallelism for dense models with vLLM 0.16+.
h.gymmodule refactored for a cleaner interface and improved extensibility. Refer to: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html - Training
a. Addedpreserve_thinkingparameter to control whether historical chain-of-thought content is retained during inference and training.
b. Addedchat_template_kwargssupport: training and inference now support sample-level configuration of parameters such asmax_pixelsandenable_thinking; on the serving side,enable_thinking/preserve_thinkingcan be passed via the OpenAI-compatible API format.
c. Added support for the Zigzag Ring sequence parallelism strategy on Ascend NPU. (Thanks to @addsubmuldiv for the contribution.)
d. Audio data can now be read directly from video files. (Thanks to @Tohrusky for the contribution.)
e. Python 3.13 compatibility.
New Models
- Language Models
a. deepseek-ai/DeepSeek-V4-Flash series
b. OpenBMB/MiniCPM5-1B series
c. inclusionAI/Ling-2.6-1T, inclusionAI/Ring-2.6-1T series - Multimodal Models
a. OpenBMB/MiniCPM-V-4.6 (Thanks to @tsjyma for the contribution.)
b. PaddlePaddle/PaddleOCR-VL-1.6
c. google/gemma-4-12B-it series
d. moonshotai/Kimi-K2.5, moonshotai/Kimi-K2.6 with multimodal support
What's Changed
- [bugfix] fix datasets cache hash by @Jintao-Huang in #9284
- [readme] fix readme by @Jintao-Huang in #9285
- [template] support chat_template_kwargs by @Jintao-Huang in #9272
- compat python3.13 by @Jintao-Huang in #9288
- [bugfix]: fix qwen2.5 omni image list mode fps by @Tohrusky in #9290
- update lint compat py313 by @Jintao-Huang in #9289
- [model] Support deepseek v4 template by @Jintao-Huang in #9208
- [docs] update support docs by @Jintao-Huang in #9291
- [docs] fix docs install swift by @Jintao-Huang in #9292
- [model] Add support for MiniCPM-V-4.6 model by @tsjyma in #9286
- [docs] update swift image 4.2.0 by @Jintao-Huang in #9300
- Fix missing root_image_dir attribute in deepseekocr template by @hjh0119 in #9305
- [bugfix] fix deepspeed lr_scheduler by @Jintao-Huang in #9304
- fix vllm dp & reset_encoder_cache & fix vllm init with zero3 by @hjh0119 in #9295
- Megatron REAL Support by @li2zhi in #9270
- [docs] update docs & fix gemma4 megatron by @Jintao-Huang in #9308
- megatron grpo data_shuffle by @hjh0119 in #9310
- [template] Support preserve thinking by @Jintao-Huang in #9309
- [Bugfix] Fix NPU FSDP2 parameter device placement for full_tensor access by @ys2025-AI in #9314
- NPU ring attention adapt by @addsubmuldiv in #9298
- refactor_mm_token_type by @Jintao-Huang in #9316
- [model] support minicpmv4_6 by @Jintao-Huang in #9326
- [bugfix] fix qwen3 vl template by @Jintao-Huang in #9327
- [fix] support video as audio input by @Tohrusky in #9311
- [bugfix] fix lora peft 019 by @Jintao-Huang in #9333
- [docs] update minicpmv4.6 docs by @Jintao-Huang in #9334
- [megatron] also save LoRA weights when merge_lora is set to true by @Jintao-Huang in #9332
- fix grpo metric_for_best_model by @hjh0119 in #9336
- [docs] update docs by @Jintao-Huang in #9338
- fix ppo best model metrics by @hjh0119 in #9339
- [docs] update qwen3-next GDN by @Jintao-Huang in #9340
- [bugfix] fix mm_token_type_ids cp by @Jintao-Huang in #9341
- [megatron] update megatron test_convert_precision by @Jintao-Huang in #9342
- [bugfix] fix qwen3_5 zero3 by @Jintao-Huang in #9346
- [model] support ling2.6 ring2.6 by @Jintao-Huang in #9347
- [docs] update megatron docs by @Jintao-Huang in #9350
- [megatron] compat megatron 0.17 by @Jintao-Huang in #9354
- update dataloader_persistent_workers by @Jintao-Huang in #9248
- [model] Update minicpm-v-4.6 by @Jintao-Huang in #9362
- [bugfix] fix encoding by @Jintao-Huang in #9363
- [fix] MiniCPM-V 4.6: Fix sliced image token splitting, supplement collator and vLLM paths by @tsjyma in #9366
- [bugfix] fix agent response decode by @Jintao-Huang in #9369
- Support FIPO by @li2zhi in #9328
- [megatron] support gemma4 megatron by @Jintao-Huang in #9296
- [megatron] support megatron fp4 by @Jintao-Huang in #9330
- [Bugfix] Fix loss missing from logs when context parallelism is enabled by @Zhikaiiii in #9380
- [Bug fix] Avoid OOM when casting fp32 on NPU for GRPO with vLLM colocate by @ys2025-AI in #9335
- modify NPU Qwen3.5 Megatron practice by @hazelduan in #9382
- [1/N] Initial support for Megatron-Ray by @hjh0119 in #9381
- update ray_utils by @Jintao-Huang in #9383
- [bugfix] fix truncation_strategy & mm_token by @Jintao-Huang in #9385
- [bugfix] fix mm_token_id by @Jintao-Huang in #9389
- [megatron] test_convert_precision support cp/padding_free by @Jintao-Huang in #9388
- [model] gemma4 compat transformers 5.9 by @Jintao-Huang in #9393
- [bugfix] fix yaml torchrun by @Jintao-Huang in #9396
- remove NotImplementedError by @Jintao-Huang in #9397
- [megatron] test_convert_precision support attention fused by @Jintao-Huang in #9395
- [docs] update_docs by @Jintao-Huang in #9404
- [bugfix] fix mllm grpo liger loss by @hjh0119 in #9406
- [model] support minicpm-5 by @Jintao-Huang in #9418
- [bugfix] fix grpo eval_use_evalscope true by @hjh0119 in #9417
- [bugfix] fix vllm StatelessProcessGroup create by @hjh0119 in #9420
- [bugfix] fix vllm bugs by @hjh0119 in #9419
- [megatron] Support deepseek-v4 megatron by @Jintao-Huang in #9386
- [megatron] add fp8 lora example by @Jintao-Huang in #9423
- [bugfix] Fix LLaVA OneVision 1.5 vit_gradient_checkpointing issue by @randydl in #9428
- [docs] remove obsolete packing_cache reference from FAQ by @xyzhang626 in #9426
- [docs] support deepseek_v4 readme by @Jintao-Huang in #9430
- [bugfix] Qwen3.5 SP compat transformers 5.9.0 by @Jintao-Huang in #9434
- [megatron] support bailing_v25 megatron by @Jintao-Huang in #9442
- [examples] add rm infer example by @Jintao-Huang in #9444
- [agent_template] Update the default value of agent_template by @Jintao-Huang in #9445
- [bugfix] fix kernels TransformersEngine & FP8 by @Jintao-Huang in #9447
- Fix: restore requires_grad after _save_converted_model to work around peft inject_adapter side effect by @kiritozc in #9452
- [megatron] support megatron_extra_kwargs by @Jintao-Huang in #9459
- [docs] update swift image 4.2.3 by @Jintao-Huang in #9458
- [CI] Fix NPU unittest discovery and runtime setup by @addsubmuldiv in #9456
- support megatron/megatron ray multi turn grpo by @hjh0119 in #9405
- Fix NPU Megatron checkpoint resume by @addsubmuldiv in #9460
- [model] support deepseek_v4 reasoning_effort by @Jintao-Huang in #9461
- [model] support paddle_ocr v1.6 by @Jintao-Huang in #9464
- refactor teacher server by @hjh0119 in #9457
- rollout ep by @hjh0119 in #9392
- fix grpo example by @hjh0119 in #9352
- [bugfix] fix gkd hf generate to use template generate by @hjh0119 in #9473
- [bugfix] fix vit_gradient_checkpointing_kwargs passing in megatron tr… by @randydl in #9470
- [bugfix] fix omni bugs by @Jintao-Huang in #9477
- [model] use autoconfig by @Jintao-Huang in #9478
- Stabilize NPU CI device assignment and enable math-verify tests by @addsubmuldiv in #9474
- [bugfix] fix gkd eval by @hjh0119 in #9476
- [bugfix] fix megatron merge_lora args_path by @Jintao-Huang in #9480
- fix CI by @Jintao-Huang in #9482
- Npu grpo Fix by @addsubmuldiv in #9468
- [bugfix] fix peft>=0.19 target_parameters by @Jintao-Huang in #9483
- [bugfix] fix device_map ddp (compat transformers>=5.0) by @Jintao-Huang in #9484
- fix argument bug "is_binary_loss_scale" by @ZhiyuanLi218 in #9489
- [bugfix]Fix Megatron model device placement when
use_cpu_initializationis enabled by @ShiroNyaa in #9446 - [bugfix] fix args_path megatron by @Jintao-Huang in #9491
- [model] support gemma-4-12B-it by @Jintao-Huang in #9487
- [examples] add gemma4 shell by @Jintao-Huang in #9492
- update NPU doc support environment versions by @addsubmuldiv in #9488
- [megatron] add batch_p2p_comm arguments by @Jintao-Huang in #9493
- [megatron] support language_model_only by @Jintao-Huang in #9496
- [rollout] Don't hard-import vllm in multi_turn (breaks use_vllm=False) by @LukeLIN-web in #9498
- [model] kimi_k25 support mm by @Jintao-Huang in #9497
- [bugfix] update hf_config del attr by @Jintao-Huang in #9502
- [megatron] Support flash_2/flash_3/flash_4 by @Jintao-Huang in #9503
- [bugfix] fix callbacks by @Jintao-Huang in #9505
- [bugfix] fix perf_log by @Jintao-Huang in #9506
- [2/N] support megatron-ray gkd by @hjh0119 in #9471
- [misc, megatron grpo] empty cache before wake up by @hjh0119 in #9515
- [megatron] fix megatron reranker pp by @Jintao-Huang in #9514
- Fix transformers 5 deepspeed accelerator state by @addsubmuldiv in #9511
- [bugfix] fix ray build_pkg by @Jintao-Huang in #9520
- [megatron] Support megatron CP/non-padding-free more tasks by @Jintao-Huang in #9516
- [template] update template decode_generate_ids by @Jintao-Huang in #9523
- Update the NPU dependency versions and scripts. by @hazelduan in #9500
- [bugfix] Fix megatron lr_mult by @Jintao-Huang in #9524
- [bugfix] fix process_weights_after_loading & non_thinking_prefix by @hjh0119 in #9519
- [bugfix] fix grpo target_parameters & chord device by @hjh0119 in #9525
New Contributors
- @tsjyma made their first contribution in #9286
- @randydl made their first contribution in #9428
- @xyzhang626 made their first contribution in #9426
- @kiritozc made their first contribution in #9452
- @ZhiyuanLi218 made their first contribution in #9489
- @ShiroNyaa made their first contribution in #9446
- @LukeLIN-web made their first contribution in #9498
Full Changelog: v4.2.0...v4.3.0