Skip to content

v4.3.0

Latest

Choose a tag to compare

@Jintao-Huang Jintao-Huang released this 10 Jun 06:56
· 4 commits to main since this release

中文版

新特性

  1. Megatron-SWIFT
    a. 新增 model_type 支持:deepseek_v4, gemma4, gemma4_unified, bailing_hybrid, bailing_moe, qwen3_asr。DeepSeek-V4 微调实践:https://swift.readthedocs.io/zh-cn/latest/BestPractices/deepseek-v4.html 。Gemma4 训练脚本:https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4
    b. 新增 language_model_only 参数,支持只加载、训练和保存多模态模型的语言模型部分。
    c. 更多任务类型支持上下文并行(长文本训练):embedding、generative_reranker、seq_cls 和 reward_model 任务。
    d. Qwen3-Next 支持 Mcore-GDN 方式运行(默认),支持序列 packing、FP8 及 CP能力。
    e. Mcore-Bridge 新增极简 Forward 示例:创建模型、执行 forward并计算损失,方便接入其他项目:https://github.com/modelscope/mcore-bridge?tab=readme-ov-file#minimal-forward-example
    f. test_convert_precision 功能增强:新增对 Attention Backend(fused / flash)、上下文并行(CP)及 padding_free 的支持。
    g. generative_reranker 任务训练 lm_head 部分显存占用优化,只提取 positive / negative token 位置的 logits 而不是完整的 logits。
    h. 训练设置 --merge_lora true 时,会将 lora 增量权重和 merged 权重都进行存储。
    i. 新增 megatron_extra_kwargs 参数,支持将额外参数直接透传至 Megatron,提升配置灵活性。
    j. 支持通过 --attention_backend flash_2 / flash_3 / flash_4 手动指定 Attention 后端。
    k. 新增 Megatron FP4 训练参数:fp4_formatfp4_recipefp4_param_gather
    l. 新增 FP8 与 LoRA 结合使用的训练示例脚本:https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/lora.sh
    m. embedding 与 generative_reranker 任务支持非 padding_free 方式训练。
    n. batch_p2p_comm 参数支持。如遇流水线并行(PP)训练卡住的情况,可将其设置为 False。
    o. mlp_padding_free 参数兼容上下文并行。
  2. RL
    a. 新增基于 Megatron + Ray 的 GRPO 和 GKD 训练支持,适用于超大规模分布式强化学习场景,参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/Ray.html
    b. Megatron GRPO 支持多轮对话场景下的训练
    c. Megatron GRPO 新增 REAL 算法支持。(感谢@li2zhi 贡献)
    d. GRPO 新增 FIPO 算法支持。(感谢 @li2zhi 贡献)
    e. GKD 的 teacher_server 重构为基于 swift deploy 的部署方式。
    f. 新增基于 FrozenLake 环境的多轮训练完整示例,参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/gym_env.html#frozenlake
    g. rollout / deploy 支持 vllm 0.16+ 的 dense model 数据并行。
    h. gym 模块重构,参考文档 https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
  3. 训练
    a. 新增 preserve_thinking 参数:可在推理与训练时控制是否保留历史思考内容。
    b. 新增 chat_template_kwargs:训练/推理支持数据样本粒度配置 max_pixels、enable_thinking 等参数,部署侧支持通过 OpenAI 格式传入 enable_thinking / preserve_thinking。
    c. 支持在昇腾 NPU 上使用 Zigzag Ring 序列并行策略。(感谢 @addsubmuldiv 的贡献)
    d. 音频数据支持从视频文件读取。(感谢 @Tohrusky 的贡献)
    e. 兼容python 3.13。

新模型

  1. 纯文本模型
    a. deepseek-ai/DeepSeek-V4-Flash 系列
    b. OpenBMB/MiniCPM5-1B 系列
    c. inclusionAI/Ling-2.6-1T, inclusionAI/Ring-2.6-1T 系列
  2. 多模态模型
    a. OpenBMB/MiniCPM-V-4.6(感谢 @tsjyma 的贡献)
    b. PaddlePaddle/PaddleOCR-VL-1.6
    c. google/gemma-4-12B-it 系列
    d. moonshotai/Kimi-K2.5, moonshotai/Kimi-K2.6 支持多模态

English Version

New Features

  1. Megatron-SWIFT
    a. New model_type support: deepseek_v4, gemma4, gemma4_unified, bailing_hybrid, bailing_moe, qwen3_asr. DeepSeek-V4 fine-tuning best practices: https://swift.readthedocs.io/en/latest/BestPractices/deepseek-v4.html. Gemma4 training scripts: https://github.com/modelscope/ms-swift/blob/main/examples/models/gemma4
    b. Added language_model_only parameter to support loading, training, and saving only the language model component of multimodal models.
    c. Context Parallelism (CP) extended to more task types (long-sequence training): embedding, generative_reranker, seq_cls, and reward_model.
    d. Qwen3-Next now defaults to Mcore-GDN, enabling sequence packing, FP8 training, and Context Parallelism.
    e. Mcore-Bridge adds a minimal Forward example covering model creation, forward execution, and loss computation, making it straightforward to integrate into other projects: https://github.com/modelscope/mcore-bridge?tab=readme-ov-file#minimal-forward-example
    f. test_convert_precision enhancements: added support for Attention Backend (fused / flash), Context Parallelism (CP), and padding_free mode.
    g. Memory optimization for lm_head in generative_reranker training: only the logits at positive / negative token positions are extracted instead of materializing the full logits matrix, significantly reducing peak memory usage in large-vocabulary scenarios.
    h. When --merge_lora true is set, both the LoRA delta weights and the merged weights are now saved simultaneously.
    i. Added megatron_extra_kwargs parameter to allow arbitrary extra arguments to be passed directly to Megatron, improving configuration flexibility.
    j. Support for manually specifying the Attention backend via --attention_backend flash_2 / flash_3 / flash_4.
    k. Added Megatron FP4 training parameters: fp4_format, fp4_recipe, and fp4_param_gather.
    l. Added a training example script for combined FP8 + LoRA usage: https://github.com/modelscope/ms-swift/blob/main/examples/megatron/fp8/lora.sh
    m. embedding and generative_reranker tasks now support training without padding_free mode.
    n. Added batch_p2p_comm parameter. If pipeline parallelism (PP) training hangs, setting this to False can resolve the communication blocking issue.
    o. mlp_padding_free is now compatible with Context Parallelism (CP).
  2. RL
    a. Added GRPO and GKD training support based on Megatron + Ray, targeting large-scale distributed reinforcement learning scenarios. Refer to: https://swift.readthedocs.io/en/latest/Instruction/Ray.html
    b. Megatron GRPO now supports multi-turn dialogue training.
    c. Megatron GRPO adds support for the REAL algorithm. (Thanks to @li2zhi for the contribution.)
    d. GRPO adds support for the FIPO algorithm. (Thanks to @li2zhi for the contribution.)
    e. GKD teacher_server has been refactored to use swift deploy as the serving backend.
    f. Added a complete multi-turn training example based on the FrozenLake environment. Refer to: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html#frozenlake
    g. rollout / deploy now supports data parallelism for dense models with vLLM 0.16+.
    h. gym module refactored for a cleaner interface and improved extensibility. Refer to: https://swift.readthedocs.io/en/latest/Instruction/GRPO/DeveloperGuide/gym_env.html
  3. Training
    a. Added preserve_thinking parameter to control whether historical chain-of-thought content is retained during inference and training.
    b. Added chat_template_kwargs support: training and inference now support sample-level configuration of parameters such as max_pixels and enable_thinking; on the serving side, enable_thinking / preserve_thinking can be passed via the OpenAI-compatible API format.
    c. Added support for the Zigzag Ring sequence parallelism strategy on Ascend NPU. (Thanks to @addsubmuldiv for the contribution.)
    d. Audio data can now be read directly from video files. (Thanks to @Tohrusky for the contribution.)
    e. Python 3.13 compatibility.

New Models

  1. Language Models
    a. deepseek-ai/DeepSeek-V4-Flash series
    b. OpenBMB/MiniCPM5-1B series
    c. inclusionAI/Ling-2.6-1T, inclusionAI/Ring-2.6-1T series
  2. Multimodal Models
    a. OpenBMB/MiniCPM-V-4.6 (Thanks to @tsjyma for the contribution.)
    b. PaddlePaddle/PaddleOCR-VL-1.6
    c. google/gemma-4-12B-it series
    d. moonshotai/Kimi-K2.5, moonshotai/Kimi-K2.6 with multimodal support

What's Changed

New Contributors

Full Changelog: v4.2.0...v4.3.0