diff --git a/README.md b/README.md
index 710d6a0b95..80215b90be 100644
--- a/README.md
+++ b/README.md
@@ -75,6 +75,7 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 🎉 News
+- 🎁 2025.09.02: Megatron-SWIFT now supports multimodal model training. Documentation can be found [here](./docs/source_en/Megatron-SWIFT/Multimodal-Model.md).
 - 🎁 2025.08.12: Support [Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT) in SFT training, use parameter `--enable_dft_loss true`. Training scripts can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh).
 - 🎁 2025.07.12: Deployment(pt/vLLM/SGLang) of Embedding models is supported, check [here](examples/deploy/embedding/client.py).
 - 🎁 2025.07.09: Megatron-SWIFT supports LoRA training. Compared to ms-swift, it achieves significant speedup on MoE models. Training scripts can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora).
diff --git a/README_CN.md b/README_CN.md
index d34ac9bd85..ef6f062a76 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -71,6 +71,7 @@
 - **模型量化**：支持AWQ、GPTQ、FP8和BNB的量化导出，导出的模型支持使用vLLM/SGLang/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🎁 2025.09.02: Megatron-SWIFT支持多模态模型训练。文档参考[这里](./docs/source/Megatron-SWIFT/多模态模型.md)。
 - 🎁 2025.08.12: 支持在SFT训练中使用[Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT)，使用参数 `--enable_dft_loss true`。训练脚本参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh)
 - 🎁 2025.07.12: 支持部署Embedding模型的部署(pt/vLLM/SGLang), 查看[这里](examples/deploy/embedding/client.py).
 - 🎁 2025.07.09: Megatron-SWIFT支持LoRA训练。相比ms-swift，在MoE模型提速显著。训练脚本参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora)。
diff --git "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
index 7886f6548b..4ba75ad711 100644
--- "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
+++ "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
@@ -159,7 +159,7 @@
 - 🔥aligner_lr: 当训练多模态大模型时，该参数指定aligner的学习率，默认为None，等于learning_rate。
 - lr_scheduler_type: lr_scheduler类型，默认为'cosine'。
 - lr_scheduler_kwargs: lr_scheduler其他参数。默认为None。
-- 🔥gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。
+- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。
   - 注意：当使用DDP而不使用deepspeed/fsdp，且gradient_checkpointing_kwargs为None，会默认设置其为`'{"use_reentrant": false}'`。
 - full_determinism: 确保训练中获得可重现的结果，注意：这会对性能产生负面影响。默认为False。
 - 🔥report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb swanlab`、`--report_to all`。
@@ -211,10 +211,10 @@
 - hub_private_repo: 默认为False。
 
 ### Tuner参数
-- 🔥freeze_llm: 该参数只对多模态模型生效，可用于全参和LoRA，但含义不同。若是全参数训练，将freeze_llm设置为True将会将llm部分权重进行冻结，若是LoRA训练且`target_modules`设置为'all-linear'，将freeze_llm设置为True将会取消在llm部分添加LoRA模块。该参数默认为False。
-- 🔥freeze_vit: 该参数只对多模态模型生效，可用于全参和LoRA，含义参考`freeze_llm`。默认为True。
+- 🔥freeze_llm: 该参数只对多模态模型生效，可用于全参和LoRA，但会产生不同的效果。若是全参数训练，将freeze_llm设置为True将会将LLM部分权重进行冻结；若是LoRA训练且`target_modules`设置为'all-linear'，将freeze_llm设置为True将会取消在LLM部分添加LoRA模块。该参数默认为False。
+- 🔥freeze_vit: 该参数只对多模态模型生效，可用于全参和LoRA，但会产生不同的效果。若是全参数训练，将freeze_vit设置为True将会将vit部分权重进行冻结；若是LoRA训练且`target_modules`设置为'all-linear'，将freeze_vit设置为True将会取消在vit部分添加LoRA模块。该参数默认为True。
   - 注意：这里的vit不仅限于vision_tower, 也包括audio_tower。
-- 🔥freeze_aligner: 该参数只对多模态模型生效，可用于全参和LoRA，含义参考`freeze_llm`。默认为True。
+- 🔥freeze_aligner: 该参数只对多模态模型生效，可用于全参和LoRA，但会产生不同的效果。若是全参数训练，将freeze_aligner设置为True将会将aligner（也称为projector）部分权重进行冻结；若是LoRA训练且`target_modules`设置为'all-linear'，将freeze_aligner设置为True将会取消在aligner部分添加LoRA模块。该参数默认为True。
 - 🔥target_modules: 指定lora模块, 默认为`['all-linear']`。你也可以设置为module的后缀，例如：`--target_modules q_proj k_proj v_proj`。该参数不限于LoRA，可用于其他tuners。
   - 注意：在LLM和多模态LLM中，'all-linear'的行为有所不同。若是LLM则自动寻找除lm_head外的linear并附加tuner；若是多模态LLM，则默认只在LLM上附加tuner，该行为可以被`freeze_llm`、`freeze_vit`、`freeze_aligner`控制。
 - 🔥target_regex: 指定lora模块的regex表达式，默认为`None`。如果该值传入，则target_modules参数失效。该参数不限于LoRA，可用于其他tuners。
diff --git "a/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" "b/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
index b3836c2592..f9a8ba548b 100644
--- "a/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
+++ "b/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
@@ -652,12 +652,12 @@
 |[Qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/Qwen/Qwen-VL-Chat-Int4)|qwen_vl|qwen_vl|-|&#x2718;|vision|[Qwen/Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)|
 |[Qwen/Qwen-Audio-Chat](https://modelscope.cn/models/Qwen/Qwen-Audio-Chat)|qwen_audio|qwen_audio|-|&#x2718;|audio|[Qwen/Qwen-Audio-Chat](https://huggingface.co/Qwen/Qwen-Audio-Chat)|
 |[Qwen/Qwen-Audio](https://modelscope.cn/models/Qwen/Qwen-Audio)|qwen_audio|qwen_audio|-|&#x2718;|audio|[Qwen/Qwen-Audio](https://huggingface.co/Qwen/Qwen-Audio)|
-|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)|
-|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)|
-|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)|
-|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)|
-|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)|
-|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)|
+|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)|
+|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)|
+|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)|
+|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)|
+|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)|
+|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)|
 |[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)|
 |[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)|
 |[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)|
@@ -667,16 +667,16 @@
 |[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-AWQ)|
 |[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-AWQ)|
 |[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ)|
-|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)|
-|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)|
-|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)|
-|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)|
-|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)|
-|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)|
-|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|
-|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)|
-|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)|
-|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)|
+|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)|
+|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)|
+|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)|
+|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)|
+|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)|
+|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)|
+|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|
+|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)|
+|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)|
+|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)|
 |[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)|
 |[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)|
 |[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)|
diff --git "a/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
index b76390dd75..b4864b250f 100644
--- "a/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
+++ "b/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
@@ -192,6 +192,10 @@
 
 **Tuner参数**:
 - train_type: 可选为'lora'和'full'。默认为'full'。
+- 🔥freeze_llm: 该参数只对多模态模型生效，可用于全参和LoRA，但会产生不同的效果。若是全参数训练，将freeze_llm设置为True将会将LLM部分权重进行冻结；若是LoRA训练且`target_modules`设置为'all-linear'，将freeze_llm设置为True将会取消在LLM部分添加LoRA模块。该参数默认为False。
+- 🔥freeze_vit: 该参数只对多模态模型生效，可用于全参和LoRA，但会产生不同的效果。若是全参数训练，将freeze_vit设置为True将会将vit部分权重进行冻结；若是LoRA训练且`target_modules`设置为'all-linear'，将freeze_vit设置为True将会取消在vit部分添加LoRA模块。该参数默认为True。
+  - 注意：这里的vit不仅限于vision_tower, 也包括audio_tower。
+- 🔥freeze_aligner: 该参数只对多模态模型生效，可用于全参和LoRA，但会产生不同的效果。若是全参数训练，将freeze_aligner设置为True将会将aligner（也称为projector）部分权重进行冻结；若是LoRA训练且`target_modules`设置为'all-linear'，将freeze_aligner设置为True将会取消在aligner部分添加LoRA模块。该参数默认为True。
 
 全参数训练：
 - freeze_parameters: 需要被冻结参数的前缀，默认为`[]`。
@@ -234,6 +238,8 @@ Megatron训练参数继承自Megatron参数和基本参数（与ms-swift共用da
   - 若要自定义attention_mask，你可以设置`--padding_free false`。
   - 注意：Megatron-SWIFT训练特性优先支持padding_free格式，若非特殊情况，请勿修改该值。
 - mlp_padding_free: 默认为False。用于padding_free设置为false时，对mlp进行padding_free优化。这可以在自定义attention_mask的同时，提升训练速度和减少显存占用。
+- vit_gradient_checkpointing: 多模态模型训练时，是否对vit部分开启gradient_checkpointing。默认为True。
+- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。
 - 🔥packing: 是否使用序列packing，默认为False。当前支持CPT/SFT/DPO。
 - packing_length: packing的长度。默认为None，设置为max_length。
 - streaming: 流式读取并处理数据集，默认False。
diff --git "a/docs/source/Megatron-SWIFT/\345\244\232\346\250\241\346\200\201\346\250\241\345\236\213.md" "b/docs/source/Megatron-SWIFT/\345\244\232\346\250\241\346\200\201\346\250\241\345\236\213.md"
new file mode 100644
index 0000000000..e9186658c5
--- /dev/null
+++ "b/docs/source/Megatron-SWIFT/\345\244\232\346\250\241\346\200\201\346\250\241\345\236\213.md"
@@ -0,0 +1,156 @@
+# 多模态模型
+
+ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen2.5-VL等模型的CPT/SFT/DPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。
+
+环境准备请参考Megatron-SWIFT的[快速开始文档](./快速开始.md)。
+
+## Dense模型 Full/LoRA
+
+这里介绍使用2卡80GiB A100对Qwen2.5-VL-7B-Instruct模型进行Latex-OCR的微调，分别使用全参数和LoRA的方式，以下最佳实践可以在10分钟内完成。
+
+首先，我们需要将HF格式的权重转为Megatron格式：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --to_mcore true \
+    --torch_dtype bfloat16 \
+    --output_dir Qwen2.5-VL-7B-Instruct-mcore \
+    --test_convert_precision true
+```
+
+### Full
+
+全参数训练脚本如下：
+```shell
+# 2 * 72GiB; 4.1s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --packing true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-VL-7B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing true \
+    --max_length 2048 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 8
+```
+
+将全参数保存的Megatron格式权重转为HF格式：
+- 注意：`--mcore_model`请指向`iter_xxx`的上级目录。默认会使用`latest_checkpointed_iteration.txt`中对应的checkpoint。
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --mcore_model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
+    --to_hf true \
+    --torch_dtype bfloat16 \
+    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --test_convert_precision true
+```
+
+### LoRA
+
+LoRA训练脚本如下：
+```shell
+# 2 * 23GiB; 2.3s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --tensor_model_parallel_size 1 \
+    --sequence_parallel true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --packing true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-VL-7B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing true \
+    --max_length 2048 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 8
+```
+
+将LoRA保存的增量权重进行Merge-LoRA并转为HF格式：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --mcore_adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
+    --to_hf true \
+    --torch_dtype bfloat16 \
+    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --test_convert_precision true
+```
+
+
+最后，我们使用生成的HF格式权重对验证集进行推理：
+```shell
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --attn_impl flash_attn \
+    --stream true \
+    --load_data_args true \
+    --temperature 0 \
+    --max_new_tokens 512
+```
+
+推理结果如下：
+```
+[QUERY] Using LaTeX to perform OCR on the image.
+[LABELS] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x )
+[RESPONSE] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x )
+--------------------------------------------------
+[QUERY] Using LaTeX to perform OCR on the image.
+[LABELS] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y
+[RESPONSE] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y
+--------------------------------------------------
+[QUERY] Using LaTeX to perform OCR on the image.
+[LABELS] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 }
+[RESPONSE] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 }
+```
diff --git "a/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md" "b/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md"
index d105487afe..782d532cab 100644
--- "a/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md"
+++ "b/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md"
@@ -1,7 +1,7 @@
 
 # 快速开始
 
-ms-swift引入了Megatron的并行技术来加速大模型的训练，包括数据并行、张量并行、流水线并行、序列并行，上下文并行，专家并行。支持Qwen3、[Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh)、Qwen2.5、Llama3、Deepseek-R1、GLM4.5等模型的预训练和微调。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。推荐在MoE训练时使用Megatron-SWIFT，这通常可以获得10倍的训练速度提升。
+ms-swift引入了Megatron的并行技术来加速大模型的训练，包括数据并行、张量并行、流水线并行、序列并行，上下文并行，专家并行。支持Qwen3、[Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh)、Qwen2.5、Llama3、Deepseek-R1、GLM4.5等模型的CPT/SFT/DPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。推荐在MoE训练时使用Megatron-SWIFT，这通常可以获得10倍的训练速度提升。
 
 ## 环境准备
 使用Megatron-SWIFT，除了安装swift依赖外，还需要安装以下内容：
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 7e9dbba285..af6129e628 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -38,6 +38,7 @@ Swift DOCUMENTATION
    Megatron-SWIFT/快速开始.md
    Megatron-SWIFT/命令行参数.md
    Megatron-SWIFT/LoRA训练.md
+   Megatron-SWIFT/多模态模型.md
 
 .. toctree::
    :maxdepth: 2
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
index 2606710f79..6499184173 100644
--- a/docs/source_en/Instruction/Command-line-parameters.md
+++ b/docs/source_en/Instruction/Command-line-parameters.md
@@ -162,7 +162,7 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
 - 🔥aligner_lr: When training a multimodal large model, this parameter specifies the learning rate for the aligner. By default, it is set to None, which means it equals `learning_rate`.
 - lr_scheduler_type: Type of lr_scheduler, defaults to 'cosine'.
 - lr_scheduler_kwargs: Other parameters for the lr_scheduler, defaults to None.
-- 🔥gradient_checkpointing_kwargs: Parameters for `torch.utils.checkpoint`. For example, set as `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to None.
+- gradient_checkpointing_kwargs: Parameters for `torch.utils.checkpoint`. For example, set as `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to None.
   - Note: When using DDP without DeepSpeed/FSDP, and `gradient_checkpointing_kwargs` is `None`, it will default to `'{"use_reentrant": false}'`.
 - full_determinism: Ensures reproducible results during training. Note: This will negatively impact performance. Defaults to False.
 - 🔥report_to: Default value is `tensorboard`. You can also specify `--report_to tensorboard wandb swanlab` or `--report_to all`.
@@ -215,11 +215,11 @@ Other important parameters:
 
 ### Tuner Arguments
 
-- 🔥freeze_llm: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, but with different meanings. In full parameter training, setting freeze_llm to True will freeze some of the LLM weights. In LoRA training, if `target_modules` is set to 'all-linear', setting freeze_llm to True will prevent adding LoRA modules to the LLM part. The default is False.
-- 🔥freeze_vit: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, with similar meanings as `freeze_llm`. The default is True.
-  - Note: Here, "vit" refers not only to the vision_tower but also includes the audio_tower.
-- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, with similar meanings as `freeze_llm`. The default is True.
-- 🔥 target_modules: Specifies the LoRA modules. The default is `['all-linear']`, but you can also pass layer-name suffixes, e.g. `--target_modules q_proj k_proj v_proj`. This argument is not restricted to LoRA and can be used with other tuners as well.
+- 🔥freeze_llm: This parameter only takes effect for multimodal models and can be used in both full-parameter and LoRA training, but with different behaviors. In full-parameter training, setting `freeze_llm` to `True` will freeze the weights of the LLM component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_llm` to `True` will prevent LoRA modules from being added to the LLM component. The default value is `False`.
+- 🔥freeze_vit: This parameter only applies to multimodal models and can be used in both full-parameter and LoRA training, though with different effects. In full-parameter training, setting `freeze_vit` to `True` will freeze the weights of the ViT component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_vit` to `True` will prevent LoRA modules from being added to the ViT component. The default value is `True`.
+  - Note: The term "ViT" here refers not only to the vision tower but also includes the audio tower.
+- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used in both full-parameter and LoRA training, with differing outcomes. In full-parameter training, setting `freeze_aligner` to `True` will freeze the weights of the aligner (also known as the projector) component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_aligner` to `True` will prevent LoRA modules from being added to the aligner component. The default value is `True`.
+- 🔥target_modules: Specifies the LoRA modules. The default is `['all-linear']`, but you can also pass layer-name suffixes, e.g. `--target_modules q_proj k_proj v_proj`. This argument is not restricted to LoRA and can be used with other tuners as well.
   - Note: The behavior of the special value `'all-linear'` differs between plain LLMs and multimodal LLMs. For a standard LLM, it automatically locates every linear layer except `lm_head` and attaches a tuner. For a multimodal LLM, it attaches the tuner only to the LLM component by default. This default can be changed with the `freeze_llm`, `freeze_vit`, and `freeze_aligner` options.
 - 🔥target_regex: Specifies a regex expression for LoRA modules, with a default of `None`. If this value is provided, the target_modules parameter becomes ineffective. This parameter is not limited to LoRA and can be used for other tuners.
 - target_parameters: List of parameter names to be replaced with LoRA. This argument behaves similarly to target_modules, but you should pass parameter names instead. This feature requires "peft>=0.17.0". For example, in many Mixture-of-Experts (MoE) layers in Hugging Face Transformers, `nn.Linear` is not used; instead, `nn.Parameter` is used. In such cases, the `target_parameters` argument can be used to apply LoRA.
diff --git a/docs/source_en/Instruction/Supported-models-and-datasets.md b/docs/source_en/Instruction/Supported-models-and-datasets.md
index a20d5c0bd6..1830a604a9 100644
--- a/docs/source_en/Instruction/Supported-models-and-datasets.md
+++ b/docs/source_en/Instruction/Supported-models-and-datasets.md
@@ -652,12 +652,12 @@ The table below introduces the models integrated with ms-swift:
 |[Qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/Qwen/Qwen-VL-Chat-Int4)|qwen_vl|qwen_vl|-|&#x2718;|vision|[Qwen/Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)|
 |[Qwen/Qwen-Audio-Chat](https://modelscope.cn/models/Qwen/Qwen-Audio-Chat)|qwen_audio|qwen_audio|-|&#x2718;|audio|[Qwen/Qwen-Audio-Chat](https://huggingface.co/Qwen/Qwen-Audio-Chat)|
 |[Qwen/Qwen-Audio](https://modelscope.cn/models/Qwen/Qwen-Audio)|qwen_audio|qwen_audio|-|&#x2718;|audio|[Qwen/Qwen-Audio](https://huggingface.co/Qwen/Qwen-Audio)|
-|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)|
-|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)|
-|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)|
-|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)|
-|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)|
-|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)|
+|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)|
+|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)|
+|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)|
+|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)|
+|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)|
+|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)|
 |[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)|
 |[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)|
 |[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)|
@@ -667,16 +667,16 @@ The table below introduces the models integrated with ms-swift:
 |[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-AWQ)|
 |[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-AWQ)|
 |[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ)|
-|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)|
-|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)|
-|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)|
-|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)|
-|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)|
-|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)|
-|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|
-|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)|
-|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)|
-|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)|
+|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)|
+|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)|
+|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)|
+|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)|
+|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)|
+|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)|
+|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)|
+|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)|
+|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)|
+|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2714;|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)|
 |[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)|
 |[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)|
 |[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)|
diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
index 0501c3db87..89963700d0 100644
--- a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
+++ b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -206,6 +206,10 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 **Tuner Parameters**:
 
 - train_type: Options are `'lora'` and `'full'`. Default is `'full'`.
+- 🔥freeze_llm: This parameter only takes effect for multimodal models and can be used in both full-parameter and LoRA training, but with different behaviors. In full-parameter training, setting `freeze_llm` to `True` will freeze the weights of the LLM component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_llm` to `True` will prevent LoRA modules from being added to the LLM component. The default value is `False`.
+- 🔥freeze_vit: This parameter only applies to multimodal models and can be used in both full-parameter and LoRA training, though with different effects. In full-parameter training, setting `freeze_vit` to `True` will freeze the weights of the ViT component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_vit` to `True` will prevent LoRA modules from being added to the ViT component. The default value is `True`.
+  - Note: The term "ViT" here refers not only to the vision tower but also includes the audio tower.
+- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used in both full-parameter and LoRA training, with differing outcomes. In full-parameter training, setting `freeze_aligner` to `True` will freeze the weights of the aligner (also known as the projector) component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_aligner` to `True` will prevent LoRA modules from being added to the aligner component. The default value is `True`.
 
 Full-parameter Training:
 
@@ -249,6 +253,8 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
   - If you wish to customize the attention_mask, you can set `--padding_free false`.
   - Note: The Megatron-SWIFT training feature prioritizes support for the padding-free format. Unless under special circumstances, please do not modify this value.
 - mlp_padding_free: The default is False. This is used for applying padding-free optimization to the MLP when padding_free is set to false. It allows for improved training speed and reduced memory usage while customizing the attention_mask.
+- vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT part during multimodal model training. Default: True.
+- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Default: None.
 - 🔥packing: Whether to use sequence packing, defaults to False. Currently supports CPT/SFT/DPO.
 - packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
 - streaming: Stream data loading and processing, default is False.
diff --git a/docs/source_en/Megatron-SWIFT/Multimodal-Model.md b/docs/source_en/Megatron-SWIFT/Multimodal-Model.md
new file mode 100644
index 0000000000..91becd4917
--- /dev/null
+++ b/docs/source_en/Megatron-SWIFT/Multimodal-Model.md
@@ -0,0 +1,158 @@
+# Multimodal Models
+
+ms-swift introduces Megatron's parallelization techniques to accelerate the training of large multimodal models. Currently, it supports CPT/SFT/DPO for models such as Qwen2.5-VL. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md).
+
+For environment setup, please refer to the Megatron-SWIFT [Quick Start guide](./Quick-start.md).
+
+## Dense Model Full/LoRA Fine-tuning
+
+This section demonstrates fine-tuning the Qwen2.5-VL-7B-Instruct model on the LaTeX-OCR task using two 80GiB A100 GPUs, with both full-parameter fine-tuning and LoRA. The best practices described below can be completed within 10 minutes.
+
+First, we need to convert the model weights from Hugging Face format to Megatron format:
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --model Qwen/Qwen2.5-VL-7B-Instruct \
+    --to_mcore true \
+    --torch_dtype bfloat16 \
+    --output_dir Qwen2.5-VL-7B-Instruct-mcore \
+    --test_convert_precision true
+```
+
+### Full
+
+The full-parameter training script is as follows:
+```shell
+# 2 * 72GiB; 4.1s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --packing true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-VL-7B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing true \
+    --max_length 2048 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 8
+```
+
+Convert Megatron-format weights saved with full parameters to Hugging Face format:
+
+- Note: `--mcore_model` should point to the parent directory of `iter_xxx`. By default, the checkpoint specified in `latest_checkpointed_iteration.txt` will be used.
+
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --mcore_model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
+    --to_hf true \
+    --torch_dtype bfloat16 \
+    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --test_convert_precision true
+```
+
+### LoRA
+
+The LoRA training script is as follows:
+```shell
+# 2 * 23GiB; 2.3s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --tensor_model_parallel_size 1 \
+    --sequence_parallel true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --packing true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-VL-7B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing true \
+    --max_length 2048 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 8
+```
+
+Merge the LoRA-saved incremental weights and convert them to Hugging Face format:
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift export \
+    --mcore_adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \
+    --to_hf true \
+    --torch_dtype bfloat16 \
+    --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --test_convert_precision true
+```
+
+
+Finally, we use the generated Hugging Face format weights to perform inference on the validation set:
+```shell
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \
+    --attn_impl flash_attn \
+    --stream true \
+    --load_data_args true \
+    --temperature 0 \
+    --max_new_tokens 512
+```
+
+The inference results are as follows:
+```
+[QUERY] Using LaTeX to perform OCR on the image.
+[LABELS] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x )
+[RESPONSE] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x )
+--------------------------------------------------
+[QUERY] Using LaTeX to perform OCR on the image.
+[LABELS] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y
+[RESPONSE] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y
+--------------------------------------------------
+[QUERY] Using LaTeX to perform OCR on the image.
+[LABELS] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 }
+[RESPONSE] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 }
+```
diff --git a/docs/source_en/Megatron-SWIFT/Quick-start.md b/docs/source_en/Megatron-SWIFT/Quick-start.md
index 0e5ceca9d4..389ae8bea1 100644
--- a/docs/source_en/Megatron-SWIFT/Quick-start.md
+++ b/docs/source_en/Megatron-SWIFT/Quick-start.md
@@ -1,6 +1,6 @@
 # Quick Start
 
-ms-swift incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports the pre-training and fine-tuning of models such as Qwen3, [Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh), Qwen2.5, Llama3, Deepseek-R1 and GLM4.5 series. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md). We recommend using Megatron-SWIFT for MoE training; it can typically achieve a 10x speedup in training.
+ms-swift incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports CPT/SFT/DPO for models such as Qwen3, [Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh), Qwen2.5, Llama3, Deepseek-R1 and GLM4.5 series. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md). We recommend using Megatron-SWIFT for MoE training; it can typically achieve a 10x speedup in training.
 
 ## Environment Setup
 
diff --git a/docs/source_en/index.rst b/docs/source_en/index.rst
index a7a3ac0811..c561735643 100644
--- a/docs/source_en/index.rst
+++ b/docs/source_en/index.rst
@@ -38,6 +38,7 @@ Swift DOCUMENTATION
    Megatron-SWIFT/Quick-start.md
    Megatron-SWIFT/Command-line-parameters.md
    Megatron-SWIFT/LoRA-Training.md
+   Megatron-SWIFT/Multimodal-Model.md
 
 
 .. toctree::
diff --git a/examples/megatron/multimodal/dense/dpo.sh b/examples/megatron/multimodal/dense/dpo.sh
new file mode 100644
index 0000000000..edea6bdb35
--- /dev/null
+++ b/examples/megatron/multimodal/dense/dpo.sh
@@ -0,0 +1,39 @@
+# 4 * 60GiB 14s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=4 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+megatron rlhf \
+    --rlhf_type dpo \
+    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --dataset 'swift/RLAIF-V-Dataset#20000' \
+    --train_type full \
+    --tensor_model_parallel_size 4 \
+    --sequence_parallel true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --packing true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-VL-7B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing true \
+    --max_length 8192 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 16 \
+    --attention_backend flash \
+    --beta 0.1 \
+    --loss_type sigmoid
diff --git a/examples/megatron/multimodal/dense/full.sh b/examples/megatron/multimodal/dense/full.sh
new file mode 100644
index 0000000000..3590fad38d
--- /dev/null
+++ b/examples/megatron/multimodal/dense/full.sh
@@ -0,0 +1,34 @@
+# 2 * 72GiB; 4.1s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --packing true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-VL-7B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing true \
+    --max_length 2048 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 8
diff --git a/examples/megatron/multimodal/dense/lora.sh b/examples/megatron/multimodal/dense/lora.sh
new file mode 100644
index 0000000000..1e232f5f07
--- /dev/null
+++ b/examples/megatron/multimodal/dense/lora.sh
@@ -0,0 +1,38 @@
+# 2 * 23GiB; 2.3s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+MAX_PIXELS=1003520 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --load Qwen2.5-VL-7B-Instruct-mcore \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --tensor_model_parallel_size 1 \
+    --sequence_parallel true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --packing true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen2.5-VL-7B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing true \
+    --max_length 2048 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 8
diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py
index a1ef0b3905..c8a86cb3ad 100644
--- a/swift/llm/template/base.py
+++ b/swift/llm/template/base.py
@@ -1216,7 +1216,7 @@ def _encode_truncated(self, inputs: StdTemplateInputs):
                     encoded[key] = value
         else:
             encoded = self._encode(inputs)
-
+        self._handle_megatron_cp(encoded)  # TODO: fix cp_size & cached_dataset
         input_ids = encoded.get('input_ids')
         labels = encoded.get('labels')
         loss_scale = encoded.get('loss_scale')
@@ -1276,7 +1276,6 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         encoded['input_ids'] = input_ids
         encoded['labels'] = labels
         encoded['loss_scale'] = loss_scale
-        self._handle_megatron_cp(encoded)  # TODO: fix cp_size & cached_dataset
         if encoded.get('labels') is not None:
             encoded['labels'][0] = -100
         if encoded.get('loss_scale') is not None:
@@ -1626,7 +1625,7 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in
         res = {}
         if self.padding_free:
             assert len(batch) == 1, f'batch: {batch}'
-            for k in ['input_ids', 'labels', 'position_ids', 'loss_scale', 'channel']:
+            for k in ['input_ids', 'labels', 'position_ids', 'loss_scale', 'channel', 'real_position_ids']:
                 v = batch[0].get(k)
                 if v is not None:
                     res[k] = v if k == 'channel' else [v]
@@ -1648,9 +1647,10 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in
                     res[key] = val
 
         keys = [
-            'input_ids', 'inputs_embeds', 'attention_mask', 'labels', 'loss_scale', 'position_ids', 'token_type_ids'
+            'input_ids', 'inputs_embeds', 'attention_mask', 'labels', 'loss_scale', 'position_ids', 'token_type_ids',
+            'real_position_ids'
         ]
-        pad_values = [self.tokenizer.pad_token_id, 0., 0, -100, 0., 0., 0]
+        pad_values = [self.tokenizer.pad_token_id, 0., 0, -100, 0., 0., 0, 0.]
         # Convert to tensor and remove unnecessary dimensions.
         seq_lens = None
         for key in keys:
@@ -1677,10 +1677,14 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in
             if self.padding_free:
                 cp_size = self.sequence_parallel_size
                 if cp_size > 1:
-                    padding_len = padding_to - seq_lens[0]
-                    position_ids = res['position_ids'][0].tolist()
-                    position_ids += list(range(cp_size * 2)) * (padding_len // (cp_size * 2))
-                    res['position_ids'] = [torch.tensor(position_ids)]
+                    for key in ['position_ids', 'real_position_ids']:
+                        padding_len = padding_to - seq_lens[0]
+                        position_ids = res[key][0]
+                        extended_position_ids = torch.arange(cp_size * 2).repeat(padding_len // (cp_size * 2))
+                        if position_ids.ndim == 3:  # compat mrope
+                            extended_position_ids = extended_position_ids[None,
+                                                                          None, :].expand(position_ids.shape[0], 1, -1)
+                        res[key] = [torch.concat([position_ids, extended_position_ids], dim=-1)]
             else:
                 seq_len = max(seq_lens) if padding_to is None else padding_to
                 res['attention_mask'] = torch.tril(torch.ones(
@@ -1694,13 +1698,16 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in
                 continue
             if self.use_megatron and not self.padding_free and key == 'attention_mask':
                 continue
-            if padding_to is not None and not (self.padding_free and key == 'position_ids'
+            if padding_to is not None and not (self.padding_free and key in {'position_ids', 'real_position_ids'}
                                                and self.sequence_parallel_size > 1):
                 padding_len = padding_to - seq_lens[0]
                 if padding_len > 0:
                     res[key][0] = F.pad(res[key][0], (0, padding_len) if padding_right else (padding_len, 0),
                                         'constant', pad_value)
-            res[key] = self._pad_sequence(res[key], pad_value)
+            if key == 'real_position_ids':
+                res[key] = torch.concat(res[key], dim=-1)
+            else:
+                res[key] = self._pad_sequence(res[key], pad_value)
 
         # multimodal
         res.update(self._data_collator_mm_data(batch))
diff --git a/swift/llm/template/template/qwen.py b/swift/llm/template/template/qwen.py
index b59bbfa2f3..f6a935f168 100644
--- a/swift/llm/template/template/qwen.py
+++ b/swift/llm/template/template/qwen.py
@@ -424,9 +424,7 @@ def _get_position_ids(self, inputs: Dict[str, Any]):
 
     def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[int] = None) -> Dict[str, Any]:
         res = super()._data_collator(batch, padding_to=padding_to)
-        if self.padding_free:
-            res['real_position_ids'] = self.concat_tensor(batch, 'real_position_ids', -1)
-        elif self.is_training:
+        if not self.padding_free and self.is_training:
             res['position_ids'] = self._get_position_ids(res)
         return res
 
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
index 9628b4da6a..02b3721add 100644
--- a/swift/megatron/argument/megatron_args.py
+++ b/swift/megatron/argument/megatron_args.py
@@ -31,6 +31,9 @@ class RLHFMegatronArgumentsMixin:
 @dataclass
 class MegatronTunerMixin:
     train_type: Literal['lora', 'full'] = 'full'
+    freeze_llm: bool = False
+    freeze_vit: bool = True
+    freeze_aligner: bool = True
     # full
     freeze_parameters: List[str] = field(default_factory=list)
     freeze_parameters_regex: Optional[str] = None
@@ -71,6 +74,8 @@ def load_tuner_config(adapter_load: Optional[str]) -> Dict[str, Any]:
     def __post_init__(self):
         if self.freeze_parameters_ratio > 0 and self.pipeline_model_parallel_size > 1:
             raise ValueError('`freeze_parameters_ratio` is not supported when `pipeline_model_parallel_size` > 1')
+        if self.target_regex:
+            self.target_modules = self.target_regex
 
 
 @dataclass
@@ -94,6 +99,10 @@ class ExtraMegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin):
     partial_rotary_factor: Optional[float] = None
     use_shared_expert_gate: Optional[bool] = None
 
+    # visual
+    vit_gradient_checkpointing: bool = True
+    gradient_checkpointing_kwargs: Optional[Union[dict, str]] = None
+
 
 @dataclass
 class MegatronArguments(ExtraMegatronArguments):
@@ -185,7 +194,8 @@ class MegatronArguments(ExtraMegatronArguments):
     group_query_attention: Optional[bool] = None
     num_query_groups: Optional[int] = None
     max_position_embeddings: Optional[int] = None
-    position_embedding_type: Literal['learned_absolute', 'rope', 'mrope', 'relative', 'none'] = 'rope'
+    position_embedding_type: Optional[Literal['learned_absolute', 'rope', 'mrope', 'relative', 'none']] = None
+    mrope_section: Optional[List[int]] = None
     rotary_base: Optional[int] = None
     rotary_percent: float = 1.
     rotary_interleaved: Optional[bool] = None
@@ -376,10 +386,14 @@ def __post_init__(self):
             self.rope_scaling = json_parse_to_dict(self.rope_scaling)
             if 'type' in self.rope_scaling and 'rope_type' not in self.rope_scaling:
                 self.rope_scaling['rope_type'] = self.rope_scaling['type']
+        if self.gradient_checkpointing_kwargs is not None:
+            self.gradient_checkpointing_kwargs = json_parse_to_dict(self.gradient_checkpointing_kwargs)
         if self.eval_interval is None:
             self.eval_interval = self.save_interval
         if self.seq_length is None:
             self.seq_length = self.max_position_embeddings
+        if self.position_embedding_type is None:
+            self.position_embedding_type = 'rope'
         if self.tensorboard_dir is None and self.save is not None:
             self.tensorboard_dir = f'{self.save}/runs'
         self._init_moe()
diff --git a/swift/megatron/argument/train_args.py b/swift/megatron/argument/train_args.py
index 6f492dc7cd..9481b451df 100644
--- a/swift/megatron/argument/train_args.py
+++ b/swift/megatron/argument/train_args.py
@@ -17,7 +17,6 @@ class MegatronTrainArguments(MegatronArguments, BaseArguments):
     add_version: bool = True
 
     def init_model_args(self, tokenizer, config):
-        self.megatron_model_meta = get_megatron_model_meta(self.model_type)
         kwargs = self.megatron_model_meta.convert_hf_config(config)
         if self.new_special_tokens and kwargs['padded_vocab_size'] < len(tokenizer):
             kwargs['padded_vocab_size'] = math.ceil(len(tokenizer) / 128) * 128
@@ -28,6 +27,9 @@ def init_model_args(self, tokenizer, config):
                 setattr(self, k, v)
         MegatronArguments.__post_init__(self)
         self.extra_args = self.parse_to_megatron()
+        self.extra_args['model_info'] = self.model_info
+        self.extra_args['model_meta'] = self.model_meta
+        self.extra_args['megatron_model_meta'] = self.megatron_model_meta
 
     def _init_save(self):
         init_process_group(backend=self.ddp_backend, timeout=self.ddp_timeout)
@@ -46,6 +48,7 @@ def __post_init__(self):
             self.padding_free = True
         self.load = to_abspath(self.load, check_path_exist=True)
         BaseArguments.__post_init__(self)
+        self.megatron_model_meta = get_megatron_model_meta(self.model_type)
         if len(self.dataset) == 0 and len(self.cached_dataset) == 0:
             raise ValueError(f'self.dataset: {self.dataset}, self.cached_dataset: {self.cached_dataset}. '
                              'Please input the training dataset.')
diff --git a/swift/megatron/init.py b/swift/megatron/init.py
index a1d90db786..f9ae0747dd 100644
--- a/swift/megatron/init.py
+++ b/swift/megatron/init.py
@@ -518,6 +518,103 @@ def __repr__(self):
     TELinear.__repr__ = __repr__
 
 
+def _patch_mrope():
+    from megatron.core.models.common.embeddings.rotary_pos_embedding import MultimodalRotaryEmbedding
+    from megatron.core import parallel_state
+    from megatron.core.models.common.embeddings.rope_utils import (get_pos_emb_on_this_cp_rank,
+                                                                   _apply_rotary_pos_emb_bshd)
+    from megatron.core.models.common.embeddings import rope_utils
+    from megatron.training import get_args
+
+    def forward(self, position_ids, mrope_section: List[int], packed_seq: bool = False) -> torch.Tensor:
+        seq = position_ids.to(device=self.inv_freq.device, dtype=self.inv_freq.dtype)
+
+        if self.seq_len_interpolation_factor is not None:
+            seq *= 1 / self.seq_len_interpolation_factor
+
+        # shape (3, bs, dim, 1)
+        inv_freq_expanded = self.inv_freq[None, None, :, None].expand(3, seq.shape[1], -1, 1)
+        # shape (3, bs, 1, seq_length)
+        seq_expanded = seq[:, :, None, :].float()
+        # shape (3, bs, seq_length, dim)
+        freqs = (inv_freq_expanded @ seq_expanded).transpose(2, 3)
+        # first part even vector components, second part odd vector components,
+        #  2 * dim in dimension size
+        if not self.rotary_interleaved:
+            emb = torch.cat((freqs, freqs), dim=-1)  # shape (3, bs, seq_length, 2 * dim)
+        else:
+            bs = freqs.shape[1]
+            emb = torch.stack((freqs.view(3, bs, -1, 1), freqs.view(3, bs, -1, 1)),
+                              dim=-1).view(3, bs, freqs.shape[0], -1)
+
+        # generate freqs with mrope_section
+        # shape (bs, seq_length, 2 * dim)
+        mrope_section = mrope_section * 2
+        emb = torch.cat([m[i % 3] for i, m in enumerate(emb.split(mrope_section, dim=-1))], dim=-1)
+
+        # shape (seq_length, bs, 1, 2 * dim)
+        emb = emb[..., None, :].transpose(0, 1).contiguous()
+        if parallel_state.get_context_parallel_world_size() > 1 and not packed_seq:
+            # slice rotary_pos_emb along sequence dimension and select the parition of the current
+            # CP rank
+            emb = get_pos_emb_on_this_cp_rank(emb, 0, parallel_state.get_context_parallel_group())
+        return emb
+
+    MultimodalRotaryEmbedding.forward = forward
+    _origin_apply_rotary_pos_emb_thd = rope_utils._apply_rotary_pos_emb_thd
+
+    def _apply_rotary_pos_emb_thd(
+        t: torch.Tensor,
+        cu_seqlens: torch.Tensor,
+        freqs: torch.Tensor,
+        rotary_interleaved: bool = False,
+        multi_latent_attention: bool = False,
+        mscale: float = 1.0,
+        cp_group: torch.distributed.ProcessGroup = None,
+    ) -> torch.Tensor:
+        """A baseline implementation of applying RoPE for `thd` format.
+
+        Args:
+            t (Tensor): Input tensor T is of shape [t, h, d]
+            cu_seqlens(Tensor):  Cumulative sum of sequence lengths in a batch for `t`,
+            with shape [b + 1] and dtype torch.int32.
+            freqs (Tensor): Rotary Positional embedding tensor freq is of shape [max_s, 1, 1, d]
+            cp_group (torch.distributed.ProcessGroup): The context parallel group
+
+        Returns:
+            Tensor: Shape [t, h, d]. The input tensor after applying RoPE.
+        """
+        args = get_args()
+        if args.position_embedding_type != 'mrope':
+            return _origin_apply_rotary_pos_emb_thd(
+                t,
+                cu_seqlens,
+                freqs,
+                rotary_interleaved=rotary_interleaved,
+                multi_latent_attention=multi_latent_attention,
+                mscale=mscale,
+                cp_group=cp_group,
+            )
+
+        if cp_group is None:
+            raise ValueError('cp_group must be provided for THD format RoPE')
+        cp_size = cp_group.size()
+        cu_seqlens = cu_seqlens // cp_size
+        seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
+
+        return torch.cat([
+            _apply_rotary_pos_emb_bshd(
+                x.unsqueeze(1),
+                f,
+                rotary_interleaved=rotary_interleaved,
+                multi_latent_attention=multi_latent_attention,
+                mscale=mscale,
+            ) for x, f in zip(torch.split(t, seqlens), torch.split(freqs, seqlens))
+        ]).squeeze(1)
+
+    rope_utils._apply_rotary_pos_emb_thd = _apply_rotary_pos_emb_thd
+
+
 def _patch_megatron():
     _patch_flash_attn()
     _patch_transformer_engine()
@@ -527,6 +624,7 @@ def _patch_megatron():
     _patch_TEGroupedLinear()
     _patch_TransformerLayer()
     _patch_compile_helpers()
+    _patch_mrope()
     from swift.megatron import tuners  # patch lora
     try:
         _patch_torch_FileSystemReader()
@@ -546,6 +644,8 @@ def _patch_megatron():
 
 def init_megatron_env() -> None:
     if 'MEGATRON_LM_PATH' not in os.environ:
+        # TODO: Synchronization issues may occur in DDP scenarios
+        # if the distributed environment has not been initialized.
         os.environ['MEGATRON_LM_PATH'] = git_clone_github(
             'https://github.com/NVIDIA/Megatron-LM', branch='core_r0.13.0')
     with safe_ddp_context(hash_id='megatron-lm'):
diff --git a/swift/megatron/model/__init__.py b/swift/megatron/model/__init__.py
index 3d13a8d1b5..3c882c9864 100644
--- a/swift/megatron/model/__init__.py
+++ b/swift/megatron/model/__init__.py
@@ -1,4 +1,4 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from . import gpt
+from . import gpt, mm_gpt
 from .constant import MegatronModelType
 from .register import MegatronModelMeta, get_megatron_model_meta, register_megatron_model
diff --git a/swift/megatron/model/constant.py b/swift/megatron/model/constant.py
index 8eebb6aa76..56e2ea6707 100644
--- a/swift/megatron/model/constant.py
+++ b/swift/megatron/model/constant.py
@@ -1,3 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 class MegatronModelType:
     gpt = 'gpt'
+    qwen2_vl = 'qwen2_vl'
+    qwen2_5_vl = 'qwen2_5_vl'
diff --git a/swift/megatron/model/gpt/__init__.py b/swift/megatron/model/gpt/__init__.py
index 32c2c9b861..9e2654620e 100644
--- a/swift/megatron/model/gpt/__init__.py
+++ b/swift/megatron/model/gpt/__init__.py
@@ -1,55 +1,64 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from swift.llm import ModelType
 from ..constant import MegatronModelType
+from ..gpt_model import GPTModel
+from ..model_provider import model_provider
 from ..register import MegatronModelMeta, register_megatron_model
 from .config import convert_gpt_hf_config
 from .hf2mcore import convert_hf2mcore
 from .mcore2hf import convert_mcore2hf
-from .model import model_provider
 
 register_megatron_model(
-    MegatronModelMeta(MegatronModelType.gpt, [
-        ModelType.qwen2,
-        ModelType.qwen2_5,
-        ModelType.qwq,
-        ModelType.qwq_preview,
-        ModelType.qwen2_5_math,
-        ModelType.llama,
-        ModelType.llama3,
-        ModelType.llama3_1,
-        ModelType.llama3_2,
-        ModelType.longwriter_llama3_1,
-        ModelType.codefuse_codellama,
-        ModelType.marco_o1,
-        ModelType.deepseek,
-        ModelType.deepseek_r1_distill,
-        ModelType.yi,
-        ModelType.yi_coder,
-        ModelType.sus,
-        ModelType.skywork_o1,
-        ModelType.openbuddy_llama,
-        ModelType.openbuddy_llama3,
-        ModelType.megrez,
-        ModelType.reflection,
-        ModelType.numina,
-        ModelType.ziya,
-        ModelType.mengzi3,
-        ModelType.qwen3,
-        ModelType.qwen3_thinking,
-        ModelType.qwen3_nothinking,
-        ModelType.qwen2_moe,
-        ModelType.qwen3_moe,
-        ModelType.qwen3_moe_thinking,
-        ModelType.internlm3,
-        ModelType.mimo,
-        ModelType.mimo_rl,
-        ModelType.moonlight,
-        ModelType.deepseek_moe,
-        ModelType.deepseek_v2,
-        ModelType.deepseek_v2_5,
-        ModelType.deepseek_r1,
-        ModelType.dots1,
-        ModelType.ernie,
-        ModelType.glm4_5,
-        ModelType.deepseek_v3_1,
-    ], model_provider, convert_gpt_hf_config, convert_mcore2hf, convert_hf2mcore))
+    MegatronModelMeta(
+        MegatronModelType.gpt,
+        [
+            ModelType.qwen2,
+            ModelType.qwen2_5,
+            ModelType.qwq,
+            ModelType.qwq_preview,
+            ModelType.qwen2_5_math,
+            ModelType.llama,
+            ModelType.llama3,
+            ModelType.llama3_1,
+            ModelType.llama3_2,
+            ModelType.longwriter_llama3_1,
+            ModelType.codefuse_codellama,
+            ModelType.marco_o1,
+            ModelType.deepseek,
+            ModelType.deepseek_r1_distill,
+            ModelType.yi,
+            ModelType.yi_coder,
+            ModelType.sus,
+            ModelType.skywork_o1,
+            ModelType.openbuddy_llama,
+            ModelType.openbuddy_llama3,
+            ModelType.megrez,
+            ModelType.reflection,
+            ModelType.numina,
+            ModelType.ziya,
+            ModelType.mengzi3,
+            ModelType.qwen3,
+            ModelType.qwen3_thinking,
+            ModelType.qwen3_nothinking,
+            ModelType.qwen2_moe,
+            ModelType.qwen3_moe,
+            ModelType.qwen3_moe_thinking,
+            ModelType.internlm3,
+            ModelType.mimo,
+            ModelType.mimo_rl,
+            ModelType.moonlight,
+            ModelType.deepseek_moe,
+            ModelType.deepseek_v2,
+            ModelType.deepseek_v2_5,
+            ModelType.deepseek_r1,
+            ModelType.dots1,
+            ModelType.ernie,
+            ModelType.glm4_5,
+            ModelType.deepseek_v3_1,
+        ],
+        model_provider=model_provider,
+        model_cls=GPTModel,
+        convert_hf_config=convert_gpt_hf_config,
+        convert_mcore2hf=convert_mcore2hf,
+        convert_hf2mcore=convert_hf2mcore,
+    ))
diff --git a/swift/megatron/model/gpt/config.py b/swift/megatron/model/gpt/config.py
index ec58a28142..7b6a1803a1 100644
--- a/swift/megatron/model/gpt/config.py
+++ b/swift/megatron/model/gpt/config.py
@@ -39,6 +39,9 @@ def convert_gpt_hf_config(config) -> Dict[str, Any]:
             res['rotary_interleaved'] = True
         elif architectures == 'Glm4MoeForCausalLM':
             res['moe_router_score_function'] = 'sigmoid'
+        elif architectures in {'Qwen2VLForConditionalGeneration', 'Qwen2_5_VLForConditionalGeneration'}:
+            res['position_embedding_type'] = 'mrope'
+            res['mrope_section'] = res['rope_scaling']['mrope_section']
     if first_k_dense_replace is not None:
         res['moe_layer_freq'] = f'[0]*{first_k_dense_replace}+[1]*{res["num_layers"] - first_k_dense_replace}'
     if res.get('moe_router_score_function', 'softmax') == 'sigmoid':
diff --git a/swift/megatron/model/gpt/hf2mcore.py b/swift/megatron/model/gpt/hf2mcore.py
index 93a8d9b36e..76780be641 100644
--- a/swift/megatron/model/gpt/hf2mcore.py
+++ b/swift/megatron/model/gpt/hf2mcore.py
@@ -91,7 +91,7 @@ def set_mlp_state(args, mg_mlp, hf_mlp):
 
 def set_layer_state(args, mg_model, hf_model, layer_idx):
     mg_layer = mg_model.decoder.layers[layer_idx]
-    hf_layer = hf_model.model.layers[layer_idx]
+    hf_layer = hf_model.layers[layer_idx]
     if args.multi_latent_attention:
         set_mla_attn_state(args, mg_layer.self_attention, hf_layer.self_attn)
         mg_layer.input_layernorm.weight.data.copy_(hf_layer.input_layernorm.weight)
@@ -115,4 +115,4 @@ def convert_hf2mcore(hf_model, mg_model):
         mg_model.output_layer.weight.data.copy_(hf_model.lm_head.weight)
     mg_model.decoder.final_layernorm.weight.data.copy_(hf_model.model.norm.weight)
     for layer_idx in range(args.num_layers):
-        set_layer_state(args, mg_model, hf_model, layer_idx)
+        set_layer_state(args, mg_model, hf_model.model, layer_idx)
diff --git a/swift/megatron/model/gpt/mcore2hf.py b/swift/megatron/model/gpt/mcore2hf.py
index bd0e480f65..3f063d4559 100644
--- a/swift/megatron/model/gpt/mcore2hf.py
+++ b/swift/megatron/model/gpt/mcore2hf.py
@@ -88,7 +88,7 @@ def set_mlp_state(args, mg_mlp, hf_mlp):
 
 def set_layer_state(args, mg_model, hf_model, layer_idx):
     mg_layer = mg_model.decoder.layers[layer_idx]
-    hf_layer = hf_model.model.layers[layer_idx]
+    hf_layer = hf_model.layers[layer_idx]
 
     if args.multi_latent_attention:
         set_mla_attn_state(args, mg_layer.self_attention, hf_layer.self_attn)
@@ -113,4 +113,4 @@ def convert_mcore2hf(hf_model, mg_model):
         hf_model.lm_head.weight.data.copy_(mg_model.output_layer.weight)
     hf_model.model.norm.weight.data.copy_(mg_model.decoder.final_layernorm.weight)
     for layer_idx in range(args.num_layers):
-        set_layer_state(args, mg_model, hf_model, layer_idx)
+        set_layer_state(args, mg_model, hf_model.model, layer_idx)
diff --git a/swift/megatron/model/gpt_model.py b/swift/megatron/model/gpt_model.py
index 7b68c27ca8..f03a7c855e 100644
--- a/swift/megatron/model/gpt_model.py
+++ b/swift/megatron/model/gpt_model.py
@@ -86,7 +86,7 @@ def __init__(
         new_inv_freq, self.attention_scaling = get_rope_inv_freq()
         self.rotary_pos_emb.inv_freq = new_inv_freq.to(self.rotary_pos_emb.inv_freq.device)
 
-        if self.attention_scaling != 1 and config.apply_rope_fusion:
+        if (self.attention_scaling != 1 or position_embedding_type == 'mrope') and config.apply_rope_fusion:
             config.apply_rope_fusion = False
             logger.warning('`apply_rope_fusion` does not support `attention_scaling`. '
                            f'Setting `config.apply_rope_fusion`: {config.apply_rope_fusion}')
@@ -154,7 +154,7 @@ def forward(
         rotary_pos_emb = None
         rotary_pos_cos = None
         rotary_pos_sin = None
-        if self.position_embedding_type == 'rope':
+        if self.position_embedding_type in {'rope', 'mrope'}:
             if not self.training and self.config.flash_decode and inference_params:
                 # Flash decoding uses precomputed cos and sin for RoPE
                 rotary_pos_cos, rotary_pos_sin = self.rotary_pos_emb_cache.setdefault(
@@ -162,16 +162,23 @@ def forward(
                     self.rotary_pos_emb.get_cos_sin(inference_params.max_sequence_length),
                 )
             else:
-                rotary_seq_len = self.rotary_pos_emb.get_rotary_seq_len(inference_params, self.decoder, decoder_input,
-                                                                        self.config, packed_seq_params)
+                rotary_seq_len = RotaryEmbedding.get_rotary_seq_len(self, inference_params, self.decoder, decoder_input,
+                                                                    self.config, packed_seq_params)
                 if self.hf_rope_scaling is not None:
                     attention_scaling = dynamic_rope_update(self, self.rotary_pos_emb.inv_freq, rotary_seq_len)
                     if attention_scaling is not None:
                         self.attention_scaling = attention_scaling
-                rotary_pos_emb = self.rotary_pos_emb(
-                    rotary_seq_len,
-                    packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd',
-                )
+                if self.position_embedding_type == 'mrope':
+                    rotary_pos_emb = self.rotary_pos_emb(
+                        position_ids,
+                        mrope_section=self.mrope_section,
+                        packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd',
+                    )
+                else:
+                    rotary_pos_emb = self.rotary_pos_emb(
+                        rotary_seq_len,
+                        packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd',
+                    )
         if ((self.config.enable_cuda_graph or self.config.flash_decode) and rotary_pos_cos is not None
                 and inference_params):
             sequence_len_offset = torch.tensor(
diff --git a/swift/megatron/model/mm_gpt/__init__.py b/swift/megatron/model/mm_gpt/__init__.py
new file mode 100644
index 0000000000..30f489086c
--- /dev/null
+++ b/swift/megatron/model/mm_gpt/__init__.py
@@ -0,0 +1 @@
+from . import qwen2_5_vl
diff --git a/swift/megatron/model/mm_gpt/qwen2_5_vl.py b/swift/megatron/model/mm_gpt/qwen2_5_vl.py
new file mode 100644
index 0000000000..ba23e2ee8e
--- /dev/null
+++ b/swift/megatron/model/mm_gpt/qwen2_5_vl.py
@@ -0,0 +1,148 @@
+import torch
+from megatron.core.models.huggingface import HuggingFaceModule
+from megatron.training import get_args, get_tokenizer
+
+from swift.llm import ModelType, get_model_tokenizer, to_device
+from ..constant import MegatronModelType
+from ..gpt.hf2mcore import set_layer_state as set_layer_state_hf2mcore
+from ..gpt.mcore2hf import set_layer_state as set_layer_state_mcore2hf
+from ..register import register_megatron_model
+from .utils import MMGPTMegatronModelMeta, patch_device_map_meta
+
+
+def convert_hf2mcore_qwen2_5_vl(hf_model, mg_model):
+    language_model = hf_model.model
+    if hasattr(language_model, 'language_model'):
+        language_model = language_model.language_model
+    visual = hf_model.visual if hasattr(hf_model, 'visual') else hf_model.model.visual
+    mg_language_model = mg_model.language_model
+    args = get_args()
+    mg_language_model.embedding.word_embeddings.weight.data.copy_(language_model.embed_tokens.weight)
+    if args.untie_embeddings_and_output_weights:
+        mg_language_model.output_layer.weight.data.copy_(hf_model.lm_head.weight)
+    mg_language_model.decoder.final_layernorm.weight.data.copy_(language_model.norm.weight)
+    for layer_idx in range(args.num_layers):
+        set_layer_state_hf2mcore(args, mg_language_model, language_model, layer_idx)
+    mg_model.visual.model.load_state_dict(visual.state_dict())
+
+
+def convert_mcore2hf_qwen2_5_vl(hf_model, mg_model):
+    language_model = hf_model.model
+    if hasattr(language_model, 'language_model'):
+        language_model = language_model.language_model
+    visual = hf_model.visual if hasattr(hf_model, 'visual') else hf_model.model.visual
+    mg_language_model = mg_model.language_model
+    args = get_args()
+    language_model.embed_tokens.weight.data.copy_(mg_language_model.embedding.word_embeddings.weight)
+    if args.untie_embeddings_and_output_weights:
+        hf_model.lm_head.weight.data.copy_(mg_language_model.output_layer.weight)
+    language_model.norm.weight.data.copy_(mg_language_model.decoder.final_layernorm.weight)
+    for layer_idx in range(args.num_layers):
+        set_layer_state_mcore2hf(args, mg_language_model, language_model, layer_idx)
+    visual.load_state_dict(mg_model.visual.model.state_dict())
+
+
+class Qwen2_5VL_Vit(HuggingFaceModule):
+    vision_tower = ['model']
+    aligner = ['model.merger']
+    version = 'v2_5'
+
+    def __init__(self, config):
+        if self.version == 'v2_5':
+            try:
+                from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel
+            except ImportError:
+                from transformers.models.qwen2_5_vl import Qwen2_5_VLModel as Qwen2_5_VLTextModel
+            context = patch_device_map_meta(Qwen2_5_VLTextModel)
+        elif self.version == 'v2':
+            try:
+                from transformers.models.qwen2_vl import Qwen2VLTextModel
+            except ImportError:
+                from transformers.models.qwen2_vl import Qwen2VLModel as Qwen2VLTextModel
+            context = patch_device_map_meta(Qwen2VLTextModel)
+        super().__init__(config)
+        args = get_args()
+        model_dir = args.model_info.model_dir
+        kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {}
+        with context:
+            model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs)
+        self.model = model.visual.to('cuda')
+        self.model_config = model.config
+        self.processor = get_tokenizer()
+
+    def forward(self, *args, **kwargs):
+        return self.model(*args, **kwargs)
+
+    def get_inputs_embeds(self, inputs_embeds, **kwargs):
+        input_ids = kwargs['input_ids']
+        pixel_values = kwargs.get('pixel_values')
+        pixel_values_videos = kwargs.get('pixel_values_videos')
+        image_grid_thw = kwargs.get('image_grid_thw')
+        video_grid_thw = kwargs.get('video_grid_thw')
+        dtype = self.model.dtype
+        if pixel_values is None and pixel_values_videos is None:  # plain-text
+            from PIL import Image
+            images = [Image.new('RGB', (32, 32), (0, 0, 0))]
+            media_inputs = self.processor.image_processor(images=images, return_tensors='pt')
+            device = input_ids.device
+            media_inputs = to_device(media_inputs, device)
+            pixel_values = media_inputs['pixel_values'].type(dtype)
+            image_embeds = self.model(pixel_values, grid_thw=media_inputs['image_grid_thw'])
+            inputs_embeds = inputs_embeds + image_embeds.mean() * 0.
+        else:
+            if pixel_values is None:
+                pixel_values_mixed = pixel_values_videos
+                grid_thw = video_grid_thw
+            elif pixel_values_videos is None:
+                pixel_values_mixed = pixel_values
+                grid_thw = image_grid_thw
+            else:
+                pixel_values_mixed = torch.concat([pixel_values, pixel_values_videos], dim=0)
+                grid_thw = torch.concat([image_grid_thw, video_grid_thw], dim=0)
+            pixel_values_mixed = pixel_values_mixed.type(dtype)
+            mixed_embeds = self.model(pixel_values_mixed, grid_thw=grid_thw)
+            if pixel_values is None:
+                image_embeds = None
+                video_embeds = mixed_embeds
+            elif pixel_values_videos is None:
+                image_embeds = mixed_embeds
+                video_embeds = None
+            else:
+                merge_length = self.processor.image_processor.merge_size**2
+                image_tokens = (image_grid_thw.prod(dim=-1) // merge_length).sum()
+                image_embeds = mixed_embeds[:image_tokens]
+                video_embeds = mixed_embeds[image_tokens:]
+
+            if image_embeds is not None:
+                image_mask = (input_ids == self.model_config.image_token_id).unsqueeze(-1).expand_as(inputs_embeds)
+                image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
+                inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
+
+            if video_embeds is not None:
+                video_mask = (input_ids == self.model_config.video_token_id).unsqueeze(-1).expand_as(inputs_embeds)
+                video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype)
+                inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
+        return inputs_embeds
+
+
+class Qwen2VL_Vit(Qwen2_5VL_Vit):
+    version = 'v2'
+
+
+register_megatron_model(
+    MMGPTMegatronModelMeta(
+        MegatronModelType.qwen2_5_vl, [
+            ModelType.qwen2_5_vl,
+        ],
+        convert_hf2mcore=convert_hf2mcore_qwen2_5_vl,
+        convert_mcore2hf=convert_mcore2hf_qwen2_5_vl,
+        visual_cls=Qwen2_5VL_Vit))
+
+register_megatron_model(
+    MMGPTMegatronModelMeta(
+        MegatronModelType.qwen2_vl, [
+            ModelType.qwen2_vl,
+        ],
+        convert_hf2mcore=convert_hf2mcore_qwen2_5_vl,
+        convert_mcore2hf=convert_mcore2hf_qwen2_5_vl,
+        visual_cls=Qwen2VL_Vit))
diff --git a/swift/megatron/model/mm_gpt/utils.py b/swift/megatron/model/mm_gpt/utils.py
new file mode 100644
index 0000000000..2b85c9f0f7
--- /dev/null
+++ b/swift/megatron/model/mm_gpt/utils.py
@@ -0,0 +1,44 @@
+from contextlib import contextmanager
+from dataclasses import dataclass
+from typing import Any, Callable, Dict, Type
+
+import torch
+from torch import nn
+from transformers import PretrainedConfig
+
+from ..gpt.config import convert_gpt_hf_config
+from ..mm_gpt_model import MultimodalGPTModel
+from ..model_provider import model_provider as model_provider_func
+from ..register import MegatronModelMeta
+
+
+@contextmanager
+def patch_device_map_meta(model_cls):
+    __origin_init__ = model_cls.__init__
+
+    def __init__(self, *args, **kwargs):
+        with torch.device('meta'):
+            __origin_init__(self, *args, **kwargs)
+
+    model_cls.__init__ = __init__
+
+    from transformers import PreTrainedModel
+    _origin_initialize_weight = PreTrainedModel._initialize_weights
+
+    def _initialize_weight(self, *args, **kwargs):
+        return
+
+    PreTrainedModel._initialize_weights = _initialize_weight
+
+    try:
+        yield
+    finally:
+        model_cls.__init__ = __origin_init__
+        PreTrainedModel._initialize_weights = _origin_initialize_weight
+
+
+@dataclass
+class MMGPTMegatronModelMeta(MegatronModelMeta):
+    model_cls: Type[nn.Module] = MultimodalGPTModel
+    model_provider: Callable[[], nn.Module] = model_provider_func
+    convert_hf_config: Callable[[PretrainedConfig], Dict[str, Any]] = convert_gpt_hf_config
diff --git a/swift/megatron/model/mm_gpt_model.py b/swift/megatron/model/mm_gpt_model.py
new file mode 100644
index 0000000000..d4871b5b54
--- /dev/null
+++ b/swift/megatron/model/mm_gpt_model.py
@@ -0,0 +1,98 @@
+from contextlib import contextmanager
+
+import torch
+from megatron.core import InferenceParams
+from megatron.core.packed_seq_params import PackedSeqParams
+from megatron.core.tensor_parallel import VocabParallelEmbedding, scatter_to_sequence_parallel_region
+from megatron.core.transformer.module import MegatronModule
+from megatron.core.transformer.spec_utils import ModuleSpec
+from megatron.core.transformer.transformer_config import TransformerConfig
+from megatron.training import get_args
+
+from .gpt_model import GPTModel
+
+
+class MultimodalGPTModel(MegatronModule):
+
+    def __init__(self,
+                 config: TransformerConfig,
+                 transformer_layer_spec: ModuleSpec,
+                 vocab_size: int,
+                 max_sequence_length: int,
+                 pre_process: bool = True,
+                 post_process: bool = True,
+                 *args,
+                 **kwargs):
+        super().__init__(config)
+        self.pre_process = pre_process
+        self.post_process = post_process
+        self.language_model = GPTModel(config, transformer_layer_spec, vocab_size, max_sequence_length, pre_process,
+                                       post_process, *args, **kwargs)
+
+        self.share_embeddings_and_output_weights = self.language_model.share_embeddings_and_output_weights
+        args = get_args()
+        self.visual = None
+        if pre_process and args.megatron_model_meta.visual_cls is not None:
+            self.visual = args.megatron_model_meta.visual_cls(config)
+
+    @contextmanager
+    def _patch_word_embeddings(self, kwargs):
+        origin_forward = VocabParallelEmbedding.forward
+
+        def forward(_self, input_):
+            reduce_scatter_embeddings = _self.reduce_scatter_embeddings
+            _self.reduce_scatter_embeddings = False
+            res = origin_forward(_self, input_)
+            _self.reduce_scatter_embeddings = reduce_scatter_embeddings
+            if self.visual is not None:
+                res = self.visual.get_inputs_embeds(res, **kwargs)
+            if reduce_scatter_embeddings:
+                res = res.transpose(0, 1).contiguous()
+                res = scatter_to_sequence_parallel_region(res, group=_self.tp_group)
+            return res
+
+        VocabParallelEmbedding.forward = forward
+        try:
+            yield
+        finally:
+            VocabParallelEmbedding.forward = origin_forward
+
+    # Code borrowed from NVIDIA/Megatron-LM
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        position_ids: torch.Tensor,
+        attention_mask: torch.Tensor = None,
+        decoder_input: torch.Tensor = None,
+        labels: torch.Tensor = None,
+        inference_params: InferenceParams = None,
+        packed_seq_params: PackedSeqParams = None,
+        **kwargs,
+    ) -> torch.Tensor:
+        if decoder_input is not None:
+            pass
+        elif self.pre_process:
+            from ..trainers.utils import get_batch_on_this_cp_rank
+            kwargs.update({'input_ids': input_ids})
+            with self._patch_word_embeddings(kwargs):
+                decoder_input = self.language_model.embedding(input_ids=input_ids, position_ids=position_ids)
+                decoder_input = get_batch_on_this_cp_rank({
+                    'decoder_input': decoder_input,
+                    'packed_seq_params': packed_seq_params
+                })['decoder_input']
+        else:
+            # intermediate stage of pipeline
+            # decoder will get hidden_states from encoder.input_tensor
+            decoder_input = None
+        return self.language_model(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            attention_mask=attention_mask,
+            decoder_input=decoder_input,
+            labels=labels,
+            inference_params=inference_params,
+            packed_seq_params=packed_seq_params,
+        )
+
+    def set_input_tensor(self, input_tensor: torch.Tensor) -> None:
+        return self.language_model.set_input_tensor(input_tensor)
diff --git a/swift/megatron/model/gpt/model.py b/swift/megatron/model/model_provider.py
similarity index 94%
rename from swift/megatron/model/gpt/model.py
rename to swift/megatron/model/model_provider.py
index 42cd69f375..1eeff0b8a2 100644
--- a/swift/megatron/model/gpt/model.py
+++ b/swift/megatron/model/model_provider.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-from typing import Union
+from typing import TYPE_CHECKING, Union
 
 import megatron.legacy
 import torch
@@ -12,11 +12,14 @@
 from megatron.training.arguments import core_transformer_config_from_args
 from megatron.training.yaml_arguments import core_transformer_config_from_yaml
 
-from ..gpt_model import GPTModel
+if TYPE_CHECKING:
+    from .gpt_model import GPTModel
+    from .mm_gpt import MultimodalGPTModel
 
 
 # Code borrowed from NVIDIA/Megatron-LM
-def model_provider(pre_process=True, post_process=True) -> Union[GPTModel, megatron.legacy.model.GPTModel]:
+def model_provider(pre_process=True,
+                   post_process=True) -> Union['GPTModel', 'MultimodalGPTModel', megatron.legacy.model.GPTModel]:
     """Builds the model.
 
     If you set the use_legacy_models to True, it will return the legacy GPT model and if not the mcore GPT model.
@@ -97,7 +100,7 @@ def oom_observer(device, alloc, device_alloc, device_free):
             # qwen2_moe
             for layer_spec in transformer_layer_spec.layer_specs:
                 layer_spec.submodules.mlp.submodules.shared_experts.params = {'gate': True}
-        model = GPTModel(
+        model = args.megatron_model_meta.model_cls(
             config=config,
             transformer_layer_spec=transformer_layer_spec,
             vocab_size=args.padded_vocab_size,
diff --git a/swift/megatron/model/register.py b/swift/megatron/model/register.py
index 950a68ede2..8ed93f1ac5 100644
--- a/swift/megatron/model/register.py
+++ b/swift/megatron/model/register.py
@@ -1,7 +1,7 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from argparse import ArgumentParser
 from dataclasses import dataclass
-from typing import Any, Callable, Dict, List, Optional
+from typing import Any, Callable, Dict, List, Optional, Type
 
 import torch.nn as nn
 from transformers import PretrainedConfig
@@ -16,11 +16,14 @@ class MegatronModelMeta:
     megatron_model_type: str
     model_types: List[str]
 
-    model_provider: Callable[[], nn.Module]
-    convert_hf_config: Callable[[PretrainedConfig], Dict[str, Any]]
     convert_mcore2hf: Callable[[nn.Module, nn.Module], None]
     convert_hf2mcore: Callable[[nn.Module, nn.Module], None]
 
+    model_cls: Type[nn.Module]
+    model_provider: Callable[[], nn.Module]
+    convert_hf_config: Callable[[PretrainedConfig], Dict[str, Any]]
+    visual_cls: Optional[Type[nn.Module]] = None
+
     extra_args_provider: Optional[Callable[[ArgumentParser], ArgumentParser]] = None
 
 
diff --git a/swift/megatron/train/sft.py b/swift/megatron/train/sft.py
index f7eb8b57ec..289529a5f9 100644
--- a/swift/megatron/train/sft.py
+++ b/swift/megatron/train/sft.py
@@ -3,6 +3,8 @@
 from functools import partial
 from typing import List, Optional, Union
 
+import torch
+
 from swift.llm.train import SwiftSft
 from swift.utils import get_logger, is_master, plot_images
 from ..argument import MegatronTrainArguments
@@ -24,12 +26,17 @@ def __init__(self, args: Optional[Union[List[str], MegatronTrainArguments]] = No
         self.train_msg = {}
         super(SwiftSft, self).__init__(args)
         args = self.args
-        _, self.processor = args.get_model_processor(load_model=False)
+        if args.model_meta.is_multimodal:
+            kwargs = {'return_dummy_model': True}
+        else:
+            kwargs = {'load_model': False}
+        with torch.device('meta'):
+            self.model, self.processor = args.get_model_processor(**kwargs)
+        self._prepare_template()
         patch_megatron_tokenizer(self.processor)
+        args.save_args(args.save)
         args.init_model_args(self.processor, self.processor.model_info.config)
-        self._prepare_template()
         self.template.use_megatron = True
-        args.save_args(args.save)
         self.trainer = self.prepare_trainer()
 
     def _get_data_collator(self):
@@ -56,8 +63,6 @@ def run(self):
             if val_dataset is not None:
                 val_dataset = build_streaming_dataloader(args, val_dataset, data_collator)
 
-        logging_path = os.path.join(args.save, 'logging.jsonl')
-        logger.info(f'The logging file will be saved in: {logging_path}')
         try:
             self.trainer.train(train_dataset, val_dataset, data_collator)
         finally:
diff --git a/swift/megatron/trainers/base.py b/swift/megatron/trainers/base.py
index afd6152a69..ad17437a42 100644
--- a/swift/megatron/trainers/base.py
+++ b/swift/megatron/trainers/base.py
@@ -242,10 +242,11 @@ def new_model_provider_func(*args, **kwargs):
             self.peft_model = prepare_mcore_model(self.unwrapped_model)
             return self.unwrapped_model
 
+        args = get_args()
+        self._init_multimodal_full(args)
         with self._patch_load_state_dict(self._load_base_checkpoint):
             model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer(
                 new_model_provider_func, model_type, *_args, **kwargs)
-        args = get_args()
         if args.initialize_embedding:
             self._initialize_embedding(self.unwrapped_model)
         if args.train_type != 'full' and args.modules_to_save:
@@ -258,8 +259,20 @@ def new_model_provider_func(*args, **kwargs):
             with adapter_state_dict_context():
                 args.iteration, args.num_floating_point_operations_so_far = load_checkpoint(
                     model, optimizer, opt_param_scheduler, load_arg='adapter_load', strict=False)
+        if args.model_meta.is_multimodal:
+            self._prepare_vit_gradient_checkpointing()
         return model, optimizer, opt_param_scheduler
 
+    def _prepare_vit_gradient_checkpointing(self):
+        visual = self.unwrapped_model.visual
+        if visual is None:
+            return
+        visual = visual.model
+        args = get_args()
+        if args.vit_gradient_checkpointing:
+            visual.gradient_checkpointing_enable(**(args.gradient_checkpointing_kwargs or {}))
+            visual.enable_input_require_grads()
+
     @staticmethod
     def _initialize_embedding(model):
         # compat new_special_tokens
@@ -702,6 +715,25 @@ def _patch_megatron(self):
         self._origin_save_checkpoint = training.save_checkpoint
         training.save_checkpoint = self.save_checkpoint
 
+    @staticmethod
+    def _init_multimodal_full(args):
+        visual_cls = args.megatron_model_meta.visual_cls
+        if args.train_type == 'full' and args.model_meta.is_multimodal and visual_cls is not None:
+            vision_tower = [f'visual.{vit}' for vit in visual_cls.vision_tower]
+            aligner = [f'visual.{_aligner}' for _aligner in visual_cls.aligner]
+            if args.freeze_llm:
+                args.freeze_parameters.append('language_model')
+            if args.freeze_vit:
+                args.freeze_parameters += vision_tower
+            if args.freeze_aligner:
+                args.freeze_parameters += aligner
+            else:
+                args.trainable_parameters += aligner
+            if args.freeze_parameters:
+                logger.info(f'freeze_parameters: {args.freeze_parameters}')
+            if args.trainable_parameters:
+                logger.info(f'additional trainable_parameters: {args.trainable_parameters}')
+
     def train(self, train_dataset, val_dataset, data_collator):
         args = self.args
         datasets_provider = get_swift_datasets_provider(train_dataset, val_dataset)
diff --git a/swift/megatron/trainers/dpo_trainer.py b/swift/megatron/trainers/dpo_trainer.py
index 7798de2b08..aef42560c1 100644
--- a/swift/megatron/trainers/dpo_trainer.py
+++ b/swift/megatron/trainers/dpo_trainer.py
@@ -179,9 +179,7 @@ def _replace_data_iterator(self, data_iterator):
         return iter(res)
 
     def forward_step(self, data_iterator, model):
-        with torch.no_grad():
-            data = next(data_iterator)
-
+        data = next(data_iterator)
         ref_logps = data.pop('logps')
         with self.stimer:
             output_tensor = model(**data)
diff --git a/swift/megatron/trainers/utils.py b/swift/megatron/trainers/utils.py
index a05b57f75f..30b90400f4 100644
--- a/swift/megatron/trainers/utils.py
+++ b/swift/megatron/trainers/utils.py
@@ -65,22 +65,41 @@ def get_packed_seq_params(position_ids: torch.Tensor) -> PackedSeqParams:
 
 
 def _split_tokens(tokens, cu_seqlens):
-    assert tokens.shape[0] == 1, f'tokens.shape: {tokens.shape}'
+    assert tokens.shape[-2] == 1, f'tokens.shape: {tokens.shape}'  # [..., 1, L]
     new_tokens = []
     cp_size = mpu.get_context_parallel_world_size()
     cp_rank = mpu.get_context_parallel_rank()
     for i in range(cu_seqlens.shape[0] - 1):
-        val = tokens[:, cu_seqlens[i]:cu_seqlens[i + 1]]
+        val = tokens[..., cu_seqlens[i]:cu_seqlens[i + 1]]
         val = val.view(
-            tokens.shape[0],
+            *tokens.shape[:-1],
             2 * cp_size,
-            val.shape[1] // (2 * cp_size),
+            val.shape[-1] // (2 * cp_size),
         )
         index = torch.tensor([cp_rank, (2 * cp_size - cp_rank - 1)], device='cpu',
                              pin_memory=True).cuda(non_blocking=True)
-        val = val.index_select(1, index)
-        new_tokens.append(val.view(tokens.shape[0], -1))
-    return torch.cat(new_tokens, dim=1)
+        val = val.index_select(-2, index)
+        new_tokens.append(val.view(*tokens.shape[:-1], -1))
+    return torch.cat(new_tokens, dim=-1)
+
+
+def _split_tokens_decoder_input(tokens, cu_seqlens):
+    assert tokens.shape[1] == 1, f'tokens.shape: {tokens.shape}'  # [L, 1, E]
+    new_tokens = []
+    cp_size = mpu.get_context_parallel_world_size()
+    cp_rank = mpu.get_context_parallel_rank()
+    for i in range(cu_seqlens.shape[0] - 1):
+        val = tokens[cu_seqlens[i]:cu_seqlens[i + 1], ...]
+        val = val.view(
+            2 * cp_size,
+            val.shape[0] // (2 * cp_size),
+            *tokens.shape[1:],
+        )
+        index = torch.tensor([cp_rank, (2 * cp_size - cp_rank - 1)], device='cpu',
+                             pin_memory=True).cuda(non_blocking=True)
+        val = val.index_select(0, index)
+        new_tokens.append(val.view(-1, *tokens.shape[1:]))
+    return torch.cat(new_tokens, dim=0)
 
 
 def get_batch_on_this_cp_rank(batch: Dict[str, Any]):
@@ -96,14 +115,23 @@ def get_batch_on_this_cp_rank(batch: Dict[str, Any]):
     # that we can get balanced workload among GPUs in a context parallel group.
     cp_size = mpu.get_context_parallel_world_size()
     if cp_size > 1:
+        args = get_args()
+        keys = ['labels', 'attention_mask', 'position_ids', 'loss_scale']
+        if args.model_meta.is_multimodal:
+            keys.append('decoder_input')
+        else:
+            keys.append('input_ids')
         packed_seq_params = batch.get('packed_seq_params')
         if packed_seq_params is None:
             return mcore_get_batch_on_this_cp_rank(batch)
         for key, val in batch.items():
-            if key in {'packed_seq_params', 'channel'}:
+            if key not in keys:
                 continue
             if val is not None:
-                batch[key] = _split_tokens(val, packed_seq_params.cu_seqlens_q)
+                if key == 'decoder_input':
+                    batch[key] = _split_tokens_decoder_input(val, packed_seq_params.cu_seqlens_q)
+                else:
+                    batch[key] = _split_tokens(val, packed_seq_params.cu_seqlens_q)
 
     return batch
 
@@ -118,5 +146,8 @@ def get_batch(data_iterator):
         batch['packed_seq_params'] = get_packed_seq_params(batch['position_ids'])
         batch['packed_seq_params'].num_samples = num_samples
     # slice batch along sequence dimension for context parallelism
+    position_ids = batch.pop('real_position_ids', None)  # fix Qwen2.5-VL
+    if position_ids is not None:
+        batch['position_ids'] = position_ids
     batch = get_batch_on_this_cp_rank(batch)
     return batch
diff --git a/swift/megatron/utils/convert.py b/swift/megatron/utils/convert.py
index cc8a70307c..19b836bd79 100644
--- a/swift/megatron/utils/convert.py
+++ b/swift/megatron/utils/convert.py
@@ -3,6 +3,7 @@
 import math
 from contextlib import contextmanager
 from dataclasses import fields
+from typing import Any, Dict
 
 import torch
 import torch.nn as nn
@@ -39,17 +40,23 @@ def _test_params_sum(model):
     logger.info(f'zero_count: {zero_count}')
 
 
-def _find_modules(model, recurse: bool = True):
+def _find_modules(model, recurse: bool = True, prefix='', ignore_modules=None):
+    ignore_modules = ignore_modules or []
+    for k in ignore_modules:
+        if prefix.startswith(k):
+            return []
+    else:
+        named_children = list(model.named_children())
+
     modules = []
-    children = list(model.children())
-    for module in children:
+    for n, module in named_children:
         if module.__class__ is nn.ModuleList:
-            modules += _find_modules(module, False)
+            modules += _find_modules(module, False, prefix=f'{prefix}{n}.', ignore_modules=ignore_modules)
         elif recurse:
-            modules += _find_modules(module)
+            modules += _find_modules(module, prefix=f'{prefix}{n}.', ignore_modules=ignore_modules)
         else:
             modules.append(module)
-    if not children:
+    if not named_children:
         modules.append(model)
     return modules
 
@@ -78,34 +85,68 @@ def _to_cpu_hook(module, args, output):
             hook.remove()
 
 
+def get_examples(is_multimodal: bool) -> Dict[str, Any]:
+    if is_multimodal:
+        data = {
+            'messages': [{
+                'role': 'user',
+                'content': '<image>describe the image.'
+            }, {
+                'role':
+                'assistant',
+                'content':
+                'The image depicts a close-up of a kitten with striking features. '
+                'The kitten has a white and gray coat with distinct black stripes, '
+                'particularly noticeable on its face and ears. Its eyes are large '
+                'and expressive, with a captivating blue hue that stands out against '
+                "the darker fur around them. The kitten's nose is small and pink, "
+                'and it has long, delicate whiskers extending from either side of its mouth. '
+                "The background is blurred, drawing attention to the kitten's face and "
+                'making it the focal point of the image. The overall impression is '
+                'one of cuteness and charm.'
+            }],
+            'images': ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png']
+        }
+    else:
+        data = {
+            'messages': [
+                {
+                    'role': 'user',
+                    'content': 'Introduction to ms-swift.'
+                },
+                {
+                    'role':
+                    'assistant',
+                    'content':
+                    'ms-swift is an official framework provided by the ModelScope community for fine-tuning '
+                    'and deploying large language models and multi-modal large models.'
+                },
+            ]
+        }
+    return data
+
+
 def test_convert_precision(hf_model, mg_model, template, torch_dtype=torch.float32):
     _test_params_sum(hf_model)
     _test_params_sum(mg_model)
 
     template.set_mode('train')
-    inputs = template.encode({
-        'messages': [
-            {
-                'role': 'user',
-                'content': 'Introduction to ms-swift.'
-            },
-            {
-                'role':
-                'assistant',
-                'content':
-                'ms-swift is an official framework provided by the ModelScope community for fine-tuning '
-                'and deploying large language models and multi-modal large models.'
-            },
-        ]
-    })
+    template.register_post_encode_hook([hf_model])
+    is_multimodal = template.model_meta.is_multimodal
+    inputs = get_examples(is_multimodal)
+    inputs = template.encode(inputs)
     inputs = to_device(template.data_collator([inputs]), 'cuda')
 
     HfConfigFactory.set_model_config_attr(hf_model, 'use_cache', False)
-    share_embedding = mg_model.share_embeddings_and_output_weights
-    hf_modules = _find_modules(hf_model)
+    mg_language_model = mg_model.language_model if is_multimodal else mg_model
+    share_embedding = mg_language_model.share_embeddings_and_output_weights
+    model_arch = hf_model.model_meta.model_arch
+    ignore_modules = (model_arch.vision_tower + model_arch.aligner) if is_multimodal else []
+
+    hf_modules = _find_modules(hf_model, ignore_modules=ignore_modules)
     with torch.inference_mode(), _model_cpu_forward_context(hf_modules, torch_dtype, share_embedding=share_embedding):
         hf_logits = hf_model(**inputs).logits
-    hf_model = hf_model.to('cpu')
+    hf_model.to('cpu')
 
     input_ids = inputs['input_ids']
     attention_mask, _, position_ids = get_ltor_masks_and_position_ids(input_ids, -100, True, True, True)
@@ -116,15 +157,15 @@ def test_convert_precision(hf_model, mg_model, template, torch_dtype=torch.float
     # mg_torch_dtype = None
     # packed_seq_params = get_packed_seq_params(position_ids)
     # attention_mask = None
-    mg_model.config.fp8 = None  # compat fp8
-    mg_modules = _find_modules(mg_model)
+    mg_language_model.config.fp8 = None  # compat fp8
+    mg_modules = _find_modules(mg_language_model, ignore_modules=['visual'])
+    kwargs = {k: v for k, v in inputs.items() if k not in ['input_ids', 'attention_mask', 'labels']}
+    if 'position_ids' not in kwargs:
+        kwargs['position_ids'] = position_ids
     with torch.inference_mode(), _model_cpu_forward_context(
             mg_modules, mg_torch_dtype, 'cuda', share_embedding=share_embedding):
         mg_logits = mg_model(
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            packed_seq_params=packed_seq_params)
+            input_ids=input_ids, attention_mask=attention_mask, packed_seq_params=packed_seq_params, **kwargs)
 
     token_mean_diff = (mg_logits - hf_logits).abs().mean(dim=-1)
     mean_diff = token_mean_diff.mean().item()
@@ -165,7 +206,10 @@ def convert_hf2mcore(args: ExportArguments) -> None:
 
     megatron_model_meta = get_megatron_model_meta(args.model_type)
     assert megatron_model_meta is not None, f'Model: {args.model} is not supported.'
-    kwargs = megatron_model_meta.convert_hf_config(processor.model_info.config)
+    config = processor.model_info.config
+    if args.model_meta.is_multimodal and hasattr(config, 'text_config'):
+        config = config.text_config
+    kwargs = megatron_model_meta.convert_hf_config(config)
     logger.info(f'megatron_config: {kwargs}')
     _check_megatron_kwargs(kwargs)
     current_convert_kwargs = convert_kwargs.copy()
@@ -175,6 +219,9 @@ def convert_hf2mcore(args: ExportArguments) -> None:
         **kwargs, **current_convert_kwargs, save=args.output_dir, torch_dtype=args.torch_dtype)
     patch_megatron_tokenizer(processor)
     extra_args = megatron_args.parse_to_megatron()
+    extra_args['model_info'] = args.model_info
+    extra_args['model_meta'] = args.model_meta
+    extra_args['megatron_model_meta'] = megatron_model_meta
     extra_args_provider = megatron_model_meta.extra_args_provider
     initialize_megatron(extra_args_provider=extra_args_provider, args_defaults=extra_args)
 
@@ -198,7 +245,10 @@ def convert_mcore2hf(args: ExportArguments) -> None:
 
     megatron_model_meta = get_megatron_model_meta(args.model_type)
     assert megatron_model_meta is not None, f'Model: {args.model} is not supported.'
-    kwargs = megatron_model_meta.convert_hf_config(processor.model_info.config)
+    config = processor.model_info.config
+    if args.model_meta.is_multimodal and hasattr(config, 'text_config'):
+        config = config.text_config
+    kwargs = megatron_model_meta.convert_hf_config(config)
     logger.info(f'megatron_config: {kwargs}')
     _check_megatron_kwargs(kwargs)
     current_convert_kwargs = convert_kwargs.copy()
@@ -217,6 +267,9 @@ def convert_mcore2hf(args: ExportArguments) -> None:
         torch_dtype=args.torch_dtype)
     patch_megatron_tokenizer(processor)
     extra_args = megatron_args.parse_to_megatron()
+    extra_args['model_info'] = args.model_info
+    extra_args['model_meta'] = args.model_meta
+    extra_args['megatron_model_meta'] = megatron_model_meta
     extra_args_provider = megatron_model_meta.extra_args_provider
     initialize_megatron(extra_args_provider=extra_args_provider, args_defaults=extra_args)
 
diff --git a/swift/megatron/utils/utils.py b/swift/megatron/utils/utils.py
index db260bf3c3..e1611326c1 100644
--- a/swift/megatron/utils/utils.py
+++ b/swift/megatron/utils/utils.py
@@ -11,8 +11,10 @@
 from megatron.core.transformer.utils import make_sharded_tensors_for_checkpoint, sharded_state_dict_default
 from megatron.training import checkpointing, get_args
 from peft.utils.other import ModulesToSaveWrapper
+from torch import nn
 
-from swift.utils import activate_parameters, find_layers, freeze_parameters, get_logger, get_model_parameter_info
+from swift.utils import (activate_parameters, deep_getattr, find_layers, freeze_parameters, get_logger,
+                         get_model_parameter_info)
 
 logger = get_logger()
 
@@ -20,7 +22,7 @@
 def find_all_linears(model):
 
     def _cond(name, module):
-        if isinstance(module, (TELinear, TELayerNormColumnParallelLinear, TEGroupedLinear)):
+        if isinstance(module, (TELinear, TELayerNormColumnParallelLinear, TEGroupedLinear, nn.Linear)):
             return True
         return False
 
@@ -35,13 +37,64 @@ def find_embedding(model):
     return find_layers(model, lambda name, module: isinstance(module, LanguageModelEmbedding))
 
 
+def get_multimodal_target_regex(
+    args,
+    model,
+    *,
+    freeze_llm: bool = False,
+    freeze_vit: bool = True,
+    freeze_aligner: bool = True,
+) -> str:
+    modules = []
+    visual_cls = args.megatron_model_meta.visual_cls
+    vision_tower = [f'visual.{vit}' for vit in visual_cls.vision_tower]
+    aligner = [f'visual.{_aligner}' for _aligner in visual_cls.aligner]
+    if not freeze_llm:
+        modules.append('language_model')
+    if not freeze_vit:
+        modules += vision_tower
+    if not freeze_aligner:
+        modules += aligner
+    assert len(modules) > 0, f'modules: {modules}'
+
+    res = []
+    for module in modules:
+        rejected_modules = []
+        if not freeze_vit:
+            for _aligner in aligner:
+                if _aligner.startswith(f'{module}.'):
+                    rejected_modules.append(_aligner)
+
+        sub_module = deep_getattr(model, module)
+        if sub_module is None:
+            continue
+        target_modules = find_all_linears(sub_module)
+        if not target_modules:
+            continue
+        target_modules = [tm for tm in target_modules if tm]
+        target_pattern = rf'.*\.({"|".join(target_modules)})' if target_modules else ''
+        rejected_pattern = rf'(?!({"|".join(rejected_modules)}))' if rejected_modules else ''
+        res.append(rf'{rejected_pattern}{module}{target_pattern}')
+
+    return rf'^({"|".join(res)})$'
+
+
 def get_target_modules(args, model):
     if isinstance(args.target_modules, str):
         return args.target_modules
     target_modules = args.target_modules.copy()
     if 'all-linear' in target_modules:
-        target_modules.remove('all-linear')
-        target_modules += find_all_linears(model)
+        if args.model_meta.is_multimodal:
+            return get_multimodal_target_regex(
+                args,
+                model,
+                freeze_llm=args.freeze_llm,
+                freeze_vit=args.freeze_vit,
+                freeze_aligner=args.freeze_aligner,
+            )
+        else:
+            target_modules.remove('all-linear')
+            target_modules += find_all_linears(model)
     if 'all-embedding' in target_modules:
         target_modules.remove('all-embedding')
         target_modules += find_embedding(model)
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
index 3cda3e3a0d..89d6bc6332 100644
--- a/swift/trainers/mixin.py
+++ b/swift/trainers/mixin.py
@@ -839,7 +839,7 @@ def get_cu_seqlens(self, position_ids, logits_to_keep) -> torch.Tensor:
                 start, end = cu_seqlens[i], cu_seqlens[i + 1]
                 res_cu_seqlens[i + 1:] -= (~logits_to_keep[start:end]).sum()
         elif isinstance(logits_to_keep, int):
-            res_cu_seqlens[1:] -= position_ids.shape[0] + 1 - logits_to_keep
+            res_cu_seqlens[1:] -= position_ids.shape[-1] + 1 - logits_to_keep
         return res_cu_seqlens
 
     def get_batch_samples(self, *args, **kwargs):
diff --git a/tests/megatron/test_align/test_llm.py b/tests/megatron/test_align/test_llm.py
index 163fd1933a..69e62f574f 100644
--- a/tests/megatron/test_align/test_llm.py
+++ b/tests/megatron/test_align/test_llm.py
@@ -127,6 +127,16 @@ def test_glm4_5():
     _test_model('ZhipuAI/GLM-4.5-Air')
 
 
+def test_qwen2_5_vl():
+    os.environ['MAX_PIXELS'] = str(1280 * 28 * 28)
+    _test_model('Qwen/Qwen2.5-VL-7B-Instruct')
+
+
+def test_qwen2_vl():
+    os.environ['MAX_PIXELS'] = str(1280 * 28 * 28)
+    _test_model('Qwen/Qwen2-VL-7B-Instruct')
+
+
 if __name__ == '__main__':
     # test_qwen2()
     # test_llama2()
@@ -151,4 +161,6 @@ def test_glm4_5():
     # test_kimi_dev()
     # test_hunyuan()
     # test_ernie()
-    test_glm4_5()
+    # test_glm4_5()
+    test_qwen2_5_vl()
+    # test_qwen2_vl()