diff --git a/README.md b/README.md index 710d6a0b95..80215b90be 100644 --- a/README.md +++ b/README.md @@ -75,6 +75,7 @@ You can contact us and communicate with us by adding our group: ## 🎉 News +- 🎁 2025.09.02: Megatron-SWIFT now supports multimodal model training. Documentation can be found [here](./docs/source_en/Megatron-SWIFT/Multimodal-Model.md). - 🎁 2025.08.12: Support [Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT) in SFT training, use parameter `--enable_dft_loss true`. Training scripts can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh). - 🎁 2025.07.12: Deployment(pt/vLLM/SGLang) of Embedding models is supported, check [here](examples/deploy/embedding/client.py). - 🎁 2025.07.09: Megatron-SWIFT supports LoRA training. Compared to ms-swift, it achieves significant speedup on MoE models. Training scripts can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora). diff --git a/README_CN.md b/README_CN.md index d34ac9bd85..ef6f062a76 100644 --- a/README_CN.md +++ b/README_CN.md @@ -71,6 +71,7 @@ - **模型量化**:支持AWQ、GPTQ、FP8和BNB的量化导出,导出的模型支持使用vLLM/SGLang/LmDeploy推理加速,并支持继续训练。 ## 🎉 新闻 +- 🎁 2025.09.02: Megatron-SWIFT支持多模态模型训练。文档参考[这里](./docs/source/Megatron-SWIFT/多模态模型.md)。 - 🎁 2025.08.12: 支持在SFT训练中使用[Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT),使用参数 `--enable_dft_loss true`。训练脚本参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh) - 🎁 2025.07.12: 支持部署Embedding模型的部署(pt/vLLM/SGLang), 查看[这里](examples/deploy/embedding/client.py). - 🎁 2025.07.09: Megatron-SWIFT支持LoRA训练。相比ms-swift,在MoE模型提速显著。训练脚本参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/lora)。 diff --git "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" index 7886f6548b..4ba75ad711 100644 --- "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" +++ "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" @@ -159,7 +159,7 @@ - 🔥aligner_lr: 当训练多模态大模型时,该参数指定aligner的学习率,默认为None,等于learning_rate。 - lr_scheduler_type: lr_scheduler类型,默认为'cosine'。 - lr_scheduler_kwargs: lr_scheduler其他参数。默认为None。 -- 🔥gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。 +- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。 - 注意:当使用DDP而不使用deepspeed/fsdp,且gradient_checkpointing_kwargs为None,会默认设置其为`'{"use_reentrant": false}'`。 - full_determinism: 确保训练中获得可重现的结果,注意:这会对性能产生负面影响。默认为False。 - 🔥report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb swanlab`、`--report_to all`。 @@ -211,10 +211,10 @@ - hub_private_repo: 默认为False。 ### Tuner参数 -- 🔥freeze_llm: 该参数只对多模态模型生效,可用于全参和LoRA,但含义不同。若是全参数训练,将freeze_llm设置为True将会将llm部分权重进行冻结,若是LoRA训练且`target_modules`设置为'all-linear',将freeze_llm设置为True将会取消在llm部分添加LoRA模块。该参数默认为False。 -- 🔥freeze_vit: 该参数只对多模态模型生效,可用于全参和LoRA,含义参考`freeze_llm`。默认为True。 +- 🔥freeze_llm: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_llm设置为True将会将LLM部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_llm设置为True将会取消在LLM部分添加LoRA模块。该参数默认为False。 +- 🔥freeze_vit: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_vit设置为True将会将vit部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_vit设置为True将会取消在vit部分添加LoRA模块。该参数默认为True。 - 注意:这里的vit不仅限于vision_tower, 也包括audio_tower。 -- 🔥freeze_aligner: 该参数只对多模态模型生效,可用于全参和LoRA,含义参考`freeze_llm`。默认为True。 +- 🔥freeze_aligner: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_aligner设置为True将会将aligner(也称为projector)部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_aligner设置为True将会取消在aligner部分添加LoRA模块。该参数默认为True。 - 🔥target_modules: 指定lora模块, 默认为`['all-linear']`。你也可以设置为module的后缀,例如:`--target_modules q_proj k_proj v_proj`。该参数不限于LoRA,可用于其他tuners。 - 注意:在LLM和多模态LLM中,'all-linear'的行为有所不同。若是LLM则自动寻找除lm_head外的linear并附加tuner;若是多模态LLM,则默认只在LLM上附加tuner,该行为可以被`freeze_llm`、`freeze_vit`、`freeze_aligner`控制。 - 🔥target_regex: 指定lora模块的regex表达式,默认为`None`。如果该值传入,则target_modules参数失效。该参数不限于LoRA,可用于其他tuners。 diff --git "a/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" "b/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" index b3836c2592..f9a8ba548b 100644 --- "a/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" +++ "b/docs/source/Instruction/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" @@ -652,12 +652,12 @@ |[Qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/Qwen/Qwen-VL-Chat-Int4)|qwen_vl|qwen_vl|-|✘|vision|[Qwen/Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)| |[Qwen/Qwen-Audio-Chat](https://modelscope.cn/models/Qwen/Qwen-Audio-Chat)|qwen_audio|qwen_audio|-|✘|audio|[Qwen/Qwen-Audio-Chat](https://huggingface.co/Qwen/Qwen-Audio-Chat)| |[Qwen/Qwen-Audio](https://modelscope.cn/models/Qwen/Qwen-Audio)|qwen_audio|qwen_audio|-|✘|audio|[Qwen/Qwen-Audio](https://huggingface.co/Qwen/Qwen-Audio)| -|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)| -|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)| -|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)| -|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)| -|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)| -|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)| +|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)| +|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)| +|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)| +|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)| +|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)| +|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)| |[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)| |[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)| |[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)| @@ -667,16 +667,16 @@ |[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-AWQ)| |[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-AWQ)| |[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ)| -|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)| -|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)| -|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)| -|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)| -|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)| -|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)| -|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)| -|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)| -|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)| -|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)| +|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)| +|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)| +|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)| +|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)| +|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)| +|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)| +|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)| +|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)| +|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)| +|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)| |[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)| |[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)| |[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)| diff --git "a/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" index b76390dd75..b4864b250f 100644 --- "a/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" +++ "b/docs/source/Megatron-SWIFT/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" @@ -192,6 +192,10 @@ **Tuner参数**: - train_type: 可选为'lora'和'full'。默认为'full'。 +- 🔥freeze_llm: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_llm设置为True将会将LLM部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_llm设置为True将会取消在LLM部分添加LoRA模块。该参数默认为False。 +- 🔥freeze_vit: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_vit设置为True将会将vit部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_vit设置为True将会取消在vit部分添加LoRA模块。该参数默认为True。 + - 注意:这里的vit不仅限于vision_tower, 也包括audio_tower。 +- 🔥freeze_aligner: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_aligner设置为True将会将aligner(也称为projector)部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_aligner设置为True将会取消在aligner部分添加LoRA模块。该参数默认为True。 全参数训练: - freeze_parameters: 需要被冻结参数的前缀,默认为`[]`。 @@ -234,6 +238,8 @@ Megatron训练参数继承自Megatron参数和基本参数(与ms-swift共用da - 若要自定义attention_mask,你可以设置`--padding_free false`。 - 注意:Megatron-SWIFT训练特性优先支持padding_free格式,若非特殊情况,请勿修改该值。 - mlp_padding_free: 默认为False。用于padding_free设置为false时,对mlp进行padding_free优化。这可以在自定义attention_mask的同时,提升训练速度和减少显存占用。 +- vit_gradient_checkpointing: 多模态模型训练时,是否对vit部分开启gradient_checkpointing。默认为True。 +- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。 - 🔥packing: 是否使用序列packing,默认为False。当前支持CPT/SFT/DPO。 - packing_length: packing的长度。默认为None,设置为max_length。 - streaming: 流式读取并处理数据集,默认False。 diff --git "a/docs/source/Megatron-SWIFT/\345\244\232\346\250\241\346\200\201\346\250\241\345\236\213.md" "b/docs/source/Megatron-SWIFT/\345\244\232\346\250\241\346\200\201\346\250\241\345\236\213.md" new file mode 100644 index 0000000000..e9186658c5 --- /dev/null +++ "b/docs/source/Megatron-SWIFT/\345\244\232\346\250\241\346\200\201\346\250\241\345\236\213.md" @@ -0,0 +1,156 @@ +# 多模态模型 + +ms-swift引入了Megatron的并行技术来加速多模态大模型的训练。目前支持Qwen2.5-VL等模型的CPT/SFT/DPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。 + +环境准备请参考Megatron-SWIFT的[快速开始文档](./快速开始.md)。 + +## Dense模型 Full/LoRA + +这里介绍使用2卡80GiB A100对Qwen2.5-VL-7B-Instruct模型进行Latex-OCR的微调,分别使用全参数和LoRA的方式,以下最佳实践可以在10分钟内完成。 + +首先,我们需要将HF格式的权重转为Megatron格式: +```shell +CUDA_VISIBLE_DEVICES=0 \ +swift export \ + --model Qwen/Qwen2.5-VL-7B-Instruct \ + --to_mcore true \ + --torch_dtype bfloat16 \ + --output_dir Qwen2.5-VL-7B-Instruct-mcore \ + --test_convert_precision true +``` + +### Full + +全参数训练脚本如下: +```shell +# 2 * 72GiB; 4.1s/it +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +NPROC_PER_NODE=2 \ +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0,1 \ +megatron sft \ + --load Qwen2.5-VL-7B-Instruct-mcore \ + --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \ + --tensor_model_parallel_size 2 \ + --sequence_parallel true \ + --packing true \ + --freeze_llm false \ + --freeze_vit true \ + --freeze_aligner true \ + --split_dataset_ratio 0.01 \ + --micro_batch_size 1 \ + --global_batch_size 4 \ + --recompute_granularity full \ + --recompute_method uniform \ + --recompute_num_layers 1 \ + --finetune true \ + --cross_entropy_loss_fusion true \ + --lr 1e-5 \ + --lr_warmup_fraction 0.05 \ + --min_lr 1e-6 \ + --max_epochs 1 \ + --save megatron_output/Qwen2.5-VL-7B-Instruct \ + --save_interval 200 \ + --vit_gradient_checkpointing true \ + --max_length 2048 \ + --num_workers 4 \ + --no_save_optim true \ + --no_save_rng true \ + --dataset_num_proc 8 +``` + +将全参数保存的Megatron格式权重转为HF格式: +- 注意:`--mcore_model`请指向`iter_xxx`的上级目录。默认会使用`latest_checkpointed_iteration.txt`中对应的checkpoint。 +```shell +CUDA_VISIBLE_DEVICES=0 \ +swift export \ + --mcore_model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \ + --to_hf true \ + --torch_dtype bfloat16 \ + --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \ + --test_convert_precision true +``` + +### LoRA + +LoRA训练脚本如下: +```shell +# 2 * 23GiB; 2.3s/it +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +NPROC_PER_NODE=2 \ +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0,1 \ +megatron sft \ + --load Qwen2.5-VL-7B-Instruct-mcore \ + --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \ + --train_type lora \ + --lora_rank 8 \ + --lora_alpha 32 \ + --target_modules all-linear \ + --tensor_model_parallel_size 1 \ + --sequence_parallel true \ + --freeze_llm false \ + --freeze_vit true \ + --freeze_aligner true \ + --packing true \ + --split_dataset_ratio 0.01 \ + --micro_batch_size 1 \ + --global_batch_size 4 \ + --recompute_granularity full \ + --recompute_method uniform \ + --recompute_num_layers 1 \ + --finetune true \ + --cross_entropy_loss_fusion true \ + --lr 1e-4 \ + --lr_warmup_fraction 0.05 \ + --min_lr 1e-5 \ + --max_epochs 1 \ + --save megatron_output/Qwen2.5-VL-7B-Instruct \ + --save_interval 200 \ + --vit_gradient_checkpointing true \ + --max_length 2048 \ + --num_workers 4 \ + --no_save_optim true \ + --no_save_rng true \ + --dataset_num_proc 8 +``` + +将LoRA保存的增量权重进行Merge-LoRA并转为HF格式: +```shell +CUDA_VISIBLE_DEVICES=0 \ +swift export \ + --mcore_adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \ + --to_hf true \ + --torch_dtype bfloat16 \ + --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \ + --test_convert_precision true +``` + + +最后,我们使用生成的HF格式权重对验证集进行推理: +```shell +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0 \ +swift infer \ + --model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \ + --attn_impl flash_attn \ + --stream true \ + --load_data_args true \ + --temperature 0 \ + --max_new_tokens 512 +``` + +推理结果如下: +``` +[QUERY] Using LaTeX to perform OCR on the image. +[LABELS] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x ) +[RESPONSE] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x ) +-------------------------------------------------- +[QUERY] Using LaTeX to perform OCR on the image. +[LABELS] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y +[RESPONSE] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y +-------------------------------------------------- +[QUERY] Using LaTeX to perform OCR on the image. +[LABELS] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 } +[RESPONSE] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 } +``` diff --git "a/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md" "b/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md" index d105487afe..782d532cab 100644 --- "a/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md" +++ "b/docs/source/Megatron-SWIFT/\345\277\253\351\200\237\345\274\200\345\247\213.md" @@ -1,7 +1,7 @@ # 快速开始 -ms-swift引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3、[Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh)、Qwen2.5、Llama3、Deepseek-R1、GLM4.5等模型的预训练和微调。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。推荐在MoE训练时使用Megatron-SWIFT,这通常可以获得10倍的训练速度提升。 +ms-swift引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3、[Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh)、Qwen2.5、Llama3、Deepseek-R1、GLM4.5等模型的CPT/SFT/DPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/支持的模型和数据集.md)。推荐在MoE训练时使用Megatron-SWIFT,这通常可以获得10倍的训练速度提升。 ## 环境准备 使用Megatron-SWIFT,除了安装swift依赖外,还需要安装以下内容: diff --git a/docs/source/index.rst b/docs/source/index.rst index 7e9dbba285..af6129e628 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -38,6 +38,7 @@ Swift DOCUMENTATION Megatron-SWIFT/快速开始.md Megatron-SWIFT/命令行参数.md Megatron-SWIFT/LoRA训练.md + Megatron-SWIFT/多模态模型.md .. toctree:: :maxdepth: 2 diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md index 2606710f79..6499184173 100644 --- a/docs/source_en/Instruction/Command-line-parameters.md +++ b/docs/source_en/Instruction/Command-line-parameters.md @@ -162,7 +162,7 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with - 🔥aligner_lr: When training a multimodal large model, this parameter specifies the learning rate for the aligner. By default, it is set to None, which means it equals `learning_rate`. - lr_scheduler_type: Type of lr_scheduler, defaults to 'cosine'. - lr_scheduler_kwargs: Other parameters for the lr_scheduler, defaults to None. -- 🔥gradient_checkpointing_kwargs: Parameters for `torch.utils.checkpoint`. For example, set as `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to None. +- gradient_checkpointing_kwargs: Parameters for `torch.utils.checkpoint`. For example, set as `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to None. - Note: When using DDP without DeepSpeed/FSDP, and `gradient_checkpointing_kwargs` is `None`, it will default to `'{"use_reentrant": false}'`. - full_determinism: Ensures reproducible results during training. Note: This will negatively impact performance. Defaults to False. - 🔥report_to: Default value is `tensorboard`. You can also specify `--report_to tensorboard wandb swanlab` or `--report_to all`. @@ -215,11 +215,11 @@ Other important parameters: ### Tuner Arguments -- 🔥freeze_llm: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, but with different meanings. In full parameter training, setting freeze_llm to True will freeze some of the LLM weights. In LoRA training, if `target_modules` is set to 'all-linear', setting freeze_llm to True will prevent adding LoRA modules to the LLM part. The default is False. -- 🔥freeze_vit: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, with similar meanings as `freeze_llm`. The default is True. - - Note: Here, "vit" refers not only to the vision_tower but also includes the audio_tower. -- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, with similar meanings as `freeze_llm`. The default is True. -- 🔥 target_modules: Specifies the LoRA modules. The default is `['all-linear']`, but you can also pass layer-name suffixes, e.g. `--target_modules q_proj k_proj v_proj`. This argument is not restricted to LoRA and can be used with other tuners as well. +- 🔥freeze_llm: This parameter only takes effect for multimodal models and can be used in both full-parameter and LoRA training, but with different behaviors. In full-parameter training, setting `freeze_llm` to `True` will freeze the weights of the LLM component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_llm` to `True` will prevent LoRA modules from being added to the LLM component. The default value is `False`. +- 🔥freeze_vit: This parameter only applies to multimodal models and can be used in both full-parameter and LoRA training, though with different effects. In full-parameter training, setting `freeze_vit` to `True` will freeze the weights of the ViT component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_vit` to `True` will prevent LoRA modules from being added to the ViT component. The default value is `True`. + - Note: The term "ViT" here refers not only to the vision tower but also includes the audio tower. +- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used in both full-parameter and LoRA training, with differing outcomes. In full-parameter training, setting `freeze_aligner` to `True` will freeze the weights of the aligner (also known as the projector) component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_aligner` to `True` will prevent LoRA modules from being added to the aligner component. The default value is `True`. +- 🔥target_modules: Specifies the LoRA modules. The default is `['all-linear']`, but you can also pass layer-name suffixes, e.g. `--target_modules q_proj k_proj v_proj`. This argument is not restricted to LoRA and can be used with other tuners as well. - Note: The behavior of the special value `'all-linear'` differs between plain LLMs and multimodal LLMs. For a standard LLM, it automatically locates every linear layer except `lm_head` and attaches a tuner. For a multimodal LLM, it attaches the tuner only to the LLM component by default. This default can be changed with the `freeze_llm`, `freeze_vit`, and `freeze_aligner` options. - 🔥target_regex: Specifies a regex expression for LoRA modules, with a default of `None`. If this value is provided, the target_modules parameter becomes ineffective. This parameter is not limited to LoRA and can be used for other tuners. - target_parameters: List of parameter names to be replaced with LoRA. This argument behaves similarly to target_modules, but you should pass parameter names instead. This feature requires "peft>=0.17.0". For example, in many Mixture-of-Experts (MoE) layers in Hugging Face Transformers, `nn.Linear` is not used; instead, `nn.Parameter` is used. In such cases, the `target_parameters` argument can be used to apply LoRA. diff --git a/docs/source_en/Instruction/Supported-models-and-datasets.md b/docs/source_en/Instruction/Supported-models-and-datasets.md index a20d5c0bd6..1830a604a9 100644 --- a/docs/source_en/Instruction/Supported-models-and-datasets.md +++ b/docs/source_en/Instruction/Supported-models-and-datasets.md @@ -652,12 +652,12 @@ The table below introduces the models integrated with ms-swift: |[Qwen/Qwen-VL-Chat-Int4](https://modelscope.cn/models/Qwen/Qwen-VL-Chat-Int4)|qwen_vl|qwen_vl|-|✘|vision|[Qwen/Qwen-VL-Chat-Int4](https://huggingface.co/Qwen/Qwen-VL-Chat-Int4)| |[Qwen/Qwen-Audio-Chat](https://modelscope.cn/models/Qwen/Qwen-Audio-Chat)|qwen_audio|qwen_audio|-|✘|audio|[Qwen/Qwen-Audio-Chat](https://huggingface.co/Qwen/Qwen-Audio-Chat)| |[Qwen/Qwen-Audio](https://modelscope.cn/models/Qwen/Qwen-Audio)|qwen_audio|qwen_audio|-|✘|audio|[Qwen/Qwen-Audio](https://huggingface.co/Qwen/Qwen-Audio)| -|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)| -|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)| -|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)| -|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)| -|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)| -|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)| +|[Qwen/Qwen2-VL-2B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)| +|[Qwen/Qwen2-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)| +|[Qwen/Qwen2-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct)| +|[Qwen/Qwen2-VL-2B](https://modelscope.cn/models/Qwen/Qwen2-VL-2B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B)| +|[Qwen/Qwen2-VL-7B](https://modelscope.cn/models/Qwen/Qwen2-VL-7B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B)| +|[Qwen/Qwen2-VL-72B](https://modelscope.cn/models/Qwen/Qwen2-VL-72B)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2-VL-72B](https://huggingface.co/Qwen/Qwen2-VL-72B)| |[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4)| |[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4)| |[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-GPTQ-Int4)| @@ -667,16 +667,16 @@ The table below introduces the models integrated with ms-swift: |[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-2B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-2B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-AWQ)| |[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-7B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-AWQ)| |[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2-VL-72B-Instruct-AWQ)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2-VL-72B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct-AWQ)| -|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)| -|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)| -|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)| -|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)| -|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)| -|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)| -|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)| -|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)| -|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)| -|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)| +|[bytedance-research/UI-TARS-2B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-2B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)| +|[bytedance-research/UI-TARS-7B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)| +|[bytedance-research/UI-TARS-7B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-7B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)| +|[bytedance-research/UI-TARS-72B-SFT](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-SFT)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)| +|[bytedance-research/UI-TARS-72B-DPO](https://modelscope.cn/models/bytedance-research/UI-TARS-72B-DPO)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[bytedance-research/UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)| +|[allenai/olmOCR-7B-0225-preview](https://modelscope.cn/models/allenai/olmOCR-7B-0225-preview)|qwen2_vl|qwen2_vl|transformers>=4.45, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview)| +|[Qwen/Qwen2.5-VL-3B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)| +|[Qwen/Qwen2.5-VL-7B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)| +|[Qwen/Qwen2.5-VL-32B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)| +|[Qwen/Qwen2.5-VL-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-VL-72B-Instruct)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✔|vision, video|[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)| |[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-3B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)| |[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-7B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)| |[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://modelscope.cn/models/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)|qwen2_5_vl|qwen2_5_vl|transformers>=4.49, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[Qwen/Qwen2.5-VL-32B-Instruct-AWQ](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct-AWQ)| diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md index 0501c3db87..89963700d0 100644 --- a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md +++ b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md @@ -206,6 +206,10 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the **Tuner Parameters**: - train_type: Options are `'lora'` and `'full'`. Default is `'full'`. +- 🔥freeze_llm: This parameter only takes effect for multimodal models and can be used in both full-parameter and LoRA training, but with different behaviors. In full-parameter training, setting `freeze_llm` to `True` will freeze the weights of the LLM component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_llm` to `True` will prevent LoRA modules from being added to the LLM component. The default value is `False`. +- 🔥freeze_vit: This parameter only applies to multimodal models and can be used in both full-parameter and LoRA training, though with different effects. In full-parameter training, setting `freeze_vit` to `True` will freeze the weights of the ViT component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_vit` to `True` will prevent LoRA modules from being added to the ViT component. The default value is `True`. + - Note: The term "ViT" here refers not only to the vision tower but also includes the audio tower. +- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used in both full-parameter and LoRA training, with differing outcomes. In full-parameter training, setting `freeze_aligner` to `True` will freeze the weights of the aligner (also known as the projector) component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_aligner` to `True` will prevent LoRA modules from being added to the aligner component. The default value is `True`. Full-parameter Training: @@ -249,6 +253,8 @@ Megatron training parameters are inherited from Megatron parameters and basic pa - If you wish to customize the attention_mask, you can set `--padding_free false`. - Note: The Megatron-SWIFT training feature prioritizes support for the padding-free format. Unless under special circumstances, please do not modify this value. - mlp_padding_free: The default is False. This is used for applying padding-free optimization to the MLP when padding_free is set to false. It allows for improved training speed and reduced memory usage while customizing the attention_mask. +- vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT part during multimodal model training. Default: True. +- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Default: None. - 🔥packing: Whether to use sequence packing, defaults to False. Currently supports CPT/SFT/DPO. - packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length. - streaming: Stream data loading and processing, default is False. diff --git a/docs/source_en/Megatron-SWIFT/Multimodal-Model.md b/docs/source_en/Megatron-SWIFT/Multimodal-Model.md new file mode 100644 index 0000000000..91becd4917 --- /dev/null +++ b/docs/source_en/Megatron-SWIFT/Multimodal-Model.md @@ -0,0 +1,158 @@ +# Multimodal Models + +ms-swift introduces Megatron's parallelization techniques to accelerate the training of large multimodal models. Currently, it supports CPT/SFT/DPO for models such as Qwen2.5-VL. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md). + +For environment setup, please refer to the Megatron-SWIFT [Quick Start guide](./Quick-start.md). + +## Dense Model Full/LoRA Fine-tuning + +This section demonstrates fine-tuning the Qwen2.5-VL-7B-Instruct model on the LaTeX-OCR task using two 80GiB A100 GPUs, with both full-parameter fine-tuning and LoRA. The best practices described below can be completed within 10 minutes. + +First, we need to convert the model weights from Hugging Face format to Megatron format: +```shell +CUDA_VISIBLE_DEVICES=0 \ +swift export \ + --model Qwen/Qwen2.5-VL-7B-Instruct \ + --to_mcore true \ + --torch_dtype bfloat16 \ + --output_dir Qwen2.5-VL-7B-Instruct-mcore \ + --test_convert_precision true +``` + +### Full + +The full-parameter training script is as follows: +```shell +# 2 * 72GiB; 4.1s/it +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +NPROC_PER_NODE=2 \ +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0,1 \ +megatron sft \ + --load Qwen2.5-VL-7B-Instruct-mcore \ + --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \ + --tensor_model_parallel_size 2 \ + --sequence_parallel true \ + --packing true \ + --freeze_llm false \ + --freeze_vit true \ + --freeze_aligner true \ + --split_dataset_ratio 0.01 \ + --micro_batch_size 1 \ + --global_batch_size 4 \ + --recompute_granularity full \ + --recompute_method uniform \ + --recompute_num_layers 1 \ + --finetune true \ + --cross_entropy_loss_fusion true \ + --lr 1e-5 \ + --lr_warmup_fraction 0.05 \ + --min_lr 1e-6 \ + --max_epochs 1 \ + --save megatron_output/Qwen2.5-VL-7B-Instruct \ + --save_interval 200 \ + --vit_gradient_checkpointing true \ + --max_length 2048 \ + --num_workers 4 \ + --no_save_optim true \ + --no_save_rng true \ + --dataset_num_proc 8 +``` + +Convert Megatron-format weights saved with full parameters to Hugging Face format: + +- Note: `--mcore_model` should point to the parent directory of `iter_xxx`. By default, the checkpoint specified in `latest_checkpointed_iteration.txt` will be used. + +```shell +CUDA_VISIBLE_DEVICES=0 \ +swift export \ + --mcore_model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \ + --to_hf true \ + --torch_dtype bfloat16 \ + --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \ + --test_convert_precision true +``` + +### LoRA + +The LoRA training script is as follows: +```shell +# 2 * 23GiB; 2.3s/it +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +NPROC_PER_NODE=2 \ +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0,1 \ +megatron sft \ + --load Qwen2.5-VL-7B-Instruct-mcore \ + --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \ + --train_type lora \ + --lora_rank 8 \ + --lora_alpha 32 \ + --target_modules all-linear \ + --tensor_model_parallel_size 1 \ + --sequence_parallel true \ + --freeze_llm false \ + --freeze_vit true \ + --freeze_aligner true \ + --packing true \ + --split_dataset_ratio 0.01 \ + --micro_batch_size 1 \ + --global_batch_size 4 \ + --recompute_granularity full \ + --recompute_method uniform \ + --recompute_num_layers 1 \ + --finetune true \ + --cross_entropy_loss_fusion true \ + --lr 1e-4 \ + --lr_warmup_fraction 0.05 \ + --min_lr 1e-5 \ + --max_epochs 1 \ + --save megatron_output/Qwen2.5-VL-7B-Instruct \ + --save_interval 200 \ + --vit_gradient_checkpointing true \ + --max_length 2048 \ + --num_workers 4 \ + --no_save_optim true \ + --no_save_rng true \ + --dataset_num_proc 8 +``` + +Merge the LoRA-saved incremental weights and convert them to Hugging Face format: +```shell +CUDA_VISIBLE_DEVICES=0 \ +swift export \ + --mcore_adapters megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx \ + --to_hf true \ + --torch_dtype bfloat16 \ + --output_dir megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \ + --test_convert_precision true +``` + + +Finally, we use the generated Hugging Face format weights to perform inference on the validation set: +```shell +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0 \ +swift infer \ + --model megatron_output/Qwen2.5-VL-7B-Instruct/vx-xxx-hf \ + --attn_impl flash_attn \ + --stream true \ + --load_data_args true \ + --temperature 0 \ + --max_new_tokens 512 +``` + +The inference results are as follows: +``` +[QUERY] Using LaTeX to perform OCR on the image. +[LABELS] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x ) +[RESPONSE] \forall x \in X , ( \alpha f ) ( x ) = \alpha f ( x ) +-------------------------------------------------- +[QUERY] Using LaTeX to perform OCR on the image. +[LABELS] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y +[RESPONSE] \pi \int _ { c } ^ { d } \{ g ( y ) \} ^ { 2 } d y +-------------------------------------------------- +[QUERY] Using LaTeX to perform OCR on the image. +[LABELS] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 } +[RESPONSE] [ \frac 2 3 x ^ { \frac 3 2 } ] _ { 0 } ^ { 1 } +``` diff --git a/docs/source_en/Megatron-SWIFT/Quick-start.md b/docs/source_en/Megatron-SWIFT/Quick-start.md index 0e5ceca9d4..389ae8bea1 100644 --- a/docs/source_en/Megatron-SWIFT/Quick-start.md +++ b/docs/source_en/Megatron-SWIFT/Quick-start.md @@ -1,6 +1,6 @@ # Quick Start -ms-swift incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports the pre-training and fine-tuning of models such as Qwen3, [Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh), Qwen2.5, Llama3, Deepseek-R1 and GLM4.5 series. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md). We recommend using Megatron-SWIFT for MoE training; it can typically achieve a 10x speedup in training. +ms-swift incorporates Megatron's parallelization techniques to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports CPT/SFT/DPO for models such as Qwen3, [Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/qwen3_moe.sh), Qwen2.5, Llama3, Deepseek-R1 and GLM4.5 series. For a complete list of supported models, please refer to the [Supported Models and Datasets documentation](../Instruction/Supported-models-and-datasets.md). We recommend using Megatron-SWIFT for MoE training; it can typically achieve a 10x speedup in training. ## Environment Setup diff --git a/docs/source_en/index.rst b/docs/source_en/index.rst index a7a3ac0811..c561735643 100644 --- a/docs/source_en/index.rst +++ b/docs/source_en/index.rst @@ -38,6 +38,7 @@ Swift DOCUMENTATION Megatron-SWIFT/Quick-start.md Megatron-SWIFT/Command-line-parameters.md Megatron-SWIFT/LoRA-Training.md + Megatron-SWIFT/Multimodal-Model.md .. toctree:: diff --git a/examples/megatron/multimodal/dense/dpo.sh b/examples/megatron/multimodal/dense/dpo.sh new file mode 100644 index 0000000000..edea6bdb35 --- /dev/null +++ b/examples/megatron/multimodal/dense/dpo.sh @@ -0,0 +1,39 @@ +# 4 * 60GiB 14s/it +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +NPROC_PER_NODE=4 \ +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0,1,2,3 \ +megatron rlhf \ + --rlhf_type dpo \ + --load Qwen2.5-VL-7B-Instruct-mcore \ + --dataset 'swift/RLAIF-V-Dataset#20000' \ + --train_type full \ + --tensor_model_parallel_size 4 \ + --sequence_parallel true \ + --freeze_llm false \ + --freeze_vit true \ + --freeze_aligner true \ + --packing true \ + --split_dataset_ratio 0.01 \ + --micro_batch_size 1 \ + --global_batch_size 4 \ + --recompute_granularity full \ + --recompute_method uniform \ + --recompute_num_layers 1 \ + --finetune true \ + --cross_entropy_loss_fusion true \ + --lr 1e-5 \ + --lr_warmup_fraction 0.05 \ + --min_lr 1e-6 \ + --max_epochs 1 \ + --save megatron_output/Qwen2.5-VL-7B-Instruct \ + --save_interval 200 \ + --vit_gradient_checkpointing true \ + --max_length 8192 \ + --num_workers 4 \ + --no_save_optim true \ + --no_save_rng true \ + --dataset_num_proc 16 \ + --attention_backend flash \ + --beta 0.1 \ + --loss_type sigmoid diff --git a/examples/megatron/multimodal/dense/full.sh b/examples/megatron/multimodal/dense/full.sh new file mode 100644 index 0000000000..3590fad38d --- /dev/null +++ b/examples/megatron/multimodal/dense/full.sh @@ -0,0 +1,34 @@ +# 2 * 72GiB; 4.1s/it +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +NPROC_PER_NODE=2 \ +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0,1 \ +megatron sft \ + --load Qwen2.5-VL-7B-Instruct-mcore \ + --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \ + --tensor_model_parallel_size 2 \ + --sequence_parallel true \ + --packing true \ + --freeze_llm false \ + --freeze_vit true \ + --freeze_aligner true \ + --split_dataset_ratio 0.01 \ + --micro_batch_size 1 \ + --global_batch_size 4 \ + --recompute_granularity full \ + --recompute_method uniform \ + --recompute_num_layers 1 \ + --finetune true \ + --cross_entropy_loss_fusion true \ + --lr 1e-5 \ + --lr_warmup_fraction 0.05 \ + --min_lr 1e-6 \ + --max_epochs 1 \ + --save megatron_output/Qwen2.5-VL-7B-Instruct \ + --save_interval 200 \ + --vit_gradient_checkpointing true \ + --max_length 2048 \ + --num_workers 4 \ + --no_save_optim true \ + --no_save_rng true \ + --dataset_num_proc 8 diff --git a/examples/megatron/multimodal/dense/lora.sh b/examples/megatron/multimodal/dense/lora.sh new file mode 100644 index 0000000000..1e232f5f07 --- /dev/null +++ b/examples/megatron/multimodal/dense/lora.sh @@ -0,0 +1,38 @@ +# 2 * 23GiB; 2.3s/it +PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \ +NPROC_PER_NODE=2 \ +MAX_PIXELS=1003520 \ +CUDA_VISIBLE_DEVICES=0,1 \ +megatron sft \ + --load Qwen2.5-VL-7B-Instruct-mcore \ + --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \ + --train_type lora \ + --lora_rank 8 \ + --lora_alpha 32 \ + --target_modules all-linear \ + --tensor_model_parallel_size 1 \ + --sequence_parallel true \ + --freeze_llm false \ + --freeze_vit true \ + --freeze_aligner true \ + --packing true \ + --split_dataset_ratio 0.01 \ + --micro_batch_size 1 \ + --global_batch_size 4 \ + --recompute_granularity full \ + --recompute_method uniform \ + --recompute_num_layers 1 \ + --finetune true \ + --cross_entropy_loss_fusion true \ + --lr 1e-4 \ + --lr_warmup_fraction 0.05 \ + --min_lr 1e-5 \ + --max_epochs 1 \ + --save megatron_output/Qwen2.5-VL-7B-Instruct \ + --save_interval 200 \ + --vit_gradient_checkpointing true \ + --max_length 2048 \ + --num_workers 4 \ + --no_save_optim true \ + --no_save_rng true \ + --dataset_num_proc 8 diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py index a1ef0b3905..c8a86cb3ad 100644 --- a/swift/llm/template/base.py +++ b/swift/llm/template/base.py @@ -1216,7 +1216,7 @@ def _encode_truncated(self, inputs: StdTemplateInputs): encoded[key] = value else: encoded = self._encode(inputs) - + self._handle_megatron_cp(encoded) # TODO: fix cp_size & cached_dataset input_ids = encoded.get('input_ids') labels = encoded.get('labels') loss_scale = encoded.get('loss_scale') @@ -1276,7 +1276,6 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]: encoded['input_ids'] = input_ids encoded['labels'] = labels encoded['loss_scale'] = loss_scale - self._handle_megatron_cp(encoded) # TODO: fix cp_size & cached_dataset if encoded.get('labels') is not None: encoded['labels'][0] = -100 if encoded.get('loss_scale') is not None: @@ -1626,7 +1625,7 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in res = {} if self.padding_free: assert len(batch) == 1, f'batch: {batch}' - for k in ['input_ids', 'labels', 'position_ids', 'loss_scale', 'channel']: + for k in ['input_ids', 'labels', 'position_ids', 'loss_scale', 'channel', 'real_position_ids']: v = batch[0].get(k) if v is not None: res[k] = v if k == 'channel' else [v] @@ -1648,9 +1647,10 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in res[key] = val keys = [ - 'input_ids', 'inputs_embeds', 'attention_mask', 'labels', 'loss_scale', 'position_ids', 'token_type_ids' + 'input_ids', 'inputs_embeds', 'attention_mask', 'labels', 'loss_scale', 'position_ids', 'token_type_ids', + 'real_position_ids' ] - pad_values = [self.tokenizer.pad_token_id, 0., 0, -100, 0., 0., 0] + pad_values = [self.tokenizer.pad_token_id, 0., 0, -100, 0., 0., 0, 0.] # Convert to tensor and remove unnecessary dimensions. seq_lens = None for key in keys: @@ -1677,10 +1677,14 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in if self.padding_free: cp_size = self.sequence_parallel_size if cp_size > 1: - padding_len = padding_to - seq_lens[0] - position_ids = res['position_ids'][0].tolist() - position_ids += list(range(cp_size * 2)) * (padding_len // (cp_size * 2)) - res['position_ids'] = [torch.tensor(position_ids)] + for key in ['position_ids', 'real_position_ids']: + padding_len = padding_to - seq_lens[0] + position_ids = res[key][0] + extended_position_ids = torch.arange(cp_size * 2).repeat(padding_len // (cp_size * 2)) + if position_ids.ndim == 3: # compat mrope + extended_position_ids = extended_position_ids[None, + None, :].expand(position_ids.shape[0], 1, -1) + res[key] = [torch.concat([position_ids, extended_position_ids], dim=-1)] else: seq_len = max(seq_lens) if padding_to is None else padding_to res['attention_mask'] = torch.tril(torch.ones( @@ -1694,13 +1698,16 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in continue if self.use_megatron and not self.padding_free and key == 'attention_mask': continue - if padding_to is not None and not (self.padding_free and key == 'position_ids' + if padding_to is not None and not (self.padding_free and key in {'position_ids', 'real_position_ids'} and self.sequence_parallel_size > 1): padding_len = padding_to - seq_lens[0] if padding_len > 0: res[key][0] = F.pad(res[key][0], (0, padding_len) if padding_right else (padding_len, 0), 'constant', pad_value) - res[key] = self._pad_sequence(res[key], pad_value) + if key == 'real_position_ids': + res[key] = torch.concat(res[key], dim=-1) + else: + res[key] = self._pad_sequence(res[key], pad_value) # multimodal res.update(self._data_collator_mm_data(batch)) diff --git a/swift/llm/template/template/qwen.py b/swift/llm/template/template/qwen.py index b59bbfa2f3..f6a935f168 100644 --- a/swift/llm/template/template/qwen.py +++ b/swift/llm/template/template/qwen.py @@ -424,9 +424,7 @@ def _get_position_ids(self, inputs: Dict[str, Any]): def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[int] = None) -> Dict[str, Any]: res = super()._data_collator(batch, padding_to=padding_to) - if self.padding_free: - res['real_position_ids'] = self.concat_tensor(batch, 'real_position_ids', -1) - elif self.is_training: + if not self.padding_free and self.is_training: res['position_ids'] = self._get_position_ids(res) return res diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py index 9628b4da6a..02b3721add 100644 --- a/swift/megatron/argument/megatron_args.py +++ b/swift/megatron/argument/megatron_args.py @@ -31,6 +31,9 @@ class RLHFMegatronArgumentsMixin: @dataclass class MegatronTunerMixin: train_type: Literal['lora', 'full'] = 'full' + freeze_llm: bool = False + freeze_vit: bool = True + freeze_aligner: bool = True # full freeze_parameters: List[str] = field(default_factory=list) freeze_parameters_regex: Optional[str] = None @@ -71,6 +74,8 @@ def load_tuner_config(adapter_load: Optional[str]) -> Dict[str, Any]: def __post_init__(self): if self.freeze_parameters_ratio > 0 and self.pipeline_model_parallel_size > 1: raise ValueError('`freeze_parameters_ratio` is not supported when `pipeline_model_parallel_size` > 1') + if self.target_regex: + self.target_modules = self.target_regex @dataclass @@ -94,6 +99,10 @@ class ExtraMegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin): partial_rotary_factor: Optional[float] = None use_shared_expert_gate: Optional[bool] = None + # visual + vit_gradient_checkpointing: bool = True + gradient_checkpointing_kwargs: Optional[Union[dict, str]] = None + @dataclass class MegatronArguments(ExtraMegatronArguments): @@ -185,7 +194,8 @@ class MegatronArguments(ExtraMegatronArguments): group_query_attention: Optional[bool] = None num_query_groups: Optional[int] = None max_position_embeddings: Optional[int] = None - position_embedding_type: Literal['learned_absolute', 'rope', 'mrope', 'relative', 'none'] = 'rope' + position_embedding_type: Optional[Literal['learned_absolute', 'rope', 'mrope', 'relative', 'none']] = None + mrope_section: Optional[List[int]] = None rotary_base: Optional[int] = None rotary_percent: float = 1. rotary_interleaved: Optional[bool] = None @@ -376,10 +386,14 @@ def __post_init__(self): self.rope_scaling = json_parse_to_dict(self.rope_scaling) if 'type' in self.rope_scaling and 'rope_type' not in self.rope_scaling: self.rope_scaling['rope_type'] = self.rope_scaling['type'] + if self.gradient_checkpointing_kwargs is not None: + self.gradient_checkpointing_kwargs = json_parse_to_dict(self.gradient_checkpointing_kwargs) if self.eval_interval is None: self.eval_interval = self.save_interval if self.seq_length is None: self.seq_length = self.max_position_embeddings + if self.position_embedding_type is None: + self.position_embedding_type = 'rope' if self.tensorboard_dir is None and self.save is not None: self.tensorboard_dir = f'{self.save}/runs' self._init_moe() diff --git a/swift/megatron/argument/train_args.py b/swift/megatron/argument/train_args.py index 6f492dc7cd..9481b451df 100644 --- a/swift/megatron/argument/train_args.py +++ b/swift/megatron/argument/train_args.py @@ -17,7 +17,6 @@ class MegatronTrainArguments(MegatronArguments, BaseArguments): add_version: bool = True def init_model_args(self, tokenizer, config): - self.megatron_model_meta = get_megatron_model_meta(self.model_type) kwargs = self.megatron_model_meta.convert_hf_config(config) if self.new_special_tokens and kwargs['padded_vocab_size'] < len(tokenizer): kwargs['padded_vocab_size'] = math.ceil(len(tokenizer) / 128) * 128 @@ -28,6 +27,9 @@ def init_model_args(self, tokenizer, config): setattr(self, k, v) MegatronArguments.__post_init__(self) self.extra_args = self.parse_to_megatron() + self.extra_args['model_info'] = self.model_info + self.extra_args['model_meta'] = self.model_meta + self.extra_args['megatron_model_meta'] = self.megatron_model_meta def _init_save(self): init_process_group(backend=self.ddp_backend, timeout=self.ddp_timeout) @@ -46,6 +48,7 @@ def __post_init__(self): self.padding_free = True self.load = to_abspath(self.load, check_path_exist=True) BaseArguments.__post_init__(self) + self.megatron_model_meta = get_megatron_model_meta(self.model_type) if len(self.dataset) == 0 and len(self.cached_dataset) == 0: raise ValueError(f'self.dataset: {self.dataset}, self.cached_dataset: {self.cached_dataset}. ' 'Please input the training dataset.') diff --git a/swift/megatron/init.py b/swift/megatron/init.py index a1d90db786..f9ae0747dd 100644 --- a/swift/megatron/init.py +++ b/swift/megatron/init.py @@ -518,6 +518,103 @@ def __repr__(self): TELinear.__repr__ = __repr__ +def _patch_mrope(): + from megatron.core.models.common.embeddings.rotary_pos_embedding import MultimodalRotaryEmbedding + from megatron.core import parallel_state + from megatron.core.models.common.embeddings.rope_utils import (get_pos_emb_on_this_cp_rank, + _apply_rotary_pos_emb_bshd) + from megatron.core.models.common.embeddings import rope_utils + from megatron.training import get_args + + def forward(self, position_ids, mrope_section: List[int], packed_seq: bool = False) -> torch.Tensor: + seq = position_ids.to(device=self.inv_freq.device, dtype=self.inv_freq.dtype) + + if self.seq_len_interpolation_factor is not None: + seq *= 1 / self.seq_len_interpolation_factor + + # shape (3, bs, dim, 1) + inv_freq_expanded = self.inv_freq[None, None, :, None].expand(3, seq.shape[1], -1, 1) + # shape (3, bs, 1, seq_length) + seq_expanded = seq[:, :, None, :].float() + # shape (3, bs, seq_length, dim) + freqs = (inv_freq_expanded @ seq_expanded).transpose(2, 3) + # first part even vector components, second part odd vector components, + # 2 * dim in dimension size + if not self.rotary_interleaved: + emb = torch.cat((freqs, freqs), dim=-1) # shape (3, bs, seq_length, 2 * dim) + else: + bs = freqs.shape[1] + emb = torch.stack((freqs.view(3, bs, -1, 1), freqs.view(3, bs, -1, 1)), + dim=-1).view(3, bs, freqs.shape[0], -1) + + # generate freqs with mrope_section + # shape (bs, seq_length, 2 * dim) + mrope_section = mrope_section * 2 + emb = torch.cat([m[i % 3] for i, m in enumerate(emb.split(mrope_section, dim=-1))], dim=-1) + + # shape (seq_length, bs, 1, 2 * dim) + emb = emb[..., None, :].transpose(0, 1).contiguous() + if parallel_state.get_context_parallel_world_size() > 1 and not packed_seq: + # slice rotary_pos_emb along sequence dimension and select the parition of the current + # CP rank + emb = get_pos_emb_on_this_cp_rank(emb, 0, parallel_state.get_context_parallel_group()) + return emb + + MultimodalRotaryEmbedding.forward = forward + _origin_apply_rotary_pos_emb_thd = rope_utils._apply_rotary_pos_emb_thd + + def _apply_rotary_pos_emb_thd( + t: torch.Tensor, + cu_seqlens: torch.Tensor, + freqs: torch.Tensor, + rotary_interleaved: bool = False, + multi_latent_attention: bool = False, + mscale: float = 1.0, + cp_group: torch.distributed.ProcessGroup = None, + ) -> torch.Tensor: + """A baseline implementation of applying RoPE for `thd` format. + + Args: + t (Tensor): Input tensor T is of shape [t, h, d] + cu_seqlens(Tensor): Cumulative sum of sequence lengths in a batch for `t`, + with shape [b + 1] and dtype torch.int32. + freqs (Tensor): Rotary Positional embedding tensor freq is of shape [max_s, 1, 1, d] + cp_group (torch.distributed.ProcessGroup): The context parallel group + + Returns: + Tensor: Shape [t, h, d]. The input tensor after applying RoPE. + """ + args = get_args() + if args.position_embedding_type != 'mrope': + return _origin_apply_rotary_pos_emb_thd( + t, + cu_seqlens, + freqs, + rotary_interleaved=rotary_interleaved, + multi_latent_attention=multi_latent_attention, + mscale=mscale, + cp_group=cp_group, + ) + + if cp_group is None: + raise ValueError('cp_group must be provided for THD format RoPE') + cp_size = cp_group.size() + cu_seqlens = cu_seqlens // cp_size + seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist() + + return torch.cat([ + _apply_rotary_pos_emb_bshd( + x.unsqueeze(1), + f, + rotary_interleaved=rotary_interleaved, + multi_latent_attention=multi_latent_attention, + mscale=mscale, + ) for x, f in zip(torch.split(t, seqlens), torch.split(freqs, seqlens)) + ]).squeeze(1) + + rope_utils._apply_rotary_pos_emb_thd = _apply_rotary_pos_emb_thd + + def _patch_megatron(): _patch_flash_attn() _patch_transformer_engine() @@ -527,6 +624,7 @@ def _patch_megatron(): _patch_TEGroupedLinear() _patch_TransformerLayer() _patch_compile_helpers() + _patch_mrope() from swift.megatron import tuners # patch lora try: _patch_torch_FileSystemReader() @@ -546,6 +644,8 @@ def _patch_megatron(): def init_megatron_env() -> None: if 'MEGATRON_LM_PATH' not in os.environ: + # TODO: Synchronization issues may occur in DDP scenarios + # if the distributed environment has not been initialized. os.environ['MEGATRON_LM_PATH'] = git_clone_github( 'https://github.com/NVIDIA/Megatron-LM', branch='core_r0.13.0') with safe_ddp_context(hash_id='megatron-lm'): diff --git a/swift/megatron/model/__init__.py b/swift/megatron/model/__init__.py index 3d13a8d1b5..3c882c9864 100644 --- a/swift/megatron/model/__init__.py +++ b/swift/megatron/model/__init__.py @@ -1,4 +1,4 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from . import gpt +from . import gpt, mm_gpt from .constant import MegatronModelType from .register import MegatronModelMeta, get_megatron_model_meta, register_megatron_model diff --git a/swift/megatron/model/constant.py b/swift/megatron/model/constant.py index 8eebb6aa76..56e2ea6707 100644 --- a/swift/megatron/model/constant.py +++ b/swift/megatron/model/constant.py @@ -1,3 +1,5 @@ # Copyright (c) Alibaba, Inc. and its affiliates. class MegatronModelType: gpt = 'gpt' + qwen2_vl = 'qwen2_vl' + qwen2_5_vl = 'qwen2_5_vl' diff --git a/swift/megatron/model/gpt/__init__.py b/swift/megatron/model/gpt/__init__.py index 32c2c9b861..9e2654620e 100644 --- a/swift/megatron/model/gpt/__init__.py +++ b/swift/megatron/model/gpt/__init__.py @@ -1,55 +1,64 @@ # Copyright (c) Alibaba, Inc. and its affiliates. from swift.llm import ModelType from ..constant import MegatronModelType +from ..gpt_model import GPTModel +from ..model_provider import model_provider from ..register import MegatronModelMeta, register_megatron_model from .config import convert_gpt_hf_config from .hf2mcore import convert_hf2mcore from .mcore2hf import convert_mcore2hf -from .model import model_provider register_megatron_model( - MegatronModelMeta(MegatronModelType.gpt, [ - ModelType.qwen2, - ModelType.qwen2_5, - ModelType.qwq, - ModelType.qwq_preview, - ModelType.qwen2_5_math, - ModelType.llama, - ModelType.llama3, - ModelType.llama3_1, - ModelType.llama3_2, - ModelType.longwriter_llama3_1, - ModelType.codefuse_codellama, - ModelType.marco_o1, - ModelType.deepseek, - ModelType.deepseek_r1_distill, - ModelType.yi, - ModelType.yi_coder, - ModelType.sus, - ModelType.skywork_o1, - ModelType.openbuddy_llama, - ModelType.openbuddy_llama3, - ModelType.megrez, - ModelType.reflection, - ModelType.numina, - ModelType.ziya, - ModelType.mengzi3, - ModelType.qwen3, - ModelType.qwen3_thinking, - ModelType.qwen3_nothinking, - ModelType.qwen2_moe, - ModelType.qwen3_moe, - ModelType.qwen3_moe_thinking, - ModelType.internlm3, - ModelType.mimo, - ModelType.mimo_rl, - ModelType.moonlight, - ModelType.deepseek_moe, - ModelType.deepseek_v2, - ModelType.deepseek_v2_5, - ModelType.deepseek_r1, - ModelType.dots1, - ModelType.ernie, - ModelType.glm4_5, - ModelType.deepseek_v3_1, - ], model_provider, convert_gpt_hf_config, convert_mcore2hf, convert_hf2mcore)) + MegatronModelMeta( + MegatronModelType.gpt, + [ + ModelType.qwen2, + ModelType.qwen2_5, + ModelType.qwq, + ModelType.qwq_preview, + ModelType.qwen2_5_math, + ModelType.llama, + ModelType.llama3, + ModelType.llama3_1, + ModelType.llama3_2, + ModelType.longwriter_llama3_1, + ModelType.codefuse_codellama, + ModelType.marco_o1, + ModelType.deepseek, + ModelType.deepseek_r1_distill, + ModelType.yi, + ModelType.yi_coder, + ModelType.sus, + ModelType.skywork_o1, + ModelType.openbuddy_llama, + ModelType.openbuddy_llama3, + ModelType.megrez, + ModelType.reflection, + ModelType.numina, + ModelType.ziya, + ModelType.mengzi3, + ModelType.qwen3, + ModelType.qwen3_thinking, + ModelType.qwen3_nothinking, + ModelType.qwen2_moe, + ModelType.qwen3_moe, + ModelType.qwen3_moe_thinking, + ModelType.internlm3, + ModelType.mimo, + ModelType.mimo_rl, + ModelType.moonlight, + ModelType.deepseek_moe, + ModelType.deepseek_v2, + ModelType.deepseek_v2_5, + ModelType.deepseek_r1, + ModelType.dots1, + ModelType.ernie, + ModelType.glm4_5, + ModelType.deepseek_v3_1, + ], + model_provider=model_provider, + model_cls=GPTModel, + convert_hf_config=convert_gpt_hf_config, + convert_mcore2hf=convert_mcore2hf, + convert_hf2mcore=convert_hf2mcore, + )) diff --git a/swift/megatron/model/gpt/config.py b/swift/megatron/model/gpt/config.py index ec58a28142..7b6a1803a1 100644 --- a/swift/megatron/model/gpt/config.py +++ b/swift/megatron/model/gpt/config.py @@ -39,6 +39,9 @@ def convert_gpt_hf_config(config) -> Dict[str, Any]: res['rotary_interleaved'] = True elif architectures == 'Glm4MoeForCausalLM': res['moe_router_score_function'] = 'sigmoid' + elif architectures in {'Qwen2VLForConditionalGeneration', 'Qwen2_5_VLForConditionalGeneration'}: + res['position_embedding_type'] = 'mrope' + res['mrope_section'] = res['rope_scaling']['mrope_section'] if first_k_dense_replace is not None: res['moe_layer_freq'] = f'[0]*{first_k_dense_replace}+[1]*{res["num_layers"] - first_k_dense_replace}' if res.get('moe_router_score_function', 'softmax') == 'sigmoid': diff --git a/swift/megatron/model/gpt/hf2mcore.py b/swift/megatron/model/gpt/hf2mcore.py index 93a8d9b36e..76780be641 100644 --- a/swift/megatron/model/gpt/hf2mcore.py +++ b/swift/megatron/model/gpt/hf2mcore.py @@ -91,7 +91,7 @@ def set_mlp_state(args, mg_mlp, hf_mlp): def set_layer_state(args, mg_model, hf_model, layer_idx): mg_layer = mg_model.decoder.layers[layer_idx] - hf_layer = hf_model.model.layers[layer_idx] + hf_layer = hf_model.layers[layer_idx] if args.multi_latent_attention: set_mla_attn_state(args, mg_layer.self_attention, hf_layer.self_attn) mg_layer.input_layernorm.weight.data.copy_(hf_layer.input_layernorm.weight) @@ -115,4 +115,4 @@ def convert_hf2mcore(hf_model, mg_model): mg_model.output_layer.weight.data.copy_(hf_model.lm_head.weight) mg_model.decoder.final_layernorm.weight.data.copy_(hf_model.model.norm.weight) for layer_idx in range(args.num_layers): - set_layer_state(args, mg_model, hf_model, layer_idx) + set_layer_state(args, mg_model, hf_model.model, layer_idx) diff --git a/swift/megatron/model/gpt/mcore2hf.py b/swift/megatron/model/gpt/mcore2hf.py index bd0e480f65..3f063d4559 100644 --- a/swift/megatron/model/gpt/mcore2hf.py +++ b/swift/megatron/model/gpt/mcore2hf.py @@ -88,7 +88,7 @@ def set_mlp_state(args, mg_mlp, hf_mlp): def set_layer_state(args, mg_model, hf_model, layer_idx): mg_layer = mg_model.decoder.layers[layer_idx] - hf_layer = hf_model.model.layers[layer_idx] + hf_layer = hf_model.layers[layer_idx] if args.multi_latent_attention: set_mla_attn_state(args, mg_layer.self_attention, hf_layer.self_attn) @@ -113,4 +113,4 @@ def convert_mcore2hf(hf_model, mg_model): hf_model.lm_head.weight.data.copy_(mg_model.output_layer.weight) hf_model.model.norm.weight.data.copy_(mg_model.decoder.final_layernorm.weight) for layer_idx in range(args.num_layers): - set_layer_state(args, mg_model, hf_model, layer_idx) + set_layer_state(args, mg_model, hf_model.model, layer_idx) diff --git a/swift/megatron/model/gpt_model.py b/swift/megatron/model/gpt_model.py index 7b68c27ca8..f03a7c855e 100644 --- a/swift/megatron/model/gpt_model.py +++ b/swift/megatron/model/gpt_model.py @@ -86,7 +86,7 @@ def __init__( new_inv_freq, self.attention_scaling = get_rope_inv_freq() self.rotary_pos_emb.inv_freq = new_inv_freq.to(self.rotary_pos_emb.inv_freq.device) - if self.attention_scaling != 1 and config.apply_rope_fusion: + if (self.attention_scaling != 1 or position_embedding_type == 'mrope') and config.apply_rope_fusion: config.apply_rope_fusion = False logger.warning('`apply_rope_fusion` does not support `attention_scaling`. ' f'Setting `config.apply_rope_fusion`: {config.apply_rope_fusion}') @@ -154,7 +154,7 @@ def forward( rotary_pos_emb = None rotary_pos_cos = None rotary_pos_sin = None - if self.position_embedding_type == 'rope': + if self.position_embedding_type in {'rope', 'mrope'}: if not self.training and self.config.flash_decode and inference_params: # Flash decoding uses precomputed cos and sin for RoPE rotary_pos_cos, rotary_pos_sin = self.rotary_pos_emb_cache.setdefault( @@ -162,16 +162,23 @@ def forward( self.rotary_pos_emb.get_cos_sin(inference_params.max_sequence_length), ) else: - rotary_seq_len = self.rotary_pos_emb.get_rotary_seq_len(inference_params, self.decoder, decoder_input, - self.config, packed_seq_params) + rotary_seq_len = RotaryEmbedding.get_rotary_seq_len(self, inference_params, self.decoder, decoder_input, + self.config, packed_seq_params) if self.hf_rope_scaling is not None: attention_scaling = dynamic_rope_update(self, self.rotary_pos_emb.inv_freq, rotary_seq_len) if attention_scaling is not None: self.attention_scaling = attention_scaling - rotary_pos_emb = self.rotary_pos_emb( - rotary_seq_len, - packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd', - ) + if self.position_embedding_type == 'mrope': + rotary_pos_emb = self.rotary_pos_emb( + position_ids, + mrope_section=self.mrope_section, + packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd', + ) + else: + rotary_pos_emb = self.rotary_pos_emb( + rotary_seq_len, + packed_seq=packed_seq_params is not None and packed_seq_params.qkv_format == 'thd', + ) if ((self.config.enable_cuda_graph or self.config.flash_decode) and rotary_pos_cos is not None and inference_params): sequence_len_offset = torch.tensor( diff --git a/swift/megatron/model/mm_gpt/__init__.py b/swift/megatron/model/mm_gpt/__init__.py new file mode 100644 index 0000000000..30f489086c --- /dev/null +++ b/swift/megatron/model/mm_gpt/__init__.py @@ -0,0 +1 @@ +from . import qwen2_5_vl diff --git a/swift/megatron/model/mm_gpt/qwen2_5_vl.py b/swift/megatron/model/mm_gpt/qwen2_5_vl.py new file mode 100644 index 0000000000..ba23e2ee8e --- /dev/null +++ b/swift/megatron/model/mm_gpt/qwen2_5_vl.py @@ -0,0 +1,148 @@ +import torch +from megatron.core.models.huggingface import HuggingFaceModule +from megatron.training import get_args, get_tokenizer + +from swift.llm import ModelType, get_model_tokenizer, to_device +from ..constant import MegatronModelType +from ..gpt.hf2mcore import set_layer_state as set_layer_state_hf2mcore +from ..gpt.mcore2hf import set_layer_state as set_layer_state_mcore2hf +from ..register import register_megatron_model +from .utils import MMGPTMegatronModelMeta, patch_device_map_meta + + +def convert_hf2mcore_qwen2_5_vl(hf_model, mg_model): + language_model = hf_model.model + if hasattr(language_model, 'language_model'): + language_model = language_model.language_model + visual = hf_model.visual if hasattr(hf_model, 'visual') else hf_model.model.visual + mg_language_model = mg_model.language_model + args = get_args() + mg_language_model.embedding.word_embeddings.weight.data.copy_(language_model.embed_tokens.weight) + if args.untie_embeddings_and_output_weights: + mg_language_model.output_layer.weight.data.copy_(hf_model.lm_head.weight) + mg_language_model.decoder.final_layernorm.weight.data.copy_(language_model.norm.weight) + for layer_idx in range(args.num_layers): + set_layer_state_hf2mcore(args, mg_language_model, language_model, layer_idx) + mg_model.visual.model.load_state_dict(visual.state_dict()) + + +def convert_mcore2hf_qwen2_5_vl(hf_model, mg_model): + language_model = hf_model.model + if hasattr(language_model, 'language_model'): + language_model = language_model.language_model + visual = hf_model.visual if hasattr(hf_model, 'visual') else hf_model.model.visual + mg_language_model = mg_model.language_model + args = get_args() + language_model.embed_tokens.weight.data.copy_(mg_language_model.embedding.word_embeddings.weight) + if args.untie_embeddings_and_output_weights: + hf_model.lm_head.weight.data.copy_(mg_language_model.output_layer.weight) + language_model.norm.weight.data.copy_(mg_language_model.decoder.final_layernorm.weight) + for layer_idx in range(args.num_layers): + set_layer_state_mcore2hf(args, mg_language_model, language_model, layer_idx) + visual.load_state_dict(mg_model.visual.model.state_dict()) + + +class Qwen2_5VL_Vit(HuggingFaceModule): + vision_tower = ['model'] + aligner = ['model.merger'] + version = 'v2_5' + + def __init__(self, config): + if self.version == 'v2_5': + try: + from transformers.models.qwen2_5_vl import Qwen2_5_VLTextModel + except ImportError: + from transformers.models.qwen2_5_vl import Qwen2_5_VLModel as Qwen2_5_VLTextModel + context = patch_device_map_meta(Qwen2_5_VLTextModel) + elif self.version == 'v2': + try: + from transformers.models.qwen2_vl import Qwen2VLTextModel + except ImportError: + from transformers.models.qwen2_vl import Qwen2VLModel as Qwen2VLTextModel + context = patch_device_map_meta(Qwen2VLTextModel) + super().__init__(config) + args = get_args() + model_dir = args.model_info.model_dir + kwargs = {'attn_impl': 'flash_attn'} if args.attention_backend.name == 'flash' else {} + with context: + model, _ = get_model_tokenizer(model_dir, args.torch_dtype, return_dummy_model=True, **kwargs) + self.model = model.visual.to('cuda') + self.model_config = model.config + self.processor = get_tokenizer() + + def forward(self, *args, **kwargs): + return self.model(*args, **kwargs) + + def get_inputs_embeds(self, inputs_embeds, **kwargs): + input_ids = kwargs['input_ids'] + pixel_values = kwargs.get('pixel_values') + pixel_values_videos = kwargs.get('pixel_values_videos') + image_grid_thw = kwargs.get('image_grid_thw') + video_grid_thw = kwargs.get('video_grid_thw') + dtype = self.model.dtype + if pixel_values is None and pixel_values_videos is None: # plain-text + from PIL import Image + images = [Image.new('RGB', (32, 32), (0, 0, 0))] + media_inputs = self.processor.image_processor(images=images, return_tensors='pt') + device = input_ids.device + media_inputs = to_device(media_inputs, device) + pixel_values = media_inputs['pixel_values'].type(dtype) + image_embeds = self.model(pixel_values, grid_thw=media_inputs['image_grid_thw']) + inputs_embeds = inputs_embeds + image_embeds.mean() * 0. + else: + if pixel_values is None: + pixel_values_mixed = pixel_values_videos + grid_thw = video_grid_thw + elif pixel_values_videos is None: + pixel_values_mixed = pixel_values + grid_thw = image_grid_thw + else: + pixel_values_mixed = torch.concat([pixel_values, pixel_values_videos], dim=0) + grid_thw = torch.concat([image_grid_thw, video_grid_thw], dim=0) + pixel_values_mixed = pixel_values_mixed.type(dtype) + mixed_embeds = self.model(pixel_values_mixed, grid_thw=grid_thw) + if pixel_values is None: + image_embeds = None + video_embeds = mixed_embeds + elif pixel_values_videos is None: + image_embeds = mixed_embeds + video_embeds = None + else: + merge_length = self.processor.image_processor.merge_size**2 + image_tokens = (image_grid_thw.prod(dim=-1) // merge_length).sum() + image_embeds = mixed_embeds[:image_tokens] + video_embeds = mixed_embeds[image_tokens:] + + if image_embeds is not None: + image_mask = (input_ids == self.model_config.image_token_id).unsqueeze(-1).expand_as(inputs_embeds) + image_embeds = image_embeds.to(inputs_embeds.device, inputs_embeds.dtype) + inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds) + + if video_embeds is not None: + video_mask = (input_ids == self.model_config.video_token_id).unsqueeze(-1).expand_as(inputs_embeds) + video_embeds = video_embeds.to(inputs_embeds.device, inputs_embeds.dtype) + inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds) + return inputs_embeds + + +class Qwen2VL_Vit(Qwen2_5VL_Vit): + version = 'v2' + + +register_megatron_model( + MMGPTMegatronModelMeta( + MegatronModelType.qwen2_5_vl, [ + ModelType.qwen2_5_vl, + ], + convert_hf2mcore=convert_hf2mcore_qwen2_5_vl, + convert_mcore2hf=convert_mcore2hf_qwen2_5_vl, + visual_cls=Qwen2_5VL_Vit)) + +register_megatron_model( + MMGPTMegatronModelMeta( + MegatronModelType.qwen2_vl, [ + ModelType.qwen2_vl, + ], + convert_hf2mcore=convert_hf2mcore_qwen2_5_vl, + convert_mcore2hf=convert_mcore2hf_qwen2_5_vl, + visual_cls=Qwen2VL_Vit)) diff --git a/swift/megatron/model/mm_gpt/utils.py b/swift/megatron/model/mm_gpt/utils.py new file mode 100644 index 0000000000..2b85c9f0f7 --- /dev/null +++ b/swift/megatron/model/mm_gpt/utils.py @@ -0,0 +1,44 @@ +from contextlib import contextmanager +from dataclasses import dataclass +from typing import Any, Callable, Dict, Type + +import torch +from torch import nn +from transformers import PretrainedConfig + +from ..gpt.config import convert_gpt_hf_config +from ..mm_gpt_model import MultimodalGPTModel +from ..model_provider import model_provider as model_provider_func +from ..register import MegatronModelMeta + + +@contextmanager +def patch_device_map_meta(model_cls): + __origin_init__ = model_cls.__init__ + + def __init__(self, *args, **kwargs): + with torch.device('meta'): + __origin_init__(self, *args, **kwargs) + + model_cls.__init__ = __init__ + + from transformers import PreTrainedModel + _origin_initialize_weight = PreTrainedModel._initialize_weights + + def _initialize_weight(self, *args, **kwargs): + return + + PreTrainedModel._initialize_weights = _initialize_weight + + try: + yield + finally: + model_cls.__init__ = __origin_init__ + PreTrainedModel._initialize_weights = _origin_initialize_weight + + +@dataclass +class MMGPTMegatronModelMeta(MegatronModelMeta): + model_cls: Type[nn.Module] = MultimodalGPTModel + model_provider: Callable[[], nn.Module] = model_provider_func + convert_hf_config: Callable[[PretrainedConfig], Dict[str, Any]] = convert_gpt_hf_config diff --git a/swift/megatron/model/mm_gpt_model.py b/swift/megatron/model/mm_gpt_model.py new file mode 100644 index 0000000000..d4871b5b54 --- /dev/null +++ b/swift/megatron/model/mm_gpt_model.py @@ -0,0 +1,98 @@ +from contextlib import contextmanager + +import torch +from megatron.core import InferenceParams +from megatron.core.packed_seq_params import PackedSeqParams +from megatron.core.tensor_parallel import VocabParallelEmbedding, scatter_to_sequence_parallel_region +from megatron.core.transformer.module import MegatronModule +from megatron.core.transformer.spec_utils import ModuleSpec +from megatron.core.transformer.transformer_config import TransformerConfig +from megatron.training import get_args + +from .gpt_model import GPTModel + + +class MultimodalGPTModel(MegatronModule): + + def __init__(self, + config: TransformerConfig, + transformer_layer_spec: ModuleSpec, + vocab_size: int, + max_sequence_length: int, + pre_process: bool = True, + post_process: bool = True, + *args, + **kwargs): + super().__init__(config) + self.pre_process = pre_process + self.post_process = post_process + self.language_model = GPTModel(config, transformer_layer_spec, vocab_size, max_sequence_length, pre_process, + post_process, *args, **kwargs) + + self.share_embeddings_and_output_weights = self.language_model.share_embeddings_and_output_weights + args = get_args() + self.visual = None + if pre_process and args.megatron_model_meta.visual_cls is not None: + self.visual = args.megatron_model_meta.visual_cls(config) + + @contextmanager + def _patch_word_embeddings(self, kwargs): + origin_forward = VocabParallelEmbedding.forward + + def forward(_self, input_): + reduce_scatter_embeddings = _self.reduce_scatter_embeddings + _self.reduce_scatter_embeddings = False + res = origin_forward(_self, input_) + _self.reduce_scatter_embeddings = reduce_scatter_embeddings + if self.visual is not None: + res = self.visual.get_inputs_embeds(res, **kwargs) + if reduce_scatter_embeddings: + res = res.transpose(0, 1).contiguous() + res = scatter_to_sequence_parallel_region(res, group=_self.tp_group) + return res + + VocabParallelEmbedding.forward = forward + try: + yield + finally: + VocabParallelEmbedding.forward = origin_forward + + # Code borrowed from NVIDIA/Megatron-LM + def forward( + self, + input_ids: torch.Tensor, + position_ids: torch.Tensor, + attention_mask: torch.Tensor = None, + decoder_input: torch.Tensor = None, + labels: torch.Tensor = None, + inference_params: InferenceParams = None, + packed_seq_params: PackedSeqParams = None, + **kwargs, + ) -> torch.Tensor: + if decoder_input is not None: + pass + elif self.pre_process: + from ..trainers.utils import get_batch_on_this_cp_rank + kwargs.update({'input_ids': input_ids}) + with self._patch_word_embeddings(kwargs): + decoder_input = self.language_model.embedding(input_ids=input_ids, position_ids=position_ids) + decoder_input = get_batch_on_this_cp_rank({ + 'decoder_input': decoder_input, + 'packed_seq_params': packed_seq_params + })['decoder_input'] + else: + # intermediate stage of pipeline + # decoder will get hidden_states from encoder.input_tensor + decoder_input = None + return self.language_model( + input_ids=input_ids, + position_ids=position_ids, + attention_mask=attention_mask, + decoder_input=decoder_input, + labels=labels, + inference_params=inference_params, + packed_seq_params=packed_seq_params, + ) + + def set_input_tensor(self, input_tensor: torch.Tensor) -> None: + return self.language_model.set_input_tensor(input_tensor) diff --git a/swift/megatron/model/gpt/model.py b/swift/megatron/model/model_provider.py similarity index 94% rename from swift/megatron/model/gpt/model.py rename to swift/megatron/model/model_provider.py index 42cd69f375..1eeff0b8a2 100644 --- a/swift/megatron/model/gpt/model.py +++ b/swift/megatron/model/model_provider.py @@ -1,5 +1,5 @@ # Copyright (c) Alibaba, Inc. and its affiliates. -from typing import Union +from typing import TYPE_CHECKING, Union import megatron.legacy import torch @@ -12,11 +12,14 @@ from megatron.training.arguments import core_transformer_config_from_args from megatron.training.yaml_arguments import core_transformer_config_from_yaml -from ..gpt_model import GPTModel +if TYPE_CHECKING: + from .gpt_model import GPTModel + from .mm_gpt import MultimodalGPTModel # Code borrowed from NVIDIA/Megatron-LM -def model_provider(pre_process=True, post_process=True) -> Union[GPTModel, megatron.legacy.model.GPTModel]: +def model_provider(pre_process=True, + post_process=True) -> Union['GPTModel', 'MultimodalGPTModel', megatron.legacy.model.GPTModel]: """Builds the model. If you set the use_legacy_models to True, it will return the legacy GPT model and if not the mcore GPT model. @@ -97,7 +100,7 @@ def oom_observer(device, alloc, device_alloc, device_free): # qwen2_moe for layer_spec in transformer_layer_spec.layer_specs: layer_spec.submodules.mlp.submodules.shared_experts.params = {'gate': True} - model = GPTModel( + model = args.megatron_model_meta.model_cls( config=config, transformer_layer_spec=transformer_layer_spec, vocab_size=args.padded_vocab_size, diff --git a/swift/megatron/model/register.py b/swift/megatron/model/register.py index 950a68ede2..8ed93f1ac5 100644 --- a/swift/megatron/model/register.py +++ b/swift/megatron/model/register.py @@ -1,7 +1,7 @@ # Copyright (c) Alibaba, Inc. and its affiliates. from argparse import ArgumentParser from dataclasses import dataclass -from typing import Any, Callable, Dict, List, Optional +from typing import Any, Callable, Dict, List, Optional, Type import torch.nn as nn from transformers import PretrainedConfig @@ -16,11 +16,14 @@ class MegatronModelMeta: megatron_model_type: str model_types: List[str] - model_provider: Callable[[], nn.Module] - convert_hf_config: Callable[[PretrainedConfig], Dict[str, Any]] convert_mcore2hf: Callable[[nn.Module, nn.Module], None] convert_hf2mcore: Callable[[nn.Module, nn.Module], None] + model_cls: Type[nn.Module] + model_provider: Callable[[], nn.Module] + convert_hf_config: Callable[[PretrainedConfig], Dict[str, Any]] + visual_cls: Optional[Type[nn.Module]] = None + extra_args_provider: Optional[Callable[[ArgumentParser], ArgumentParser]] = None diff --git a/swift/megatron/train/sft.py b/swift/megatron/train/sft.py index f7eb8b57ec..289529a5f9 100644 --- a/swift/megatron/train/sft.py +++ b/swift/megatron/train/sft.py @@ -3,6 +3,8 @@ from functools import partial from typing import List, Optional, Union +import torch + from swift.llm.train import SwiftSft from swift.utils import get_logger, is_master, plot_images from ..argument import MegatronTrainArguments @@ -24,12 +26,17 @@ def __init__(self, args: Optional[Union[List[str], MegatronTrainArguments]] = No self.train_msg = {} super(SwiftSft, self).__init__(args) args = self.args - _, self.processor = args.get_model_processor(load_model=False) + if args.model_meta.is_multimodal: + kwargs = {'return_dummy_model': True} + else: + kwargs = {'load_model': False} + with torch.device('meta'): + self.model, self.processor = args.get_model_processor(**kwargs) + self._prepare_template() patch_megatron_tokenizer(self.processor) + args.save_args(args.save) args.init_model_args(self.processor, self.processor.model_info.config) - self._prepare_template() self.template.use_megatron = True - args.save_args(args.save) self.trainer = self.prepare_trainer() def _get_data_collator(self): @@ -56,8 +63,6 @@ def run(self): if val_dataset is not None: val_dataset = build_streaming_dataloader(args, val_dataset, data_collator) - logging_path = os.path.join(args.save, 'logging.jsonl') - logger.info(f'The logging file will be saved in: {logging_path}') try: self.trainer.train(train_dataset, val_dataset, data_collator) finally: diff --git a/swift/megatron/trainers/base.py b/swift/megatron/trainers/base.py index afd6152a69..ad17437a42 100644 --- a/swift/megatron/trainers/base.py +++ b/swift/megatron/trainers/base.py @@ -242,10 +242,11 @@ def new_model_provider_func(*args, **kwargs): self.peft_model = prepare_mcore_model(self.unwrapped_model) return self.unwrapped_model + args = get_args() + self._init_multimodal_full(args) with self._patch_load_state_dict(self._load_base_checkpoint): model, optimizer, opt_param_scheduler = self._origin_setup_model_and_optimizer( new_model_provider_func, model_type, *_args, **kwargs) - args = get_args() if args.initialize_embedding: self._initialize_embedding(self.unwrapped_model) if args.train_type != 'full' and args.modules_to_save: @@ -258,8 +259,20 @@ def new_model_provider_func(*args, **kwargs): with adapter_state_dict_context(): args.iteration, args.num_floating_point_operations_so_far = load_checkpoint( model, optimizer, opt_param_scheduler, load_arg='adapter_load', strict=False) + if args.model_meta.is_multimodal: + self._prepare_vit_gradient_checkpointing() return model, optimizer, opt_param_scheduler + def _prepare_vit_gradient_checkpointing(self): + visual = self.unwrapped_model.visual + if visual is None: + return + visual = visual.model + args = get_args() + if args.vit_gradient_checkpointing: + visual.gradient_checkpointing_enable(**(args.gradient_checkpointing_kwargs or {})) + visual.enable_input_require_grads() + @staticmethod def _initialize_embedding(model): # compat new_special_tokens @@ -702,6 +715,25 @@ def _patch_megatron(self): self._origin_save_checkpoint = training.save_checkpoint training.save_checkpoint = self.save_checkpoint + @staticmethod + def _init_multimodal_full(args): + visual_cls = args.megatron_model_meta.visual_cls + if args.train_type == 'full' and args.model_meta.is_multimodal and visual_cls is not None: + vision_tower = [f'visual.{vit}' for vit in visual_cls.vision_tower] + aligner = [f'visual.{_aligner}' for _aligner in visual_cls.aligner] + if args.freeze_llm: + args.freeze_parameters.append('language_model') + if args.freeze_vit: + args.freeze_parameters += vision_tower + if args.freeze_aligner: + args.freeze_parameters += aligner + else: + args.trainable_parameters += aligner + if args.freeze_parameters: + logger.info(f'freeze_parameters: {args.freeze_parameters}') + if args.trainable_parameters: + logger.info(f'additional trainable_parameters: {args.trainable_parameters}') + def train(self, train_dataset, val_dataset, data_collator): args = self.args datasets_provider = get_swift_datasets_provider(train_dataset, val_dataset) diff --git a/swift/megatron/trainers/dpo_trainer.py b/swift/megatron/trainers/dpo_trainer.py index 7798de2b08..aef42560c1 100644 --- a/swift/megatron/trainers/dpo_trainer.py +++ b/swift/megatron/trainers/dpo_trainer.py @@ -179,9 +179,7 @@ def _replace_data_iterator(self, data_iterator): return iter(res) def forward_step(self, data_iterator, model): - with torch.no_grad(): - data = next(data_iterator) - + data = next(data_iterator) ref_logps = data.pop('logps') with self.stimer: output_tensor = model(**data) diff --git a/swift/megatron/trainers/utils.py b/swift/megatron/trainers/utils.py index a05b57f75f..30b90400f4 100644 --- a/swift/megatron/trainers/utils.py +++ b/swift/megatron/trainers/utils.py @@ -65,22 +65,41 @@ def get_packed_seq_params(position_ids: torch.Tensor) -> PackedSeqParams: def _split_tokens(tokens, cu_seqlens): - assert tokens.shape[0] == 1, f'tokens.shape: {tokens.shape}' + assert tokens.shape[-2] == 1, f'tokens.shape: {tokens.shape}' # [..., 1, L] new_tokens = [] cp_size = mpu.get_context_parallel_world_size() cp_rank = mpu.get_context_parallel_rank() for i in range(cu_seqlens.shape[0] - 1): - val = tokens[:, cu_seqlens[i]:cu_seqlens[i + 1]] + val = tokens[..., cu_seqlens[i]:cu_seqlens[i + 1]] val = val.view( - tokens.shape[0], + *tokens.shape[:-1], 2 * cp_size, - val.shape[1] // (2 * cp_size), + val.shape[-1] // (2 * cp_size), ) index = torch.tensor([cp_rank, (2 * cp_size - cp_rank - 1)], device='cpu', pin_memory=True).cuda(non_blocking=True) - val = val.index_select(1, index) - new_tokens.append(val.view(tokens.shape[0], -1)) - return torch.cat(new_tokens, dim=1) + val = val.index_select(-2, index) + new_tokens.append(val.view(*tokens.shape[:-1], -1)) + return torch.cat(new_tokens, dim=-1) + + +def _split_tokens_decoder_input(tokens, cu_seqlens): + assert tokens.shape[1] == 1, f'tokens.shape: {tokens.shape}' # [L, 1, E] + new_tokens = [] + cp_size = mpu.get_context_parallel_world_size() + cp_rank = mpu.get_context_parallel_rank() + for i in range(cu_seqlens.shape[0] - 1): + val = tokens[cu_seqlens[i]:cu_seqlens[i + 1], ...] + val = val.view( + 2 * cp_size, + val.shape[0] // (2 * cp_size), + *tokens.shape[1:], + ) + index = torch.tensor([cp_rank, (2 * cp_size - cp_rank - 1)], device='cpu', + pin_memory=True).cuda(non_blocking=True) + val = val.index_select(0, index) + new_tokens.append(val.view(-1, *tokens.shape[1:])) + return torch.cat(new_tokens, dim=0) def get_batch_on_this_cp_rank(batch: Dict[str, Any]): @@ -96,14 +115,23 @@ def get_batch_on_this_cp_rank(batch: Dict[str, Any]): # that we can get balanced workload among GPUs in a context parallel group. cp_size = mpu.get_context_parallel_world_size() if cp_size > 1: + args = get_args() + keys = ['labels', 'attention_mask', 'position_ids', 'loss_scale'] + if args.model_meta.is_multimodal: + keys.append('decoder_input') + else: + keys.append('input_ids') packed_seq_params = batch.get('packed_seq_params') if packed_seq_params is None: return mcore_get_batch_on_this_cp_rank(batch) for key, val in batch.items(): - if key in {'packed_seq_params', 'channel'}: + if key not in keys: continue if val is not None: - batch[key] = _split_tokens(val, packed_seq_params.cu_seqlens_q) + if key == 'decoder_input': + batch[key] = _split_tokens_decoder_input(val, packed_seq_params.cu_seqlens_q) + else: + batch[key] = _split_tokens(val, packed_seq_params.cu_seqlens_q) return batch @@ -118,5 +146,8 @@ def get_batch(data_iterator): batch['packed_seq_params'] = get_packed_seq_params(batch['position_ids']) batch['packed_seq_params'].num_samples = num_samples # slice batch along sequence dimension for context parallelism + position_ids = batch.pop('real_position_ids', None) # fix Qwen2.5-VL + if position_ids is not None: + batch['position_ids'] = position_ids batch = get_batch_on_this_cp_rank(batch) return batch diff --git a/swift/megatron/utils/convert.py b/swift/megatron/utils/convert.py index cc8a70307c..19b836bd79 100644 --- a/swift/megatron/utils/convert.py +++ b/swift/megatron/utils/convert.py @@ -3,6 +3,7 @@ import math from contextlib import contextmanager from dataclasses import fields +from typing import Any, Dict import torch import torch.nn as nn @@ -39,17 +40,23 @@ def _test_params_sum(model): logger.info(f'zero_count: {zero_count}') -def _find_modules(model, recurse: bool = True): +def _find_modules(model, recurse: bool = True, prefix='', ignore_modules=None): + ignore_modules = ignore_modules or [] + for k in ignore_modules: + if prefix.startswith(k): + return [] + else: + named_children = list(model.named_children()) + modules = [] - children = list(model.children()) - for module in children: + for n, module in named_children: if module.__class__ is nn.ModuleList: - modules += _find_modules(module, False) + modules += _find_modules(module, False, prefix=f'{prefix}{n}.', ignore_modules=ignore_modules) elif recurse: - modules += _find_modules(module) + modules += _find_modules(module, prefix=f'{prefix}{n}.', ignore_modules=ignore_modules) else: modules.append(module) - if not children: + if not named_children: modules.append(model) return modules @@ -78,34 +85,68 @@ def _to_cpu_hook(module, args, output): hook.remove() +def get_examples(is_multimodal: bool) -> Dict[str, Any]: + if is_multimodal: + data = { + 'messages': [{ + 'role': 'user', + 'content': 'describe the image.' + }, { + 'role': + 'assistant', + 'content': + 'The image depicts a close-up of a kitten with striking features. ' + 'The kitten has a white and gray coat with distinct black stripes, ' + 'particularly noticeable on its face and ears. Its eyes are large ' + 'and expressive, with a captivating blue hue that stands out against ' + "the darker fur around them. The kitten's nose is small and pink, " + 'and it has long, delicate whiskers extending from either side of its mouth. ' + "The background is blurred, drawing attention to the kitten's face and " + 'making it the focal point of the image. The overall impression is ' + 'one of cuteness and charm.' + }], + 'images': ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png'] + } + else: + data = { + 'messages': [ + { + 'role': 'user', + 'content': 'Introduction to ms-swift.' + }, + { + 'role': + 'assistant', + 'content': + 'ms-swift is an official framework provided by the ModelScope community for fine-tuning ' + 'and deploying large language models and multi-modal large models.' + }, + ] + } + return data + + def test_convert_precision(hf_model, mg_model, template, torch_dtype=torch.float32): _test_params_sum(hf_model) _test_params_sum(mg_model) template.set_mode('train') - inputs = template.encode({ - 'messages': [ - { - 'role': 'user', - 'content': 'Introduction to ms-swift.' - }, - { - 'role': - 'assistant', - 'content': - 'ms-swift is an official framework provided by the ModelScope community for fine-tuning ' - 'and deploying large language models and multi-modal large models.' - }, - ] - }) + template.register_post_encode_hook([hf_model]) + is_multimodal = template.model_meta.is_multimodal + inputs = get_examples(is_multimodal) + inputs = template.encode(inputs) inputs = to_device(template.data_collator([inputs]), 'cuda') HfConfigFactory.set_model_config_attr(hf_model, 'use_cache', False) - share_embedding = mg_model.share_embeddings_and_output_weights - hf_modules = _find_modules(hf_model) + mg_language_model = mg_model.language_model if is_multimodal else mg_model + share_embedding = mg_language_model.share_embeddings_and_output_weights + model_arch = hf_model.model_meta.model_arch + ignore_modules = (model_arch.vision_tower + model_arch.aligner) if is_multimodal else [] + + hf_modules = _find_modules(hf_model, ignore_modules=ignore_modules) with torch.inference_mode(), _model_cpu_forward_context(hf_modules, torch_dtype, share_embedding=share_embedding): hf_logits = hf_model(**inputs).logits - hf_model = hf_model.to('cpu') + hf_model.to('cpu') input_ids = inputs['input_ids'] attention_mask, _, position_ids = get_ltor_masks_and_position_ids(input_ids, -100, True, True, True) @@ -116,15 +157,15 @@ def test_convert_precision(hf_model, mg_model, template, torch_dtype=torch.float # mg_torch_dtype = None # packed_seq_params = get_packed_seq_params(position_ids) # attention_mask = None - mg_model.config.fp8 = None # compat fp8 - mg_modules = _find_modules(mg_model) + mg_language_model.config.fp8 = None # compat fp8 + mg_modules = _find_modules(mg_language_model, ignore_modules=['visual']) + kwargs = {k: v for k, v in inputs.items() if k not in ['input_ids', 'attention_mask', 'labels']} + if 'position_ids' not in kwargs: + kwargs['position_ids'] = position_ids with torch.inference_mode(), _model_cpu_forward_context( mg_modules, mg_torch_dtype, 'cuda', share_embedding=share_embedding): mg_logits = mg_model( - input_ids=input_ids, - attention_mask=attention_mask, - position_ids=position_ids, - packed_seq_params=packed_seq_params) + input_ids=input_ids, attention_mask=attention_mask, packed_seq_params=packed_seq_params, **kwargs) token_mean_diff = (mg_logits - hf_logits).abs().mean(dim=-1) mean_diff = token_mean_diff.mean().item() @@ -165,7 +206,10 @@ def convert_hf2mcore(args: ExportArguments) -> None: megatron_model_meta = get_megatron_model_meta(args.model_type) assert megatron_model_meta is not None, f'Model: {args.model} is not supported.' - kwargs = megatron_model_meta.convert_hf_config(processor.model_info.config) + config = processor.model_info.config + if args.model_meta.is_multimodal and hasattr(config, 'text_config'): + config = config.text_config + kwargs = megatron_model_meta.convert_hf_config(config) logger.info(f'megatron_config: {kwargs}') _check_megatron_kwargs(kwargs) current_convert_kwargs = convert_kwargs.copy() @@ -175,6 +219,9 @@ def convert_hf2mcore(args: ExportArguments) -> None: **kwargs, **current_convert_kwargs, save=args.output_dir, torch_dtype=args.torch_dtype) patch_megatron_tokenizer(processor) extra_args = megatron_args.parse_to_megatron() + extra_args['model_info'] = args.model_info + extra_args['model_meta'] = args.model_meta + extra_args['megatron_model_meta'] = megatron_model_meta extra_args_provider = megatron_model_meta.extra_args_provider initialize_megatron(extra_args_provider=extra_args_provider, args_defaults=extra_args) @@ -198,7 +245,10 @@ def convert_mcore2hf(args: ExportArguments) -> None: megatron_model_meta = get_megatron_model_meta(args.model_type) assert megatron_model_meta is not None, f'Model: {args.model} is not supported.' - kwargs = megatron_model_meta.convert_hf_config(processor.model_info.config) + config = processor.model_info.config + if args.model_meta.is_multimodal and hasattr(config, 'text_config'): + config = config.text_config + kwargs = megatron_model_meta.convert_hf_config(config) logger.info(f'megatron_config: {kwargs}') _check_megatron_kwargs(kwargs) current_convert_kwargs = convert_kwargs.copy() @@ -217,6 +267,9 @@ def convert_mcore2hf(args: ExportArguments) -> None: torch_dtype=args.torch_dtype) patch_megatron_tokenizer(processor) extra_args = megatron_args.parse_to_megatron() + extra_args['model_info'] = args.model_info + extra_args['model_meta'] = args.model_meta + extra_args['megatron_model_meta'] = megatron_model_meta extra_args_provider = megatron_model_meta.extra_args_provider initialize_megatron(extra_args_provider=extra_args_provider, args_defaults=extra_args) diff --git a/swift/megatron/utils/utils.py b/swift/megatron/utils/utils.py index db260bf3c3..e1611326c1 100644 --- a/swift/megatron/utils/utils.py +++ b/swift/megatron/utils/utils.py @@ -11,8 +11,10 @@ from megatron.core.transformer.utils import make_sharded_tensors_for_checkpoint, sharded_state_dict_default from megatron.training import checkpointing, get_args from peft.utils.other import ModulesToSaveWrapper +from torch import nn -from swift.utils import activate_parameters, find_layers, freeze_parameters, get_logger, get_model_parameter_info +from swift.utils import (activate_parameters, deep_getattr, find_layers, freeze_parameters, get_logger, + get_model_parameter_info) logger = get_logger() @@ -20,7 +22,7 @@ def find_all_linears(model): def _cond(name, module): - if isinstance(module, (TELinear, TELayerNormColumnParallelLinear, TEGroupedLinear)): + if isinstance(module, (TELinear, TELayerNormColumnParallelLinear, TEGroupedLinear, nn.Linear)): return True return False @@ -35,13 +37,64 @@ def find_embedding(model): return find_layers(model, lambda name, module: isinstance(module, LanguageModelEmbedding)) +def get_multimodal_target_regex( + args, + model, + *, + freeze_llm: bool = False, + freeze_vit: bool = True, + freeze_aligner: bool = True, +) -> str: + modules = [] + visual_cls = args.megatron_model_meta.visual_cls + vision_tower = [f'visual.{vit}' for vit in visual_cls.vision_tower] + aligner = [f'visual.{_aligner}' for _aligner in visual_cls.aligner] + if not freeze_llm: + modules.append('language_model') + if not freeze_vit: + modules += vision_tower + if not freeze_aligner: + modules += aligner + assert len(modules) > 0, f'modules: {modules}' + + res = [] + for module in modules: + rejected_modules = [] + if not freeze_vit: + for _aligner in aligner: + if _aligner.startswith(f'{module}.'): + rejected_modules.append(_aligner) + + sub_module = deep_getattr(model, module) + if sub_module is None: + continue + target_modules = find_all_linears(sub_module) + if not target_modules: + continue + target_modules = [tm for tm in target_modules if tm] + target_pattern = rf'.*\.({"|".join(target_modules)})' if target_modules else '' + rejected_pattern = rf'(?!({"|".join(rejected_modules)}))' if rejected_modules else '' + res.append(rf'{rejected_pattern}{module}{target_pattern}') + + return rf'^({"|".join(res)})$' + + def get_target_modules(args, model): if isinstance(args.target_modules, str): return args.target_modules target_modules = args.target_modules.copy() if 'all-linear' in target_modules: - target_modules.remove('all-linear') - target_modules += find_all_linears(model) + if args.model_meta.is_multimodal: + return get_multimodal_target_regex( + args, + model, + freeze_llm=args.freeze_llm, + freeze_vit=args.freeze_vit, + freeze_aligner=args.freeze_aligner, + ) + else: + target_modules.remove('all-linear') + target_modules += find_all_linears(model) if 'all-embedding' in target_modules: target_modules.remove('all-embedding') target_modules += find_embedding(model) diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py index 3cda3e3a0d..89d6bc6332 100644 --- a/swift/trainers/mixin.py +++ b/swift/trainers/mixin.py @@ -839,7 +839,7 @@ def get_cu_seqlens(self, position_ids, logits_to_keep) -> torch.Tensor: start, end = cu_seqlens[i], cu_seqlens[i + 1] res_cu_seqlens[i + 1:] -= (~logits_to_keep[start:end]).sum() elif isinstance(logits_to_keep, int): - res_cu_seqlens[1:] -= position_ids.shape[0] + 1 - logits_to_keep + res_cu_seqlens[1:] -= position_ids.shape[-1] + 1 - logits_to_keep return res_cu_seqlens def get_batch_samples(self, *args, **kwargs): diff --git a/tests/megatron/test_align/test_llm.py b/tests/megatron/test_align/test_llm.py index 163fd1933a..69e62f574f 100644 --- a/tests/megatron/test_align/test_llm.py +++ b/tests/megatron/test_align/test_llm.py @@ -127,6 +127,16 @@ def test_glm4_5(): _test_model('ZhipuAI/GLM-4.5-Air') +def test_qwen2_5_vl(): + os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) + _test_model('Qwen/Qwen2.5-VL-7B-Instruct') + + +def test_qwen2_vl(): + os.environ['MAX_PIXELS'] = str(1280 * 28 * 28) + _test_model('Qwen/Qwen2-VL-7B-Instruct') + + if __name__ == '__main__': # test_qwen2() # test_llama2() @@ -151,4 +161,6 @@ def test_glm4_5(): # test_kimi_dev() # test_hunyuan() # test_ernie() - test_glm4_5() + # test_glm4_5() + test_qwen2_5_vl() + # test_qwen2_vl()