Skip to content

[bugfix] Fix lora llm resume from checkpoint#9225

Merged
Jintao-Huang merged 20 commits into
modelscope:mainfrom
Jintao-Huang:fix_lora_llm_resume_from_checkpoint
Apr 28, 2026
Merged

[bugfix] Fix lora llm resume from checkpoint#9225
Jintao-Huang merged 20 commits into
modelscope:mainfrom
Jintao-Huang:fix_lora_llm_resume_from_checkpoint

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes an outdated comment about checkpoint support in the SFT script and modifies the LoRA LLM tuner plugin to ensure that vision tower and aligner modules are set to require gradients after loading. Feedback was provided to improve the robustness of the new logic by adding checks for the existence of model architecture metadata and its attributes to avoid potential attribute errors.

Comment thread swift/tuner_plugin/lora_llm.py
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables support for resuming from checkpoints in multimodal LoRA training by updating the DeepSpeed integration. Key changes include removing the training status check in Qwen model forwards to ensure consistency with DeepSpeed, and implementing a more robust parameter loading mechanism in LoRALLMTuner that handles DeepSpeed ZeRO-3's partitioned parameters. A performance improvement was suggested to gather all parameters at once when using ZeRO-3 to reduce communication overhead during state dict loading.

Comment thread swift/tuner_plugin/lora_llm.py Outdated
Comment on lines +34 to +40
params_dict = dict(model.named_parameters())
for name, tensor in state_dict.items():
if name in params_dict:
param = params_dict[name]
with deepspeed.zero.GatheredParameters([param], modifier_rank=0):
if deepspeed.comm.get_rank() == 0:
param.data.copy_(tensor)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance when using DeepSpeed ZeRO-3, it's more efficient to gather all relevant parameters at once rather than iterating and gathering them one by one. This minimizes communication overhead.

Suggested change
params_dict = dict(model.named_parameters())
for name, tensor in state_dict.items():
if name in params_dict:
param = params_dict[name]
with deepspeed.zero.GatheredParameters([param], modifier_rank=0):
if deepspeed.comm.get_rank() == 0:
param.data.copy_(tensor)
params_dict = dict(model.named_parameters())
params_to_load = {name: params_dict[name] for name in state_dict if name in params_dict}
if params_to_load:
with deepspeed.zero.GatheredParameters(list(params_to_load.values()), modifier_rank=0):
if deepspeed.comm.get_rank() == 0:
for name, param in params_to_load.items():
param.data.copy_(state_dict[name])

@Jintao-Huang Jintao-Huang merged commit ecf00d4 into modelscope:main Apr 28, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants