Skip to content

Fix megatron save oom#9166

Merged
Jintao-Huang merged 2 commits into
modelscope:mainfrom
Jintao-Huang:fix_megatron_save_oom
Apr 21, 2026
Merged

Fix megatron save oom#9166
Jintao-Huang merged 2 commits into
modelscope:mainfrom
Jintao-Huang:fix_megatron_save_oom

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the minimum Python version requirement from 3.9 to 3.10 across multiple documentation files and introduces a garbage collection call (gc_collect) in the Megatron trainer's checkpointing logic to reduce memory pressure. A review comment suggests that an additional garbage collection call might be necessary before saving weights if LoRA merging is enabled, as that process is memory-intensive and could still lead to out-of-memory errors.

model = []
else:
model = self.wrapped_models
gc_collect()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While adding gc_collect() here helps mitigate OOM issues during the initial checkpointing phase, it might also be beneficial to call it again before the save_safetensors block (around line 743), especially if merge_lora is enabled. Merging LoRA adapters can be memory-intensive and may leave fragmented memory that could cause OOM during the subsequent save_weights call.

@Jintao-Huang Jintao-Huang merged commit 4f379be into modelscope:main Apr 21, 2026
3 checks passed
Jintao-Huang added a commit that referenced this pull request Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants