Fix megatron save oom#9166
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the minimum Python version requirement from 3.9 to 3.10 across multiple documentation files and introduces a garbage collection call (gc_collect) in the Megatron trainer's checkpointing logic to reduce memory pressure. A review comment suggests that an additional garbage collection call might be necessary before saving weights if LoRA merging is enabled, as that process is memory-intensive and could still lead to out-of-memory errors.
| model = [] | ||
| else: | ||
| model = self.wrapped_models | ||
| gc_collect() |
There was a problem hiding this comment.
While adding gc_collect() here helps mitigate OOM issues during the initial checkpointing phase, it might also be beneficial to call it again before the save_safetensors block (around line 743), especially if merge_lora is enabled. Merging LoRA adapters can be memory-intensive and may leave fragmented memory that could cause OOM during the subsequent save_weights call.
#8228