Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty ZeRO3 partition cache #3060

Merged
merged 11 commits into from Mar 24, 2023
Merged

Empty ZeRO3 partition cache #3060

merged 11 commits into from Mar 24, 2023

Conversation

tjruwase
Copy link
Contributor

API to free GPU memory consumed by zero3 partition cache.
Fix #3025

@stas00
Copy link
Contributor

stas00 commented Mar 22, 2023

oh, sorry, could this new method be added to the API docs please? Thank you, Tunji!

Perhaps something like:

By default at the end of the training some parameters will remain unpartitioned and use up some gpu memory. This is done on purpose as an optimization should you resume training again. If you'd like to clear out the cached parameters that use up gpu memory, you can call:

deepspeed_engine.empty_partition_cache()

as soon as the training has finished.

@stas00
Copy link
Contributor

stas00 commented Mar 22, 2023

Thank you for adding the doc - looks great, Tunji!

@jeffra jeffra merged commit e80ae08 into master Mar 24, 2023
1 check failed
@jeffra jeffra deleted the olruwase/zero_partition_cache branch March 24, 2023 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] zero3 memory leak on return from training loop
3 participants