-
Notifications
You must be signed in to change notification settings - Fork 47
Garbage collect on every train / ref model step #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
src/forge/actors/trainer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Why do we do CP save on evert step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a Titan checkpointer impl detail. What actually happens is that it checks if it should save, which is determined by the checkpoint frequency attr found in our config. If it shouldn't checkpoint it just returns. See here.
A much much much better name would be maybe_save IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why isn't it a "finally" step that's done by default?
src/forge/actors/reference_model.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what is this doing internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this PR?
As reported in #201, there appeared to be a memory leak in the components using TorchTitan - RLTrainer and ReferenceModel. This PR manually calls the TorchTitan garbage collection util on every step, which eliminates the memory leak.
How was this PR tested?
With

gc_freq: 100000(never running essentially)Notice the yellow line (reference model) and the ref line (trainer model) consistently going up.
With

gc_freq: 1(run every step)Notice the yellow line (reference model) and red line (trainer model) stays relatively flat.
FAQs
The theory is that Monarch doesn't know that it is able to free memory after returning from an Endpoint. So every time the Endpoint (eitherActually Titan disables garbage collection manually to improve performance. We have to re-enable it with this PR.forwardortrain_step) is called, new memory is allocated. This theory definitely requires further investigation, but the fix lends credibility.