Skip to content

Conversation

@joecummings
Copy link
Member

@joecummings joecummings commented Sep 22, 2025

What is this PR?

As reported in #201, there appeared to be a memory leak in the components using TorchTitan - RLTrainer and ReferenceModel. This PR manually calls the TorchTitan garbage collection util on every step, which eliminates the memory leak.

How was this PR tested?

With gc_freq: 100000 (never running essentially)
Screenshot 2025-09-22 at 1 43 19 PM

Notice the yellow line (reference model) and the ref line (trainer model) consistently going up.

With gc_freq: 1 (run every step)
Screenshot 2025-09-22 at 3 19 55 PM

Notice the yellow line (reference model) and red line (trainer model) stays relatively flat.

FAQs

  1. Why is this happening? The theory is that Monarch doesn't know that it is able to free memory after returning from an Endpoint. So every time the Endpoint (either forward or train_step) is called, new memory is allocated. This theory definitely requires further investigation, but the fix lends credibility. Actually Titan disables garbage collection manually to improve performance. We have to re-enable it with this PR.
  2. How long does garbage collection take? Could this be a bottleneck? Garbage collection times are logged. In my experiments the longest it takes is 1 ms.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 22, 2025
@joecummings joecummings marked this pull request as ready for review September 22, 2025 19:21
@joecummings joecummings linked an issue Sep 22, 2025 that may be closed by this pull request
@joecummings joecummings merged commit d190278 into meta-pytorch:main Sep 22, 2025
3 of 8 checks passed
@joecummings joecummings deleted the debug-titan-ooms branch September 22, 2025 19:28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why do we do CP save on evert step?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Titan checkpointer impl detail. What actually happens is that it checks if it should save, which is determined by the checkpoint frequency attr found in our config. If it shouldn't checkpoint it just returns. See here.

A much much much better name would be maybe_save IMO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why isn't it a "finally" step that's done by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what is this doing internally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

photomz pushed a commit to photomz/forge that referenced this pull request Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory leak?

4 participants