Garbage collect on every train / ref model step #209

joecummings · 2025-09-22T17:55:19Z

What is this PR?

As reported in #201, there appeared to be a memory leak in the components using TorchTitan - RLTrainer and ReferenceModel. This PR manually calls the TorchTitan garbage collection util on every step, which eliminates the memory leak.

How was this PR tested?

With gc_freq: 100000 (never running essentially)

Notice the yellow line (reference model) and the ref line (trainer model) consistently going up.

With gc_freq: 1 (run every step)

Notice the yellow line (reference model) and red line (trainer model) stays relatively flat.

FAQs

Why is this happening? The theory is that Monarch doesn't know that it is able to free memory after returning from an Endpoint. So every time the Endpoint (either forward or train_step) is called, new memory is allocated. This theory definitely requires further investigation, but the fix lends credibility. Actually Titan disables garbage collection manually to improve performance. We have to re-enable it with this PR.
How long does garbage collection take? Could this be a bottleneck? Garbage collection times are logged. In my experiments the longest it takes is 1 ms.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

apps/grpo/qwen3_1_7b.yaml

Ritesh1905 · 2025-09-22T19:37:55Z

src/forge/actors/trainer.py

question: Why do we do CP save on evert step?

This is a Titan checkpointer impl detail. What actually happens is that it checks if it should save, which is determined by the checkpoint frequency attr found in our config. If it shouldn't checkpoint it just returns. See here.

A much much much better name would be maybe_save IMO

Curious why isn't it a "finally" step that's done by default?

Ritesh1905 · 2025-09-22T19:38:22Z

src/forge/actors/reference_model.py

Can you explain what is this doing internally?

Code can be found here: https://github.com/pytorch/torchtitan/blob/c9cb3046867ca3cacd6771a60acf65ede424715e/torchtitan/tools/utils.py#L37

joecummings added 3 commits September 22, 2025 09:07

Garbage collect every step

42d44b0

<Replace this line with a title. Use 1 line only, 67 chars or less>

2d6b2f0

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Garbage collect on every step

daa8edd

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 22, 2025

joecummings marked this pull request as ready for review September 22, 2025 19:21

joecummings requested review from Ritesh1905 and allenwang28 September 22, 2025 19:21

allenwang28 reviewed Sep 22, 2025

View reviewed changes

apps/grpo/qwen3_1_7b.yaml Outdated Show resolved Hide resolved

allenwang28 approved these changes Sep 22, 2025

View reviewed changes

joecummings linked an issue Sep 22, 2025 that may be closed by this pull request

memory leak? #201

Closed

joecummings merged commit d190278 into meta-pytorch:main Sep 22, 2025
3 of 8 checks passed

joecummings deleted the debug-titan-ooms branch September 22, 2025 19:28

Ritesh1905 reviewed Sep 22, 2025

View reviewed changes

photomz pushed a commit to photomz/forge that referenced this pull request Oct 25, 2025

Garbage collect on every train / ref model step (meta-pytorch#209)

d9b4dce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Garbage collect on every train / ref model step #209

Garbage collect on every train / ref model step #209

Uh oh!

joecummings commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Ritesh1905 Sep 22, 2025

Uh oh!

joecummings Sep 22, 2025

Uh oh!

vidhyav Sep 22, 2025

Uh oh!

Ritesh1905 Sep 22, 2025

Uh oh!

joecummings Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Garbage collect on every train / ref model step #209

Garbage collect on every train / ref model step #209

Uh oh!

Conversation

joecummings commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this PR?

How was this PR tested?

FAQs

Uh oh!

Uh oh!

Uh oh!

Ritesh1905 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

vidhyav Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Ritesh1905 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

joecummings Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joecummings commented Sep 22, 2025 •

edited

Loading