fix dcp for new weight update #246

casteryh · 2025-09-27T23:54:04Z

Now working for QWen3 8B. Update weights takes ~30 seconds. (~90 seconds total for a single step) https://meta.wandb.io/torchforge/grpo-training/runs/uacquboq?nw=nwuseryuxuanh
Also added proper clean up of old checkpoints (evict/ttl for weight sync #194)

Copilot generated summary follows:

This pull request introduces significant improvements to weight management in the GRPO training pipeline, focusing on more efficient handling, saving, and cleanup of model weights using torchstore and distributed checkpoints (DCP). The main changes include adding a mechanism to drop old weights, refactoring how weights are pushed and loaded (favoring DCP whole state dicts), and enhancing robustness with new utility functions and tests.

Weight Management Improvements

Added the drop_weights async function in apps/grpo/main.py to efficiently delete old model weights after each training step, preventing unnecessary storage growth. This function uses new utilities to locate and drop both DCP and individual parameter keys. ([1], [2])
Refactored the weight loading logic in PolicyWorker.update to prefer loading the entire state dict from a single DCP handle when available, falling back to individual parameters otherwise. This streamlines weight updates and reduces complexity. (src/forge/actors/policy.pyL563-R582)
Updated the weight saving logic in Trainer.push_weights to save the whole state dict as a single DCP handle when use_dcp is enabled, improving performance and consistency. (src/forge/actors/trainer.pyL347-R369)

Utility and Configuration Enhancements

Added new utility functions and improved the DcpHandle class in src/forge/actors/_torchstore_utils.py, including a robust drop() method to safely delete checkpoints and handle manifold storage cases. ([1], [2])
Updated configuration files to enable built-in vllm loading and DCP usage for both policy and trainer components, aligning the pipeline with new weight management strategies. ([1], [2])

Testing and Reliability

Introduced unit tests for the DcpHandle.drop() method to ensure proper deletion and cleanup behavior, including edge cases for manifold storage. (tests/unit_tests/test_torchstore_utils.pyR1-R61)

These changes collectively make the training pipeline more efficient, reliable, and easier to maintain by improving how model weights are stored, loaded, and cleaned up.

vidhyav · 2025-09-29T14:09:15Z

apps/grpo/main.py

                training_step += 1
                mlogger.log("loss/training_step", loss, training_step)
                await trainer.push_weights.fanout(training_step)
                await policy.update_weights.fanout(training_step)


I am wondering if there is a possibility of needing "some" history of the weights. Can the RL loop be still alive after this statement finishes but the policy model goes down?

I am wondering if there is a possibility of needing "some" history of the weights. Can the RL loop be still alive after this statement finishes but the policy model goes down?

Can you elaborate on what you meant by "the RL loop be still alive"?

* metric logger simple example * it works * delete old files * refactoring + docstrings * docstring * comments * update method name * no circular import * update command * update arg name * move metric actor out of asyncio lock * fix deregister * lint * docstring * fix result extraction and add logger shutdown * fix shutdown order * simplification + docstrings * bug fix + register if respawn * it works * use procmesh as key * docstring * remove protected imports * create get_metric_logger * call became fanout * upstream changes --------- Co-authored-by: Felipe Mello <felipemello@fb.com>

joecummings

Just one comment, but otherwise very helpful

joecummings · 2025-09-29T15:21:43Z

apps/grpo/main.py

        return self._tokenizer.pad_token_id


+async def drop_weights(version: int):


Great function, but this is actually the kind of logic I don't care to see in the main.py file. Would it be possible to have this be part of torchstore / dcp itself (or a wrapper we write)? That way we can specify here "keep_last_n_weights".

…model config (for Titan + vLLM) (#241) * commit * flag * format * nit * nit

Jack-Khuu · 2025-09-29T17:53:18Z

nit: Can you format the clickable links in the PR description? Copilot borked it

casteryh · 2025-09-29T18:52:44Z

nit: Can you format the clickable links in the PR description? Copilot borked it

done!

joecummings

I left a few comments and questions, but overall LGTM

joecummings · 2025-09-29T20:17:16Z

src/forge/actors/_torchstore_utils.py

+            self.param_names = None
+            return
+
+        import shutil


You can import this at the top

joecummings · 2025-09-29T20:17:45Z

src/forge/actors/_torchstore_utils.py

+        import shutil
+
+        try:
+            shutil.rmtree(self.checkpoint_id, ignore_errors=False)


Why do we want to suppress the errors here?

Fair. I was just thinking logging the error is fine and don't want to crash everything if delete is not successful. Let me know what you think.

joecummings · 2025-09-29T20:19:21Z

src/forge/actors/policy.py

+                loaded = model.load_weights([(name, param)])
+                del param
+                loaded_weights.update(loaded)
        logger.info(


The entire weight update timing is already calculated at the policy update level.

yes, but it's different from each worker's update time. The entire updating time is basically the longest among the workers.

Eventually, we might just want to remove the top-level logging time I suppose

…ciency

joecummings · 2025-09-29T21:32:27Z

we'll need #255 to land

casteryh added 3 commits September 27, 2025 00:47

dcp efficiency and weight cleanup

cf5bead

await!

ef504d6

Merge branch 'main' into dcp-efficiency

b594bf7

casteryh requested review from JenniferWang, allenwang28 and joecummings September 27, 2025 23:54

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 27, 2025

casteryh requested a review from Jack-Khuu September 27, 2025 23:54

key error

9ca7c51

vidhyav reviewed Sep 29, 2025

View reviewed changes

joecummings reviewed Sep 29, 2025

View reviewed changes

Generalize the policy update integration test to sanity check on any …

516abf0

…model config (for Titan + vLLM) (#241) * commit * flag * format * nit * nit

casteryh added 3 commits September 29, 2025 11:10

dcp efficiency and weight cleanup

95114d1

await!

c9434ea

key error

ea97ba5

joecummings approved these changes Sep 29, 2025

View reviewed changes

casteryh added 2 commits September 29, 2025 13:39

Merge remote-tracking branch 'yhu-forge/dcp-efficiency' into dcp-effi…

89a3c58

…ciency

import at top

22d1be2

joecummings mentioned this pull request Sep 29, 2025

Update torchstore and torchtitan deps #255

Merged

casteryh merged commit a1714c3 into meta-pytorch:main Sep 29, 2025
5 checks passed

		return self._tokenizer.pad_token_id


		async def drop_weights(version: int):

fix dcp for new weight update #246

fix dcp for new weight update #246

Uh oh!

Conversation

casteryh commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casteryh commented Sep 29, 2025

Uh oh!

joecummings left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings commented Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

casteryh commented Sep 27, 2025 •

edited

Loading

Jack-Khuu commented Sep 29, 2025 •

edited

Loading

joecummings Sep 29, 2025 •

edited

Loading