Skip to content

Conversation

pradeepfn
Copy link
Contributor

This PR trying to solve two challenges;

  1. Update the integration test code to work with new policy service API. In the new API, during service init 'setUp' gets called internally. If we call setup again in our test/driver code, it will complain of GPU memory not enough. This is fixed in this PR.

  2. With new APIs, we have to send the torchstore references to vllm workers via Policy service. I attempted to pass torchstore references via WorkerConfig (shown in this PR). However, I'm encountering pickling issues. Sending the code to get some feedback from the experts.

Aggregated Logs (2025-09-02 06:35:21) >>>
[1 similar log lines] MAST is not supported on this platform. You can ignore this if you do not work at Meta.
[1 similar log lines] Failed to initialize replica 0: cannot pickle 'monarch._rust_bindings.monarch_hyperactor.mailbox.Mailbox' object
[1 similar log lines] CRITICAL:root:Unhandled exception in actor endpoint
[2 similar log lines] Traceback (most recent call last):
[4 similar log lines] File "/home/pradeepfdo/.conda/envs/forge/lib/python3.10/site-packages/monarch/_src/actor/actor_mesh.py", line 827, in instrumented
[2 similar log lines] result = await the_method(args, **kwargs)
[4 similar log lines] File "/home/pradeepfdo/forge_fork/src/forge/controller/service/service.py", line 115, in initialize
[2 similar log lines] await asyncio.gather(
[r.initialize() for r in replicas])
[2 similar log lines] self.actor = await self.actor_def.launch(
[2 similar log lines] File "/home/pradeepfdo/forge_fork/src/forge/actors/policy.py", line 131, in launch
[2 similar log lines] "vllm_worker", PolicyWorker, **asdict(config.worker_params)
[6 similar log lines] File "/home/pradeepfdo/.conda/envs/forge/lib/python3.10/dataclasses.py", line 1238, in asdict
[2 similar log lines] return _asdict_inner(obj, dict_factory)
[2 similar log lines] value = _asdict_inner(getattr(obj, f.name), dict_factory)
[2 similar log lines] return copy.deepcopy(obj)
[16 similar log lines] File "/home/pradeepfdo/.conda/envs/forge/lib/python3.10/copy.py", line 172, in deepcopy
[4 similar log lines] y = _reconstruct(x, memo, *rv)
[2 similar log lines] state = deepcopy(state, memo)
[2 similar log lines] y = copier(x, memo)
[2 similar log lines] y[deepcopy(key, memo)] = deepcopy(value, memo)
[2 similar log lines] y = func(*args)
[2 similar log lines] args = (deepcopy(arg, memo) for arg in args)
[2 similar log lines] rv = reductor(4)
[2 similar log lines] TypeError: cannot pickle 'monarch._rust_bindings.monarch_hyperactor.mailbox.Mailbox' object
[1 similar log lines] result = await instrumented()
[1 similar log lines] raise e
<<< Aggregated Logs (2025-09-02 06:35:24) <<<

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 2, 2025
resources=num_gpus,
state_dict_key=state_dict_key,
policy_config, service_config = get_configs(
1, "meta-llama/Llama-3.1-8B-Instruct", store=store
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing torchstore reference results in pickling errors. Pasted in the summary.

@pradeepfn pradeepfn closed this Sep 2, 2025
@pradeepfn
Copy link
Contributor Author

use #115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant