-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
Trying torchstore for weight transfer in torchforge on Qwen 3 30B-A3B model and we see:
[1 similar log lines] [0] [Generator-0/1] 2025-11-14 21:18:21 CRITICAL Unhandled exception in actor endpoint
[1 similar log lines] [0] Traceback (most recent call last):
[10 similar log lines] [0] File "/home/allencwang/monarch-1/python/monarch/_src/actor/actor_mesh.py", line 992, in handle
[3 similar log lines] [0] result = await the_method(*args, **kwargs)
[4 similar log lines] [0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2 similar log lines] [0] File "/home/allencwang/forge/src/forge/actors/generator.py", line 491, in update_weights
[1 similar log lines] [0] await self.worker.update_weights.call(version=version)
[2 similar log lines] [0] File "/home/allencwang/monarch-1/python/monarch/_src/actor/future.py", line 151, in mark_complete
[2 similar log lines] [0] func, value = set_result, await coro
[2 similar log lines] [0] ^^^^^^^^^^
[1 similar log lines] [0] File "/home/allencwang/monarch-1/python/monarch/_src/actor/endpoint.py", line 154, in process
[1 similar log lines] [0] rank, value = await r._recv()
[2 similar log lines] [0] ^^^^^^^^^^^^^^^
[2 similar log lines] [0] return self._process(result)
[3 similar log lines] [0] ^^^^^^^^^^^^^^^^^^^^^
[1 similar log lines] [0] return rank, super()._process(msg)
[2 similar log lines] [0] raise cast(Exception, payload)
[2 similar log lines] [0] monarch._src.actor.actor_mesh.ActorError: A remote actor call has failed.
[3 similar log lines] [0] Traceback of where the remote call failed (most recent call last):
[3 similar log lines] [0] File "/home/allencwang/torchstore/torchstore/client.py", line 38, in _locate_volumes
[1 similar log lines] [0] return await self._controller.locate_volumes.call_one(key)
[2 similar log lines] [0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2 similar log lines] [0] raise e
[1 similar log lines] [0] raise KeyError(
[1 similar log lines] [0] KeyError: "DTensor 'policy_ver_0000000001.model.layers.0.mlp.experts.32.gate_proj.weight' is only partially committed. Not all shards have been stored yet. Please ensure all ranks complete their put() operations."
[4 similar log lines] [0]
[1 similar log lines] [0] The above exception was the direct cause of the following exception:
[1 similar log lines] [0] param = await ts.get(param_key)
[2 similar log lines] [0] File "/home/allencwang/torchstore/torchstore/api.py", line 203, in get
[1 similar log lines] [0] return await cl.get(key, inplace_tensor, tensor_slice_spec)
[1 similar log lines] [0] stored_object_type = await self._get_stored_object_type(key)
[1 similar log lines] [0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1 similar log lines] [0] File "/home/allencwang/torchstore/torchstore/client.py", line 229, in _get_stored_object_type
[1 similar log lines] [0] volume_map = await self._locate_volumes(key)
[1 similar log lines] [0] raise KeyError(str(e)) from e
[1 similar log lines] [0] KeyError: 'A remote actor call has failed.\n Traceback of where the remote call failed (most recent call last):\n File "/home/allencwang/monarch-1/python/monarch/_src/actor/actor_mesh.py", line 999, in handle\n raise e\n File "/home/allencwang/monarch-1/python/monarch/_src/actor/actor_mesh.py", line 992, in handle\n result = await the_method(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/allencwang/torchstore/torchstore/controller.py", line 172, in locate_volumes\n raise KeyError(\nKeyError: "DTensor \'policy_ver_0000000001.model.layers.0.mlp.experts.32.gate_proj.weight\' is only partially committed. Not all shards have been stored yet. Please ensure all ranks complete their put() operations."\n'
``
Metadata
Metadata
Assignees
Labels
No labels