Skip to content

Qwen MoE weight transfer tracking issue #78

@allenwang28

Description

@allenwang28

Trying torchstore for weight transfer in torchforge on Qwen 3 30B-A3B model and we see:

[1 similar log lines] [0] [Generator-0/1] 2025-11-14 21:18:21 CRITICAL Unhandled exception in actor endpoint
[1 similar log lines] [0] Traceback (most recent call last):
[10 similar log lines] [0]   File "/home/allencwang/monarch-1/python/monarch/_src/actor/actor_mesh.py", line 992, in handle
[3 similar log lines] [0]     result = await the_method(*args, **kwargs)
[4 similar log lines] [0]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2 similar log lines] [0]   File "/home/allencwang/forge/src/forge/actors/generator.py", line 491, in update_weights
[1 similar log lines] [0]     await self.worker.update_weights.call(version=version)
[2 similar log lines] [0]   File "/home/allencwang/monarch-1/python/monarch/_src/actor/future.py", line 151, in mark_complete
[2 similar log lines] [0]     func, value = set_result, await coro
[2 similar log lines] [0]                               ^^^^^^^^^^
[1 similar log lines] [0]   File "/home/allencwang/monarch-1/python/monarch/_src/actor/endpoint.py", line 154, in process
[1 similar log lines] [0]     rank, value = await r._recv()
[2 similar log lines] [0]                   ^^^^^^^^^^^^^^^
[2 similar log lines] [0]     return self._process(result)
[3 similar log lines] [0]            ^^^^^^^^^^^^^^^^^^^^^
[1 similar log lines] [0]     return rank, super()._process(msg)
[2 similar log lines] [0]     raise cast(Exception, payload)
[2 similar log lines] [0] monarch._src.actor.actor_mesh.ActorError: A remote actor call has failed.
[3 similar log lines] [0]  Traceback of where the remote call failed (most recent call last):
[3 similar log lines] [0]   File "/home/allencwang/torchstore/torchstore/client.py", line 38, in _locate_volumes
[1 similar log lines] [0]     return await self._controller.locate_volumes.call_one(key)
[2 similar log lines] [0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2 similar log lines] [0]     raise e
[1 similar log lines] [0]     raise KeyError(
[1 similar log lines] [0] KeyError: "DTensor 'policy_ver_0000000001.model.layers.0.mlp.experts.32.gate_proj.weight' is only partially committed. Not all shards have been stored yet. Please ensure all ranks complete their put() operations."
[4 similar log lines] [0] 
[1 similar log lines] [0] The above exception was the direct cause of the following exception:
[1 similar log lines] [0]     param = await ts.get(param_key)
[2 similar log lines] [0]   File "/home/allencwang/torchstore/torchstore/api.py", line 203, in get
[1 similar log lines] [0]     return await cl.get(key, inplace_tensor, tensor_slice_spec)
[1 similar log lines] [0]     stored_object_type = await self._get_stored_object_type(key)
[1 similar log lines] [0]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1 similar log lines] [0]   File "/home/allencwang/torchstore/torchstore/client.py", line 229, in _get_stored_object_type
[1 similar log lines] [0]     volume_map = await self._locate_volumes(key)
[1 similar log lines] [0]     raise KeyError(str(e)) from e
[1 similar log lines] [0] KeyError: 'A remote actor call has failed.\n Traceback of where the remote call failed (most recent call last):\n  File "/home/allencwang/monarch-1/python/monarch/_src/actor/actor_mesh.py", line 999, in handle\n    raise e\n  File "/home/allencwang/monarch-1/python/monarch/_src/actor/actor_mesh.py", line 992, in handle\n    result = await the_method(*args, **kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/home/allencwang/torchstore/torchstore/controller.py", line 172, in locate_volumes\n    raise KeyError(\nKeyError: "DTensor \'policy_ver_0000000001.model.layers.0.mlp.experts.32.gate_proj.weight\' is only partially committed. Not all shards have been stored yet. Please ensure all ranks complete their put() operations."\n'
``

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions