Skip to content

Conversation

H-Huang
Copy link
Contributor

@H-Huang H-Huang commented May 20, 2025

Hitting an error with RuntimeError: The size of tensor a (128256) must match the size of tensor b (128220) at non-singleton dimension 0 when restoring the parameters in DiLoCo.

Depends on this fix: pytorch/pytorch#154156, tests pass with pytorch nightly

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 20, 2025
@H-Huang H-Huang marked this pull request as draft May 20, 2025 21:35
p.data.copy_(
DTensor.from_local(
self.original_parameters[name], p.device_mesh, p.placements
self.original_parameters[name], p.device_mesh, p.placements, shape=p.shape, stride=p.stride()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why shape changes?

Copy link
Contributor Author

@H-Huang H-Huang May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 10 ranks, the original parameter was shape [128256, 4096]

After converting to dtensor each rank had a local tensor of [12826, 4096], except for rank 9 which was [12822, 4096]

I saw this note in the docs about needing to add shape arg if tensors are not even across ranks. Though it is strange because I cannot repro this in a test yet. However this does fix the issue when running on the cluster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uneven sharding part makes sense. But I recalled the issue didn't happen at the first iteration? If it is an uneven sharding issue, shouldn't it happen at the first iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would happen at the iteration which the semi-sync methods perform sync, so its dependent on the sync_every config

@H-Huang H-Huang marked this pull request as ready for review May 27, 2025 15:17
@H-Huang H-Huang requested a review from d4l3k May 27, 2025 15:44
@H-Huang H-Huang merged commit dafb968 into meta-pytorch:main May 27, 2025
6 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants