Skip to content

Conversation

@H-Huang
Copy link
Contributor

@H-Huang H-Huang commented Mar 10, 2025

Async NCCL collectives dont block the CPU which means their futures will also get executed. This causes an issue in manager.py when we modify the allreduce tensor in a callback (https://github.com/pytorch/torchft/blob/main/torchft/manager.py#L292) as the tensor may not have finished allreduce.

The fix is to add an event after the allreduce has been wait()ed and then wait on this event before setting the future.

Created a test to repro:

pytest torchft/manager_integ_test.py -vsk test_manager_allreduce

@H-Huang H-Huang marked this pull request as draft March 10, 2025 17:06
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 10, 2025
@H-Huang H-Huang requested a review from d4l3k March 10, 2025 18:07
@H-Huang H-Huang force-pushed the diloco branch 3 times, most recently from 7753bbb to b23a598 Compare March 10, 2025 20:53
@H-Huang H-Huang marked this pull request as ready for review March 10, 2025 21:06
@H-Huang H-Huang changed the title [WIP] fix nccl future execution fix nccl future execution Mar 10, 2025
@H-Huang H-Huang requested a review from fegin March 10, 2025 21:08
@H-Huang H-Huang force-pushed the diloco branch 2 times, most recently from 70d648a to fa61e1d Compare March 10, 2025 21:13
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for fixing this!

@H-Huang H-Huang merged commit 8f021e1 into meta-pytorch:main Mar 10, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants