Skip to content

Conversation

@yifuwang
Copy link
Contributor

To leverage the low latency intra-node comm in c10d (pytorch/pytorch#114001), torch.cuda.set_device() needs to be invoked before init_process_group().

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 16, 2023
@Chillee Chillee merged commit 88a4d77 into meta-pytorch:main Dec 17, 2023
@carmocca
Copy link

Hi @yifuwang!

Is this change something that you would recommend in general? In every resource online set_device has been put after init_process_group():

Could you elaborate why is this necessary with ENABLE_INTRA_NODE_COMM and what are the differences (if any) of setting the device before or after?

Thank you!

@yifuwang
Copy link
Contributor Author

Hey @carmocca,

Is this change something that you would recommend in general?

Without ENABLE_INTRA_NODE_COMM, I don't think it matters so long as you set the correct device before the first collective. There are instances of set_device before init_process_group in the links you posted (e.g. https://pytorch.org/docs/stable/distributed.html#launch-utility and https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).

Could you elaborate why is this necessary with ENABLE_INTRA_NODE_COMM

Techniquely it's not a hard requirement. It's just that the feature is still new and experimental, and we're still figuring out the UX. Curious if this constraint is causing issues in your project aside from inconvenience. Thanks

@carmocca
Copy link

Let me run Lightning's CI with the order changed to see if any issues pop up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants