Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relation between apex.parallel.DistributedDataParallel and torch.distributed #75

Closed
xmyqsh opened this issue Nov 3, 2018 · 3 comments
Closed

Comments

@xmyqsh
Copy link

xmyqsh commented Nov 3, 2018

I haven't gone through the code yet.
Could anyone give a quickly explain about the relation between apex.parallel.DistributedDataParallel and torch.nn.parallel.DistributedDataParallel, as well as torch.distributed.launch?

@mcarilli
Copy link
Contributor

mcarilli commented Nov 5, 2018

apex.parallel.DistributedDataParallel and torch.nn.parallel.DistributedDataParallel have the same purpose. They are model wrappers that automatically take care of gradient allreduces during the backward pass. Their usage is almost identical. The Apex version offers some features that the torch version does not, but we plan to merge Apex features into upstream eventually, so for forward compatibility, you may as well just use the torch version.
apex.parallel.DistributedDataParallel example
torch.nn.parallel.DistributedDataParallel example (note the slightly different constructor arguments)
FP16_Optimizer happens to be used in these examples, but its presence is unrelated to the DistributedDataParallel wrappers. You can ignore it.

torch.distributed.launch is a wrapper script intended to spawn multiple processes, and supply them with the arguments and the environment necessary to set up distributed training within each process. torch.distributed.launch can be used with either apex.parallel.DistributedDataParallel or torch.nn.parallel.DistributedDataParallel.

@gbrow004
Copy link

gbrow004 commented Jun 8, 2019

I understand that they both have the same purposes, but are there any potential/theoretical advantages to using the apex vs torch, aside from extra options? Performance/speed?

@mcarilli
Copy link
Contributor

Right now, I'd recommend torch.nn.parallel.DistributedDataParallel for all practical purposes. It's pretty darn good (fast and robust).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants