relation between apex.parallel.DistributedDataParallel and torch.distributed #75

xmyqsh · 2018-11-03T13:52:41Z

I haven't gone through the code yet.
Could anyone give a quickly explain about the relation between apex.parallel.DistributedDataParallel and torch.nn.parallel.DistributedDataParallel, as well as torch.distributed.launch?

mcarilli · 2018-11-05T19:03:32Z

apex.parallel.DistributedDataParallel and torch.nn.parallel.DistributedDataParallel have the same purpose. They are model wrappers that automatically take care of gradient allreduces during the backward pass. Their usage is almost identical. The Apex version offers some features that the torch version does not, but we plan to merge Apex features into upstream eventually, so for forward compatibility, you may as well just use the torch version.
apex.parallel.DistributedDataParallel example
torch.nn.parallel.DistributedDataParallel example (note the slightly different constructor arguments)
FP16_Optimizer happens to be used in these examples, but its presence is unrelated to the DistributedDataParallel wrappers. You can ignore it.

torch.distributed.launch is a wrapper script intended to spawn multiple processes, and supply them with the arguments and the environment necessary to set up distributed training within each process. torch.distributed.launch can be used with either apex.parallel.DistributedDataParallel or torch.nn.parallel.DistributedDataParallel.

gbrow004 · 2019-06-08T18:21:35Z

I understand that they both have the same purposes, but are there any potential/theoretical advantages to using the apex vs torch, aside from extra options? Performance/speed?

mcarilli · 2019-06-17T18:49:33Z

Right now, I'd recommend torch.nn.parallel.DistributedDataParallel for all practical purposes. It's pretty darn good (fast and robust).

mcarilli closed this as completed Nov 11, 2018

williamFalcon mentioned this issue Oct 5, 2019

Add Apex DDP option Lightning-AI/pytorch-lightning#291

Closed

DKandrew mentioned this issue Feb 8, 2020

Does apex.parallel.SyncBatchNorm support torch.nn.parallel.DistributedDataParallel? #709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

relation between apex.parallel.DistributedDataParallel and torch.distributed #75

relation between apex.parallel.DistributedDataParallel and torch.distributed #75

xmyqsh commented Nov 3, 2018

mcarilli commented Nov 5, 2018

gbrow004 commented Jun 8, 2019

mcarilli commented Jun 17, 2019

relation between apex.parallel.DistributedDataParallel and torch.distributed #75

relation between apex.parallel.DistributedDataParallel and torch.distributed #75

Comments

xmyqsh commented Nov 3, 2018

mcarilli commented Nov 5, 2018

gbrow004 commented Jun 8, 2019

mcarilli commented Jun 17, 2019