Possible using torch DDP(DistributedDataParallel)? #2886

ericxsun · 2022-12-27T10:15:23Z

Is your feature request related to a problem? Please describe.
How could we train model with torch DDP?

tgaddair · 2022-12-28T05:19:47Z

Hey @ericxsun, adding DDP is something we've been considering lately, but haven't prioritized it since Horovod is currently supported and it's not clear what the added benefit of DDP would be for most users. Is there a reason you would prefer to use DDP instead of Horovod for distributed training in Ludwig?

ericxsun · 2022-12-28T12:36:39Z

Thanks @tgaddair

Just be familiar with DDP and technical stack in the group is native PyTorch。Change to Horovod may not be possible.

Could you give me some points, so I can do it in Ludwig with PyTorch DDP? Thank you very much

tgaddair · 2022-12-28T18:44:08Z

Hey @ericxsun, are you also able to use Ray with Ludwig? If so, it shouldn't be a problem integrating the existing distributed training into your stack. The only issue would be making sure the environment has the right dependencies, however, we provide a Docker image (ludwig-ray-gpu) that has all the dependencies, including Horovod, pre-installed.

If you're not able to use Ray, you can still do distributed training, but not distributed data preprocessing, which limits the scale of the data you can use for training to something that can fit entirely in memory.

To integrate DDP into Ludwig there are really two main touch points. The first is the Ray Trainer, which would need to be changed to support Ray's TorchConfig here. The other would the in trainer.py where all the calls to self.horovod would need to be changed to the DDP equivalent API calls.

Let me know if you'd like to discuss this further, I'd be happy to put together a quick prototype using DDP at some point soon if it would be helpful.

tgaddair · 2022-12-29T21:43:54Z

Hey @ericxsun, I've implement an initial version of DDP integration with Ray in #2890. Let me know if this works for your use case. I will do some benchmarking to see how it compares with Horovod in terms of performance before landing.

ericxsun · 2022-12-30T01:43:28Z

Hi @tgaddair, sorry for late reply. Thank you so much. I'll try the your implementation of DDP integration with Ray in #2890. Thank you again.

tgaddair · 2022-12-30T04:53:41Z

Thanks @ericxsun, let me know if it works for you. I haven't done any testing to verify correctness yet beyond small local tests, so let me know if you encounter any issues!

tgaddair · 2022-12-31T05:36:38Z

I made some updates earlier that fixed gpu support. After running benchmarks, the performance was almost identical to Horovod, with and without AMP, so should be good to merge in the PR once it's reviewed.

tgaddair mentioned this issue Dec 29, 2022

Added DistributedStrategy interface with support for DDP #2890

Merged

tgaddair self-assigned this Dec 30, 2022

tgaddair added the feature New feature or request label Dec 31, 2022

tgaddair closed this as completed in #2890 Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible using torch DDP(DistributedDataParallel)? #2886

Possible using torch DDP(DistributedDataParallel)? #2886

ericxsun commented Dec 27, 2022

tgaddair commented Dec 28, 2022

ericxsun commented Dec 28, 2022 •

edited

tgaddair commented Dec 28, 2022

tgaddair commented Dec 29, 2022

ericxsun commented Dec 30, 2022

tgaddair commented Dec 30, 2022

tgaddair commented Dec 31, 2022

Possible using torch DDP(DistributedDataParallel)? #2886

Possible using torch DDP(DistributedDataParallel)? #2886

Comments

ericxsun commented Dec 27, 2022

tgaddair commented Dec 28, 2022

ericxsun commented Dec 28, 2022 • edited

tgaddair commented Dec 28, 2022

tgaddair commented Dec 29, 2022

ericxsun commented Dec 30, 2022

tgaddair commented Dec 30, 2022

tgaddair commented Dec 31, 2022

ericxsun commented Dec 28, 2022 •

edited