Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible using torch DDP(DistributedDataParallel)? #2886

Closed
ericxsun opened this issue Dec 27, 2022 · 7 comments · Fixed by #2890
Closed

Possible using torch DDP(DistributedDataParallel)? #2886

ericxsun opened this issue Dec 27, 2022 · 7 comments · Fixed by #2890
Assignees
Labels
feature New feature or request

Comments

@ericxsun
Copy link

Is your feature request related to a problem? Please describe.
How could we train model with torch DDP?

@tgaddair
Copy link
Collaborator

Hey @ericxsun, adding DDP is something we've been considering lately, but haven't prioritized it since Horovod is currently supported and it's not clear what the added benefit of DDP would be for most users. Is there a reason you would prefer to use DDP instead of Horovod for distributed training in Ludwig?

@ericxsun
Copy link
Author

ericxsun commented Dec 28, 2022

Thanks @tgaddair

Just be familiar with DDP and technical stack in the group is native PyTorch。Change to Horovod may not be possible.

Could you give me some points, so I can do it in Ludwig with PyTorch DDP? Thank you very much

@tgaddair
Copy link
Collaborator

Hey @ericxsun, are you also able to use Ray with Ludwig? If so, it shouldn't be a problem integrating the existing distributed training into your stack. The only issue would be making sure the environment has the right dependencies, however, we provide a Docker image (ludwig-ray-gpu) that has all the dependencies, including Horovod, pre-installed.

If you're not able to use Ray, you can still do distributed training, but not distributed data preprocessing, which limits the scale of the data you can use for training to something that can fit entirely in memory.

To integrate DDP into Ludwig there are really two main touch points. The first is the Ray Trainer, which would need to be changed to support Ray's TorchConfig here. The other would the in trainer.py where all the calls to self.horovod would need to be changed to the DDP equivalent API calls.

Let me know if you'd like to discuss this further, I'd be happy to put together a quick prototype using DDP at some point soon if it would be helpful.

@tgaddair
Copy link
Collaborator

Hey @ericxsun, I've implement an initial version of DDP integration with Ray in #2890. Let me know if this works for your use case. I will do some benchmarking to see how it compares with Horovod in terms of performance before landing.

@ericxsun
Copy link
Author

Hi @tgaddair, sorry for late reply. Thank you so much. I'll try the your implementation of DDP integration with Ray in #2890. Thank you again.

@tgaddair
Copy link
Collaborator

Thanks @ericxsun, let me know if it works for you. I haven't done any testing to verify correctness yet beyond small local tests, so let me know if you encounter any issues!

@tgaddair tgaddair self-assigned this Dec 30, 2022
@tgaddair
Copy link
Collaborator

I made some updates earlier that fixed gpu support. After running benchmarks, the performance was almost identical to Horovod, with and without AMP, so should be good to merge in the PR once it's reviewed.

@tgaddair tgaddair added the feature New feature or request label Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants