New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible using torch DDP(DistributedDataParallel)? #2886
Comments
Hey @ericxsun, adding DDP is something we've been considering lately, but haven't prioritized it since Horovod is currently supported and it's not clear what the added benefit of DDP would be for most users. Is there a reason you would prefer to use DDP instead of Horovod for distributed training in Ludwig? |
Thanks @tgaddair Just be familiar with DDP and technical stack in the group is native PyTorch。Change to Horovod may not be possible. Could you give me some points, so I can do it in Ludwig with PyTorch DDP? Thank you very much |
Hey @ericxsun, are you also able to use Ray with Ludwig? If so, it shouldn't be a problem integrating the existing distributed training into your stack. The only issue would be making sure the environment has the right dependencies, however, we provide a Docker image (ludwig-ray-gpu) that has all the dependencies, including Horovod, pre-installed. If you're not able to use Ray, you can still do distributed training, but not distributed data preprocessing, which limits the scale of the data you can use for training to something that can fit entirely in memory. To integrate DDP into Ludwig there are really two main touch points. The first is the Ray Trainer, which would need to be changed to support Ray's Let me know if you'd like to discuss this further, I'd be happy to put together a quick prototype using DDP at some point soon if it would be helpful. |
Thanks @ericxsun, let me know if it works for you. I haven't done any testing to verify correctness yet beyond small local tests, so let me know if you encounter any issues! |
I made some updates earlier that fixed gpu support. After running benchmarks, the performance was almost identical to Horovod, with and without AMP, so should be good to merge in the PR once it's reviewed. |
Is your feature request related to a problem? Please describe.
How could we train model with torch DDP?
The text was updated successfully, but these errors were encountered: