Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DistributedStrategy interface with support for DDP #2890

Merged
merged 35 commits into from Jan 4, 2023
Merged

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Dec 29, 2022

Closes #2886.

Usage:

backend:
  trainer:
    strategy: ddp

This will allow us to use PyTorch 2.0 model compilation with distributed training:

https://pytorch.org/get-started/pytorch-2.0/

Caveats:

  • Will hang if an exception raised in only one of the training workers due to this issue in Ray Train. Should be addressed by the upgrade to Ray 2.1 / 2.2 in Support Distributed Training And Ray Tune with Ray 2.1 #2709.
  • Requires that all checkpoints are readable from every worker, as DDP does not support Horovod's broadcast state operations, which allow the files to only exist on the coordinator (should not be an issue for Ray users, as they already have to work with the assumption that the coordinator could land on any node).

@github-actions
Copy link

github-actions bot commented Dec 29, 2022

Unit Test Results

         6 files  ±0           6 suites  ±0   4h 15m 57s ⏱️ + 39m 29s
  3 568 tests +2    3 496 ✔️ +2    72 💤 ±0  0 ±0 
10 704 runs  +6  10 471 ✔️ +6  233 💤 ±0  0 ±0 

Results for commit b99e580. ± Comparison against base commit b320bec.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Strategy seems like a nice abstraction to add support for pytorch DDP

ludwig/distributed/base.py Show resolved Hide resolved
@@ -37,26 +37,28 @@ class HorovodBackend(LocalPreprocessingMixin, Backend):

def __init__(self, **kwargs):
super().__init__(dataset_manager=PandasDatasetManager(self), **kwargs)
self._horovod = None
self._distributed = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should self._distributed be an attribute of the Backend class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for Horovod. In the case of Ray, we are not in the right context from the backend to have access to the distributed api.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible using torch DDP(DistributedDataParallel)?
2 participants