Added DistributedStrategy interface with support for DDP #2890

tgaddair · 2022-12-29T21:43:05Z

Closes #2886.

Usage:

backend:
  trainer:
    strategy: ddp

This will allow us to use PyTorch 2.0 model compilation with distributed training:

https://pytorch.org/get-started/pytorch-2.0/

Caveats:

Will hang if an exception raised in only one of the training workers due to this issue in Ray Train. Should be addressed by the upgrade to Ray 2.1 / 2.2 in Support Distributed Training And Ray Tune with Ray 2.1 #2709.
Requires that all checkpoints are readable from every worker, as DDP does not support Horovod's broadcast state operations, which allow the files to only exist on the coordinator (should not be an issue for Ray users, as they already have to work with the assumption that the coordinator could land on any node).

for more information, see https://pre-commit.ci

github-actions · 2022-12-29T22:44:51Z

Unit Test Results

        6 files ±0         6 suites ±0 4h 15m 57s ⏱️ + 39m 29s
  3 568 tests +2   3 496 ✔️ +2   72 💤 ±0 0 ❌ ±0
10 704 runs +6 10 471 ✔️ +6 233 💤 ±0 0 ❌ ±0

Results for commit b99e580. ± Comparison against base commit b320bec.

♻️ This comment has been updated with latest results.

justinxzhao

LGTM. Strategy seems like a nice abstraction to add support for pytorch DDP

ludwig/distributed/base.py

justinxzhao · 2023-01-04T17:17:56Z

ludwig/backend/horovod.py

@@ -37,26 +37,28 @@ class HorovodBackend(LocalPreprocessingMixin, Backend):

    def __init__(self, **kwargs):
        super().__init__(dataset_manager=PandasDatasetManager(self), **kwargs)
-        self._horovod = None
+        self._distributed = None


Should self._distributed be an attribute of the Backend class?

Only for Horovod. In the case of Ray, we are not in the right context from the backend to have access to the distributed api.

for more information, see https://pre-commit.ci

tgaddair added 18 commits December 28, 2022 14:37

WIP: add distributed interface

6eb8403

WIP stragegy

d4c0327

Broadcast object

9c14328

LocalStrategy

ccb8c67

Refactored trainer

c4f9b6d

Refs

63bca03

Renamed

804fba8

Cleanup

e643e7a

Added local rank computation

1bce9f2

Added metrics

def1ea2

Return first

20b09b8

Updated ray

d46e48d

Cleanup

0300228

Resume on all ranks

c65ed23

LightGBM

dc7f2f6

Added test

212350c

Fixed ddp

94df3e2

Fixed test

0dffd91

tgaddair mentioned this pull request Dec 29, 2022

Possible using torch DDP(DistributedDataParallel)? #2886

Closed

[pre-commit.ci] auto fixes from pre-commit.com hooks

e562e98

for more information, see https://pre-commit.ci

tgaddair added 9 commits December 29, 2022 14:46

Fixed tests

2d602d4

Fixed horovod backend

a69d7df

Added comment

b3a4fef

Fixed tests

8f4c9bf

Fixed more tests

fa3770b

Updated log

5e15b51

Remove device_ids

3c52a17

Fixed wrapping

23529f9

Fixed optimizer

df73940

Fixed tests

060911f

tgaddair requested review from justinxzhao, arnavgarg1, geoffreyangus and ShreyaR December 30, 2022 19:59

tgaddair added 4 commits December 30, 2022 14:56

Fixed initialize_pytorch

3c411ee

Cleanup

c67e045

Fixed device

bee031b

Fixed horovod backend

b99e580

justinxzhao approved these changes Jan 4, 2023

View reviewed changes

tgaddair and others added 2 commits January 4, 2023 13:52

Merge

14265d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

375e420

for more information, see https://pre-commit.ci

tgaddair merged commit a376b76 into master Jan 4, 2023

tgaddair deleted the torch-ddp branch January 4, 2023 23:09

abidwael mentioned this pull request Jan 20, 2023

Benchmark performance regression: mercedes_benz_greener.ecd.yaml #2978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added DistributedStrategy interface with support for DDP #2890

Added DistributedStrategy interface with support for DDP #2890

tgaddair commented Dec 29, 2022 •

edited

github-actions bot commented Dec 29, 2022 •

edited

justinxzhao left a comment

justinxzhao Jan 4, 2023

tgaddair Jan 4, 2023

Added DistributedStrategy interface with support for DDP #2890

Added DistributedStrategy interface with support for DDP #2890

Conversation

tgaddair commented Dec 29, 2022 • edited

github-actions bot commented Dec 29, 2022 • edited

Unit Test Results

justinxzhao left a comment

Choose a reason for hiding this comment

justinxzhao Jan 4, 2023

Choose a reason for hiding this comment

tgaddair Jan 4, 2023

Choose a reason for hiding this comment

tgaddair commented Dec 29, 2022 •

edited

github-actions bot commented Dec 29, 2022 •

edited