Tune batch size using distributed training to catch edge case CUDA OOMs #2934

tgaddair · 2023-01-14T01:12:25Z

In some cases, CUDA OOMs can occur only when the additional overhead of distributed training (from the GPU fusion buffer in Horovod) are introduced. This PR runs batch size eval in a full distributed context to ensure these edge cases are accounted for when selecting the best batch size.

ludwig/backend/ray.py

github-actions · 2023-01-14T03:05:38Z

Unit Test Results

      6 files ±      0       6 suites ±0 2h 48m 43s ⏱️ + 1h 52m 2s
1 589 tests +1 556 1 564 ✔️ +1 535 24 💤 +20 1 ❌ +1
1 622 runs +1 523 1 593 ✔️ +1 506 28 💤 +16 1 ❌ +1

For more details on these failures, see this check.

Results for commit 3db937d. ± Comparison against base commit 9e3a98f.

♻️ This comment has been updated with latest results.

for more information, see https://pre-commit.ci

…/ludwig into distributed-auto-batch

for more information, see https://pre-commit.ci

ludwig/backend/ray.py

tgaddair · 2023-04-03T19:27:27Z

ludwig/trainers/trainer.py

@@ -349,6 +360,34 @@ def write_step_summary(cls, train_summary_writer, combined_loss, all_losses, ste

        train_summary_writer.flush()

+    def train_for_tuning(


This function is no longer used. There's a new API called BatchSizeEvaluator that we use instead.

tgaddair · 2023-04-03T19:28:10Z

ludwig/trainers/trainer.py

-        )
+        best_samples_per_sec = 0
+        best_batch_size = None
+        try:


Please go back to using the BatchSizeEvaluator which abstracts away all this boilerplate.

tgaddair

LGTM

Tune batch size using distributed training to catch edge case CUDA OOMs

1b1ac99

tgaddair requested review from geoffreyangus and arnavgarg1 January 14, 2023 01:12

arnavgarg1 approved these changes Jan 14, 2023

View reviewed changes

geoffreyangus reviewed Jan 14, 2023

View reviewed changes

ludwig/backend/ray.py Show resolved Hide resolved

tgaddair and others added 23 commits January 14, 2023 11:39

WIP error handling

3ce0392

Merge branch 'master' into distributed-auto-batch

989260a

Added test plumbing

78b79f0

More tests

fd2b8d0

Fixed exception handling

b453bbc

[pre-commit.ci] auto fixes from pre-commit.com hooks

c7f4e5b

for more information, see https://pre-commit.ci

Removed debug code

530f268

Merge branch 'distributed-auto-batch' of https://github.com/ludwig-ai…

bd75d16

…/ludwig into distributed-auto-batch

Cleanup

b7c4929

Speed up tests

0c78c94

Log errors

b9cd3c4

Reduce verbosity mode

2d92672

Hardcode throughput

077e120

Increase tune steps

11bd317

Merge branch 'master' into distributed-auto-batch

52aa535

[pre-commit.ci] auto fixes from pre-commit.com hooks

a13df35

for more information, see https://pre-commit.ci

fix conficts

525018d

update ray code

53d06dd

add max frac

091754a

update max batch size

995d6e6

conditional import

b6032c5

local rank

dd0cda7

fix test

3fb691a

arnavgarg1 reviewed Mar 22, 2023

View reviewed changes

ludwig/backend/ray.py Outdated Show resolved Hide resolved

reduce verbosity

d79bcda

abidwael requested review from arnavgarg1 and geoffreyangus March 23, 2023 15:51

abidwael added 2 commits March 30, 2023 13:30

Merge branch 'master' into distributed-auto-batch

3619c94

Merge branch 'master' into distributed-auto-batch

1e94e91

tgaddair commented Apr 3, 2023

View reviewed changes

abidwael added 7 commits April 3, 2023 16:19

Merge branch 'master' into distributed-auto-batch

60763b1

use BatchSizeEvaluator

030a204

just test bs tuning

4ed7cac

subclass teh right fucntion

17abeba

test all

9ea23e4

pandas<2

e81750e

clean up function

f65e837

tgaddair commented Apr 4, 2023

View reviewed changes

tgaddair mentioned this pull request Apr 23, 2023

Added DeepSpeed distributed strategy and backend #3362

Merged

5 tasks

tgaddair added 2 commits April 23, 2023 14:58

Merge branch 'master' into distributed-auto-batch

5f9d862

Load distributed strategy

3db937d

tgaddair merged commit 5fbcca7 into master Apr 24, 2023
7 of 10 checks passed

tgaddair deleted the distributed-auto-batch branch April 24, 2023 04:28

abidwael mentioned this pull request Apr 24, 2023

Fixed distributed strategy registration to be explicit #3361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune batch size using distributed training to catch edge case CUDA OOMs #2934

Tune batch size using distributed training to catch edge case CUDA OOMs #2934

tgaddair commented Jan 14, 2023

github-actions bot commented Jan 14, 2023 •

edited

tgaddair Apr 3, 2023

tgaddair Apr 3, 2023

tgaddair left a comment

		@@ -349,6 +360,34 @@ def write_step_summary(cls, train_summary_writer, combined_loss, all_losses, ste

		train_summary_writer.flush()

		def train_for_tuning(

Tune batch size using distributed training to catch edge case CUDA OOMs #2934

Tune batch size using distributed training to catch edge case CUDA OOMs #2934

Conversation

tgaddair commented Jan 14, 2023

github-actions bot commented Jan 14, 2023 • edited

Unit Test Results

tgaddair Apr 3, 2023

Choose a reason for hiding this comment

tgaddair Apr 3, 2023

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 14, 2023 •

edited