Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune batch size using distributed training to catch edge case CUDA OOMs #2934

Merged
merged 36 commits into from
Apr 24, 2023

Conversation

tgaddair
Copy link
Collaborator

In some cases, CUDA OOMs can occur only when the additional overhead of distributed training (from the GPU fusion buffer in Horovod) are introduced. This PR runs batch size eval in a full distributed context to ensure these edge cases are accounted for when selecting the best batch size.

@github-actions
Copy link

github-actions bot commented Jan 14, 2023

Unit Test Results

       6 files  ±       0         6 suites  ±0   2h 48m 43s ⏱️ + 1h 52m 2s
1 589 tests +1 556  1 564 ✔️ +1 535  24 💤 +20  1 +1 
1 622 runs  +1 523  1 593 ✔️ +1 506  28 💤 +16  1 +1 

For more details on these failures, see this check.

Results for commit 3db937d. ± Comparison against base commit 9e3a98f.

♻️ This comment has been updated with latest results.

ludwig/backend/ray.py Outdated Show resolved Hide resolved
@@ -349,6 +360,34 @@ def write_step_summary(cls, train_summary_writer, combined_loss, all_losses, ste

train_summary_writer.flush()

def train_for_tuning(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is no longer used. There's a new API called BatchSizeEvaluator that we use instead.

)
best_samples_per_sec = 0
best_batch_size = None
try:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please go back to using the BatchSizeEvaluator which abstracts away all this boilerplate.

Copy link
Collaborator Author

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tgaddair tgaddair merged commit 5fbcca7 into master Apr 24, 2023
7 of 10 checks passed
@tgaddair tgaddair deleted the distributed-auto-batch branch April 24, 2023 04:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants