-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tune batch size using distributed training to catch edge case CUDA OOMs #2934
Conversation
Unit Test Results 6 files ± 0 6 suites ±0 2h 48m 43s ⏱️ + 1h 52m 2s For more details on these failures, see this check. Results for commit 3db937d. ± Comparison against base commit 9e3a98f. ♻️ This comment has been updated with latest results. |
for more information, see https://pre-commit.ci
…/ludwig into distributed-auto-batch
for more information, see https://pre-commit.ci
ludwig/trainers/trainer.py
Outdated
@@ -349,6 +360,34 @@ def write_step_summary(cls, train_summary_writer, combined_loss, all_losses, ste | |||
|
|||
train_summary_writer.flush() | |||
|
|||
def train_for_tuning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is no longer used. There's a new API called BatchSizeEvaluator
that we use instead.
) | ||
best_samples_per_sec = 0 | ||
best_batch_size = None | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please go back to using the BatchSizeEvaluator
which abstracts away all this boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
In some cases, CUDA OOMs can occur only when the additional overhead of distributed training (from the GPU fusion buffer in Horovod) are introduced. This PR runs batch size eval in a full distributed context to ensure these edge cases are accounted for when selecting the best batch size.