New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark: don't shuffle row groups if training data requires non-shuffle #3369
Conversation
Unit Test Results 773 files - 29 773 suites - 29 9h 14m 43s ⏱️ + 33m 55s Results for commit 3712d5d. ± Comparison against base commit a5edcd0. This pull request skips 4 tests.
♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 897 files + 7 897 suites +7 10h 49m 40s ⏱️ + 1h 28m 46s For more details on these failures, see this check. Results for commit 3712d5d. ± Comparison against base commit a5edcd0. This pull request skips 4 tests.
♻️ This comment has been updated with latest results. |
16088d4
to
5c00148
Compare
horovod/spark/lightning/remote.py
Outdated
@@ -312,7 +312,7 @@ def calculate_shuffle_buffer_size(): | |||
""" | |||
import horovod.torch as hvd | |||
|
|||
if user_shuffle_buffer_size: | |||
if user_shuffle_buffer_size != -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also return 0 if user_shuffle_buffer_size is zero or do zero-check in the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if user_shuffle_buffer_size < 0
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding necessary zero-check in caller is more clear, also addresses valid value issue for < 0 case.
Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ee1222e
to
7810cc2
Compare
7810cc2
to
4ff20e2
Compare
@EnricoMi do you know if this Github Action |
horovod/spark/keras/remote.py
Outdated
shuffle_buffer_size = calculate_shuffle_buffer_size( | ||
hvd, avg_row_size, train_rows / hvd.size()) | ||
else: | ||
assert user_shuffle_buffer_size >= 0, "user_shuffle_buffer_size cannot be negative!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should prefer to raise a ValueError instead of asserting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
In some cases (using pre-partitioned or sorted data), trainer doesn't need data loader to shuffle rows. In this case, we should disable row groups shuffling in Petastorm as well. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
4ff20e2
to
3712d5d
Compare
In some cases (using pre-partitioned or sorted data), trainer doesn't
need data loader to shuffle rows. In this case, we should disable row groups
shuffling in Petastorm as well.
Checklist before submitting
Description
Fixes # (issue).
Review process to land