Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: don't shuffle row groups if training data requires non-shuffle #3369

Merged
merged 1 commit into from Jan 21, 2022

Conversation

chongxiaoc
Copy link
Collaborator

@chongxiaoc chongxiaoc commented Jan 14, 2022

In some cases (using pre-partitioned or sorted data), trainer doesn't
need data loader to shuffle rows. In this case, we should disable row groups
shuffling in Petastorm as well.

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented Jan 14, 2022

Unit Test Results

     773 files   -   29       773 suites   - 29   9h 14m 43s ⏱️ + 33m 55s
     717 tests ±    0       668 ✔️  -     4       49 💤 +    4  0 ±0 
16 646 runs   - 678  11 726 ✔️  - 512  4 920 💤  - 166  0 ±0 

Results for commit 3712d5d. ± Comparison against base commit a5edcd0.

This pull request skips 4 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 14, 2022

Unit Test Results (with flaky tests)

     897 files  +  7       897 suites  +7   10h 49m 40s ⏱️ + 1h 28m 46s
     717 tests ±  0       665 ✔️  -   6       49 💤 +4  3 +2 
19 508 runs  +78  13 646 ✔️ +69  5 855 💤 +3  7 +6 

For more details on these failures, see this check.

Results for commit 3712d5d. ± Comparison against base commit a5edcd0.

This pull request skips 4 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

@chongxiaoc chongxiaoc force-pushed the no_shuffle_files branch 3 times, most recently from 16088d4 to 5c00148 Compare January 14, 2022 21:07
@@ -312,7 +312,7 @@ def calculate_shuffle_buffer_size():
"""
import horovod.torch as hvd

if user_shuffle_buffer_size:
if user_shuffle_buffer_size != -1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also return 0 if user_shuffle_buffer_size is zero or do zero-check in the caller.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if user_shuffle_buffer_size < 0 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding necessary zero-check in caller is more clear, also addresses valid value issue for < 0 case.
Will fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tixxx @EnricoMi fixed.

@chongxiaoc chongxiaoc force-pushed the no_shuffle_files branch 2 times, most recently from ee1222e to 7810cc2 Compare January 15, 2022 00:54
horovod/spark/keras/remote.py Outdated Show resolved Hide resolved
horovod/spark/keras/remote.py Show resolved Hide resolved
@chongxiaoc
Copy link
Collaborator Author

@EnricoMi do you know if this Github Action CI (Results) / Build and Test GPU heads (on Builtkite) means a real failure or is just a flaky build?

shuffle_buffer_size = calculate_shuffle_buffer_size(
hvd, avg_row_size, train_rows / hvd.size())
else:
assert user_shuffle_buffer_size >= 0, "user_shuffle_buffer_size cannot be negative!"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should prefer to raise a ValueError instead of asserting.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

In some cases (using pre-partitioned or sorted data), trainer doesn't
need data loader to shuffle rows. In this case, we should disable row groups
shuffling in Petastorm as well.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
@chongxiaoc chongxiaoc merged commit 6fb3cf3 into horovod:master Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants