Spark: don't shuffle row groups if training data requires non-shuffle #3369

chongxiaoc · 2022-01-14T18:09:33Z

In some cases (using pre-partitioned or sorted data), trainer doesn't
need data loader to shuffle rows. In this case, we should disable row groups
shuffling in Petastorm as well.

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

github-actions · 2022-01-14T20:34:23Z

Unit Test Results

    773 files -   29     773 suites - 29 9h 14m 43s ⏱️ + 33m 55s
    717 tests ±    0     668 ✔️ -     4     49 💤 +    4 0 ❌ ±0
16 646 runs - 678 11 726 ✔️ - 512 4 920 💤 - 166 0 ❌ ±0

Results for commit 3712d5d. ± Comparison against base commit a5edcd0.

This pull request skips 4 tests.

test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

github-actions · 2022-01-14T20:34:35Z

Unit Test Results (with flaky tests)

    897 files +  7     897 suites +7 10h 49m 40s ⏱️ + 1h 28m 46s
    717 tests ±  0     665 ✔️ -   6     49 💤 +4 3 ❌ +2
19 508 runs +78 13 646 ✔️ +69 5 855 💤 +3 7 ❌ +6

For more details on these failures, see this check.

Results for commit 3712d5d. ± Comparison against base commit a5edcd0.

This pull request skips 4 tests.

test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error

♻️ This comment has been updated with latest results.

Tixxx · 2022-01-14T21:59:51Z

horovod/spark/lightning/remote.py

@@ -312,7 +312,7 @@ def calculate_shuffle_buffer_size():
        """
        import horovod.torch as hvd

-        if user_shuffle_buffer_size:
+        if user_shuffle_buffer_size != -1:


We should also return 0 if user_shuffle_buffer_size is zero or do zero-check in the caller.

what if user_shuffle_buffer_size < 0 here?

I think adding necessary zero-check in caller is more clear, also addresses valid value issue for < 0 case.
Will fix.

@Tixxx @EnricoMi fixed.

horovod/spark/lightning/datamodule.py

horovod/spark/keras/remote.py

chongxiaoc · 2022-01-20T07:10:30Z

@EnricoMi do you know if this Github Action CI (Results) / Build and Test GPU heads (on Builtkite) means a real failure or is just a flaky build?

tgaddair · 2022-01-20T17:37:12Z

horovod/spark/keras/remote.py

            shuffle_buffer_size = calculate_shuffle_buffer_size(
                hvd, avg_row_size, train_rows / hvd.size())
        else:
+            assert user_shuffle_buffer_size >= 0, "user_shuffle_buffer_size cannot be negative!"


Should prefer to raise a ValueError instead of asserting.

In some cases (using pre-partitioned or sorted data), trainer doesn't need data loader to shuffle rows. In this case, we should disable row groups shuffling in Petastorm as well. Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>

chongxiaoc requested review from tgaddair, irasit and Tixxx January 14, 2022 18:09

chongxiaoc force-pushed the no_shuffle_files branch 3 times, most recently from 16088d4 to 5c00148 Compare January 14, 2022 21:07

Tixxx reviewed Jan 14, 2022

View reviewed changes

horovod/spark/lightning/datamodule.py Show resolved Hide resolved

chongxiaoc force-pushed the no_shuffle_files branch 2 times, most recently from ee1222e to 7810cc2 Compare January 15, 2022 00:54

Tixxx approved these changes Jan 17, 2022

View reviewed changes

irasit approved these changes Jan 19, 2022

View reviewed changes

horovod/spark/keras/remote.py Outdated Show resolved Hide resolved

horovod/spark/keras/remote.py Show resolved Hide resolved

chongxiaoc force-pushed the no_shuffle_files branch from 7810cc2 to 4ff20e2 Compare January 19, 2022 21:21

tgaddair reviewed Jan 20, 2022

View reviewed changes

tgaddair approved these changes Jan 20, 2022

View reviewed changes

chongxiaoc force-pushed the no_shuffle_files branch from 4ff20e2 to 3712d5d Compare January 20, 2022 19:28

chongxiaoc merged commit 6fb3cf3 into horovod:master Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: don't shuffle row groups if training data requires non-shuffle #3369

Spark: don't shuffle row groups if training data requires non-shuffle #3369

chongxiaoc commented Jan 14, 2022 •

edited

github-actions bot commented Jan 14, 2022 •

edited

github-actions bot commented Jan 14, 2022 •

edited

Tixxx Jan 14, 2022

EnricoMi Jan 14, 2022

chongxiaoc Jan 14, 2022

chongxiaoc Jan 15, 2022

chongxiaoc commented Jan 20, 2022

tgaddair Jan 20, 2022

chongxiaoc Jan 20, 2022

Spark: don't shuffle row groups if training data requires non-shuffle #3369

Spark: don't shuffle row groups if training data requires non-shuffle #3369

Conversation

chongxiaoc commented Jan 14, 2022 • edited

Checklist before submitting

Description

Review process to land

github-actions bot commented Jan 14, 2022 • edited

Unit Test Results

github-actions bot commented Jan 14, 2022 • edited

Unit Test Results (with flaky tests)

Tixxx Jan 14, 2022

Choose a reason for hiding this comment

EnricoMi Jan 14, 2022

Choose a reason for hiding this comment

chongxiaoc Jan 14, 2022

Choose a reason for hiding this comment

chongxiaoc Jan 15, 2022

Choose a reason for hiding this comment

chongxiaoc commented Jan 20, 2022

tgaddair Jan 20, 2022

Choose a reason for hiding this comment

chongxiaoc Jan 20, 2022

Choose a reason for hiding this comment

chongxiaoc commented Jan 14, 2022 •

edited

github-actions bot commented Jan 14, 2022 •

edited

github-actions bot commented Jan 14, 2022 •

edited