Fixes unit test that had empty Dask partitions after splitting #2313

geoffreyangus · 2022-07-26T01:54:15Z

This PR adjusts unit test parameters for tests/integration_tests/test_preprocessing.py::test_dask_known_divisions so that we do not have empty partitions after running the splitter. Prior to this change, this unit test would often fail with the following error:

E                       ray.exceptions.RayTaskError(AssertionError): ray::_get_read_tasks() (pid=83627, ip=127.0.0.1)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/read_api.py", line 1136, in _get_read_tasks
E                           reader = ds.create_reader(**kwargs)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 167, in create_reader
E                           return _ParquetDatasourceReader(**kwargs)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 230, in __init__
E                           self._encoding_ratio = self._estimate_files_encoding_ratio()
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 318, in _estimate_files_encoding_ratio
E                           sample_ratios = ray.get(futures)
E                       ray.exceptions.RayTaskError(AssertionError): ray::_sample_piece() (pid=83653, ip=127.0.0.1)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 437, in _sample_piece
E                           assert num_rows > 0 and metadata.num_rows > 0, (
E                       AssertionError: Sampled number of rows: 0 and total number of rows: 0 should be positive

venv39_fresh/lib/python3.9/site-packages/ray/_private/worker.py:2239: RayTaskError(AssertionError)

…::test_dask_known_divisions

github-actions · 2022-07-26T02:45:09Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 37m 7s ⏱️ + 15m 3s
2 941 tests ±0 2 892 ✔️ ±0   49 💤 ±0 0 ❌ ±0
8 823 runs ±0 8 640 ✔️ ±0 183 💤 ±0 0 ❌ ±0

Results for commit 3585bb2. ± Comparison against base commit 940141e.

♻️ This comment has been updated with latest results.

arnavgarg1 · 2022-07-26T16:13:44Z

tests/integration_tests/test_preprocessing.py

-    )
-    data_df = dd.from_pandas(pd.read_csv(data_csv), npartitions=10)
+
+    # num_examples=100 and npartitions=2 to ensure the test is not flaky, by having non-empty post-split datasets.


@geoffreyangus I've actually noticed this problem quite a few times while writing my own tests. Out of curiosity, do you think we should handle empty post-split datasets more gracefully throughout Ludwig instead of relying on adjusting samples/partitions to create non-empty datasets?

That's true– you're right that we probably should. This issue did seem to appear only when using Ray nightly though, so it might be worth waiting until Ray 2 is released and stable. I've filed an issue to track this bug here: #2324.

fix empty partitions in tests/integration_tests/test_preprocessing.py…

6e3ccce

…::test_dask_known_divisions

geoffreyangus requested review from jeffreyftang and magdyksaleh July 26, 2022 01:54

geoffreyangus added a commit that referenced this pull request Jul 26, 2022

wip: testing if changes #2310 and #2313 allow PR to pass CI

11dd744

magdyksaleh approved these changes Jul 26, 2022

View reviewed changes

jeffreyftang approved these changes Jul 26, 2022

View reviewed changes

arnavgarg1 approved these changes Jul 26, 2022

View reviewed changes

arnavgarg1 reviewed Jul 26, 2022

View reviewed changes

Merge branch 'master' into fix-test-preprocessing-known-divisions

3585bb2

geoffreyangus merged commit 2fcd9b3 into master Jul 28, 2022

geoffreyangus deleted the fix-test-preprocessing-known-divisions branch July 28, 2022 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes unit test that had empty Dask partitions after splitting #2313

Fixes unit test that had empty Dask partitions after splitting #2313

geoffreyangus commented Jul 26, 2022

github-actions bot commented Jul 26, 2022 •

edited

Loading

arnavgarg1 Jul 26, 2022 •

edited

Loading

geoffreyangus Jul 28, 2022 •

edited

Loading

Fixes unit test that had empty Dask partitions after splitting #2313

Fixes unit test that had empty Dask partitions after splitting #2313

Conversation

geoffreyangus commented Jul 26, 2022

github-actions bot commented Jul 26, 2022 • edited Loading

Unit Test Results

arnavgarg1 Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

geoffreyangus Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jul 26, 2022 •

edited

Loading

arnavgarg1 Jul 26, 2022 •

edited

Loading

geoffreyangus Jul 28, 2022 •

edited

Loading