Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes unit test that had empty Dask partitions after splitting #2313

Merged
merged 2 commits into from
Jul 28, 2022

Conversation

geoffreyangus
Copy link
Collaborator

This PR adjusts unit test parameters for tests/integration_tests/test_preprocessing.py::test_dask_known_divisions so that we do not have empty partitions after running the splitter. Prior to this change, this unit test would often fail with the following error:

E                       ray.exceptions.RayTaskError(AssertionError): ray::_get_read_tasks() (pid=83627, ip=127.0.0.1)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/read_api.py", line 1136, in _get_read_tasks
E                           reader = ds.create_reader(**kwargs)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 167, in create_reader
E                           return _ParquetDatasourceReader(**kwargs)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 230, in __init__
E                           self._encoding_ratio = self._estimate_files_encoding_ratio()
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 318, in _estimate_files_encoding_ratio
E                           sample_ratios = ray.get(futures)
E                       ray.exceptions.RayTaskError(AssertionError): ray::_sample_piece() (pid=83653, ip=127.0.0.1)
E                         File "/Users/geoffreyangus/repositories/predibase/ludwig/venv39_fresh/lib/python3.9/site-packages/ray/data/datasource/parquet_datasource.py", line 437, in _sample_piece
E                           assert num_rows > 0 and metadata.num_rows > 0, (
E                       AssertionError: Sampled number of rows: 0 and total number of rows: 0 should be positive

venv39_fresh/lib/python3.9/site-packages/ray/_private/worker.py:2239: RayTaskError(AssertionError)

@github-actions
Copy link

github-actions bot commented Jul 26, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 37m 7s ⏱️ + 15m 3s
2 941 tests ±0  2 892 ✔️ ±0    49 💤 ±0  0 ±0 
8 823 runs  ±0  8 640 ✔️ ±0  183 💤 ±0  0 ±0 

Results for commit 3585bb2. ± Comparison against base commit 940141e.

♻️ This comment has been updated with latest results.

)
data_df = dd.from_pandas(pd.read_csv(data_csv), npartitions=10)

# num_examples=100 and npartitions=2 to ensure the test is not flaky, by having non-empty post-split datasets.
Copy link
Contributor

@arnavgarg1 arnavgarg1 Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffreyangus I've actually noticed this problem quite a few times while writing my own tests. Out of curiosity, do you think we should handle empty post-split datasets more gracefully throughout Ludwig instead of relying on adjusting samples/partitions to create non-empty datasets?

Copy link
Collaborator Author

@geoffreyangus geoffreyangus Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true– you're right that we probably should. This issue did seem to appear only when using Ray nightly though, so it might be worth waiting until Ray 2 is released and stable. I've filed an issue to track this bug here: #2324.

@geoffreyangus geoffreyangus merged commit 2fcd9b3 into master Jul 28, 2022
@geoffreyangus geoffreyangus deleted the fix-test-preprocessing-known-divisions branch July 28, 2022 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants