-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes unit test that had empty Dask partitions after splitting #2313
Conversation
…::test_dask_known_divisions
) | ||
data_df = dd.from_pandas(pd.read_csv(data_csv), npartitions=10) | ||
|
||
# num_examples=100 and npartitions=2 to ensure the test is not flaky, by having non-empty post-split datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geoffreyangus I've actually noticed this problem quite a few times while writing my own tests. Out of curiosity, do you think we should handle empty post-split datasets more gracefully throughout Ludwig instead of relying on adjusting samples/partitions to create non-empty datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true– you're right that we probably should. This issue did seem to appear only when using Ray nightly though, so it might be worth waiting until Ray 2 is released and stable. I've filed an issue to track this bug here: #2324.
This PR adjusts unit test parameters for
tests/integration_tests/test_preprocessing.py::test_dask_known_divisions
so that we do not have empty partitions after running the splitter. Prior to this change, this unit test would often fail with the following error: