Adds regression tests for #2020 #2021

geoffreyangus · 2022-05-11T01:30:46Z

Adds regression tests for #2020, a PR that implemented a fix for NaNs introduced via an outer join concat in the dask df engine.

While writing these tests, I found that PandasEngine.df_like was also doing an outer join of proc_cols (implicitly through pd.DataFrame init) instead of an inner join, similarly leading to NaN values in columns whose preprocessing step called for dropping rows (typically OutputFeature features). This is remedied through an inner join concat. We would like to be able to implement an inner join concat in the dask df engine in the future. It is not currently possible in the dask df engine due to the parallel nature of dask dataframes.

github-actions · 2022-05-11T01:58:13Z

Unit Test Results

      6 files ±0       6 suites ±0 1h 43m 28s ⏱️ + 11m 49s
2 774 tests ±0 2 741 ✔️ +1   33 💤 ±0 0 ❌ - 1
8 322 runs ±0 8 219 ✔️ +1 103 💤 ±0 0 ❌ - 1

Results for commit 6c98b34. ± Comparison against base commit 2c3fe2e.

♻️ This comment has been updated with latest results.

for more information, see https://pre-commit.ci

into add-dask-nans-test

geoffreyangus · 2022-05-11T19:11:53Z

tests/integration_tests/utils.py

@@ -520,7 +520,18 @@ def run_api_experiment(input_features, output_features, data_csv):
        shutil.rmtree(output_dir, ignore_errors=True)


-def create_data_set_to_use(data_format, raw_data):
+def read_csv_with_nan(path, nan_percent=0.0):


This function converts nan_percent of samples in each row of the CSV into NaN. This is important for tests that drop rows for exactly one feature– with this change, one has guarantees about the number of rows that will be dropped.

Example:
In a unit test, we are simulating predicting the targets column which is missing 10% of samples. We choose to drop rows missing a value for targets. With this sampling scheme, we know we will have exactly 90% of samples left.

geoffreyangus added 3 commits May 10, 2022 16:21

fixes nans in dask df engine

6445c97

adds tests

5b127c3

merge

75b60f1

geoffreyangus requested review from tgaddair, anmshkmr and hungcs May 11, 2022 01:30

geoffreyangus and others added 12 commits May 11, 2022 08:57

fixes with logs

228104f

fixes

e5026b0

[pre-commit.ci] auto fixes from pre-commit.com hooks

5a7d779

for more information, see https://pre-commit.ci

cleanup

783a71c

checking accuracy closeness

1396848

investigates ray batcher dropping samples with logs

c500cf5

clean up for PR review

92804d3

merge

5f7fa0f

[pre-commit.ci] auto fixes from pre-commit.com hooks

cbb6abe

for more information, see https://pre-commit.ci

cleanup

9df105b

Merge branch 'add-dask-nans-test' of https://github.com/ludwig-ai/ludwig

186e2e4

into add-dask-nans-test

add missing test param

e7c5dad

tgaddair approved these changes May 11, 2022

View reviewed changes

updated sampling to nan_percent% of rows in each col

27d1993

geoffreyangus commented May 11, 2022

View reviewed changes

cleanup

6c98b34

geoffreyangus merged commit 9ae57a9 into master May 11, 2022

geoffreyangus deleted the add-dask-nans-test branch May 11, 2022 20:24

geoffreyangus mentioned this pull request May 12, 2022

Improve performance of DataFrameEngine.df_like #2029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds regression tests for #2020 #2021

Adds regression tests for #2020 #2021

geoffreyangus commented May 11, 2022

github-actions bot commented May 11, 2022 •

edited

Loading

geoffreyangus May 11, 2022 •

edited

Loading

Adds regression tests for #2020 #2021

Adds regression tests for #2020 #2021

Conversation

geoffreyangus commented May 11, 2022

github-actions bot commented May 11, 2022 • edited Loading

Unit Test Results

geoffreyangus May 11, 2022 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented May 11, 2022 •

edited

Loading

geoffreyangus May 11, 2022 •

edited

Loading