Fixes NaN handling in boolean dtypes #2058

geoffreyangus · 2022-05-25T18:18:19Z

Addresses #2054. CSVs are now read without type inference to preserve NaN values. These NaNs are then filtered out by handle_missing_values. After that, each column is casted to its respective dtype based on its feature type.

github-actions · 2022-05-25T19:08:11Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 2m 32s ⏱️ - 16m 22s
2 823 tests +2 2 789 ✔️ +3   34 💤 ±0 0 ❌ - 1
8 469 runs +6 8 363 ✔️ +7 106 💤 ±0 0 ❌ - 1

Results for commit 22d991d. ± Comparison against base commit 7a2bfd6.

♻️ This comment has been updated with latest results.

justinxzhao · 2022-05-25T23:10:38Z

ludwig/data/preprocessing.py

+    logger.debug("handle missing values")
+    for feature_config in feature_configs:
+        preprocessing_parameters = metadata[feature_config[NAME]][PREPROCESSING]
+        handle_missing_values(dataset_cols, feature_config, preprocessing_parameters)


I just noticed that handle_missing_values is also called in build_metadata(), so we'll be making the call to handling missing values twice. Is that expected?

I may be misremembering, but I believe @tgaddair originally introduced the double missing values call for a reason

Just pushed up a refactor that makes the call to handle_missing_values once. I couldn't find a good reason for calling it twice, but super happy to revert this change @tgaddair if I missed something.

That should be to handle preprocess_for_prediction, right?

It seems like almost every DataFormatPreprocessor.preprocess_for_prediction implementation calls build_dataset, which is where handle_missing_values is called. The only exception is HDF5Preprocessor:

ludwig/ludwig/data/preprocessing.py

Line 979 in 3526d12

dataset = load_hdf5(dataset, features, split_data=False, shuffle_training=False)

My understanding though is that this is only called if dealing with cached data.

the hdf5 is supposed to be output of internally processing, you cannot throw a random unprocessed hdf5 at it

reorganizes cast_columns and handle_missing_values

eced5f1

geoffreyangus added 4 commits May 25, 2022 12:53

fix failing tests with logs

19aa622

remove logs

2c129e6

re-added deflaked test

d213164

cleanup

fa92bbe

geoffreyangus requested review from tgaddair and justinxzhao May 25, 2022 22:33

geoffreyangus marked this pull request as ready for review May 25, 2022 22:33

geoffreyangus requested a review from w4nderlust May 25, 2022 22:33

justinxzhao reviewed May 25, 2022

View reviewed changes

justinxzhao approved these changes May 25, 2022

View reviewed changes

geoffreyangus added 18 commits May 25, 2022 16:43

refactor to avoid calling handle missing values twice

ec1d0d4

refactored build preprocessing and metadata to separate fns

304a079

Merge branch 'master' into fix-nan-handling

86df81d

improve style with metadata

3526d12

Merge branch 'master' into fix-nan-handling

c0f1eae

preserves outputs as booleans for binary output feature

e902a36

remove extraneous casting

6151050

fixes related to manual boolean casting

2c3bddb

erge branch 'master' into fix-nan-handling

8ee5338

leaving a note comment in read_xsv for prosperity

43c7e8e

updates wording

0e20dac

cast changed from np fixed length str (modin) to object

f11d450

cleanup

522afad

cleanup

0e620f1

unit tests

f08adeb

Merge branch 'master' into fix-nan-handling

e38377e

revert back to str type again

9f63f49

Merge branch 'master' into fix-nan-handling

db336a5

geoffreyangus added 4 commits May 31, 2022 20:11

Merge branch 'master' into fix-nan-handling

382394a

Merge branch 'master' into fix-nan-handling

e744a40

add backwards compatible behavior in torchscript

ca33a25

merge

033411b

geoffreyangus added a commit that referenced this pull request Jun 3, 2022

remove nan testing for now (until #2058 merged)

ffab9f0

add comment in precompute_fill_value to remind devs of NaNs

ace64ee

tgaddair approved these changes Jun 6, 2022

View reviewed changes

geoffreyangus added 5 commits June 6, 2022 14:42

Merge branch 'master' into fix-nan-handling

1400848

merge

6bdcfc1

Merge branch 'master' into fix-nan-handling

96cd2b2

revert changes to test_class_imbalance_feature::test_imbalance_ray

5451bc2

cleanup

22d991d

geoffreyangus merged commit 3030fc2 into master Jun 13, 2022

geoffreyangus deleted the fix-nan-handling branch June 13, 2022 23:39

This was referenced Jun 13, 2022

Differences in model output between Ray and Local backends. #2054

Closed

Fixes dtype of SPLIT column if already provided in CSV #2140

Merged

Fixes FILL_WITH_MEAN missing value strategy with appropriate cast #2141

Merged

jeffreyftang mentioned this pull request Jun 21, 2022

Expose dtype as a parameter of the read_xsv function instead of a purely hardcoded value #2177

Merged

geoffreyangus mentioned this pull request Jun 23, 2022

Fix postprocessing on binary feature columns with number dtype #2189

Merged

tgaddair mentioned this pull request Jul 9, 2022

Fixed handling of invalud number values to treat as missing values #2247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes NaN handling in boolean dtypes #2058

Fixes NaN handling in boolean dtypes #2058

geoffreyangus commented May 25, 2022

github-actions bot commented May 25, 2022 •

edited

Loading

justinxzhao May 25, 2022

w4nderlust May 26, 2022

geoffreyangus May 26, 2022

tgaddair Jun 3, 2022

geoffreyangus Jun 3, 2022

w4nderlust Jun 3, 2022

Fixes NaN handling in boolean dtypes #2058

Fixes NaN handling in boolean dtypes #2058

Conversation

geoffreyangus commented May 25, 2022

github-actions bot commented May 25, 2022 • edited Loading

Unit Test Results

justinxzhao May 25, 2022

Choose a reason for hiding this comment

w4nderlust May 26, 2022

Choose a reason for hiding this comment

geoffreyangus May 26, 2022

Choose a reason for hiding this comment

tgaddair Jun 3, 2022

Choose a reason for hiding this comment

geoffreyangus Jun 3, 2022

Choose a reason for hiding this comment

w4nderlust Jun 3, 2022

Choose a reason for hiding this comment

github-actions bot commented May 25, 2022 •

edited

Loading