Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes NaN handling in boolean dtypes #2058

Merged
merged 33 commits into from
Jun 13, 2022
Merged

Fixes NaN handling in boolean dtypes #2058

merged 33 commits into from
Jun 13, 2022

Conversation

geoffreyangus
Copy link
Collaborator

Addresses #2054. CSVs are now read without type inference to preserve NaN values. These NaNs are then filtered out by handle_missing_values. After that, each column is casted to its respective dtype based on its feature type.

@github-actions
Copy link

github-actions bot commented May 25, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 2m 32s ⏱️ - 16m 22s
2 823 tests +2  2 789 ✔️ +3    34 💤 ±0  0  - 1 
8 469 runs  +6  8 363 ✔️ +7  106 💤 ±0  0  - 1 

Results for commit 22d991d. ± Comparison against base commit 7a2bfd6.

♻️ This comment has been updated with latest results.

@geoffreyangus geoffreyangus marked this pull request as ready for review May 25, 2022 22:33
logger.debug("handle missing values")
for feature_config in feature_configs:
preprocessing_parameters = metadata[feature_config[NAME]][PREPROCESSING]
handle_missing_values(dataset_cols, feature_config, preprocessing_parameters)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that handle_missing_values is also called in build_metadata(), so we'll be making the call to handling missing values twice. Is that expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misremembering, but I believe @tgaddair originally introduced the double missing values call for a reason

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed up a refactor that makes the call to handle_missing_values once. I couldn't find a good reason for calling it twice, but super happy to revert this change @tgaddair if I missed something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be to handle preprocess_for_prediction, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like almost every DataFormatPreprocessor.preprocess_for_prediction implementation calls build_dataset, which is where handle_missing_values is called. The only exception is HDF5Preprocessor:

dataset = load_hdf5(dataset, features, split_data=False, shuffle_training=False)

My understanding though is that this is only called if dealing with cached data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hdf5 is supposed to be output of internally processing, you cannot throw a random unprocessed hdf5 at it

geoffreyangus added a commit that referenced this pull request Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants