-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes NaN handling in boolean dtypes #2058
Conversation
logger.debug("handle missing values") | ||
for feature_config in feature_configs: | ||
preprocessing_parameters = metadata[feature_config[NAME]][PREPROCESSING] | ||
handle_missing_values(dataset_cols, feature_config, preprocessing_parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that handle_missing_values
is also called in build_metadata()
, so we'll be making the call to handling missing values twice. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misremembering, but I believe @tgaddair originally introduced the double missing values call for a reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed up a refactor that makes the call to handle_missing_values
once. I couldn't find a good reason for calling it twice, but super happy to revert this change @tgaddair if I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be to handle preprocess_for_prediction, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like almost every DataFormatPreprocessor.preprocess_for_prediction
implementation calls build_dataset
, which is where handle_missing_values
is called. The only exception is HDF5Preprocessor
:
ludwig/ludwig/data/preprocessing.py
Line 979 in 3526d12
dataset = load_hdf5(dataset, features, split_data=False, shuffle_training=False) |
My understanding though is that this is only called if dealing with cached data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hdf5 is supposed to be output of internally processing, you cannot throw a random unprocessed hdf5 at it
Addresses #2054. CSVs are now read without type inference to preserve NaN values. These NaNs are then filtered out by
handle_missing_values
. After that, each column is casted to its respective dtype based on its feature type.