Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes FILL_WITH_MEAN missing value strategy with appropriate cast #2141

Merged
merged 2 commits into from
Jun 14, 2022

Conversation

geoffreyangus
Copy link
Collaborator

This PR is a follow-up to #2058, which removed the assumption that CSV columns would be loaded with dtypes inferred by Dask/Pandas. Instead, all columns are now initially loaded with dtype object, with the expectation that these are casted to their appropriate dtype downstream. Number features column is therefore not read in with dtype float. This leads to the following error when attempting to apply the missing value strategy FILL_WITH_MEAN:

ludwig/api.py:1276: in preprocess
    preprocessed_data = preprocess_for_training(
ludwig/data/preprocessing.py:1580: in preprocess_for_training
    processed = data_format_processor.preprocess_for_training(
ludwig/data/preprocessing.py:275: in preprocess_for_training
    return _preprocess_file_for_training(
ludwig/data/preprocessing.py:1661: in _preprocess_file_for_training
    data, training_set_metadata = build_dataset(
ludwig/data/preprocessing.py:1099: in build_dataset
    feature_name_to_preprocessing_parameters = build_preprocessing_parameters(
ludwig/data/preprocessing.py:1232: in build_preprocessing_parameters
    fill_value = precompute_fill_value(dataset_cols, feature_config, preprocessing_parameters, backend)
ludwig/data/preprocessing.py:1368: in precompute_fill_value
    return backend.df_engine.compute(dataset_cols[feature[COLUMN]].mean())
venv38/lib/python3.8/site-packages/dask/dataframe/core.py:94: in wrapper
    return func(self, *args, **kwargs)
venv38/lib/python3.8/site-packages/dask/dataframe/core.py:1986: in mean
    _raise_if_object_series(self, "mean")
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

x = Dask Series Structure:
npartitions=1
    object
       ...
Name: num_E23AE, dtype: object
Dask Name: getitem, 2 tasks, funcname = 'mean'

    def _raise_if_object_series(x, funcname):
        """
        Utility function to raise an error if an object column does not support
        a certain operation like `mean`.
        """
        if isinstance(x, Series) and hasattr(x, "dtype") and x.dtype == object:
>           raise ValueError("`%s` not supported with object series" % funcname)
E           ValueError: `mean` not supported with object series

venv38/lib/python3.8/site-packages/dask/dataframe/core.py:3161: ValueError

This PR fixes this with appropriate dtype casting right before calling mean(). This PR also adds a regression test.

@github-actions
Copy link

Unit Test Results

       4 files  ±0         4 suites  ±0   1h 20m 27s ⏱️ -40s
2 825 tests +2  2 791 ✔️ +2  34 💤 ±0  0 ±0 
5 650 runs  +4  5 579 ✔️ +4  71 💤 ±0  0 ±0 

Results for commit 799fdc5. ± Comparison against base commit 520af82.

@tgaddair tgaddair added the bug Something isn't working label Jun 14, 2022
Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice parametrization of the backend.

@tgaddair tgaddair merged commit dcd07d8 into master Jun 14, 2022
@tgaddair tgaddair deleted the fix-fill-with-mean branch June 14, 2022 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants