Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix automl to treat binary as categorical when missing values present #1292

Merged
merged 1 commit into from
Sep 7, 2021

Conversation

tgaddair
Copy link
Collaborator

@tgaddair tgaddair commented Sep 7, 2021

This is a temporary workaround to handle the following error:

...
 File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/api.py", line 428, in train
    **kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/api.py", line 1317, in preprocess
    random_seed=random_seed
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/data/preprocessing.py", line 1496, in preprocess_for_training
    random_seed=random_seed
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/data/preprocessing.py", line 197, in preprocess_for_training
    random_seed=random_seed
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/data/preprocessing.py", line 1677, in _preprocess_df_for_training
    backend=backend
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/data/preprocessing.py", line 1142, in build_dataset
    backend
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/data/preprocessing.py", line 1255, in build_metadata
    backend
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/features/binary_feature.py", line 79, in get_feature_meta
    f"Binary feature column {column.name} expects 2 distinct values, "
ValueError: Binary feature column gender expects 2 distinct values, found: ['Female', 'Male', '0']

Essentially, the feature was treated as binary because it had 2 distinct values, but because there were also NaNs, they were coerced into '0', which led to there being 3 distinct values during preprocessing.

Long-term, we should instead coerce NaNs for binary columns into the correct "false" type by default (or replace with mode). If no false type can be determined, we should treat the column as categorical in automl.

Copy link
Contributor

@ShreyaR ShreyaR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@ANarayan
Copy link
Collaborator

ANarayan commented Sep 7, 2021

looks go to me as well!

@tgaddair tgaddair merged commit 6f3a720 into master Sep 7, 2021
@tgaddair tgaddair deleted the fix-binary-missing-val branch September 7, 2021 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants