-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add boolean postprocessing to dataset type inference for automl #2193
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need a test since presumably this is fixing a bug.
ludwig/automl/utils.py
Outdated
@@ -78,6 +78,23 @@ def avg_num_tokens(field: Series) -> int: | |||
return avg_words | |||
|
|||
|
|||
def check_if_boolean(field: Series) -> bool: | |||
if len(field) > 5000: | |||
field = field.sample(n=5000, random_state=40) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this function could be given a Dask Series, which does not support sample(n)
. Dask only supports sampling a fraction of the data. Typically we would just use field.head(5000)
here, and avoid calling len
, which is also expensive in Dask.
More general question: do we need to compute this directly from the series when we already have a function called get_distinct_values
that returns the top k distinct values? We should be able to derive this info from the distinct values instead of recomputing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Changed it accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @tgaddair that it would be great to add a test for this new check, perhaps in a new file like test_data_source.py
.
ludwig/automl/data_source.py
Outdated
@@ -69,6 +75,20 @@ def get_audio_values(self, column: str, sample_size: int = 10) -> int: | |||
def get_avg_num_tokens(self, column: str) -> int: | |||
return avg_num_tokens(self.df[column]) | |||
|
|||
def check_if_boolean(self, column: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename is_boolean
ludwig/automl/data_source.py
Outdated
if num_unique_values <= 3: | ||
for entry in unique_values: | ||
try: | ||
if np.isnan(entry): | ||
continue | ||
except TypeError: | ||
return False | ||
if isinstance(entry, bool): | ||
continue | ||
return False | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if num_unique_values <= 3: | |
for entry in unique_values: | |
try: | |
if np.isnan(entry): | |
continue | |
except TypeError: | |
return False | |
if isinstance(entry, bool): | |
continue | |
return False | |
return True | |
if num_unique_values > 3: | |
return False | |
for entry in unique_values: | |
try: | |
if np.isnan(entry): | |
continue | |
except TypeError: | |
# When does this happen? | |
return False | |
if isinstance(entry, bool): | |
continue | |
# Encountered non-boolean type. | |
return False | |
return True |
ludwig/automl/base_config.py
Outdated
is_boolean = source.check_if_boolean(field) | ||
if is_boolean: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Consolidate into 1 line.
for more information, see https://pre-commit.ci
Added a test and made some changes so that it is just a function in ludwig that checks if the field is boolean. That way we don't need to propagate the change through the platform as well and instead leverage the Datasource API |
Code Pull Requests
When a csv file is read in via pandas, if it contains a column of boolean values with some missing values, it is misinterpreted as an object dtype instead of a bool dtype. This corrects that.