Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add boolean postprocessing to dataset type inference for automl #2193

Merged
merged 15 commits into from
Jul 6, 2022

Conversation

magdyksaleh
Copy link
Contributor

Code Pull Requests

When a csv file is read in via pandas, if it contains a column of boolean values with some missing values, it is misinterpreted as an object dtype instead of a bool dtype. This corrects that.

@magdyksaleh magdyksaleh requested review from tgaddair and removed request for justinxzhao June 25, 2022 03:41
@github-actions
Copy link

github-actions bot commented Jun 25, 2022

Unit Test Results

       6 files  ±    0         6 suites  ±0   2h 28m 6s ⏱️ + 10m 35s
2 909 tests +  54  2 863 ✔️ +  42    46 💤 +12  0 ±0 
8 727 runs  +162  8 585 ✔️ +126  142 💤 +36  0 ±0 

Results for commit 3bc010e. ± Comparison against base commit edc7c26.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a test since presumably this is fixing a bug.

@@ -78,6 +78,23 @@ def avg_num_tokens(field: Series) -> int:
return avg_words


def check_if_boolean(field: Series) -> bool:
if len(field) > 5000:
field = field.sample(n=5000, random_state=40)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function could be given a Dask Series, which does not support sample(n). Dask only supports sampling a fraction of the data. Typically we would just use field.head(5000) here, and avoid calling len, which is also expensive in Dask.

More general question: do we need to compute this directly from the series when we already have a function called get_distinct_values that returns the top k distinct values? We should be able to derive this info from the distinct values instead of recomputing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Changed it accordingly

Copy link
Collaborator

@justinxzhao justinxzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @tgaddair that it would be great to add a test for this new check, perhaps in a new file like test_data_source.py.

@@ -69,6 +75,20 @@ def get_audio_values(self, column: str, sample_size: int = 10) -> int:
def get_avg_num_tokens(self, column: str) -> int:
return avg_num_tokens(self.df[column])

def check_if_boolean(self, column: str) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename is_boolean

Comment on lines 80 to 90
if num_unique_values <= 3:
for entry in unique_values:
try:
if np.isnan(entry):
continue
except TypeError:
return False
if isinstance(entry, bool):
continue
return False
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if num_unique_values <= 3:
for entry in unique_values:
try:
if np.isnan(entry):
continue
except TypeError:
return False
if isinstance(entry, bool):
continue
return False
return True
if num_unique_values > 3:
return False
for entry in unique_values:
try:
if np.isnan(entry):
continue
except TypeError:
# When does this happen?
return False
if isinstance(entry, bool):
continue
# Encountered non-boolean type.
return False
return True

Comment on lines 208 to 209
is_boolean = source.check_if_boolean(field)
if is_boolean:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consolidate into 1 line.

magdyksaleh and others added 6 commits June 30, 2022 10:18
branch 'b-add-csv-postprocessing' of github.com:ludwig-ai/ludwig into b-add-csv-postprocessing
@magdyksaleh
Copy link
Contributor Author

Added a test and made some changes so that it is just a function in ludwig that checks if the field is boolean. That way we don't need to propagate the change through the platform as well and instead leverage the Datasource API

@tgaddair tgaddair merged commit 27e0b9b into master Jul 6, 2022
@tgaddair tgaddair deleted the b-add-csv-postprocessing branch July 6, 2022 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants