Add boolean postprocessing to dataset type inference for automl #2193

magdyksaleh · 2022-06-25T03:40:22Z

Code Pull Requests

When a csv file is read in via pandas, if it contains a column of boolean values with some missing values, it is misinterpreted as an object dtype instead of a bool dtype. This corrects that.

…ostprocessing

for more information, see https://pre-commit.ci

github-actions · 2022-06-25T04:20:18Z

Unit Test Results

      6 files ±    0       6 suites ±0 2h 28m 6s ⏱️ + 10m 35s
2 909 tests +  54 2 863 ✔️ +  42   46 💤 +12 0 ❌ ±0
8 727 runs +162 8 585 ✔️ +126 142 💤 +36 0 ❌ ±0

Results for commit 3bc010e. ± Comparison against base commit edc7c26.

♻️ This comment has been updated with latest results.

tgaddair

We also need a test since presumably this is fixing a bug.

tgaddair · 2022-06-26T16:22:13Z

ludwig/automl/utils.py

@@ -78,6 +78,23 @@ def avg_num_tokens(field: Series) -> int:
    return avg_words


+def check_if_boolean(field: Series) -> bool:
+    if len(field) > 5000:
+        field = field.sample(n=5000, random_state=40)


I believe this function could be given a Dask Series, which does not support sample(n). Dask only supports sampling a fraction of the data. Typically we would just use field.head(5000) here, and avoid calling len, which is also expensive in Dask.

More general question: do we need to compute this directly from the series when we already have a function called get_distinct_values that returns the top k distinct values? We should be able to derive this info from the distinct values instead of recomputing it.

Makes sense. Changed it accordingly

…ostprocessing

justinxzhao

Agreed with @tgaddair that it would be great to add a test for this new check, perhaps in a new file like test_data_source.py.

justinxzhao · 2022-06-27T21:53:41Z

ludwig/automl/data_source.py

@@ -69,6 +75,20 @@ def get_audio_values(self, column: str, sample_size: int = 10) -> int:
    def get_avg_num_tokens(self, column: str) -> int:
        return avg_num_tokens(self.df[column])

+    def check_if_boolean(self, column: str) -> bool:


nit: rename is_boolean

justinxzhao · 2022-06-27T22:01:15Z

ludwig/automl/data_source.py

+        if num_unique_values <= 3:
+            for entry in unique_values:
+                try:
+                    if np.isnan(entry):
+                        continue
+                except TypeError:
+                    return False
+                if isinstance(entry, bool):
+                    continue
+                return False
+        return True


Suggested change

if num_unique_values <= 3:

for entry in unique_values:

try:

if np.isnan(entry):

continue

except TypeError:

return False

if isinstance(entry, bool):

continue

return False

return True

if num_unique_values > 3:

return False

for entry in unique_values:

try:

if np.isnan(entry):

continue

except TypeError:

# When does this happen?

return False

if isinstance(entry, bool):

continue

# Encountered non-boolean type.

return False

return True

justinxzhao · 2022-06-27T22:06:34Z

ludwig/automl/base_config.py

+            is_boolean = source.check_if_boolean(field)
+            if is_boolean:


nit: Consolidate into 1 line.

for more information, see https://pre-commit.ci

branch 'b-add-csv-postprocessing' of github.com:ludwig-ai/ludwig into b-add-csv-postprocessing

magdyksaleh · 2022-06-30T08:33:47Z

Added a test and made some changes so that it is just a function in ludwig that checks if the field is boolean. That way we don't need to propagate the change through the platform as well and instead leverage the Datasource API

for more information, see https://pre-commit.ci

magdyksaleh added 3 commits June 24, 2022 20:36

add check_if_boolean fn

0c10f64

comment

c21ce28

Mergh branch 'master' of github.com:ludwig-ai/ludwig into b-add-csv-p…

c9c56f6

…ostprocessing

magdyksaleh requested a review from justinxzhao June 25, 2022 03:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

c2753cb

for more information, see https://pre-commit.ci

magdyksaleh requested review from tgaddair and removed request for justinxzhao June 25, 2022 03:41

magdyksaleh assigned tgaddair Jun 25, 2022

tgaddair requested changes Jun 26, 2022

View reviewed changes

magdyksaleh added 2 commits June 27, 2022 10:48

Use df method directly

a0e92df

Merge branch 'master' of github.com:ludwig-ai/ludwig into b-add-csv-p…

f9b4a9e

…ostprocessing

magdyksaleh requested a review from tgaddair June 27, 2022 18:44

justinxzhao reviewed Jun 27, 2022

View reviewed changes

magdyksaleh and others added 6 commits June 30, 2022 10:18

add test

c718875

some refactoring

b0a93c7

revert

0671233

[pre-commit.ci] auto fixes from pre-commit.com hooks

4cb7f88

for more information, see https://pre-commit.ci

Mergh

387a833

branch 'b-add-csv-postprocessing' of github.com:ludwig-ai/ludwig into b-add-csv-postprocessing

silly

9392fd3

magdyksaleh requested a review from justinxzhao June 30, 2022 08:33

magdyksaleh and others added 3 commits June 30, 2022 10:48

make dask import optional

5ec0a4c

make automl import optional

2ca7ece

[pre-commit.ci] auto fixes from pre-commit.com hooks

3bc010e

for more information, see https://pre-commit.ci

tgaddair approved these changes Jul 6, 2022

View reviewed changes

tgaddair merged commit 27e0b9b into master Jul 6, 2022

tgaddair deleted the b-add-csv-postprocessing branch July 6, 2022 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add boolean postprocessing to dataset type inference for automl #2193

Add boolean postprocessing to dataset type inference for automl #2193

magdyksaleh commented Jun 25, 2022

github-actions bot commented Jun 25, 2022 •

edited

Loading

tgaddair left a comment

tgaddair Jun 26, 2022

magdyksaleh Jun 27, 2022

justinxzhao left a comment

justinxzhao Jun 27, 2022

justinxzhao Jun 27, 2022

justinxzhao Jun 27, 2022

magdyksaleh commented Jun 30, 2022

Add boolean postprocessing to dataset type inference for automl #2193

Add boolean postprocessing to dataset type inference for automl #2193

Conversation

magdyksaleh commented Jun 25, 2022

Code Pull Requests

github-actions bot commented Jun 25, 2022 • edited Loading

Unit Test Results

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Jun 26, 2022

Choose a reason for hiding this comment

magdyksaleh Jun 27, 2022

Choose a reason for hiding this comment

justinxzhao left a comment

Choose a reason for hiding this comment

justinxzhao Jun 27, 2022

Choose a reason for hiding this comment

justinxzhao Jun 27, 2022

Choose a reason for hiding this comment

justinxzhao Jun 27, 2022

Choose a reason for hiding this comment

magdyksaleh commented Jun 30, 2022

github-actions bot commented Jun 25, 2022 •

edited

Loading