Update Ludwig AutoML Feature Type Selection #1485

amholler · 2021-11-14T15:22:13Z

This Pull Request is intended to improve Ludwig AutoML automatic feature type selection. The change set includes Justin's PR (not yet landed) to improve binary feature type selection as well as additional feature type selection updates to address opportunities observed relative to the types selected for LBT and/or by practitioner(s) on the 12 tabular datasets used to develop AutoML heuristics. The additional feature type selection updates were largely adapted from ideas discussed in the Ludwig AutoML meetings.

The PR is based on the tf-legacy branch, and comprises the following:

Justin's PR to improve binary feature type selection
Classify binary data type only if the values are bool-like. #1473
Changes in ludwig/automl/base_config.py and ludwig/utils/strings_utils.py
** Refinement of binary type identification to support binary for single distinct values
** Refinement of category type identification for small-count distinct values to further specify that those values either not be all numericals or all be sequential integers, the latter suggesting that the numbers were chosen to represent categories.
** Refinement of fall-through remaining case type selection as category if sampling finds non-numerical values; else numerical.

The impact of the PR was assessed by comparing its type output for the create_auto_config API with that produced by the current tf-legacy branch code and that chosen for LBT and manually by practitioners for the 12 tabular datasets used to develop AutoML heuristics. Given the positive impact, it is proposed to land this change to Ludwig’s tf-legacy and master branches. The results of this testing are available here: https://docs.google.com/document/d/1nLDbkYtg5J5Xb3sqRF25EGu3O2OOUwOSnbvQKdNhz68/edit?usp=sharing

tgaddair

Nice! I like these changes. Similar to @justinxzhao 's PR, we should also think about how to handle this for large datasets, where it may be expensive to carry through all the distinct values in the metadata.

My thought at the moment is we can push down a lot of these computations into the DataSource abstraction so we compute only the derived data from the distinct values (like whether the values are sequential, etc.). Happy to deep dive into this in more detail.

amholler · 2021-11-17T22:51:19Z

Hi, @tgaddair Thank you very much for your feedback! I believe I have addressed the
scaling issues and the other comments you made on @justinxzhao 's PR (1473). Could
you please take another look? Thanks again!

tgaddair

Thanks for the fix! This looks good to me!

Co-authored-by: Anne Holler <anne@vmware.com>

Co-authored-by: amholler <86269492+amholler@users.noreply.github.com> Co-authored-by: Anne Holler <anne@vmware.com>

Update Ludwig AutoML Feature Type Selection

6c99ad7

tgaddair reviewed Nov 14, 2021

View reviewed changes

Address code review feedback

8aeeace

tgaddair approved these changes Nov 17, 2021

View reviewed changes

tgaddair merged commit 91ccb04 into ludwig-ai:tf-legacy Nov 17, 2021

tgaddair pushed a commit that referenced this pull request Nov 18, 2021

Update Ludwig AutoML Feature Type Selection (#1485)

39eeef5

Co-authored-by: Anne Holler <anne@vmware.com>

tgaddair added a commit that referenced this pull request Nov 19, 2021

Update Ludwig AutoML Feature Type Selection (#1485) (#1491)

5d562cb

Co-authored-by: amholler <86269492+amholler@users.noreply.github.com> Co-authored-by: Anne Holler <anne@vmware.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Ludwig AutoML Feature Type Selection #1485

Update Ludwig AutoML Feature Type Selection #1485

amholler commented Nov 14, 2021 •

edited

Loading

tgaddair left a comment

amholler commented Nov 17, 2021

tgaddair left a comment

Update Ludwig AutoML Feature Type Selection #1485

Update Ludwig AutoML Feature Type Selection #1485

Conversation

amholler commented Nov 14, 2021 • edited Loading

tgaddair left a comment

Choose a reason for hiding this comment

amholler commented Nov 17, 2021

tgaddair left a comment

Choose a reason for hiding this comment

amholler commented Nov 14, 2021 •

edited

Loading