Update Ludwig AutoML Feature Type Selection #1485
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This Pull Request is intended to improve Ludwig AutoML automatic feature type selection. The change set includes Justin's PR (not yet landed) to improve binary feature type selection as well as additional feature type selection updates to address opportunities observed relative to the types selected for LBT and/or by practitioner(s) on the 12 tabular datasets used to develop AutoML heuristics. The additional feature type selection updates were largely adapted from ideas discussed in the Ludwig AutoML meetings.
The PR is based on the tf-legacy branch, and comprises the following:
Classify binary data type only if the values are bool-like. #1473
** Refinement of binary type identification to support binary for single distinct values
** Refinement of category type identification for small-count distinct values to further specify that those values either not be all numericals or all be sequential integers, the latter suggesting that the numbers were chosen to represent categories.
** Refinement of fall-through remaining case type selection as category if sampling finds non-numerical values; else numerical.
The impact of the PR was assessed by comparing its type output for the create_auto_config API with that produced by the current tf-legacy branch code and that chosen for LBT and manually by practitioners for the 12 tabular datasets used to develop AutoML heuristics. Given the positive impact, it is proposed to land this change to Ludwig’s tf-legacy and master branches. The results of this testing are available here: https://docs.google.com/document/d/1nLDbkYtg5J5Xb3sqRF25EGu3O2OOUwOSnbvQKdNhz68/edit?usp=sharing