Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AutoML Target Encoding integration (auto mode) #7847

Open
exalate-issue-sync bot opened this issue May 11, 2023 · 1 comment
Open

Improve AutoML Target Encoding integration (auto mode) #7847

exalate-issue-sync bot opened this issue May 11, 2023 · 1 comment

Comments

@exalate-issue-sync
Copy link

Ideas for improving the general performance of the basic Target Encoding integration in AutoML (currently turned off by default in AutoML, but activated by setting preprocessing = \["target_encoding] in the AutoML function).

  • Configure TE on a per-algorithm basis (XGBoost and non-XGB models to start, then tune each model separately). One suggestion was: apply only to categorical columns with cardinality >=10 for xgboost and >=25 for h2o tree algos
  • Consider not applying TE at all to DNN models (DNNs are able to find out interactions more easily than tree models. TE usually is bad for NNs since they probably overfit to the values TE provides instead of finding them using backprop.)
  • Different minimal cardinality threshold (when > N categories, turn on TE, otherwise leave it off)
  • Different upper cardinality threshold (when > N categories, drop original categorical column, otherwise keep original column in the training frame)

Our current approach is: Column is encoded if card >= 10 (hard limit) and nrows/card >= 10 (blending inflection point).

We also want to improve the user experience by offering customizable TE encodings (but for now its just a on/off switch to an auto-TE strategy). Ticket for that is here: [https://0xdata.atlassian.net/browse/PUBDEV-7803|https://0xdata.atlassian.net/browse/PUBDEV-7803|smart-link]

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-7795
Assignee: Sebastien Poirier
Reporter: Erin LeDell
State: Open
Fix Version: Backlog
Attachments: N/A
Development PRs: N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant