Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

number of categorical features #367

Closed
xuzhang5788 opened this issue Apr 3, 2021 · 1 comment
Closed

number of categorical features #367

xuzhang5788 opened this issue Apr 3, 2021 · 1 comment

Comments

@xuzhang5788
Copy link

In your blog, I remembered that you mentioned the largest number of categorical features was 25. Is it true? My dataset has 85 categorical columns.

@pplonski
Copy link
Contributor

pplonski commented Apr 6, 2021

There are 2 methods for handling categorical features. By default, all categoricals are converted to integers. There is no limit for a number of categories. However, there is also an additional step during AutoML training called mix_encoding. In this step, the categorical features with less than 21 categories are treated with binary one-hot encoding. The mix_encoding is available by default in the Compete step. If you want to enable/disable mix_encoding please set it in AutoML() constructor.

Example:

automl = AutoML(mode="Optuna", mix_encoding=True)
automl.fit(X, y)

In the above example, the AutoML will tune algorithms with Optuna and will also try Xgboost with mix_encoding (integer + one-hot encodings).

Important notes

  • mix_encoding will not be used (even if set in constructor) if all categorical features have more than 20 categories. There needs to be at least one feature with less than 21 categories.
  • mix_encoding is used only for the Xgboost algorithm (probably can add some enhancement here ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants