Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_cat_threshold Hyperparameter #2261

Closed
pford221 opened this issue Jul 12, 2019 · 3 comments
Closed

max_cat_threshold Hyperparameter #2261

pford221 opened this issue Jul 12, 2019 · 3 comments

Comments

@pford221
Copy link
Contributor

pford221 commented Jul 12, 2019

Hi,

Is there another way to find out what is happening with max_cat_threhsold without trying to find it in the C++ source code? To save anyone kind enough to reply some time, I'll share my hypothesis and hopefully you can just correct me if I'm wrong?

Suppose there is a categorical variable of cardinality 10 (i.e. 10 unique levels). Also suppose that we set max_cat_threshold = 1. At each split opportunity, the algorithm aggregates the sum(gradients) / sum(hessians) for all the records in each of the 10 categories and then sorts the 10 categories by that ratio from lowest to highest (or highest to lowest...doesn't matter). Then because max_cat_threshold = 1, the algorithm only has one split point to evaluate and this split point is as close to the median (as determined by number of observations or weighted observations) as possible?

I appreciate any help on this!

@guolinke
Copy link
Collaborator

@pford221 yeah, you are right. max_cat_threshold is to limit the categorical split points.

@pford221
Copy link
Contributor Author

Thank you, @guolinke. I realize that max_cat_threshold limits split points, but would you mind commenting on if my description of how it does it more-or-less correct? I really appreciate it.

@guolinke
Copy link
Collaborator

@pford221
I think most of them are correct.
But for

Then because max_cat_threshold = 1, the algorithm only has one split point to evaluate and this split point is as close to the median (as determined by number of observations or weighted observations) as possible?

, I want to clarify that, the split point will be the one with the highest (or lowest) sum(gradients) / sum(hessians) .

@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants