Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some clarifications for the algorithm #3010

Closed
shenkev opened this issue Apr 21, 2020 · 1 comment
Closed

Some clarifications for the algorithm #3010

shenkev opened this issue Apr 21, 2020 · 1 comment
Labels

Comments

@shenkev
Copy link

shenkev commented Apr 21, 2020

  1. Exclusive feature bundling has a notion of "joining features that rarely take on non-zero values simultaneously". What does "nonzero" mean in the context of joining 2 continuous variables? I'd imagine continuous variables are actually nonzero most of the time in most dataset. Is "zero" set to some mean value or would continuous features just not be bundled?

  2. How do we expect EFB to behave when the features are dense and there are few exclusive sets? Will bundling still occur?

  3. The algorithm seems to allow a small fraction of conflicts in feature bundling, how are conflicts handled? Are they set to 0 by default or the value of one feature is ignored or something else?

@guolinke
Copy link
Collaborator

  1. EFB is used in the sparse data, which contains many zero/nan values. And in our implementation, we use most_freq_bin (the bucketed int value with most data) as "zero" to perform bundling. Therefore, EFB may work for dense data with many repeated values.
  2. The EFB is always used. It may cannot find the bundle for dense data.
  3. The feature value in the conflicting row will be treated is zero/most_freq_bin.

@shenkev shenkev closed this as completed Apr 22, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants