Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the feature importance returned by 'gain' ? #1842

Closed
AlexandraBomane opened this issue Nov 13, 2018 · 2 comments
Closed

What is the feature importance returned by 'gain' ? #1842

AlexandraBomane opened this issue Nov 13, 2018 · 2 comments

Comments

@AlexandraBomane
Copy link

Does the output of LGBMClassifier().booster_.feature_importance(importance_type='gain') is equivalent to gini importances which used by RandomForestClassifier provided by Scikit-Learn (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) ?

@AlexandraBomane AlexandraBomane changed the title Feature importance What is the feature importance returned by 'gain' ? Nov 13, 2018
@julioasotodv
Copy link
Contributor

julioasotodv commented Nov 20, 2018

Well, they are roughly equivalent. The Random Forest implementation in sklearn is based on Breiman's paper (https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf); therefore, the loss (objective function) used in the model is the gini impurity (and the information gain is measured in terms of that loss).

On LGBM, however, you define the loss metric to minimize directly. It is not gini impurity, but rather logloss (cross-entropy loss) for instance if your objective='binary'. If you define a different objective in the model configuration, the loss to minimize will be another one (you can see which ones in https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective)

However, even for the same data, feature importance estimates between RandomForestClassifier and LGBM can be different; even if both models were to use the exact same loss (whether it is gini impurity or whatever). Don't forget that these estimates are what the model 'thinks' about the features and the dataset, which unless your model is a perfect predictor, will be not always right nonetheless

@StrikerRUS
Copy link
Collaborator

@lock lock bot locked as resolved and limited conversation to collaborators Mar 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants