SHAP scores for feature importance #945

shuttie · 2023-03-04T11:17:39Z

Now we emit splits as feature importance for all the boosters. But they have some drawbacks:

they depend on number of trees and tree depth
features with high cardinality will more frequently be part of split
hard to compare importance between boosters and different runs

We propose to compute SHAP scores for feature importance:

a linear estimation on how each feature contributes to the final metric
same implementation for all the boosters
does not depend on # of trees and tree depth

TreeSHAP is a major PITA (as shap4j does not support windows, and requires python-specific pickling), so we can just implement KernelSHAP and call it a day:

there should be a time budget for the whole MC sampling during KernelSHAP evaluation
we may also emit dispersion as a measure of confidence in the estimation of each feature.

shuttie added the enhancement New feature or request label Mar 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SHAP scores for feature importance #945

SHAP scores for feature importance #945

shuttie commented Mar 4, 2023

SHAP scores for feature importance #945

SHAP scores for feature importance #945

Comments

shuttie commented Mar 4, 2023