[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)? #6388

NegatedObjectIdentity · 2024-03-27T08:49:36Z

Description

In shap/shap#3574 we discuss if it possible to have a node in LightGBM that has no samples assigned to it during training. This question is important to clarify if it is necessary to assert coverage of the nodes in case of no background dataset is is passed and SHAP computations are based on tree paths. One of the SHAP maintainers asked for a clarification on this question.

To summarize the lengthy discussion at shap/shap#3574:

There exists a case (multiclass prediction with LightGBM, interactions=True, data=None, feature_perturbation='tree_path_dependent') where SHAP explanations fail due to an

ExplainerError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional".

But since this works in all other cases (non multiclass cases) and also with interactions=False in the multiclass case (this case is using the LightGBM implementation) and since it even works if one removes the assert of coverage, the question arises if for this single case the assert of the coverage is necessary at all.

It would be necessary to assert coverage if it is possible to get "empty" nodes during training, which would be an uncovered node. Hence the question if it is possible to have a node in LightGBM that has no samples assigned to it during training.

Please point me toward any documentation that covers this question or alternatively if one of the maintainers could clarify this question would help us a lot to get forward! Thank you very much in advance!

Reproducible example

from sklearn.datasets import load_diabetes
from sklearn.datasets import load_digits
from sklearn.datasets import load_breast_cancer
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
from shap.explainers import Tree as TreeExplainer

# data | regression | binary | multi class
data_reg = load_diabetes(as_frame=True)
data_bin = load_breast_cancer(as_frame=True)
data_mult = load_digits(as_frame=True)

# train models | regression | binary | multi class
model_reg = LGBMRegressor(**{'verbosity': -1,}).fit(data_reg.data, data_reg.target)
model_bin = LGBMClassifier(**{'verbosity': -1,}).fit(data_bin.data, data_bin.target)
model_mult = LGBMClassifier(**{'verbosity': -1,}).fit(data_mult.data, data_mult.target)

# Explainer
explainer_reg = TreeExplainer(
    model_reg,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_bin = TreeExplainer(
    model_bin,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_mult = TreeExplainer(
    model_mult,
    data=None,
    feature_perturbation='tree_path_dependent')

# Explainations
explainations_reg = explainer_reg(data_reg.data, interactions=False)
explainations_reg_inter = explainer_reg(data_reg.data, interactions=True)
explainations_bin = explainer_bin(data_bin.data, interactions=False)
explainations_bin_inter = explainer_bin(data_bin.data, interactions=True)
explainations_mult = explainer_mult(data_mult.data, interactions=False)
explainations_mult_inter = explainer_mult(data_mult.data, interactions=True)

Environment info

Win11
Python 3.11.8
LightGBM 4.3.0
SHAP 0.45.0

Additional Comments

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-04-01T03:18:23Z

possible to have a node in LightGBM that has no samples assigned to it during training

Given a dataset X with label y, while performing one round of boosting on X and y I don't believe it's possible for LightGBM to produce any leaf nodes matching 0 samples in X.

These checks explicitly prevent the addition of splits that result in 0 samples on one side of the split.

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 846 in 28536a0

CHECK_GT(best_split_info.left_count, 0);

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 856 in 28536a0

CHECK_GT(best_split_info.right_count, 0);

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 856 in 28536a0

CHECK_GT(best_split_info.right_count, 0);

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 880 in 28536a0

CHECK_GT(best_split_info.right_count, 0);

Here's a minimal example using lightgbm==4.3.0 showing how to trigger those checks, by providing a "forced" split that is impossible to satisfy.

import lightgbm as lgb
import numpy as np
import json
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10_000, n_features=1, centers=3)

with open("forced_splits.json", "w") as f:
    f.write(json.dumps(
        {"feature": 0, "threshold": np.max(X) + 1.0}
    ))

# construct the estimator + fit the model
bst = lgb.train(
    params={
        "forcedsplits_filename": "forced_splits.json",
        "objective": "multiclass",
        "min_gain_to_split": 0.0,
        "min_data_in_leaf": 0,
        "num_classes": 3,
        "num_iterations": 10,
        "num_leaves": 31,
        "verbose": 1
    },
    train_set=lgb.Dataset(X, label=y, params={"min_data_in_bin": 1})
)

lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 856 .

Ok, so given that I just trained a LightGBM model on dataset `X`, every leaf in every tree will match at least one sample in `X`?

No.

LightGBM supports "training continuation", where you start from an already-created model and then perform additional boosting rounds to add trees to that model.

It's not necessary that the training dataset used to produce that initial model is the same as the one used to generate those new trees. If the distribution of features is very different in those two datasets, it's possible for earlier trees (from the pre-trained model) to have leaves that match 0 samples in the newly-provided dataset.

For more, see https://stackoverflow.com/questions/73664093/lightgbm-train-vs-update-vs-refit/73669068#73669068.

So you CANNOT assume that every leaf in every tree in a LightGBM model m will match at least one record in training dataset X, unless every tree in the model was created from X.

A LightGBM model can be created from multiple, heterogeneous datasets, so you cannot assume that for a given model there is such as thing "the", single, training dataset.

Wait, you can set `min_child_samples = 0`?

Yes. In shap/shap#3574, you mentioned the parameter min_child_samples several times. That is an alias for min_data_in_leaf, which must be in the range >=0 (parameter docs).

Setting it to 0 is supported as a way to say "use other mechanisms to prevent overfitting instead", like:

min_gain_to_split = minimum total change in the loss function that a split must provide to be added to the tree
max_depth, num_leaves = limits on the number of leaves, regardless of how many samples fall in them

NegatedObjectIdentity · 2024-04-02T07:11:58Z

@jameslamb Thank you very much! Really appreciating your help!

jameslamb added the question label Mar 27, 2024

jameslamb added the awaiting response label Apr 1, 2024

github-actions bot removed the awaiting response label Apr 2, 2024

NegatedObjectIdentity mentioned this issue Apr 2, 2024

BUG: LightGBM with multiclass interaction TreeShap produces explainer error shap/shap#3574

Open

4 tasks

NegatedObjectIdentity closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)? #6388

[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)? #6388

NegatedObjectIdentity commented Mar 27, 2024

jameslamb commented Apr 1, 2024

NegatedObjectIdentity commented Apr 2, 2024

[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)? #6388

[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)? #6388

Comments

NegatedObjectIdentity commented Mar 27, 2024

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Apr 1, 2024

Ok, so given that I just trained a LightGBM model on dataset X, every leaf in every tree will match at least one sample in X?

Wait, you can set min_child_samples = 0?

NegatedObjectIdentity commented Apr 2, 2024

Ok, so given that I just trained a LightGBM model on dataset `X`, every leaf in every tree will match at least one sample in `X`?

Wait, you can set `min_child_samples = 0`?