Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)? #6388

Closed
NegatedObjectIdentity opened this issue Mar 27, 2024 · 2 comments
Labels

Comments

@NegatedObjectIdentity
Copy link

Description

In shap/shap#3574 we discuss if it possible to have a node in LightGBM that has no samples assigned to it during training. This question is important to clarify if it is necessary to assert coverage of the nodes in case of no background dataset is is passed and SHAP computations are based on tree paths. One of the SHAP maintainers asked for a clarification on this question.

To summarize the lengthy discussion at shap/shap#3574:

There exists a case (multiclass prediction with LightGBM, interactions=True, data=None, feature_perturbation='tree_path_dependent') where SHAP explanations fail due to an

ExplainerError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional".

But since this works in all other cases (non multiclass cases) and also with interactions=False in the multiclass case (this case is using the LightGBM implementation) and since it even works if one removes the assert of coverage, the question arises if for this single case the assert of the coverage is necessary at all.

It would be necessary to assert coverage if it is possible to get "empty" nodes during training, which would be an uncovered node. Hence the question if it is possible to have a node in LightGBM that has no samples assigned to it during training.

Please point me toward any documentation that covers this question or alternatively if one of the maintainers could clarify this question would help us a lot to get forward! Thank you very much in advance!

Reproducible example

from sklearn.datasets import load_diabetes
from sklearn.datasets import load_digits
from sklearn.datasets import load_breast_cancer
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
from shap.explainers import Tree as TreeExplainer

# data | regression | binary | multi class
data_reg = load_diabetes(as_frame=True)
data_bin = load_breast_cancer(as_frame=True)
data_mult = load_digits(as_frame=True)

# train models | regression | binary | multi class
model_reg = LGBMRegressor(**{'verbosity': -1,}).fit(data_reg.data, data_reg.target)
model_bin = LGBMClassifier(**{'verbosity': -1,}).fit(data_bin.data, data_bin.target)
model_mult = LGBMClassifier(**{'verbosity': -1,}).fit(data_mult.data, data_mult.target)

# Explainer
explainer_reg = TreeExplainer(
    model_reg,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_bin = TreeExplainer(
    model_bin,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_mult = TreeExplainer(
    model_mult,
    data=None,
    feature_perturbation='tree_path_dependent')

# Explainations
explainations_reg = explainer_reg(data_reg.data, interactions=False)
explainations_reg_inter = explainer_reg(data_reg.data, interactions=True)
explainations_bin = explainer_bin(data_bin.data, interactions=False)
explainations_bin_inter = explainer_bin(data_bin.data, interactions=True)
explainations_mult = explainer_mult(data_mult.data, interactions=False)
explainations_mult_inter = explainer_mult(data_mult.data, interactions=True)

Environment info

Win11
Python 3.11.8
LightGBM 4.3.0
SHAP 0.45.0

Additional Comments

@jameslamb
Copy link
Collaborator

possible to have a node in LightGBM that has no samples assigned to it during training

Given a dataset X with label y, while performing one round of boosting on X and y I don't believe it's possible for LightGBM to produce any leaf nodes matching 0 samples in X.

These checks explicitly prevent the addition of splits that result in 0 samples on one side of the split.

CHECK_GT(best_split_info.left_count, 0);

CHECK_GT(best_split_info.right_count, 0);

CHECK_GT(best_split_info.right_count, 0);

CHECK_GT(best_split_info.right_count, 0);

Here's a minimal example using lightgbm==4.3.0 showing how to trigger those checks, by providing a "forced" split that is impossible to satisfy.

import lightgbm as lgb
import numpy as np
import json
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10_000, n_features=1, centers=3)

with open("forced_splits.json", "w") as f:
    f.write(json.dumps(
        {"feature": 0, "threshold": np.max(X) + 1.0}
    ))

# construct the estimator + fit the model
bst = lgb.train(
    params={
        "forcedsplits_filename": "forced_splits.json",
        "objective": "multiclass",
        "min_gain_to_split": 0.0,
        "min_data_in_leaf": 0,
        "num_classes": 3,
        "num_iterations": 10,
        "num_leaves": 31,
        "verbose": 1
    },
    train_set=lgb.Dataset(X, label=y, params={"min_data_in_bin": 1})
)
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 856 .

Ok, so given that I just trained a LightGBM model on dataset X, every leaf in every tree will match at least one sample in X?

No.

LightGBM supports "training continuation", where you start from an already-created model and then perform additional boosting rounds to add trees to that model.

It's not necessary that the training dataset used to produce that initial model is the same as the one used to generate those new trees. If the distribution of features is very different in those two datasets, it's possible for earlier trees (from the pre-trained model) to have leaves that match 0 samples in the newly-provided dataset.

For more, see https://stackoverflow.com/questions/73664093/lightgbm-train-vs-update-vs-refit/73669068#73669068.

So you CANNOT assume that every leaf in every tree in a LightGBM model m will match at least one record in training dataset X, unless every tree in the model was created from X.

A LightGBM model can be created from multiple, heterogeneous datasets, so you cannot assume that for a given model there is such as thing "the", single, training dataset.

Wait, you can set min_child_samples = 0?

Yes. In shap/shap#3574, you mentioned the parameter min_child_samples several times. That is an alias for min_data_in_leaf, which must be in the range >=0 (parameter docs).

Setting it to 0 is supported as a way to say "use other mechanisms to prevent overfitting instead", like:

  • min_gain_to_split = minimum total change in the loss function that a split must provide to be added to the tree
  • max_depth, num_leaves = limits on the number of leaves, regardless of how many samples fall in them

@NegatedObjectIdentity
Copy link
Author

@jameslamb Thank you very much! Really appreciating your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants