New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TreeSHAP Values are inconsistent #670
Comments
What if you use |
When i tried model.predict() gave me only the most probable class as output, not the class probabilities? |
i also tried model.score() but that threw an error |
|
I tried, the problem persists, ableit with smaller variance. I now get 0.3267 vs 0.3289 - which is a more realistic value given the balanced 3 class data set. With lundbergs package that variance is much smaller. |
I don't understand this part:
Why? Do you have link to somewhere of paper to prove this? |
Hi, I am a colleague of ntrost-targ. Our assumption comes from the local accuray / additivity of explainability (see https://ema.drwhy.ai/breakDown.html#BDMethodGen and Titanic sample with base value = 0.2353095 and posterior probabilities as f(x) ) which should give That is (see also https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7326367/#S10title , Property (1) ). We just wanted to assert equality with the python lundberg implementation and did the reverse computation Cheers, |
Hi Marc, thanks for the explanation. Although not 100% sure, I think that this small difference comes from the smoothing of posteriori probability. Depending on the leaf node size, this smoothing may have slightly different impact on posteriori probability calculation. If you choose two samples hitting the same leaf node, I guess that this difference will be smaller. It is hard to know if two samples arrive the same leaf node. As a work around, I suggest you to compute the difference on all the samples of one class. I guess that you will find several clusters of values with tiny difference. |
Hi Haifeng, i tried the same calculation but with Gradient Boosted Trees (smile.classification.gbm(formula,iris) ) but arrive at 0.59 vs 0.69 (which is also in absolute an odd value, i'm expecting roughly 0.33). Also the variance for the whole iris dataset in python is negligible on the order of 1e-16 - see the script below. For us the local explainability with shap is very important and we would be thankful if you take a deeper look into the issue.
notice how the backward calculation matches the explainer.expected_value for the first class. Bests, |
Hi Haifeng, I just came across your Commercial License Usage clause in the SMILE license when using SMILE in a commercial setting (i.e. incorporation of SMILE in commercial products). But I could not find any further details on the website regarding modalities and costs for the commercial license and setting. Could you give us details on that topic and when commercial licensing is required ? And could we request a deeper look into our SHAP issue on your side when being commercial subscribers ? Thanks a lot in advance + Cheers, |
@mroettig please contact me by email. |
Describe the bug
When calculating TreeSHAP values for random forest classification they dont add up.
I would expect that the prediction from .vote() minus the respective SHAP Values gives me the base value which is constant and should be the same for different observations. Note that this is the behaviour we observe in lundbergs python module. Also it would be really handy if there were a function that just calculates the base value (expected_value in python) for me.
Expected behavior
When calculating TreeSHAP Values I expect them to add up together with the base value to the predicted probability
Actual behavior
Calculated Base Values vary even for observations from the same class
Code snippet
Input data
Iris data set
Additional context
The text was updated successfully, but these errors were encountered: