# Notes on biases and other issues with Explainable AI tools.

Nathan A. Mahynski

There is nothing "wrong" with these metrics of importance, it is just that people seem to forget this is just telling you about how the "model uses or relies" on certain features.  Bad models give nonsense answers and you should always remember you are computing these things to understand the MODEL not the GROUND TRUTH.

# Random Forest Feature Importances

RF feature importances are computed using the amount of "decreasing in impurity" of nodes throughout the trees in the forests.  However, low cardinality features (yes/no or [1,2,3] vs. high cardinality or features with many values) can be artificially given lower importance scores than what might be considered "fair."  Moreover, feature correlation leads to bias as well.  These are discussed at length in [this blog post](https://explained.ai/rf-importance/#7) but other discussions are available on the internet as well:

* sklearn's [discussion](https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html)
* [toward data science](https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e)
* [medium](https://medium.com/@eng.mohammed.saad.18/detailed-explanation-of-random-forests-features-importance-bias-8755d26ac3bc)
* [stack exchange](https://datascience.stackexchange.com/questions/51976/why-is-random-forest-feature-importance-biased-towards-high-cadinality-features)

tl;dr

High cardinality features have many values; this tends to increase the possibility that a high capacity model can find a 1:1 relationship with your response variable.  For example, social security numbers vs. person.  This is almost like the issue with multiple comparisons; if you look hard enough at features with lots of values, you can basically draw a circle/partition around eahc unique one and related that to your outpu.  Of course, this will not genererlize at all, so a well-trained model should resist this.  This bias tends to exist when the model has a large enough capacity to overfit.  [Permutation feature importances](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance) are generally recommended instead, but they also underestimate the importance of correlated variables (because it tends to spread importance across them). This is why (1) decorrelating your inputs (with Spearman R, for example) and using something like [BorutaSHAP](https://github.com/Ekeany/Boruta-Shap) to get rid of excess/poor features is generally a good tool to improve explainability.  In particular, this bias issue was discovered by adding a random feature to a model, then computing importances with the logic that anythin less important than random should be discarded (only to find out [the random feature was considered very important somehow](https://explained.ai/rf-importance/#7)?!) BorutaSHAP scrambles features and keeps only those which perform better than randomized versions of themselves so it is similar (but expensive).
> Note: if all your features are continuous variables (like chemical concentrations) you probably won't run into a problem because ALL of your features are like this; but if you also have some categorical variables (one-hot encoded or not) you may find that the importance of these will be suppressed relative to the continuous features.
>
> Also Note: extremely randomized trees choose features randomly so they might be less prone to this bias than RF's.

Some possibly helpful libraries:
* [eli5](https://eli5.readthedocs.io/en/latest/index.html)
* [rfpimp](https://pypi.org/project/rfpimp/)

# SHAP vs. ACV

These notes are from [this blog post](https://towardsdatascience.com/the-right-way-to-compute-your-shapley-values-cfea30509254) comparing SHAP to the [ACV](https://github.com/salimamoukou/acv00) library.

See:
1. [Amoukou et al., The Shapley Value of coalition of variables provides better explanations (2021)](https://arxiv.org/pdf/2103.13342.pdf)
2. [Amoukou et al., Accurate and robust Shapley Values for explaining predictions and focusing on local important variables (2021)](https://arxiv.org/pdf/2106.03820.pdf)

tl;dr

---

The conventional SHAP methodology does not treate categorical variables well.  The one-hot encoding typically used makes it seem like different columns are decorrelated, which is incorrect. At the moment, it seems that ACV only works on tree-based models.

In general, the SHAP and ACV approaches will yield noticeable differences when a model is trained on a dataset with a high proportion of 
1. one-hot encoded categorical features and 
2. correlated features.

Regardless, while differences can occasionally be noticeable for some cases, overall SHAP and ACV seem very similar to me.