This repository provides code to reproduce results in the paper "Debiasing SHAP scores in random forests" by Markus Loecher (2023)
The folder structure is as follows:
- The actual code to run simulations and the functions to create figures are all in the
src
folder. - The resulting pdfs are in the
figures
folder. - Some (most) simulations take a long time to run, so we have saved the data outputs from those runs in the
data
folder. - Figures from the paper:
- We do not include source code for Figures 1 and Figure 6, since these are legacy figures from the work leading to this paper and were created prior to this research.
- There are dedicated Rmd files to create Figures 2 and 3. The figures in the Appendix are created by
makeFigs_Appendix.R
. - Figures 4 and 5 were created by python code and requite installing the python module
TreeModelsFromScratch
, directions for which can be found on the original github page. Deviating from the general pattern, even the plots are created in thedata/titanic/
folder.
- Table 1: We imported the AUC scores for SHAP, SHAPoob, MDA, MDI from a previous paper and only recreated the entry for
$\widehat{\text{SHAP}}^{shrunk}_{in}$ . The relevant files for this areAUC_run_simulations.R
andsrc/AUC_simulations_functions.R
. (Again, these simulations take a long time to run, so we have saved the data for your convenience)
I should note the following: this code base is more complex than strictly necessary as a result of having evolved over the years by contributions from various students both at the Bachelor and Master level. Roughly speaking, there are four separate parts, all of which led to the final insights in the paper:
- The initial research ideas were tested in python (sklearn) where we successfully separated out-of-bag from inbag data for trees/forests.
- Inspired by the need for more control on the features and attributes of trees, we then developed our own random forest library (still in python) which we used to create Figures 4 and 5.
- In order to replicate our results in different settings, we eventually switched from the original shap module to treeshap in R. Most of the initial functions are found in the files
StroblData_ASTA2022.R
andhelperFuns.R
. However, those functions work only for the train/test methodology. - The actual separation of out-of-bag from inbag data for trees/forests was achieved in a separate effort and can be found in the files
treewise_shap_simulation.R
andsim_utils.R
.