Add Sobol-MDA importance measure #604

clementbenard · 2022-03-01T17:32:04Z

Add Sobol-MDA importance measure for regression forests, defined in https://doi.org/10.1093/biomet/asac017.

Details
The SobolMDA gives the proportion of explained output variance lost when a given variable is removed from the model. The complexity of the SobolMDA computed for all input variables is independent of the number of variables (whereas the brute force approach has a quadratic complexity).

factorize projected tree predict function

snembrini · 2022-06-22T05:21:16Z

I would advice against adding that variable importance in ranger for the following reasons:

the authors are not proposing anything new: it is a class of VI called "drop-and-relearn" (see LOCO in Lei 2018)
Conveniently enough the authors cite Ishwaran's permutation importance and completely omit the importance he refers to as "holdout importance" implemented in randomForestSRC since 2018, which does the exact same thing
the Conditional Predictive Impact (CPI) already does a great job with correlated predictors

clementbenard · 2022-06-30T10:58:42Z

Hello Stefano,

Thank you for having a look at the Sobol-MDA algorithm ! Your concerns are addressed in our article (https://arxiv.org/pdf/2102.13347.pdf), especially in Subsection 4.1 for competitor comparisons. I deepen the discussion below, and I would be interested to have your thoughts about the following explanations.

You are right that the estimated quantity by the Sobol-MDA is not new at all : it estimates the accuracy decrease when a variable is removed. Actually, this was formalized by Sobol in 1993, for example. LOCO, the holdout importance of randomForestSRC, and the Sobol-MDA are different algorithms to target the same quantity.

The strong limitation of the "drop-and-relearn" approach (LOCO, holdout vimp in randomForestSRC, Hooker & Mentch (2016), Williamson (2021), etc...) is the computational complexity, which is O(p^2), i.e., quadratic with the number of input variables "p". (For each of the p inputs, a forest has to be retrained with a O(p) complexity). When p is moderate or large, "drop-and-relearn" algorithms quickly become untractable. If you want to conduct a backward variable selection for example, the complexity is even O(p^3).

The Sobol-MDA has a computational complexity independent from p, because no forest is retrained in the procedure. Only the initial forest is used, and predictions are computed ignoring splits in the trees. The theoretical analysis shows that is does converge towards the same quantity as "drop-and-relearn" algorithms. Empirically, the efficiency is very close, but at a much cheaper computational cost of the Sobol-MDA.

About your last point, CPI is based on the knockoff approach (Candès 2018), which is highly efficient when it is possible to sample from the conditional distributions of the inputs. Often in practice, only a data sample is available, and the estimation of the conditional distributions are a very difficult task, even in small dimension.

mikoontz · 2022-09-02T21:55:00Z

I'm interested in this as a possible alternative importance metric in {ranger}. One benefit is that it doesn't require the user to go learn the whole {mlr3} ecosystem in order to implement it (which is the only way to use the {cpi} framework, as far as I know), if the goal is to better handle correlated predictors.

I'm working on using a devtools::install_github() call on this particular commit from @clementbenard to try to put it into practice but am not having good luck just yet. I'll report back with some steps for future users if I get it to work.

edit:

Whoops I just noticed that, unless this gets merged directly into {ranger}, it probably is better to install the {sobolMDA} package from gitlab instead. Then you can run the {ranger} version of the ranger() function or the {sobolMDA} version of the ranger() function:

# devtools::install_gitlab(repo = "drti/sobolmda")
library(sobolMDA)
library(ranger)

rg.iris <- ranger::ranger(Species ~ ., data = iris, importance = "impurity")
rg.iris$variable.importance

rg.iris <- sobolMDA::ranger(Species ~ ., data = iris, importance = "sobolMDA")
rg.iris$variable.importance

edit2:

I noticed that this importance measure doesn't appear to work for my case, where I'm using case.weights to upsample some observations. Only NaN is returned. I suspect this is related to some observations never being out of bag, and so the summaries for those (non-existent) out-of-bag observations across trees returns NaN.

You can recreate the error by using very few trees and relying on chance to have some observations never be out-of-bag:

rg.iris <- sobolMDA::ranger(Species ~ ., data = iris, importance = "sobolMDA", num.trees = 10, seed = 2)
rg.iris$variable.importance

This also appears to be a challenge for the conditional predictive impact approach mentioned by @snembrini above: bips-hb/cpi#9

add SobolMDA importance measure

75c0f90

factorize projected tree predict function

mikoontz mentioned this pull request Aug 29, 2022

importance = "permutation", what is this doing? #237

Closed

update documentation with SobolMDA in importance option

bc7803a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sobol-MDA importance measure #604

Add Sobol-MDA importance measure #604

clementbenard commented Mar 1, 2022 •

edited

Loading

snembrini commented Jun 22, 2022 •

edited

Loading

clementbenard commented Jun 30, 2022

mikoontz commented Sep 2, 2022 •

edited

Loading

Add Sobol-MDA importance measure #604

Are you sure you want to change the base?

Add Sobol-MDA importance measure #604

Conversation

clementbenard commented Mar 1, 2022 • edited Loading

snembrini commented Jun 22, 2022 • edited Loading

clementbenard commented Jun 30, 2022

mikoontz commented Sep 2, 2022 • edited Loading

clementbenard commented Mar 1, 2022 •

edited

Loading

snembrini commented Jun 22, 2022 •

edited

Loading

mikoontz commented Sep 2, 2022 •

edited

Loading