Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sobol-MDA importance measure #604

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

clementbenard
Copy link

@clementbenard clementbenard commented Mar 1, 2022

Add Sobol-MDA importance measure for regression forests, defined in https://doi.org/10.1093/biomet/asac017.

Details
The SobolMDA gives the proportion of explained output variance lost when a given variable is removed from the model. The complexity of the SobolMDA computed for all input variables is independent of the number of variables (whereas the brute force approach has a quadratic complexity).

factorize projected tree predict function
@snembrini
Copy link

snembrini commented Jun 22, 2022

I would advice against adding that variable importance in ranger for the following reasons:

  1. the authors are not proposing anything new: it is a class of VI called "drop-and-relearn" (see LOCO in Lei 2018)
  2. Conveniently enough the authors cite Ishwaran's permutation importance and completely omit the importance he refers to as "holdout importance" implemented in randomForestSRC since 2018, which does the exact same thing
  3. the Conditional Predictive Impact (CPI) already does a great job with correlated predictors

@clementbenard
Copy link
Author

Hello Stefano,

Thank you for having a look at the Sobol-MDA algorithm ! Your concerns are addressed in our article (https://arxiv.org/pdf/2102.13347.pdf), especially in Subsection 4.1 for competitor comparisons. I deepen the discussion below, and I would be interested to have your thoughts about the following explanations.

You are right that the estimated quantity by the Sobol-MDA is not new at all : it estimates the accuracy decrease when a variable is removed. Actually, this was formalized by Sobol in 1993, for example. LOCO, the holdout importance of randomForestSRC, and the Sobol-MDA are different algorithms to target the same quantity.

The strong limitation of the "drop-and-relearn" approach (LOCO, holdout vimp in randomForestSRC, Hooker & Mentch (2016), Williamson (2021), etc...) is the computational complexity, which is O(p^2), i.e., quadratic with the number of input variables "p". (For each of the p inputs, a forest has to be retrained with a O(p) complexity). When p is moderate or large, "drop-and-relearn" algorithms quickly become untractable. If you want to conduct a backward variable selection for example, the complexity is even O(p^3).

The Sobol-MDA has a computational complexity independent from p, because no forest is retrained in the procedure. Only the initial forest is used, and predictions are computed ignoring splits in the trees. The theoretical analysis shows that is does converge towards the same quantity as "drop-and-relearn" algorithms. Empirically, the efficiency is very close, but at a much cheaper computational cost of the Sobol-MDA.

About your last point, CPI is based on the knockoff approach (Candès 2018), which is highly efficient when it is possible to sample from the conditional distributions of the inputs. Often in practice, only a data sample is available, and the estimation of the conditional distributions are a very difficult task, even in small dimension.

@mikoontz
Copy link

mikoontz commented Sep 2, 2022

I'm interested in this as a possible alternative importance metric in {ranger}. One benefit is that it doesn't require the user to go learn the whole {mlr3} ecosystem in order to implement it (which is the only way to use the {cpi} framework, as far as I know), if the goal is to better handle correlated predictors.

I'm working on using a devtools::install_github() call on this particular commit from @clementbenard to try to put it into practice but am not having good luck just yet. I'll report back with some steps for future users if I get it to work.

edit:

Whoops I just noticed that, unless this gets merged directly into {ranger}, it probably is better to install the {sobolMDA} package from gitlab instead. Then you can run the {ranger} version of the ranger() function or the {sobolMDA} version of the ranger() function:

# devtools::install_gitlab(repo = "drti/sobolmda")
library(sobolMDA)
library(ranger)

rg.iris <- ranger::ranger(Species ~ ., data = iris, importance = "impurity")
rg.iris$variable.importance

rg.iris <- sobolMDA::ranger(Species ~ ., data = iris, importance = "sobolMDA")
rg.iris$variable.importance

edit2:

I noticed that this importance measure doesn't appear to work for my case, where I'm using case.weights to upsample some observations. Only NaN is returned. I suspect this is related to some observations never being out of bag, and so the summaries for those (non-existent) out-of-bag observations across trees returns NaN.

You can recreate the error by using very few trees and relying on chance to have some observations never be out-of-bag:

rg.iris <- sobolMDA::ranger(Species ~ ., data = iris, importance = "sobolMDA", num.trees = 10, seed = 2)
rg.iris$variable.importance

This also appears to be a challenge for the conditional predictive impact approach mentioned by @snembrini above: bips-hb/cpi#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants