-
-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Sobol-MDA importance measure #604
base: master
Are you sure you want to change the base?
Conversation
factorize projected tree predict function
I would advice against adding that variable importance in ranger for the following reasons:
|
Hello Stefano, Thank you for having a look at the Sobol-MDA algorithm ! Your concerns are addressed in our article (https://arxiv.org/pdf/2102.13347.pdf), especially in Subsection 4.1 for competitor comparisons. I deepen the discussion below, and I would be interested to have your thoughts about the following explanations. You are right that the estimated quantity by the Sobol-MDA is not new at all : it estimates the accuracy decrease when a variable is removed. Actually, this was formalized by Sobol in 1993, for example. LOCO, the holdout importance of randomForestSRC, and the Sobol-MDA are different algorithms to target the same quantity. The strong limitation of the "drop-and-relearn" approach (LOCO, holdout vimp in randomForestSRC, Hooker & Mentch (2016), Williamson (2021), etc...) is the computational complexity, which is O(p^2), i.e., quadratic with the number of input variables "p". (For each of the p inputs, a forest has to be retrained with a O(p) complexity). When p is moderate or large, "drop-and-relearn" algorithms quickly become untractable. If you want to conduct a backward variable selection for example, the complexity is even O(p^3). The Sobol-MDA has a computational complexity independent from p, because no forest is retrained in the procedure. Only the initial forest is used, and predictions are computed ignoring splits in the trees. The theoretical analysis shows that is does converge towards the same quantity as "drop-and-relearn" algorithms. Empirically, the efficiency is very close, but at a much cheaper computational cost of the Sobol-MDA. About your last point, CPI is based on the knockoff approach (Candès 2018), which is highly efficient when it is possible to sample from the conditional distributions of the inputs. Often in practice, only a data sample is available, and the estimation of the conditional distributions are a very difficult task, even in small dimension. |
I'm interested in this as a possible alternative importance metric in {ranger}. One benefit is that it doesn't require the user to go learn the whole {mlr3} ecosystem in order to implement it (which is the only way to use the {cpi} framework, as far as I know), if the goal is to better handle correlated predictors. I'm working on using a edit: Whoops I just noticed that, unless this gets merged directly into {ranger}, it probably is better to install the {sobolMDA} package from gitlab instead. Then you can run the {ranger} version of the
edit2: I noticed that this importance measure doesn't appear to work for my case, where I'm using You can recreate the error by using very few trees and relying on chance to have some observations never be out-of-bag:
This also appears to be a challenge for the conditional predictive impact approach mentioned by @snembrini above: bips-hb/cpi#9 |
Add Sobol-MDA importance measure for regression forests, defined in https://doi.org/10.1093/biomet/asac017.
Details
The SobolMDA gives the proportion of explained output variance lost when a given variable is removed from the model. The complexity of the SobolMDA computed for all input variables is independent of the number of variables (whereas the brute force approach has a quadratic complexity).