Allow RF Minimal Depth Filter to do its own thresholding #2687

annette987 · 2019-11-26T03:15:56Z

According to Ishwaran et al (https://www.tandfonline.com/doi/abs/10.1198/jasa.2009.tm08622), the minimal depth of a random forest can be used for high-dimensional variable selection.

Minimal depth is implemented in mlr with the filter randomForestSRC_var.select, and the argument method = "md". However, the implementation in mlr, returns a value for every feature and the threshold information is lost. So the in-built thresholding cannot be used for feature selection.

library(survival)
library(mlr)
#> Loading required package: ParamHelpers

data(veteran)
vet.task <- makeSurvTask(id = "VET", data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)
dat = generateFilterValuesData(task = vet.task, 
                         method = "randomForestSRC_var.select",
                         nselect = 5,
                         more.args=list("randomForestSRC_var.select"=list(method="md", nrep=3))
)
dat
#> FilterValues:
#> Task: VET
#>                  name    type                     method value
#> 1:     celltype.large numeric randomForestSRC_var.select 6.044
#> 2:     celltype.adeno numeric randomForestSRC_var.select 4.335
#> 3: celltype.smallcell numeric randomForestSRC_var.select 4.138
#> 4:  celltype.squamous numeric randomForestSRC_var.select 4.033
#> 5:              prior numeric randomForestSRC_var.select 3.894
#> 6:                age numeric randomForestSRC_var.select 3.614
#> 7:           diagtime numeric randomForestSRC_var.select 1.911
#> 8:              karno numeric randomForestSRC_var.select 1.844
#> 9:                trt numeric randomForestSRC_var.select 0.902

^{Created on 2019-11-26 by the reprex package (v0.3.0)}

At the moment, in Filter.R, randomForestSRC::var.select is called with always.use = getTaskFeatureNames(task), so that all features are always returned. However, randomForestSRC::var.select returns a variable, topvars, which are the variables that ranked above the threshold. Although this variable is used in the filter,for method="md", it is over-ridden by the always.use argument. It is not used for method = "vh", but should be.

I propose that the always.use argument be removed and that in filterFeatures.R, nselect be adjusted to the number of non-null values.

I have coded this and can do a PR.

larskotthoff · 2019-11-27T04:43:37Z

Thanks, but if I understand you correctly, that would be equivalent to thresholding the returned values from the current implementation. Is there anything that this change would allow us to do that we can't do at the moment?

annette987 · 2019-11-27T07:10:42Z

Yes. The minimal depth algorithm, implemented in randomForestSRC::var.select.rfsrc, calculates the best threshold for the data and as a result the most important variables. We cannot access that information currently using mlr, so we have to rely on the usual tuning methods to select a threshold, which may not give as good a result.
It is a very minor change in terms of the code and would be more in line with the way the filter was designed to work.

larskotthoff · 2019-11-27T17:12:24Z

Ah ok, sounds good. Could you make the PR please?

annette987 mentioned this issue Nov 28, 2019

Allow RF Minimal Depth Filter to do its own thresholding #2690

Merged

pat-s closed this as completed in #2690 Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow RF Minimal Depth Filter to do its own thresholding #2687

Allow RF Minimal Depth Filter to do its own thresholding #2687

annette987 commented Nov 26, 2019 •

edited

larskotthoff commented Nov 27, 2019

annette987 commented Nov 27, 2019

larskotthoff commented Nov 27, 2019

Allow RF Minimal Depth Filter to do its own thresholding #2687

Allow RF Minimal Depth Filter to do its own thresholding #2687

Comments

annette987 commented Nov 26, 2019 • edited

larskotthoff commented Nov 27, 2019

annette987 commented Nov 27, 2019

larskotthoff commented Nov 27, 2019

annette987 commented Nov 26, 2019 •

edited