Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow RF Minimal Depth Filter to do its own thresholding #2687

Closed
annette987 opened this issue Nov 26, 2019 · 3 comments · Fixed by #2690
Closed

Allow RF Minimal Depth Filter to do its own thresholding #2687

annette987 opened this issue Nov 26, 2019 · 3 comments · Fixed by #2690

Comments

@annette987
Copy link
Contributor

annette987 commented Nov 26, 2019

According to Ishwaran et al (https://www.tandfonline.com/doi/abs/10.1198/jasa.2009.tm08622), the minimal depth of a random forest can be used for high-dimensional variable selection.

Minimal depth is implemented in mlr with the filter randomForestSRC_var.select, and the argument method = "md". However, the implementation in mlr, returns a value for every feature and the threshold information is lost. So the in-built thresholding cannot be used for feature selection.

library(survival)
library(mlr)
#> Loading required package: ParamHelpers

data(veteran)
vet.task <- makeSurvTask(id = "VET", data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)
dat = generateFilterValuesData(task = vet.task, 
                         method = "randomForestSRC_var.select",
                         nselect = 5,
                         more.args=list("randomForestSRC_var.select"=list(method="md", nrep=3))
)
dat
#> FilterValues:
#> Task: VET
#>                  name    type                     method value
#> 1:     celltype.large numeric randomForestSRC_var.select 6.044
#> 2:     celltype.adeno numeric randomForestSRC_var.select 4.335
#> 3: celltype.smallcell numeric randomForestSRC_var.select 4.138
#> 4:  celltype.squamous numeric randomForestSRC_var.select 4.033
#> 5:              prior numeric randomForestSRC_var.select 3.894
#> 6:                age numeric randomForestSRC_var.select 3.614
#> 7:           diagtime numeric randomForestSRC_var.select 1.911
#> 8:              karno numeric randomForestSRC_var.select 1.844
#> 9:                trt numeric randomForestSRC_var.select 0.902

Created on 2019-11-26 by the reprex package (v0.3.0)

At the moment, in Filter.R, randomForestSRC::var.select is called with always.use = getTaskFeatureNames(task), so that all features are always returned. However, randomForestSRC::var.select returns a variable, topvars, which are the variables that ranked above the threshold. Although this variable is used in the filter,for method="md", it is over-ridden by the always.use argument. It is not used for method = "vh", but should be.

I propose that the always.use argument be removed and that in filterFeatures.R, nselect be adjusted to the number of non-null values.

I have coded this and can do a PR.

@larskotthoff
Copy link
Sponsor Member

Thanks, but if I understand you correctly, that would be equivalent to thresholding the returned values from the current implementation. Is there anything that this change would allow us to do that we can't do at the moment?

@annette987
Copy link
Contributor Author

Yes. The minimal depth algorithm, implemented in randomForestSRC::var.select.rfsrc, calculates the best threshold for the data and as a result the most important variables. We cannot access that information currently using mlr, so we have to rely on the usual tuning methods to select a threshold, which may not give as good a result.
It is a very minor change in terms of the code and would be more in line with the way the filter was designed to work.

@larskotthoff
Copy link
Sponsor Member

Ah ok, sounds good. Could you make the PR please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants