You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Minimal depth is implemented in mlr with the filter randomForestSRC_var.select, and the argument method = "md". However, the implementation in mlr, returns a value for every feature and the threshold information is lost. So the in-built thresholding cannot be used for feature selection.
At the moment, in Filter.R, randomForestSRC::var.select is called with always.use = getTaskFeatureNames(task), so that all features are always returned. However, randomForestSRC::var.select returns a variable, topvars, which are the variables that ranked above the threshold. Although this variable is used in the filter,for method="md", it is over-ridden by the always.use argument. It is not used for method = "vh", but should be.
I propose that the always.use argument be removed and that in filterFeatures.R, nselect be adjusted to the number of non-null values.
I have coded this and can do a PR.
The text was updated successfully, but these errors were encountered:
Thanks, but if I understand you correctly, that would be equivalent to thresholding the returned values from the current implementation. Is there anything that this change would allow us to do that we can't do at the moment?
Yes. The minimal depth algorithm, implemented in randomForestSRC::var.select.rfsrc, calculates the best threshold for the data and as a result the most important variables. We cannot access that information currently using mlr, so we have to rely on the usual tuning methods to select a threshold, which may not give as good a result.
It is a very minor change in terms of the code and would be more in line with the way the filter was designed to work.
According to Ishwaran et al (https://www.tandfonline.com/doi/abs/10.1198/jasa.2009.tm08622), the minimal depth of a random forest can be used for high-dimensional variable selection.
Minimal depth is implemented in mlr with the filter randomForestSRC_var.select, and the argument method = "md". However, the implementation in mlr, returns a value for every feature and the threshold information is lost. So the in-built thresholding cannot be used for feature selection.
Created on 2019-11-26 by the reprex package (v0.3.0)
At the moment, in Filter.R, randomForestSRC::var.select is called with always.use = getTaskFeatureNames(task), so that all features are always returned. However, randomForestSRC::var.select returns a variable, topvars, which are the variables that ranked above the threshold. Although this variable is used in the filter,for method="md", it is over-ridden by the always.use argument. It is not used for method = "vh", but should be.
I propose that the always.use argument be removed and that in filterFeatures.R, nselect be adjusted to the number of non-null values.
I have coded this and can do a PR.
The text was updated successfully, but these errors were encountered: