Replies: 1 comment 2 replies
-
|
This is the internal function used by In your case since n < 1000, it looks like the sub ratio is exp(-1)=0.368. Therefore, 36.8% of the data is being subsampled (without replacement) for each iteration of the subsampling procedure. Note that the double bootstrap does not make use of the |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I had some follow-up questions to a recent exchange in this discussion group on the subsample function for extracting confidence intervals on VIMPs. As background, we are interested in two different models: 1) overall survival, and 2) competing risk for death with 4 different causes of death. Our sample is 801 participants with 243 deaths. The 4 causes of death were distributed this way: 1) 51 deaths due to cause 1, 2) 61 due to cause 2, 3) 38 due to cause 3, and 4) 93 due to cause 4. We have 56 variables. Our RF for overall survival has a C index of approximately 0.7 using node size=15, mtry=8, nsplit=10, and 10,000 trees. We specified importance="permute" in our forest run.
We are very interested in inference of our VIMPs for these 56 variables. We would like to use a set of objective settings for the subsample function. As mentioned in my previous post, we found that a previous version of the randomForestSRC package, using the default settings for subsample, gave us a larger list of important variables (based on the confidence limits) than the current default settings. Given the previous post, I have the following additional questions:
Also, I believe this above is the previous default setting for the subratio? I say this because when I set subratio to this value in the
current version of randomForestSRC, I obtain somewhat similar results to the results we obtained with the older version of
randomForestSRC.
And, if so, can you please tell me what the current default setting is for subratio?
In Ishwaran and Lu, 2019, you recommend the inverse square root of n to use as a subsample size to balance bias and MSE for the
goal of finding true signal among the variables, in most cases. However, you also state that setting b=n^3/4 might be better for
weeding out those variables with poor signal (true negative rate). Given the absence of an objective measure for users to select the
settings based on a performance value, for applications of the subsample call, are there some guidelines we can use to select a
subratio (and also select the number of subsamples conducted---I know the default is B=100)? We are interested in finding
variables with both strong and moderate signal.
Given that our sample size and number of variables are not prohibitive, and we have the time to let the subsample call run over a
longer time, do you recommend using the double bootstrap method? And, if so, are there some guidelines for the setting of
subratio and "B" for using the double bootstrap? Again, we hope to ascertain variables with both moderate and strong signal.
Thank you!
Mindy
Beta Was this translation helpful? Give feedback.
All reactions