Follow-up questions on subsampling for VIMP CI #450

cmnelson1 · 2024-10-11T19:35:28Z

cmnelson1
Oct 11, 2024

I had some follow-up questions to a recent exchange in this discussion group on the subsample function for extracting confidence intervals on VIMPs. As background, we are interested in two different models: 1) overall survival, and 2) competing risk for death with 4 different causes of death. Our sample is 801 participants with 243 deaths. The 4 causes of death were distributed this way: 1) 51 deaths due to cause 1, 2) 61 due to cause 2, 3) 38 due to cause 3, and 4) 93 due to cause 4. We have 56 variables. Our RF for overall survival has a C index of approximately 0.7 using node size=15, mtry=8, nsplit=10, and 10,000 trees. We specified importance="permute" in our forest run.

We are very interested in inference of our VIMPs for these 56 variables. We would like to use a set of objective settings for the subsample function. As mentioned in my previous post, we found that a previous version of the randomForestSRC package, using the default settings for subsample, gave us a larger list of important variables (based on the confidence limits) than the current default settings. Given the previous post, I have the following additional questions:

The subratio argument can be set between 0 and 1. As defined in the manual:

**subratio**            
 _Ratio of subsample size to original sample size. The default is approximately equal to the inverse square root of the sample size._

 If the subratio argument is the ratio of the subsample size to the original sample size, the default of the inverse square root of the 
 sample size refers to calculation of the subsample size, not the subratio, I believe. Is that correct?  
 For our sample size of 801, the inverse square root is 28.3. To get the subratio, we would divide 28.3 by 801 to get 0.0353 and we 
 would put subratio=0.0353 into our subsample call?

Also, I believe this above is the previous default setting for the subratio? I say this because when I set subratio to this value in the
current version of randomForestSRC, I obtain somewhat similar results to the results we obtained with the older version of
randomForestSRC.
And, if so, can you please tell me what the current default setting is for subratio?
In Ishwaran and Lu, 2019, you recommend the inverse square root of n to use as a subsample size to balance bias and MSE for the
goal of finding true signal among the variables, in most cases. However, you also state that setting b=n^3/4 might be better for
weeding out those variables with poor signal (true negative rate). Given the absence of an objective measure for users to select the
settings based on a performance value, for applications of the subsample call, are there some guidelines we can use to select a
subratio (and also select the number of subsamples conducted---I know the default is B=100)? We are interested in finding
variables with both strong and moderate signal.
Given that our sample size and number of variables are not prohibitive, and we have the time to let the subsample call run over a
longer time, do you recommend using the double bootstrap method? And, if so, are there some guidelines for the setting of
subratio and "B" for using the double bootstrap? Again, we hope to ascertain variables with both moderate and strong signal.

Thank you!
Mindy

ishwaran · 2024-10-25T19:46:04Z

ishwaran
Oct 25, 2024
Collaborator

This is the internal function used by subsample to set the ratio

> randomForestSRC:::get.subsample.subratio
function (n, base.n = 1000, base.r = exp(-1)) 
{
    if (n < base.n) {
        subratio <- base.r
    }
    else {
        subratio <- base.r * sqrt(base.n)/sqrt(n)
    }
}

In your case since n < 1000, it looks like the sub ratio is exp(-1)=0.368.

Therefore, 36.8% of the data is being subsampled (without replacement) for each iteration of the subsampling procedure.

Note that the double bootstrap does not make use of the subratio argument. This is because the double bootstrap uses sampling with replacement.

2 replies

cmnelson1 Feb 12, 2025
Author

Thank you for all of your previous responses on the vimp subsampling methods in rfsrc.

In reviewing some output from extract.subsample, we found that there are circumstances in which the "pvalue" is <0.05 and yet the CI brackets "0" and, thus, the "signif" is "FALSE"; we found other circumstances in which the "pvalue" is >0.05 and yet the CI does not bracket "0" and, thus, the "signif" is "TRUE".

We have reviewed the code for extract.subsample but did not quite figure out exactly what the "pvalue" is calculating. Can you please clarify what "pvalue" is estimating?

Here are some examples from our output of extract.subsample:
$var.sel
variable lower median upper pvalue signif
var1 0.796624105 2.130799053 2.580163131 0.040000 TRUE
var2 0.754567208 1.387734904 1.683530076 0.020000 TRUE
var3 0.135952346 1.115181006 1.388929405 0.100000 TRUE
var4 0.26865884 0.939832711 1.23562892 0.080000 TRUE
var5 0.128472849 0.740296377 1.022851665 0.150000 TRUE
var6 0.20196012 0.655561618 0.850679504 0.060000 TRUE
var7 0.213780907 0.509531303 0.617180975 0.040000 TRUE
var8 0.009281197 0.068997871 0.103432463 0.070000 TRUE

$var.sel.Z
variable lower mean upper pvalue signif
var1 0.812851213 1.659256224 2.50566123 0.000061 TRUE
var2 0.532561379 1.06351996 1.59447854 0.000043 TRUE
var3 0.268946857 0.92092666 1.57290646 0.002816 TRUE
var13 0.163555281 0.831497243 1.4994392 0.007346 TRUE
var4 0.277317445 0.798554255 1.31979106 0.001338 TRUE
var5 0.102163019 0.621402857 1.14064269 0.009498 TRUE
var6 0.081959896 0.513838545 0.94571719 0.009853 TRUE
var7 0.149213543 0.374109471 0.5990054 0.000556 TRUE
var9 -0.073728904 0.406152749 0.8860344 0.048574 FALSE
var10 -0.039099805 0.379072091 0.79724399 0.037808 FALSE
var11 -0.028332608 0.339163762 0.70666013 0.035237 FALSE
var12 -0.000246144 0.337512449 0.67527104 0.025084 FALSE

Thank you!

ishwaran Feb 17, 2025
Collaborator

I think the issue you are referring to is related to the object $var.sel This refers to the nonparametric estimator which is derived without assuming normality. So since it is not pivoted to obtain the confidence interval there may be some discrepancies in the p-value and confidence interval.

All of this is to say that the extract.subsample function is not meant for users but is an internal function which we use for development and so forth - some of which is exploratory.

I would recommend you just print the object as in the help file example: print is a wrapper to the extract.subsample function (in fact if you look at the print.subsample function you will see how we prefer for it to be used):

 ## training the forest
 reg.o <- rfsrc(Ozone ~ ., airquality)
    
 ## default subsample call
 reg.smp.o <- subsample(reg.o)
    
 ## summary of results
 print(reg.smp.o)
 print(reg.smp.o, alpha=.10)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up questions on subsampling for VIMP CI #450

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Follow-up questions on subsampling for VIMP CI #450

Uh oh!

cmnelson1 Oct 11, 2024

Replies: 1 comment · 2 replies

Uh oh!

ishwaran Oct 25, 2024 Collaborator

Uh oh!

cmnelson1 Feb 12, 2025 Author

Uh oh!

ishwaran Feb 17, 2025 Collaborator

cmnelson1
Oct 11, 2024

Replies: 1 comment 2 replies

ishwaran
Oct 25, 2024
Collaborator

cmnelson1 Feb 12, 2025
Author

ishwaran Feb 17, 2025
Collaborator