class imbalance, sparse data, and random survival forest #138

seaspray09 · 2021-12-23T02:33:18Z

seaspray09
Dec 23, 2021

We have been performing survival analyses using random survival forests (RSF) where we have apparent class imbalance (for example, a 5:95 event to nonevent ratio) and potential sparse data. Our interests are in finding and evaluating potential risk predictors for clinical outcomes described in such datasets.

My questions are as as follows:

What are your thoughts and approach on how class imbalance should be considered when performing RSF calculations. If we have low outcome event rates (such as a 5% event rate), would this be an issue in performing RSF calculations? What might be some recommended adjustments to the RSF calculations/code hyperparameters that one may need to take into account when evaluating data with class imbalance so that one can have some assurance of the predictors that are found along with the variable rank order? Does the imbalanced() function of randomForestSRC apply only to random forest (RF) or also to RSF?
For performing RSF on sparse data, would RSF be suitable for finding and evaluating risk predictors in such a dataset? The dataset is a mix of dense and sparse data. What might be some adjustments that you would recommend, if necessary, to allow for proper use of RSF to discover and evaluate risk predictors from such a dataset?

ishwaran · 2021-12-23T17:29:13Z

ishwaran
Dec 23, 2021
Collaborator

Technically, this is a much different scenario than the class imbalanced data problem, which applies to classification with imbalanced frequency, whereas your scenario is survival analysis in the presence of a high censoring rate. High censoring is typical in medical settings and does not require any change of the methodology for random survival forests. Therefore you do not need to make any modifications for risk assessments and so forth. However, I would recommend that you use the Brier score for performance metric wherever possible since the C-index can be biased in high censoring cases. The function get.brier.survival could be useful for that. See also the vignette on random survival forests: https://luminwin.github.io/randomForestSRC/articles/survival.html

1 reply

seaspray09 Dec 27, 2021
Author

Thank you for providing this invaluable insight Dr. Ishwaran. As a follow-up, I was wondering how the mechanism within random survival forests accounts for the presence of a high censoring rate. Is it due to the use of the log-rank test statistic to select the cutpoints for predictors at the branch points, or something else? Also, how permissible would calculations based on the log-rank statistic be under situations where proportional hazard assumptions be violated?

ishwaran · 2021-12-27T15:37:55Z

ishwaran
Dec 27, 2021
Collaborator

It's not necessarily the RSF mechanism that is at play, but rather that the survival framework itself accounts for both censored data and uncensored data and this what makes it possible to deal with high censoring rates. Estimators used in survival analysis, Nelson-Aalen or KM, are specifically designed to handle censoring.

The problem with class imbalanced data analysis is that the Bayes rule is used for classification. The underlying estimators used to estimate the conditional distribution are just fine (just like in survival analysis), but the problem is classification is based on the Bayes rule which will have extremely poor performance in high imbalanced data.

It is permissible to use log-rank when proportional hazards are violated. The PBC data set is a great example which shows how well log-rank can work in such settings. I think you can construct examples where you can do better under modifications, using different splitting rules, etc., but as a general rule it should work just fine.

1 reply

seaspray09 Dec 29, 2021
Author

Thank you for providing these additional clarifications, Dr. Ishwaran.

Since survival frameworks account for highly censored data, is there therefore not a need to be concerned about high class imbalance when looking for predictors of a disease outcome using RSF?

Specifically, for a scenario that is highly censored (for example 100 disease events among 2000 participants over a certain time period):

What would be your recommendations for performing feature selection from input data containing approximately 30 to 40 variables followed by assessing importance among predictors?
What may be appropriate ways to interpret "important predictors" in this situation?

To find important predictors towards our intended disease event outcome, we have tried to run RSF using randomForestSRC as follows:

rfsrc(Surv(time, event)~., data, ntree=1000, family="surv", err.block=1, tree.err=TRUE, importance=TRUE).

The output is yielding performance error rates of approximately 0.40, so this is also of concern.

ishwaran · 2021-12-29T14:02:02Z

ishwaran
Dec 29, 2021
Collaborator

It's just like I said, there's nothing really special about the high censoring rate case excepting that the C-index may be biased. Since variable importance uses C-index, this could be a problem, however variable importance is a difference, and it's likely this bias is cancelled out in the differencing operation. I use high censored data all the time (the usual scenario in medical applications) and have not noticed anything unusual. Currently we do not provide variable importance measures using the Brier score, for example, which could potentially remove such bias if it existed.

The high performance error rate suggests you are not able to do achieve high performance. That's just how it is sometimes for a machine learning problem. Did you try tuning the forest parameters, like nodesize and mtry? That might help.

3 replies

suengeek Dec 1, 2023

Dear Prof. Ishwaran,
I am doing a survival random forest for predicting cvd events, however, when I run the codes for calculating brier score, here comes an error:
###obtain Brier score using KM and RSF censoring distribution estimators
bs.km <- get.brier.survival(obj1, cens.mode = "km")$brier.score
bs.rsf <- get.brier.survival(obj1, cens.model = "rfsrc")$brier.score

###plot the brier score
plot(bs.km, type = "s", col = 2)
lines(bs.rsf, type ="s", col = 4)
legend("bottomright", legend = c("cens.model = km", "cens.model = rfsrc"), fill = c(2,4))

and here is the warning for 'bs.rsf':
Error in generic.predict.rfsrc(object, newdata, m.target = m.target, importance = importance, :
x-variables in test data do not match original training data.
I double checked the data frame and codes, but I didn't find the reason, could you help me?

ishwaran Dec 11, 2023
Collaborator

That warning is issued when the test data features do not match the training features, typically this occurs when you have factors in the test data that take different levels than observed in the training data. Without a reproducible example, however, it is not possible for me to diagnose this further. Can you try to create a reproducible example?

suengeek Feb 7, 2024

Dear, This is not a problem in the testing process, I just followed your illustration. It happens in the training process.

obj1 <- rfsrc(Surv(Mace.survival,Mace.indicator)~., data = dta,
ntree = 1000, nodesize = 57,set.seed(-123456),importance="permute")
get.cindex(obj1$yvar[,1], obj1$yvar[,2], obj1$predicted.oob)

bs.km <- get.brier.survival(obj1, cens.mode = "km")$brier.score
bs.rsf <- get.brier.survival(obj1, cens.model = "rfsrc")$brier.score
Error in generic.predict.rfsrc(object, newdata, m.target = m.target, importance = importance, :
x-variables in test data do not match original training data

YinanHuang11 · 2024-03-02T22:08:31Z

YinanHuang11
Mar 2, 2024

Currently, we are evaluating the comparative time to cardiovascular (CV) adverse events in patients after taking drug A vs. drug B vs. drug C, based on about 60+ baseline predictors. We will be using your randomForestSRC.

Shall we perform any feature reduction? Of note, these CV outcomes are rare events, of more than 3000 patients, only about less than 200 developed CV events, including 5 for drug A, 62 for drug B and 62 for drug C?

I read from your prior post that Brier score is recommended in the scenario of imbalanced data. Any other evaluation metrics should also be considered, e.g., AUC or calibration plots?

0 replies

ishwaran · 2024-03-05T03:51:14Z

ishwaran
Mar 5, 2024
Collaborator

The Brier score seems like a good idea

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

class imbalance, sparse data, and random survival forest #138

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

class imbalance, sparse data, and random survival forest #138

seaspray09 Dec 23, 2021

Replies: 5 comments · 5 replies

ishwaran Dec 23, 2021 Collaborator

seaspray09 Dec 27, 2021 Author

ishwaran Dec 27, 2021 Collaborator

seaspray09 Dec 29, 2021 Author

ishwaran Dec 29, 2021 Collaborator

suengeek Dec 1, 2023

ishwaran Dec 11, 2023 Collaborator

suengeek Feb 7, 2024

Dear, This is not a problem in the testing process, I just followed your illustration. It happens in the training process.

YinanHuang11 Mar 2, 2024

ishwaran Mar 5, 2024 Collaborator

seaspray09
Dec 23, 2021

Replies: 5 comments 5 replies

ishwaran
Dec 23, 2021
Collaborator

seaspray09 Dec 27, 2021
Author

ishwaran
Dec 27, 2021
Collaborator

seaspray09 Dec 29, 2021
Author

ishwaran
Dec 29, 2021
Collaborator

ishwaran Dec 11, 2023
Collaborator

YinanHuang11
Mar 2, 2024

ishwaran
Mar 5, 2024
Collaborator