class imbalance, sparse data, and random survival forest #138
Replies: 5 comments 5 replies
-
Technically, this is a much different scenario than the class imbalanced data problem, which applies to classification with imbalanced frequency, whereas your scenario is survival analysis in the presence of a high censoring rate. High censoring is typical in medical settings and does not require any change of the methodology for random survival forests. Therefore you do not need to make any modifications for risk assessments and so forth. However, I would recommend that you use the Brier score for performance metric wherever possible since the C-index can be biased in high censoring cases. The function |
Beta Was this translation helpful? Give feedback.
-
It's not necessarily the RSF mechanism that is at play, but rather that the survival framework itself accounts for both censored data and uncensored data and this what makes it possible to deal with high censoring rates. Estimators used in survival analysis, Nelson-Aalen or KM, are specifically designed to handle censoring. The problem with class imbalanced data analysis is that the Bayes rule is used for classification. The underlying estimators used to estimate the conditional distribution are just fine (just like in survival analysis), but the problem is classification is based on the Bayes rule which will have extremely poor performance in high imbalanced data. It is permissible to use log-rank when proportional hazards are violated. The PBC data set is a great example which shows how well log-rank can work in such settings. I think you can construct examples where you can do better under modifications, using different splitting rules, etc., but as a general rule it should work just fine. |
Beta Was this translation helpful? Give feedback.
-
It's just like I said, there's nothing really special about the high censoring rate case excepting that the C-index may be biased. Since variable importance uses C-index, this could be a problem, however variable importance is a difference, and it's likely this bias is cancelled out in the differencing operation. I use high censored data all the time (the usual scenario in medical applications) and have not noticed anything unusual. Currently we do not provide variable importance measures using the Brier score, for example, which could potentially remove such bias if it existed. The high performance error rate suggests you are not able to do achieve high performance. That's just how it is sometimes for a machine learning problem. Did you try tuning the forest parameters, like |
Beta Was this translation helpful? Give feedback.
-
Currently, we are evaluating the comparative time to cardiovascular (CV) adverse events in patients after taking drug A vs. drug B vs. drug C, based on about 60+ baseline predictors. We will be using your randomForestSRC. Shall we perform any feature reduction? Of note, these CV outcomes are rare events, of more than 3000 patients, only about less than 200 developed CV events, including 5 for drug A, 62 for drug B and 62 for drug C? I read from your prior post that Brier score is recommended in the scenario of imbalanced data. Any other evaluation metrics should also be considered, e.g., AUC or calibration plots? |
Beta Was this translation helpful? Give feedback.
-
The Brier score seems like a good idea |
Beta Was this translation helpful? Give feedback.
-
We have been performing survival analyses using random survival forests (RSF) where we have apparent class imbalance (for example, a 5:95 event to nonevent ratio) and potential sparse data. Our interests are in finding and evaluating potential risk predictors for clinical outcomes described in such datasets.
My questions are as as follows:
What are your thoughts and approach on how class imbalance should be considered when performing RSF calculations. If we have low outcome event rates (such as a 5% event rate), would this be an issue in performing RSF calculations? What might be some recommended adjustments to the RSF calculations/code hyperparameters that one may need to take into account when evaluating data with class imbalance so that one can have some assurance of the predictors that are found along with the variable rank order? Does the imbalanced() function of randomForestSRC apply only to random forest (RF) or also to RSF?
For performing RSF on sparse data, would RSF be suitable for finding and evaluating risk predictors in such a dataset? The dataset is a mix of dense and sparse data. What might be some adjustments that you would recommend, if necessary, to allow for proper use of RSF to discover and evaluate risk predictors from such a dataset?
Beta Was this translation helpful? Give feedback.
All reactions