## Survival Analysis: Based on Random Forest

The tutorial gives typical workflow of Random Forest-based survival analysis including data-preprocessing, model selection and traning&validation, and uses R package `randomForestSRC`.

Formally, it can be listed by:
1. Data Preprocessing
  - convert variables
  - load training and test set
2. Model Selection
  - cross validation
  - tune parameters
3. Traning&Validation
  - train RFS model
  - measure CI on testset
  - survival rates on time of interest
  
If you are confused about how Random Forest-based survival analysis works, you can read official documents in [here](https://kogalur.github.io/randomForestSRC/theory.html).

### Step0 - Load library and Data

In [1]:
library('survival')
library('randomForestSRC')
# set random state
set.seed(0)

"package 'randomForestSRC' was built under R version 3.5.1"
 randomForestSRC 2.7.0 
 
 Type rfsrc.news() to see new features, changes, and bug fixes. 
 



In [2]:
data(veteran, package = "randomForestSRC")
cat("Number of samples:", nrow(veteran), "\n")
cat("Columns of dataset:", colnames(veteran), "\n")
veteran[c(1:5), ]

Number of samples: 137 
Columns of dataset: trt celltype time status karno diagtime age prior 


trt,celltype,time,status,karno,diagtime,age,prior
1,1,72,1,60,7,69,0
1,1,411,1,70,5,64,10
1,1,228,1,60,3,38,0
1,1,126,1,60,9,63,10
1,1,118,1,70,11,65,10


### Step1 - Data Preprocessing

You can split dataset into training and test or directly load data by reading files

In [3]:
# Sample the data and create a training subset.
train <- sample(1:nrow(veteran), round(nrow(veteran) * 0.80))
data_train <- veteran[train, ]
data_test <- veteran[-train, ]

### Step2 - Model Selection

This section includes hyperparameters tuning by repeated k-fold cross validation on training set.

As described by author of `RandomSurvivalForest`, we can do model selection with the help of conventions below.
> Reasonable models can be formed with the judicious selection of `mtry`, `nsplit`, `nodesize`, and `nodedepth` without exhaustive and deterministic splitting

**Optional Reading:**
You can try to get the best hyperparameters' estimation by using python package `hyperopt`.

By the means of described, repeated 4-fold cross validation on training set for 3 times is used, besides, `mtry` and `nsplit` is fixed to defult and 0 respectively, results of searching are:
- "ci": 0.7422341993268755 (mean value of cross validation)
- "params": {"mtry": 5, "nodesize": 8, "ntree": 100}

### Step3 - Model Training & Evaluation

We will pass arguments to object `rfsrc` for training robust model after completing hyperparameters tuning, and then validate our fitted model using test set.

Here, evaluation and more in this section includes:
- calculating CI metrics
- calculating survival rate on specified time
- saving result as file

**Notes:**

The RFS's prediction for individual $i$ is the sum of CHE(cumulative hazard estimate) on time of death $t_i$.

$$
\begin{equation}
\mathcal{M}_i = \sum_{k=1}^{M}\hat{H}_e^*(t_k^*|{\bf X}_i).
\end{equation}
$$

You can understand it via [this](https://kogalur.github.io/randomForestSRC/theory.html)

#### 3.0 - Model Training

In [4]:
# pass arguments we got before
# Train the model.
model <- rfsrc(Surv(time, status) ~ ., data_train, ntree=100, mtry=5, nodesize=8, nsplit=0, seed=0)
# Test the model
pred <- predict(model, data_test)
# Compare the results.
print(model)
print(pred)

                         Sample size: 110
                    Number of deaths: 102
                     Number of trees: 100
           Forest terminal node size: 8
       Average no. of terminal nodes: 14.67
No. of variables tried at each split: 5
              Total no. of variables: 6
       Resampling used to grow trees: swr
    Resample size used to grow trees: 110
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank
                          Error rate: 28.32%

  Sample size of test (predict) data: 27
       Number of deaths in test data: 26
                Number of grow trees: 100
  Average no. of grow terminal nodes: 14.67
         Total no. of grow variables: 6
       Resampling used to grow trees: swr
    Resample size used to grow trees: 110
                            Analysis: RSF
                              Family: surv
                 Test set error rate: 36.57%



#### 3.1 - CI (concordance index)

We can get $1-\text{CI}$ (concordance index) by any of three below:
- Test set error in the output of `print(veteran.pred)`
- built-in methods `cindex(T, E, Pred)`
- function `rcorr.cens` from package `Hmisc`

In [5]:
# pred$chf denotes CHE(cumulative hazard estimate) on all time of death for individual i
# pred$predicted denotes the sum of CHE(cumulative hazard estimate) on all time of death for individual i
cindex(data_test$time, data_test$status, -pred$predicted)

In [6]:
Hmisc::rcorr.cens(-pred$predicted, Surv(data_test$time, data_test$status))

#### 3.2 - Survival function

Prediction of survival function on test set can be accessed by `veteran.pred$survival`

In [7]:
# All time of deaths 
print(pred$time.interest)

 [1]   1   3   4   7   8  10  11  12  13  15  16  18  19  20  21  24  25  27  29
[20]  30  31  33  35  36  42  44  45  48  49  51  52  53  54  56  59  61  63  73
[39]  80  82  84  87  92  95  99 103 105 110 112 117 126 133 139 140 143 144 151
[58] 153 156 162 177 186 200 216 228 242 250 260 278 283 287 314 357 378 389 411
[77] 467 587 991 999


In [8]:
time_idx <- 10
# Surival rate at specified time
cat("Survival rate for each item in test set at time", pred$time.interest[time_idx], "\n")
print(pred$survival[, time_idx])

Survival rate for each item in test set at time 15 
 [1] 0.9555864 0.9454676 0.9241747 0.8264598 0.9511024 0.9492909 0.8080744
 [8] 0.9699793 0.3235795 0.9883923 0.9851045 0.5597983 0.9789650 0.9924810
[15] 0.9065259 0.9132871 0.6536961 0.5232299 0.8691400 0.9617306 0.9305516
[22] 0.9306988 0.5388088 0.9556418 0.9743820 0.9436184 0.9608178


#### 3.3 - Saving as file

Here, we concate test data and prediction, survival rate, and then convert it to csv file.

In [9]:
res_test <- data_test
# predicted outcome for test set
res_test$pred <- pred$predicted
res_test$survival_rate <- pred$survival[, time_idx]

In [10]:
write.csv(res_test, file = "result_rsf.csv")

### In the end

If you find something wrong or confused, feel free to concact me via raising **issue on github** or sending e-mail to **yuukilp@163.com**