## Survival Analysis: Based on Random Forest

The tutorial give typical workflow of Random Forest-based survival analysis including data-preprocessing, model selection and traning&validation.

Formally, it can be listed by:
1. Data Preprocessing
  - convert variables
  - load training and test set
2. Model Selection
  - cross validation
  - tune parameters
3. Traning&Validation
  - train RFS model
  - evaluate CI on testset
  - survival rates on time of interest
  
If you are confused about how Random Forest-based survival analysis works, you can read official documents in [here](https://kogalur.github.io/randomForestSRC/theory.html).

### Step0 - Load library and Data

In [1]:
library('survival')
library('randomForestSRC')
# set random state
set.seed(0)

"package 'randomForestSRC' was built under R version 3.5.1"
 randomForestSRC 2.7.0 
 
 Type rfsrc.news() to see new features, changes, and bug fixes. 
 



In [2]:
data(veteran, package = "randomForestSRC")
cat("Number of samples:", nrow(veteran), "\n")
cat("Columns of dataset:", colnames(veteran), "\n")
veteran[c(1:5), ]

Number of samples: 137 
Columns of dataset: trt celltype time status karno diagtime age prior 


trt,celltype,time,status,karno,diagtime,age,prior
1,1,72,1,60,7,69,0
1,1,411,1,70,5,64,10
1,1,228,1,60,3,38,0
1,1,126,1,60,9,63,10
1,1,118,1,70,11,65,10


### Step1 - Data Preprocessing

You can split dataset into training and test or directly load data by reading files

In [3]:
# Sample the data and create a training subset.
train <- sample(1:nrow(veteran), round(nrow(veteran) * 0.80))
data_train <- veteran[train, ]
data_test <- veteran[-train, ]

### Step2 - Model Selection

This section includes hyperparameters tuning by k-fold cross validation on training set

### Step3 - Model Training & Evaluation

We will pass arguments to object `rfsrc` for training robust model after completing hyperparameters tuning, and then validate our fitted model using test set.

Here, evaluation and more in this section includes:
- calculating CI metrics
- calculating survival rate on specified time
- saving result as file

**Notes:**

The RFS's prediction for individual $i$ is the sum of CHE(cumulative hazard estimate) on time of death $t_i$.

$$
\begin{equation}
\mathcal{M}_i = \sum_{k=1}^{M}\hat{H}_e^*(t_k^*|{\bf X}_i).
\end{equation}
$$

You can understand it via [this](https://kogalur.github.io/randomForestSRC/theory.html)

#### 3.0 - Model Training

In [4]:
# pass arguments we got before
# Train the model.
model <- rfsrc(Surv(time, status) ~ ., data_train, ntree = 100)
# Test the model
pred <- predict(model, data_test)
# Compare the results.
print(model)
print(pred)

                         Sample size: 110
                    Number of deaths: 102
                     Number of trees: 100
           Forest terminal node size: 3
       Average no. of terminal nodes: 36.99
No. of variables tried at each split: 3
              Total no. of variables: 6
       Resampling used to grow trees: swr
    Resample size used to grow trees: 110
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 28.88%

  Sample size of test (predict) data: 27
       Number of deaths in test data: 26
                Number of grow trees: 100
  Average no. of grow terminal nodes: 36.99
         Total no. of grow variables: 6
       Resampling used to grow trees: swr
    Resample size used to grow trees: 110
                            Analysis: RSF
                              Family: surv
                 Test 

#### 3.1 - CI (concordance index)

We can get $1-\text{CI}$ (concordance index) by any of two below:
- Test set error in the output of `print(veteran.pred)`
- built-in methods `cindex(T, E, Pred)`

In [5]:
# pred$chf denotes CHE(cumulative hazard estimate) on all time of death for individual i
# pred$predicted denotes the sum of CHE(cumulative hazard estimate) on all time of death for individual i
cindex(data_test$time, data_test$status, pred$predicted)

#### 3.2 - Survival function

Prediction of survival function on test set can be accessed by `veteran.pred$survival`

In [6]:
# All time of deaths 
pred$time.interest

In [7]:
time_idx <- 10
# Surival rate at specified time
cat("Survival rate for each item in test set at time", pred$time.interest[time_idx], "\n")
print(pred$survival[, time_idx])

Survival rate for each item in test set at time 15 
 [1] 0.9470000 0.8618333 0.8783333 0.8556667 0.9143333 0.9241667 0.8698333
 [8] 0.9565000 0.6515000 0.9830000 0.9905000 0.5318333 0.9806667 1.0000000
[15] 0.8590000 0.8431667 0.6905000 0.6711667 0.8493333 0.9675000 0.9361667
[22] 0.9480000 0.5396667 0.9343333 0.9625000 0.8960000 0.9455000


#### 3.3 - Saving as file

Here, we concate test data and prediction, survival rate, and then convert it to csv file.

In [8]:
res_test <- data_test
# predicted outcome for test set
res_test$pred <- pred$predicted
res_test$survival_rate <- pred$survival[, time_idx]

In [9]:
write.csv(res_test, file = "result_rsf.csv")

### In the end

If you find something wrong or confused, feel free to concact me via raising **issue on github** or sending e-mail to **yuukilp@163.com**