## Survival Analysis: Based on eXtreme Gradient Boosting(XGB)

The tutorial give typical workflow of eXtreme Gradient Boosting-based survival analysis including data-preprocessing, model selection and traning&validation, uses **R package** `xgboost`.

Formally, it can be listed by:
1. Data Preprocessing
  - convert variables
  - load training and test set
2. Model Selection
  - cross validation
  - tune parameters
3. Traning&Validation
  - train gbm model
  - measure CI on testset
  - survival rates on time of interest

### Concept and Related

#### Concept of coxph
Under assumption of cox proportional hazard model, there is:
$$h(t|x) = h_0(t) \cdot e^{f(x)}$$
$f(x)$ is called **Hazard Ratio**.

Given the observation $\{(X_i,T_i,E_i)|i=1,\dots ,n \}$, the log partial likelihood function is:
$$
\mathcal L=-\sum_{i\ :\ E_i=1}\bigl(f(x_i)-log\sum_{j\in \mathcal R(i)}e^{f(x_j)}\bigr) \\
\mathcal R(i)=\{j|T_j\ge T_i\}
$$
If there is ties at time of death, the corresponding log partial likelihood function is:
$$
\mathcal L=-\sum_{i=1}^{k}\bigl(\sum_{j\in \mathcal q(i)}f(x_j)-d_i*log\sum_{j\in \mathcal R(i)}e^{f(x_j)}\bigr) \\
k = \big|\text{unique}(\{T_i|E_i=1\})\big| \\
\mathcal R(i)=\{j|T_j\ge T_i\} \\
\mathcal q(i)=\{j|T_j=T_i, E_i=1\} \\
d_i = \vert q(i) \vert
$$

### Step0 - Load library

In [1]:
library('survival')
library('gbm')
# set random state
set.seed(0)

"package 'gbm' was built under R version 3.5.1"Loaded gbm 2.1.4


In [2]:
data(veteran, package = "randomForestSRC")
cat("Number of samples:", nrow(veteran), "\n")
cat("Columns of dataset:", colnames(veteran), "\n")
veteran[c(1:5), ]

Number of samples: 137 
Columns of dataset: trt celltype time status karno diagtime age prior 


trt,celltype,time,status,karno,diagtime,age,prior
1,1,72,1,60,7,69,0
1,1,411,1,70,5,64,10
1,1,228,1,60,3,38,0
1,1,126,1,60,9,63,10
1,1,118,1,70,11,65,10


### Step1 - Data Preprocessing

In [3]:
# Sample the data and create a training subset.
train <- sample(1:nrow(veteran), round(nrow(veteran) * 0.80))
data_train <- veteran[train, ]
data_test <- veteran[-train, ]

### Step2 - Model Selection

The hyperparameters should be tuned as follows:
- eta : learning rate
- max_depth : maximum depth of a tree
- min_child_weight: minimum sum of instance weight(hessian) needed in a child
- subsample: subsample ratio of the training instance
- colsample_bytree: subsample ratio of columns when constructing each tree
- gamma: minimum loss reduction required to make a further partition on a leaf node of the tree
- alpha: L1-Regularization of instance weight items
- labmda: L2-Regularization of instance weight items

### Step3 - Model Training & Evaluation

We will pass arguments to object `xgboost` for training robust model after completing hyperparameters tuning in step2, and then validate our fitted model using test set.

Here, evaluation and more in this section includes:

- calculating CI metrics
- calculating survival function on specified time
- saving result as file

#### 3.0 - Model Training & Prediction