## Step 1: load datasets
Below we generate some toy datasets using generate_toy_datasets() as defined in utils.py. User can load their own survival datasets into "datasets", which should be a list of (X, time, event) tuples, where X, time, and event are the design matrix, survival time and event vectors for a given dataset. 

In [1]:
import scipy.io

data = scipy.io.loadmat('E:/R projects/metaCox/surv_breast_os_raw.mat')
datasets = []
num_datasets = len(data['dat'][0][0])
for k in range(num_datasets):
    datasets.append((data['dat'][0][0][k], data['days'][0][0][k].ravel(), data['status'][0][0][k].ravel()))

## Step 2 (optional): feature transformation
If necessary, we can first preprocess X so that it is standardized. We provide in preprocessing.py two basic types of feature transformation functions:
- __rank_transform()__: transform features of each sample into normalized ranks
- __zscore_transform()__: transform each feature to be zero mean and unit std across samples

We can then wrap them in a FeatureTransformer object which defines the fit_transform method for our "datasets" list

In [2]:
from preprocessing import rank_transform, FeatureTransformer

feature_transformer = FeatureTransformer(rank_transform)
datasets_transformed = feature_transformer.fit_transform(datasets)

## Step 3: split into training and validation
We provide train_test_split() in utils.py to split "datasets" list in a stratified way. That is, each dataset in "datasets" is split according to test_size

In [3]:
datasets_train = datasets_transformed[:2]+datasets_transformed[3:6]
datasets_val = datasets_transformed[2:3]

## Step 4 (optional): feature selection
We can additionally perform a feature selection step to reduce the number of features before model training. In feature_selection.py we provide a feature selection method based on concordance index as commonly used to characterize the feature's correlation with survival. 

Also note that our feature selection for multiple datasets is based on meta-analysis. The concordance index is calculated for each dataset and combined into a meta-score based on the size of the dataset. This is done by wrapping the score function in a SelectKBestMeta object which defines the fit and transform function. 

In [4]:
from feature_selection import concordance_index, SelectKBestMeta

topK = 1024 # select top 1024 features
feature_selector = SelectKBestMeta(concordance_index, topK)
feature_selector.fit(datasets_train)
datasets_train_new = feature_selector.transform(datasets_train)
datasets_val_new = feature_selector.transform(datasets_val)

## Step 5: user defined keras model
This is the core input required of the user. Below we provide a simple fully-connected network with 3 hidden layers. Note that there is no need to apply any activation function to the input layer. We are building a survival regression model!

In [5]:
from keras.models import Model
from keras.layers import Input, Dense, Activation, Dropout
import keras.backend as K


def build_model(feature_dim):
    '''
    Define a callable keras model yourself
    model input should be a (None, feature_dim) tensor,
    model output should be a (None, 1) tensor
    '''
    x = Input(shape=(feature_dim,))
    #--------START OF USER CODE-------
    a0 = Dropout(0.3)(x)
    z1 = Dense(units=1024, activation=None)(a0)
    a1 = Activation(activation='elu')(z1)
    a1 = Dropout(0.5)(a1)
    z2 = Dense(units=1024, activation=None)(a1)
    a2 = Activation(activation='elu')(z2)
    a2 = Dropout(0.5)(a2)
    y = Dense(units=1, activation=None)(a2)
    #--------END OF USER CODE-------
    
    model = Model(inputs=x, outputs=y)
    return model

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Step 6: create a model and train
We provide a high-level class SurvivalModel to facilitate model training. In SurvivalModel.fit(), There are two modes for model training: merge or decentralized. For mode='dencentral'. each dataset will be treated as a mini-batch. For mode='merge', the datasets are merged into a single dataset and mini-batches are sampled from the merged dataset. If your datasets are very heterogenous (eg different cancers), consider mode='decentral'; otherwise, mode='merge' should be the choice.

In [6]:
from models import SurvivalModel

survival_model = SurvivalModel(build_model)
survival_model.fit(datasets_train_new, datasets_val_new, loss_func='logloss', epochs=2000, lr=0.001, mode='decentral')

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 0: loss_train=0.6609, ci_train=0.7106, loss_val=0.8288, ci_val=0.6390
Epoch 100: loss_train=0.5413, ci_train=0.7794, loss_val=0.7687, ci_val=0.6438
Epoch 200: loss_train=0.4945, ci_train=0.7991, loss_val=0.7688, ci_val=0.6449
Epoch 300: loss_train=0.4524, ci_train=0.8147, loss_val=0.7607, ci_val=0.6514
Epoch 400: loss_train=0.4175, ci_train=0.8312, loss_val=0.7287, ci_val=0.6696
Epoch 500: loss_train=0.3989, ci_train=0.8445, loss_val=0.7250, ci_val=0.6769
Epoch 600: loss_train=0.3609, ci_train=0.8591, loss_val=0.7102, ci_val=0.6902
Epoch 700: loss_train=0.3648, ci_train=0.8598, loss_val=0.7084, ci_val=0.6953
Epoch 800: loss_train=0.3012, ci_train=0.8851, loss_val=0.6887, ci_val=0.7016


KeyboardInterrupt: 

This model achieves an almost perfect performance on the training dataset but not so on the testing dataset. This is expected since our simulated datasets are just randomly generated and there is nothing to learn (it'll be surprising if it does learn anything useful...). You can provide your own dataset and design your own model and check if it also works on testing dataset. 

## Step 7: predict on testing data

In [None]:
y_pred = survival_model.predict(datasets_val_new[0][0])

In [None]:
cindex, _ = concordance_index(y_pred, datasets_val_new[0][1], datasets_val_new[0][2])
print(cindex)