## Cox regression and multitask learning

This notebook demonstrates how to get started with SurvivalNet using a simple Cox regression model. 

A dataset containing protein expression profiles for gliomas from TCGA is provided. Two outcomes are available: 1. Overall survival (OS) and 2. Progression free interval (PFI). First, we show how a simple Keras model can be trained using a Cox Efron loss to optimize the partial likelihood of PFI. Then, we develop a two-task model that learns from both PFI and OS to improve prediction accuracy.

Topics covered in this notebook:
1. Data formatting
2. Applying SurvivalNet losses to train Keras models
3. Handling of missing/NaN labels
4. Using SurvivalNet metrics to monitor training and evaluate performance
5. Multi-task learning
6. Generating Kaplan-Meier plots

In [1]:
import numpy as np
import os
import pandas as pd
import sys
import tensorflow as tf

import survivalnet2
from survivalnet2.data.labels import stack_labels, unstack_labels
from survivalnet2.losses import efron
from survivalnet2.metrics.concordance import HarrellsC
from survivalnet2.visualization import km_plot

np.random.seed(51)
tf.random.set_seed(51)

### Load the data

A dataset contains features and labels. In this example, features are represented by an 565 x 412 matrix where each row contains the features for one patient. The PFI and OS labels are each represented by a 565 x 2 matrix, where the first column represents the event or last followup time, and the second column contains the event indicator (1 for samples where the event was observed). These label and data formats are used throughout SurvivalNet.

In [8]:
def load_example(file):
    # load example data, generate random train/test split
    data = pd.read_csv(file, index_col=0)

    # retrieve protein expression features
    features = data.iloc[13:, :].to_numpy().T

    # get outcomes
    osr = data.iloc[[6, 5], :].to_numpy().T
    pfi = data.iloc[[12, 11], :].to_numpy().T

    # convert types
    features = features.astype(np.float32)
    osr = osr.astype(np.float32)
    pfi = pfi.astype(np.float32)

    return features, osr, pfi


# add package install path to python
install_dir = os.path.dirname(os.path.dirname(survivalnet2.__file__))
sys.path.append(install_dir)

# load example data
data_path = os.path.join(install_dir, "examples/TCGA_glioma.csv")
features, osr, pfi = load_example(data_path)
# print(features.shape, features)
# print(osr.shape, osr)
# print(pfi.shape, pfi)
# get data shape
(N, D) = features.shape

### Simulate missing labels

Datasets often have missing labels. In a multitask learning problem where we are training with both OS and PFI, we may have some samples that have one label but are missing the other. As long as one label is available, a sample can be used in training.

Here, we simulate missing labels by randomly deleting 10% of labels from OS and PFI. SurvivalNet losses implement masking, so that `NaN` time or event values are treated as missing labels. This masking is convenient for users and allows utilization of samples in datasets with sparse labels.

In [163]:
# randomly discard 10% of OS labels and 10% PFI labels
osr[np.random.choice(N, np.round(0.1 * D).astype(np.int32)), :] = np.nan
pfi[np.random.choice(N, np.round(0.1 * D).astype(np.int32)), :] = np.nan

### Train and evaluate PFI-only model using Cox Efron loss

After splitting the data into training and testing sets, we create a tf.data.Dataset object that can emit batches for training. A two layer model is built using the Keras functional interface and trained with the Cox proportional hazards loss with Efron approximation to handle tied times. This model predicts a dimensionless risk score that can be used to rank samples in terms of predicted outcomes, with higher scores corresponding to worse predicted outcomes.

The trained model is evaluated on the held-out test samples using Harrell's concordance index (c-index). c-index measures the concordance between predicted risks and actual outcomes. The predicted risk scores are also used to assign test samples to PFI risk categories which are visualized using a Kaplan-Meier plot.

In [6]:
# generate train/test split
index = np.argsort(np.random.rand(N))
train = np.zeros(N, np.bool_)
train[index[0 : np.int32(0.8 * N)].astype(np.int32)] = True
test = ~train

# create tf Dataset for Keras training
# dataset = tf.data.Dataset.from_tensor_slices((features[train, :], pfi[train, :]))

# create tf Dataset for Keras training
train_loader = SurvivalDataLoader(features, pfi, batch_size=64, shuffle=True)
test_loader = SurvivalDataLoader(features, pfi, batch_size=64, shuffle=False)

# build a simple 2 layer model
inputs = tf.keras.Input(shape=(64, 412), ragged=True)
beta1 = tf.keras.layers.Dense(units=10, activation="selu")
beta2 = tf.keras.layers.Dense(units=1, activation="linear")
outputs = beta2(beta1(inputs))
model = tf.keras.Model(inputs=inputs, outputs=outputs)

# train PFI network using cox loss
model.compile(
    loss=efron,
    metrics=[HarrellsC()],
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
)

# model.fit(x=dataset.batch(64), epochs=200, verbose=0)
model.fit(x=train_loader, epochs=200, verbose=0)
# evaluate on testing data
risks = model.predict(test_loader)
cindex = HarrellsC()
print("Testing c-index: %0.3f" % cindex(pfi[test, :], risks))

# visualize
risk_groups = np.squeeze(np.array(risks > np.median(risks), np.int32)) + 1
km_plot(
    np.array(pfi[test, :]),
    groups=risk_groups,
    xlabel="Time",
    ylabel="Progression probability",
    legend=["predicted low risk", "predicted high risk"],
)


(412,)


ValueError: in user code:

    File "/opt/anaconda3/envs/mil/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/Users/shangke/Desktop/pathology/survivalnet2-dev/survivalnet2/losses/cox.py", line 213, in efron  *
        times, events = unstack_labels(masked)
    File "/Users/shangke/Desktop/pathology/survivalnet2-dev/survivalnet2/data/labels.py", line 62, in unstack_labels  *
        times, events = tf.unstack(labels, axis=1)

    ValueError: Cannot infer argument `num` from shape (None, None)


### Train and evaluate PFI+OS multitask model using Cox Efron loss

To improve performance, we build a new model that has a single shared layer, followed by independent layers for PFI and OS prediction. We train this model using equally-weighted Efron losses for PFI and OS, and evaluate its accuracy in predicting PFI.

In [209]:
# # create tf Dataset for Keras training
# dataset = tf.data.Dataset.from_tensor_slices(
#     (features[train, :], (pfi[train, :], osr[train, :]))
# )

train_loader = SurvivalDataLoader(features, pfi, osr, batch_size=64, shuffle=True)
test_loader = SurvivalDataLoader(features, pfi, osr, batch_size=64, shuffle=False)

# build a simple 2 layer model
inputs = tf.keras.Input((features.shape[1],))
beta1 = tf.keras.layers.Dense(units=10, activation="selu")
beta_pfi = tf.keras.layers.Dense(units=1, activation="linear", name="pfi")
beta_osr = tf.keras.layers.Dense(units=1, activation="linear", name="os")
output1 = beta_pfi(beta1(inputs))
output2 = beta_osr(beta1(inputs))
model = tf.keras.Model(inputs=inputs, outputs=[output1, output2])

# train PFI network using cox efron loss and Harrell's c-index as a metric
model.compile(
    loss=[efron, efron],
    metrics=[HarrellsC()],
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
)
model.fit(x=train_loader, epochs=200, verbose=0)

# evaluate on testing data
risks = model(features[test, :])[1]
cindex = HarrellsC()
print("Testing c-index: %0.3f" % cindex(pfi[test, :], risks))

# visualize
risk_groups = np.squeeze(np.array(risks > np.median(risks), np.int)) + 1
km_plot(
    np.array(pfi[test, :]),
    groups=risk_groups,
    xlabel="Time",
    ylabel="Progression probability",
    legend=["predicted low risk", "predicted high risk"],
)

TypeError: __init__() got multiple values for argument 'batch_size'