# Transfer learning using MLNN

In this notebook, a transfer learning (TL) approach is followed to solve a cancer prediction task on gene-expression samples from a concrete tumor type. We perform a TL approach by pre-training a MLNN on the non-Lung cancer samples from the TCGA PanCancer dataset, and then fine-tune the model on the Lung cancer dataset (see `PanCancer_Lung_Split` notebook). As input data, we use the gene expression profiles directly modeled as numeric vectors.

In [1]:
import numpy as np
import pandas as pd
import warnings

# Auxiliary components
from bio_dl_utils import *

Using TensorFlow backend.


## Progression free-interval

Here, we predict the discrete progression-free interval (PFI) of each patient (sample), which correponds to a binary classification task:

In [2]:
# Define survival variable of interest
surv_variable = "PFI"
surv_variable_time = "PFI.time"

### non-Lung cancer

We only use the non-Lung tumor samples from the TCGA PanCancer dataset with the survival information of interest associated:

In [4]:
# Load samples-info dataset
Y_info = pd.read_hdf("data/PanCancer/non_Lung_pancan.h5", 
                     key="sample_type")
Y_info.shape

(9374, 4)

In [5]:
# Load survival clinical outcome dataset
Y_surv = pd.read_hdf("data/PanCancer/non_Lung_pancan.h5", 
                     key="sample_clinical")
Y_surv.shape

(9374, 33)

In [6]:
# tumor-normal distribution
Y_info.tumor_normal.value_counts(normalize=False, dropna=False)

Tumor     8771
Normal     603
Name: tumor_normal, dtype: int64

In [7]:
# Filter tumor samples from survival clinical outcome dataset
Y_surv = Y_surv.loc[Y_info.tumor_normal=="Tumor"]
Y_surv.shape

(8771, 33)

In [8]:
# Drop rows where surv_variable or surv_variable_time is NA
Y_surv.dropna(subset=[surv_variable, surv_variable_time], inplace=True)
Y_surv.shape

(8563, 33)

In [9]:
# Event PFI samples time distribution
Y_surv.loc[Y_surv.PFI==1.0]['PFI.time'].describe()

count     2992.000000
mean       625.818516
std        817.163242
min          0.000000
25%        188.000000
50%        370.000000
75%        729.250000
max      10334.000000
Name: PFI.time, dtype: float64

In [10]:
# Censored PFI samples time distribution
Y_surv.loc[Y_surv.PFI==0.0]['PFI.time'].describe()

count     5571.000000
mean      1050.329564
std       1017.383207
min          0.000000
25%        388.000000
50%        741.000000
75%       1409.000000
max      11217.000000
Name: PFI.time, dtype: float64

We create a discrete time class variable using the fixed-time point selected in `Lung_PFI_Prediction` notebook:

In [11]:
time = 230
Y_surv_disc = Y_surv[['PFI', 'PFI.time']].apply(
    lambda row: survival_fixed_time(time, row['PFI.time'], row['PFI']), axis=1)

Y_surv_disc.dropna(inplace=True)
Y_surv_disc.shape

(7707,)

In [12]:
# Event class fraction
sum(Y_surv_disc)/len(Y_surv_disc)

0.12222654729466718

In [None]:
%%time
# Load gene-exp vectors: this dataset was obtained from the final KEGG BRITE functional hierarchies dataset generated in
# 1-KEGG_BRITE_Hierarchy notebook, by selecting only the columns corresponding to PanCancer samples, removing the 
# duplicated genes (rows) and transposing it
df_gene_exp = pd.read_csv("./KEGG_gene_exp.csv")

In [13]:
df_gene_exp.shape

(10535, 7509)

In [16]:
df_gene_exp.head()

Unnamed: 0,ENSG00000187961.13,ENSG00000188290.10,ENSG00000187608.8,ENSG00000188157.13,ENSG00000186891.13,ENSG00000186827.10,ENSG00000184163.3,ENSG00000162572.19,ENSG00000131584.18,ENSG00000169962.4,...,ENSG00000067048.16,ENSG00000183878.15,ENSG00000154620.5,ENSG00000165246.12,ENSG00000012817.15,ENSG00000198692.9,ENSG00000105227.14,ENSG00000164237.8,ENSG00000175048.16,ENSG00000188706.12
TCGA.02.0047.01,1.3225,4.1604,5.8166,6.3983,-1.9942,0.7493,0.3346,0.7321,5.7493,-2.1779,...,4.576,2.1013,1.2815,3.6497,3.7614,4.6508,1.2815,4.3618,4.9426,5.7748
TCGA.02.0055.01,2.3135,3.6148,6.9599,4.3356,2.9281,1.5266,0.4016,1.1316,4.1692,-3.458,...,-3.816,-6.5064,-9.9658,-5.5735,-3.0469,-4.035,0.2881,2.5924,2.9488,5.6056
TCGA.02.2483.01,2.5707,3.8729,5.9072,6.3946,-1.9379,2.2813,0.2029,0.9419,5.3995,-2.9324,...,3.8391,1.2085,1.7744,3.0428,2.727,5.3042,-1.1172,3.5523,3.345,4.836
TCGA.02.2485.01,3.3814,5.8875,9.9433,6.2132,-0.8599,1.3051,0.0014,1.8801,6.0637,-2.4659,...,4.1036,1.5661,0.5568,2.7095,4.0019,4.809,0.9642,3.6635,3.9468,4.5571
TCGA.04.1331.01,2.05,4.7661,8.6119,6.6414,-1.685,1.3846,0.7664,2.4831,3.6961,-3.1714,...,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,-9.9658,0.5955,4.366,1.4547,5.1486


In [17]:
# Select samples with discrete time survival information associated
df_gene_exp_disc = df_gene_exp.loc[[s.replace("-", ".") for s in Y_surv_disc.index]]

In [18]:
df_gene_exp_disc.shape

(7707, 7509)

We now create the binary class variables:

In [19]:
from sklearn.preprocessing import LabelEncoder

# Convert discrete time survival numerical variables into binary variables
Y_surv_disc_class = LabelEncoder().fit_transform(Y_surv_disc)
np.unique(Y_surv_disc_class)

array([0, 1])

In [20]:
Y_surv_disc_class.shape

(7707,)

### Lung

We also load the Lung tumor samples from the TCGA PanCancer dataset with the survival information of interest associated:

In [21]:
# Load samples-info dataset
Y_info_ft = pd.read_hdf("../data/PanCancer/Lung_pancan.h5", key="sample")
# Load survival clinical outcome dataset
Y_surv_ft = pd.read_hdf("../data/PanCancer/Lung_pancan.h5", key="survival_outcome")
# Filter tumor samples from survival clinical outcome dataset
Y_surv_ft = Y_surv_ft.loc[Y_info_ft.tumor_normal=="Tumor"]
# Drop rows where surv_variable or surv_variable_time is NA
Y_surv_ft.dropna(subset=[surv_variable, surv_variable_time], inplace=True)
Y_surv_ft.shape

(999, 33)

In [22]:
time = 230
Y_surv_disc_ft = Y_surv_ft[['PFI', 'PFI.time']].apply(
    lambda row: survival_fixed_time(time, row['PFI.time'], row['PFI']), axis=1)

Y_surv_disc_ft.dropna(inplace=True)
Y_surv_disc_ft.shape

(855,)

In [23]:
# Event class fraction
sum(Y_surv_disc_ft)/len(Y_surv_disc_ft)

0.09473684210526316

In [24]:
# Select samples with discrete time survival information associated
df_gene_exp_disc_ft = df_gene_exp.loc[[s.replace("-", ".") for s in Y_surv_disc_ft.index]]

In [25]:
df_gene_exp_disc_ft.shape

(855, 7509)

We now create the binary class variables:

In [26]:
# Convert discrete time survival numerical variables into binary variables
Y_surv_disc_class_ft = LabelEncoder().fit_transform(Y_surv_disc_ft)
Y_surv_disc_class_ft.shape

(855,)

### Join PT and FT

We use bayesian-optimization to perform the hyper-parameters tuning of a MLNN architecture, using a cross-validation (CV) procedure. Random over-sampling is used both on pre-training and fine-tuning phases to deal with the severe class imbalance present in both non-Lung and Lung cancer datasets.

In [27]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import make_scorer
from imblearn.over_sampling import RandomOverSampler

# Define training datasets
X = df_gene_exp_disc_ft
y = Y_surv_disc_class_ft

# Define the scaler
sc = StandardScaler()

# Define re-sampling method
ros = RandomOverSampler(random_state=69)

# Define cross-validation train-test splits
cv_split = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=23)

# Define evaluation metrics
model_sel_metric = 'auc'
eval_metric = {'auc': make_scorer(roc_auc_score, needs_proba=True), 
               'acc': make_scorer(opt_accuracy_score, needs_proba=True), 
               'sens': make_scorer(opt_sensitivity_score, needs_proba=True),
               'spec': make_scorer(opt_specificity_score, needs_proba=True),
               'prec': make_scorer(opt_precision_score, needs_proba=True),
               'f1': make_scorer(opt_f1_score, needs_proba=True),
               'mcc': make_scorer(opt_mcc_score, needs_proba=True),
               'thres': make_scorer(opt_threshold_score, needs_proba=True)}

# Bayesian-Optimization parameters
n_iter = 100
random_state = 666

## Over-sampling

### Fine-tune all

In [28]:
from hyperopt import hp

# Define the MLNN hyperparameters space
params_space = {
    # PT hyper-param
    # We assume that the base model contains 2 dense layers
    'clf__pt_params': hp.choice('pt_params', 
                [{'clf__dense_choice': hp.choice('dense_num_layers',
                             [{#layers 2
                               'clf__add_dense': 0,
                               'clf__dense_unit_1': hp.choice('2dense_unit_1', [1000, 1500, 2000, 2350, 2700]),
                               'clf__dense_unit_2': hp.choice('2dense_unit_2', [50, 100, 250, 500, 800]),
                               
                               'clf__dense_dropout_1': hp.choice('2dense_dropout_1', [0.2, 0.4, 0.6, 0.8]),
                               'clf__dense_dropout_2': hp.choice('2dense_dropout_2', [0.2, 0.4, 0.6, 0.8])
                              },
                              {#layers 3
                               'clf__add_dense': 1,
                               'clf__dense_unit_1': hp.choice('3dense_unit_1', [1500, 1750, 2000, 2250, 2500, 2700]),
                               'clf__dense_unit_2': hp.choice('3dense_unit_2', [200, 400, 700, 1000]),
                               'clf__dense_unit_3': hp.choice('3dense_unit_3', [30, 80, 120, 160]),
                               
                               'clf__dense_activation_3': 'relu',
                               
                               'clf__dense_dropout_1': hp.choice('3dense_dropout_1', [0.2, 0.4, 0.6, 0.8]),
                               'clf__dense_dropout_2': hp.choice('3dense_dropout_2', [0.2, 0.4, 0.6, 0.8]),
                               'clf__dense_dropout_3': hp.choice('3dense_dropout_3', [0.2, 0.4, 0.6, 0.8])
                              }]),
                  
               'clf__batch_size': hp.choice('batch_size_pt', [64, 128, 256, 384, 512]),
               'clf__lr': hp.loguniform('lr_pt', np.log(1e-3), np.log(1e-1)),
               # Re-sampling hyper-params
               're_sample_pt__sampling_strategy': hp.choice('sampling_strategy_pt', [1, 1/2, 1/3, 1/4])}]),
    
    'clf__ft_params': hp.choice('ft_params', 
                [{'clf__batch_size': hp.choice('batch_size_ft', [32, 80, 128, 192, 256]),
                  'clf__lr': hp.loguniform('lr_ft', np.log(5e-4), np.log(1e-1)),
                  # Re-sampling hyper-params
                  're_sample_ft__sampling_strategy': hp.choice('sampling_strategy_ft', [1, 1/2, 1/3, 1/4])}])
}

In [29]:
from imblearn.pipeline import Pipeline

warnings.filterwarnings('ignore')

# Define PT MLNN estimator
pre_model_path = 'keras-models/ros_mlnn_pt_ft_disc_pfi_x.h5'
mlnn_pt = SklearnMLNN(input_shape=X.shape[1], dense_unit={}, dense_activation={1: 'relu', 2: 'relu'}, 
                 dense_dropout={}, output_unit=1, output_activation='sigmoid', 
                 optimizer_name='adam', loss_function='binary_crossentropy', epoch=200, patience=10, verbose=0, 
                 model_path=pre_model_path)

mlnn_pipe_pt = Pipeline([('scaler', sc), ('re_sample_pt', ros), ('clf', mlnn_pt)])

# Define FT MLNN estimator
mlnn_ft = SklearnFT(pre_layer=10, n_freeze=0, dense_unit={}, dense_activation={}, dense_dropout={},
                 optimizer_name='adam', loss_function='binary_crossentropy', epoch=200, patience=10, verbose=0, 
                 pre_model=pre_model_path, model_path='keras-models/ros_mlnn_pt_ft_disc_pfi_x_fine.h5')

mlnn_pipe_ft = Pipeline([('scaler', sc), ('re_sample_ft', ros), ('clf', mlnn_ft)])

# Define Bayesian-Optimization
hyper_search = HyperoptCV_TL(estimator_pt=mlnn_pipe_pt, 
                          X_pt=df_gene_exp_disc, 
                          y_pt=Y_surv_disc_class, estimator_ft=mlnn_pipe_ft, 
                          hyper_space=params_space, cv=cv_split, scoring=eval_metric, opt_metric=model_sel_metric, 
                          n_iter=n_iter, random_seed=random_state, verbose_file = 'class_imb_mlnn_new_pt_ft_verbose.txt')

In [None]:
%%time
best_trial = hyper_search.model_selection(X, y)

In [31]:
best_trial['result']['params']

{'clf__ft_params': {'clf__batch_size': 128,
  'clf__lr': 0.007139777328003964,
  're_sample_ft__sampling_strategy': 1},
 'clf__pt_params': {'clf__batch_size': 128,
  'clf__dense_choice': {'clf__add_dense': 0,
   'clf__dense_dropout_1': 0.8,
   'clf__dense_dropout_2': 0.8,
   'clf__dense_unit_1': 2700,
   'clf__dense_unit_2': 500},
  'clf__lr': 0.0655685703917845,
  're_sample_pt__sampling_strategy': 1}}

In [33]:
scores = best_trial['result']['test_score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']})
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.726901,0.709365,0.308105,0.239307,0.217497,0.611691,0.738827,0.03572214
std,0.09902,0.056254,0.053412,0.062848,0.064553,0.146852,0.121793,0.07001453
min,0.467836,0.601613,0.229508,0.127204,0.144231,0.3125,0.422078,1.055455e-09
25%,0.660819,0.665827,0.270499,0.191119,0.170628,0.5,0.653226,4.917509e-05
50%,0.72807,0.704354,0.304531,0.232946,0.2,0.625,0.751613,0.001340464
75%,0.815789,0.749194,0.342919,0.274052,0.25,0.701287,0.839568,0.02813395
max,0.894737,0.847984,0.4375,0.37976,0.4375,0.9375,0.941935,0.2786947


In [34]:
# Save results
file_path = 'results/ros_mlnn_new_disc_pfi_lung_ft_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)