# Transfer learning using CNN

In this notebook, a transfer learning (TL) approach is followed to solve a cancer prediction task on gene-expression samples from a concrete tumor type. We perform a TL approach by pre-training a CNN on the non-Lung cancer samples from the TCGA PanCancer dataset, and then fine-tune the model on the Lung cancer dataset (see `PanCancer_Lung_Split` notebook). As input data, we use the gene expression profiles modeled as an image (matrix) per sample, whose pixels (gene expression values) are neighbours depending on their KEGG-BRITE functional hierarchy (see `2-KEGG_BRITE_Treemap` R notebook).

In [1]:
import numpy as np
import pandas as pd
import warnings

# Auxiliary components
from bio_dl_utils import *

Using TensorFlow backend.


## Progression free-interval

Here, we predict the discrete progression-free interval (PFI) of each patient (sample), which correponds to a binary classification task:

In [2]:
# Define survival variable of interest
surv_variable = "PFI"
surv_variable_time = "PFI.time"

### non-Lung cancer

We only use the non-Lung tumor samples from the TCGA PanCancer dataset with the survival information of interest associated:

In [3]:
# Load samples-info dataset
Y_info = pd.read_hdf("../data/PanCancer/non_Lung_pancan.h5", 
                     key="sample_type")
Y_info.shape

(9374, 4)

In [4]:
# Load survival clinical outcome dataset
Y_surv = pd.read_hdf("../data/PanCancer/non_Lung_pancan.h5", 
                     key="sample_clinical")
Y_surv.shape

(9374, 33)

In [5]:
# tumor-normal distribution
Y_info.tumor_normal.value_counts(normalize=False, dropna=False)

Tumor     8771
Normal     603
Name: tumor_normal, dtype: int64

In [6]:
# Filter tumor samples from survival clinical outcome dataset
Y_surv = Y_surv.loc[Y_info.tumor_normal=="Tumor"]
Y_surv.shape

(8771, 33)

In [7]:
# Drop rows where surv_variable or surv_variable_time is NA
Y_surv.dropna(subset=[surv_variable, surv_variable_time], inplace=True)
Y_surv.shape

(8563, 33)

In [8]:
# Event PFI samples time distribution
Y_surv.loc[Y_surv.PFI==1.0]['PFI.time'].describe()

count     2992.000000
mean       625.818516
std        817.163242
min          0.000000
25%        188.000000
50%        370.000000
75%        729.250000
max      10334.000000
Name: PFI.time, dtype: float64

In [9]:
# Censored PFI samples time distribution
Y_surv.loc[Y_surv.PFI==0.0]['PFI.time'].describe()

count     5571.000000
mean      1050.329564
std       1017.383207
min          0.000000
25%        388.000000
50%        741.000000
75%       1409.000000
max      11217.000000
Name: PFI.time, dtype: float64

We create a discrete time class variable using the fixed-time point selected in `Lung_PFI_Prediction` notebook:

In [10]:
time = 230
Y_surv_disc = Y_surv[['PFI', 'PFI.time']].apply(
    lambda row: survival_fixed_time(time, row['PFI.time'], row['PFI']), axis=1)

Y_surv_disc.dropna(inplace=True)
Y_surv_disc.shape

(7707,)

In [11]:
# Event class fraction
sum(Y_surv_disc)/len(Y_surv_disc)

0.12222654729466718

In [12]:
%%time
# Load gene-exp images
n_width = 175
n_height = 175
dir_name = "gene_exp_treemap_" + str(n_width) + "_" + str(n_height) + "_npy/"

X_gene_exp = np.array([np.load(dir_name + s.replace("-", ".") + ".npy") for s in Y_surv_disc.index]) 

CPU times: user 2.9 s, sys: 1.98 s, total: 4.88 s
Wall time: 17 s


In [13]:
X_gene_exp.shape

(7707, 175, 175)

We now create the binary class variables:

In [14]:
from sklearn.preprocessing import LabelEncoder

# Convert discrete time survival numerical variables into binary variables
Y_surv_disc_class = LabelEncoder().fit_transform(Y_surv_disc)
np.unique(Y_surv_disc_class)

array([0, 1])

In [15]:
Y_surv_disc_class.shape

(7707,)

### Lung

We also load the Lung tumor samples from the TCGA PanCancer dataset with the survival information of interest associated:

In [18]:
# Load samples-info dataset
Y_info_ft = pd.read_hdf("../data/PanCancer/Lung_pancan.h5", key="sample")
# Load survival clinical outcome dataset
Y_surv_ft = pd.read_hdf("../data/PanCancer/Lung_pancan.h5", key="survival_outcome")
# Filter tumor samples from survival clinical outcome dataset
Y_surv_ft = Y_surv_ft.loc[Y_info_ft.tumor_normal=="Tumor"]
# Drop rows where surv_variable or surv_variable_time is NA
Y_surv_ft.dropna(subset=[surv_variable, surv_variable_time], inplace=True)
Y_surv_ft.shape

(999, 33)

In [19]:
time = 230
Y_surv_disc_ft = Y_surv_ft[['PFI', 'PFI.time']].apply(
    lambda row: survival_fixed_time(time, row['PFI.time'], row['PFI']), axis=1)

Y_surv_disc_ft.dropna(inplace=True)
Y_surv_disc_ft.shape

(855,)

In [20]:
%%time
# Load gene-exp matrices
X_gene_exp_ft = np.array([np.load(dir_name + s.replace("-", ".") + ".npy") for s in Y_surv_disc_ft.index]) 

CPU times: user 347 ms, sys: 120 ms, total: 466 ms
Wall time: 1.99 s


In [21]:
X_gene_exp_ft.shape

(855, 175, 175)

We now create the binary class variables:

In [22]:
# Convert discrete time survival numerical variables into binary variables
Y_surv_disc_class_ft = LabelEncoder().fit_transform(Y_surv_disc_ft)
Y_surv_disc_class_ft.shape

(855,)

### Join PT and FT

We use bayesian-optimization to perform the hyper-parameters tuning of a 2D-CNN architecture, using a cross-validation (CV) procedure. Random over-sampling is used both on pre-training and fine-tuning phases to deal with the severe class imbalance present in both non-Lung and Lung cancer datasets.

In [23]:
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import make_scorer
from imblearn.over_sampling import RandomOverSampler

# Define training datasets
X = X_gene_exp_ft
y = Y_surv_disc_class_ft

# Define the scaler
sc = StandardScaler()

# Define re-sampling method
ros = RandomOverSampler(random_state=69)

# Define image input shape
image_shape = (*X.shape[1:3], 1)

# Define custom transformer
ft = FunctionTransformer(func=reshape_transformer, kw_args={'final_shape': image_shape})

# Define cross-validation train-test splits
cv_split = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=23)

# Define evaluation metrics
model_sel_metric = 'auc'
eval_metric = {'auc': 'roc_auc', 
               'acc': make_scorer(opt_accuracy_score, needs_proba=True), 
               'sens': make_scorer(opt_sensitivity_score, needs_proba=True),
               'spec': make_scorer(opt_specificity_score, needs_proba=True),
               'prec': make_scorer(opt_precision_score, needs_proba=True),
               'f1': make_scorer(opt_f1_score, needs_proba=True),
               'mcc': make_scorer(opt_mcc_score, needs_proba=True),
               'thres': make_scorer(opt_threshold_score, needs_proba=True)}

# Bayesian-Optimization parameters
n_iter = 100
random_state = 666

## Over-sampling

### Fine-tune all

In [23]:
from hyperopt import hp

# Define the 2D-CNN hyperparameters space
params_space = {
    # PT hyper-param
    # We assume that the base model contains 2 CNN layers and 1 dense layer
    'clf__pt_params': hp.choice('pt_params', 
                [{'clf__dense_choice': hp.choice('dense_num_layers', 
                                         [{'clf__add_dense': 0,
                                           'clf__dense_unit_1': hp.choice('1dense_unit_1', [120, 160, 200, 240]),

                                           'clf__dense_dropout_1': hp.choice('1dense_dropout_1', [0.2, 0.4, 0.6, 0.8])
                                          },
                                          {'clf__add_dense': 1,
                                           'clf__dense_unit_1': hp.choice('2dense_unit_1', [120, 160, 200, 240]),
                                           'clf__dense_unit_2': hp.choice('2dense_unit_2', [25, 50, 75, 100]),

                                           'clf__dense_activation_2': 'relu',

                                           'clf__dense_dropout_1': hp.choice('2dense_dropout_1', [0.2, 0.4, 0.6, 0.8]),
                                           'clf__dense_dropout_2': hp.choice('2dense_dropout_2', [0.2, 0.4, 0.6, 0.8])
                                          }]),
                'clf__cnn_filter_1': hp.choice('cnn_filter_1', [2, 4, 8, 12, 16]),
                'clf__cnn_filter_2': hp.choice('cnn_filter_2', [8, 12, 16, 32, 40]),
                              
                # Input dim = 175, Output dim = [78, 86]
                'clf__cnn_kernel_1': hp.choice('cnn_kernel_1', [4, 8, 12, 16, 20]),
                # Input dim = [78, 86] (~ 175/2), Output dim = [32, 43]
                'clf__cnn_kernel_2': hp.choice('cnn_kernel_2', [2, 4, 8, 12, 16]),
                                   
                'clf__cnn_dropout_1': hp.choice('cnn_dropout_1', [0.2, 0.4, 0.6, 0.8]),
                'clf__cnn_dropout_2': hp.choice('cnn_dropout_2', [0.2, 0.4, 0.6, 0.8]),
                  
               'clf__batch_size': hp.choice('batch_size_pt', [64, 128, 256, 384, 512]),
               'clf__lr': hp.loguniform('lr_pt', np.log(1e-3), np.log(1e-1)),
               # Re-sampling hyper-params
               're_sample_pt__sampling_strategy': hp.choice('sampling_strategy_pt', [1, 1/2, 1/3, 1/4])}]),
    'clf__ft_params': hp.choice('ft_params', 
                [{'clf__batch_size': hp.choice('batch_size_ft', [32, 80, 128, 192, 256]),
                  'clf__lr': hp.loguniform('lr_ft', np.log(5e-4), np.log(1e-1)),
                  # Re-sampling hyper-params
                  're_sample_ft__sampling_strategy': hp.choice('sampling_strategy_ft', [1, 1/2, 1/3, 1/4])}])
}

In [24]:
from imblearn.pipeline import Pipeline

warnings.filterwarnings('ignore')

# Define PT 2D-CNN estimator
pre_model_path = 'keras-models/ros_new_cnn_pt_ft_disc_pfi_x.h5'
cnn_pt = SklearnCNN(input_shape=image_shape, cnn_filter={}, cnn_kernel={}, cnn_pool={1: 2, 2: 2}, 
                 cnn_activation={1: 'relu', 2: 'relu'}, cnn_dropout={}, dense_unit={}, dense_activation={1: 'relu'}, 
                 dense_dropout={}, output_unit=1, output_activation='sigmoid', 
                 optimizer_name='adam', loss_function='binary_crossentropy', epoch=200, patience=10, verbose=0, 
                 model_path=pre_model_path)

cnn_pipe_pt = Pipeline([('scaler', sc), ('re_sample_pt', ros), ('transf', ft), ('clf', cnn_pt)])

# Define FT 2D-CNN estimator
cnn_ft = SklearnFT(pre_layer=17, n_freeze=0, dense_unit={}, dense_activation={}, dense_dropout={},
                      output_unit=1, output_activation='sigmoid', optimizer_name='adam', 
                      loss_function='binary_crossentropy', epoch=200, patience=10, verbose=0,
                      pre_model=pre_model_path, 
                      model_path='keras-models/ros_new_cnn_pt_ft_disc_pfi_x_fine.h5')

cnn_pipe_ft = Pipeline([('scaler', sc), ('re_sample_ft', ros), ('transf', ft), ('clf', cnn_ft)])

# Define Bayesian-Optimization
hyper_search = HyperoptCV_TL(estimator_pt=cnn_pipe_pt, 
                          X_pt=X_gene_exp.reshape(X_gene_exp.shape[0], X_gene_exp.shape[1]*X_gene_exp.shape[2]), 
                          y_pt=Y_surv_disc_class, estimator_ft=cnn_pipe_ft, 
                          hyper_space=params_space, cv=cv_split, scoring=eval_metric, opt_metric=model_sel_metric,
                          n_iter=n_iter, random_seed=random_state, verbose_file = 'class_imb_pt_new_pt_ft_verbose.txt')

In [None]:
%%time
best_trial = hyper_search.model_selection(X.reshape(X.shape[0], X.shape[1]*X.shape[2]), y)

PT Hyper-params:                                     
{'clf__batch_size': 64, 'clf__cnn_dropout_1': 0.6, 'clf__cnn_dropout_2': 0.4, 'clf__cnn_filter_1': 2, 'clf__cnn_filter_2': 40, 'clf__cnn_kernel_1': 16, 'clf__cnn_kernel_2': 8, 'clf__dense_activation_2': 'relu', 'clf__dense_dropout_1': 0.8, 'clf__dense_dropout_2': 0.2, 'clf__dense_unit_1': 160, 'clf__dense_unit_2': 25, 'clf__lr': 0.027972625688359478, 're_sample_pt__sampling_strategy': 0.5}
PT params:                                           
SklearnPtCNN(batch_size=64, cnn_activation={1: 'relu', 2: 'relu'},
             cnn_dropout={1: 0.6, 2: 0.4}, cnn_filter={1: 2, 2: 40},
             cnn_kernel={1: 16, 2: 8}, cnn_pool={1: 2, 2: 2},
             dense_activation={1: 'relu', 2: 'relu'},
             dense_dropout={1: 0.8, 2: 0.2}, dense_unit={1: 160, 2: 25},
             epoch=200, input_shape=(175, 175, 1),
             loss_function='binary_crossentropy', lr=0.027972625688359478,
             model_path='keras-models/ros_new_cn

PT Hyper-params:                                                                      
{'clf__batch_size': 512, 'clf__cnn_dropout_1': 0.2, 'clf__cnn_dropout_2': 0.8, 'clf__cnn_filter_1': 4, 'clf__cnn_filter_2': 16, 'clf__cnn_kernel_1': 12, 'clf__cnn_kernel_2': 16, 'clf__dense_dropout_1': 0.6, 'clf__dense_unit_1': 120, 'clf__lr': 0.001180533046053466, 're_sample_pt__sampling_strategy': 0.3333333333333333}
PT params:                                                                            
SklearnPtCNN(batch_size=512, cnn_activation={1: 'relu', 2: 'relu'},                   
             cnn_dropout={1: 0.2, 2: 0.8}, cnn_filter={1: 4, 2: 16},
             cnn_kernel={1: 12, 2: 16}, cnn_pool={1: 2, 2: 2},
             dense_activation={1: 'relu'}, dense_dropout={1: 0.6},
             dense_unit={1: 120}, epoch=200, input_shape=(175, 175, 1),
             loss_function='binary_crossentropy', lr=0.001180533046053466,
             model_path='keras-models/ros_new_cnn_pt_ft_disc_pfi_x.h5',


PT Hyper-params:                                                                      
{'clf__batch_size': 512, 'clf__cnn_dropout_1': 0.2, 'clf__cnn_dropout_2': 0.6, 'clf__cnn_filter_1': 2, 'clf__cnn_filter_2': 8, 'clf__cnn_kernel_1': 12, 'clf__cnn_kernel_2': 16, 'clf__dense_dropout_1': 0.8, 'clf__dense_unit_1': 120, 'clf__lr': 0.027469191566294744, 're_sample_pt__sampling_strategy': 0.3333333333333333}
PT params:                                                                            
SklearnPtCNN(batch_size=512, cnn_activation={1: 'relu', 2: 'relu'},                   
             cnn_dropout={1: 0.2, 2: 0.6}, cnn_filter={1: 2, 2: 8},
             cnn_kernel={1: 12, 2: 16}, cnn_pool={1: 2, 2: 2},
             dense_activation={1: 'relu'}, dense_dropout={1: 0.8},
             dense_unit={1: 120}, epoch=200, input_shape=(175, 175, 1),
             loss_function='binary_crossentropy', lr=0.027469191566294744,
             model_path='keras-models/ros_new_cnn_pt_ft_disc_pfi_x.h5',
  

PT Hyper-params:                                                                       
{'clf__batch_size': 384, 'clf__cnn_dropout_1': 0.8, 'clf__cnn_dropout_2': 0.2, 'clf__cnn_filter_1': 12, 'clf__cnn_filter_2': 40, 'clf__cnn_kernel_1': 4, 'clf__cnn_kernel_2': 8, 'clf__dense_dropout_1': 0.8, 'clf__dense_unit_1': 200, 'clf__lr': 0.0010141410834517972, 're_sample_pt__sampling_strategy': 0.5}
PT params:                                                                             
SklearnPtCNN(batch_size=384, cnn_activation={1: 'relu', 2: 'relu'},                    
             cnn_dropout={1: 0.8, 2: 0.2}, cnn_filter={1: 12, 2: 40},
             cnn_kernel={1: 4, 2: 8}, cnn_pool={1: 2, 2: 2},
             dense_activation={1: 'relu'}, dense_dropout={1: 0.8},
             dense_unit={1: 200}, epoch=200, input_shape=(175, 175, 1),
             loss_function='binary_crossentropy', lr=0.0010141410834517972,
             model_path='keras-models/ros_new_cnn_pt_ft_disc_pfi_x.h5',
            

PT Hyper-params:                                                                       
{'clf__batch_size': 128, 'clf__cnn_dropout_1': 0.8, 'clf__cnn_dropout_2': 0.6, 'clf__cnn_filter_1': 4, 'clf__cnn_filter_2': 12, 'clf__cnn_kernel_1': 4, 'clf__cnn_kernel_2': 8, 'clf__dense_activation_2': 'relu', 'clf__dense_dropout_1': 0.6, 'clf__dense_dropout_2': 0.8, 'clf__dense_unit_1': 160, 'clf__dense_unit_2': 50, 'clf__lr': 0.07498127123920909, 're_sample_pt__sampling_strategy': 0.25}
PT params:                                                                             
SklearnPtCNN(batch_size=128, cnn_activation={1: 'relu', 2: 'relu'},                    
             cnn_dropout={1: 0.8, 2: 0.6}, cnn_filter={1: 4, 2: 12},
             cnn_kernel={1: 4, 2: 8}, cnn_pool={1: 2, 2: 2},
             dense_activation={1: 'relu', 2: 'relu'},
             dense_dropout={1: 0.6, 2: 0.8}, dense_unit={1: 160, 2: 50},
             epoch=200, input_shape=(175, 175, 1),
             loss_function='binary_

PT Hyper-params:                                                                       
{'clf__batch_size': 256, 'clf__cnn_dropout_1': 0.4, 'clf__cnn_dropout_2': 0.4, 'clf__cnn_filter_1': 8, 'clf__cnn_filter_2': 40, 'clf__cnn_kernel_1': 8, 'clf__cnn_kernel_2': 2, 'clf__dense_dropout_1': 0.2, 'clf__dense_unit_1': 240, 'clf__lr': 0.002839471923594719, 're_sample_pt__sampling_strategy': 0.25}
PT params:                                                                             
SklearnPtCNN(batch_size=256, cnn_activation={1: 'relu', 2: 'relu'},                    
             cnn_dropout={1: 0.4, 2: 0.4}, cnn_filter={1: 8, 2: 40},
             cnn_kernel={1: 8, 2: 2}, cnn_pool={1: 2, 2: 2},
             dense_activation={1: 'relu'}, dense_dropout={1: 0.2},
             dense_unit={1: 240}, epoch=200, input_shape=(175, 175, 1),
             loss_function='binary_crossentropy', lr=0.002839471923594719,
             model_path='keras-models/ros_new_cnn_pt_ft_disc_pfi_x.h5',
             mo

PT Hyper-params:                                                                       
{'clf__batch_size': 64, 'clf__cnn_dropout_1': 0.4, 'clf__cnn_dropout_2': 0.6, 'clf__cnn_filter_1': 8, 'clf__cnn_filter_2': 40, 'clf__cnn_kernel_1': 20, 'clf__cnn_kernel_2': 8, 'clf__dense_dropout_1': 0.6, 'clf__dense_unit_1': 120, 'clf__lr': 0.04480881925875512, 're_sample_pt__sampling_strategy': 1}
PT params:                                                                             
SklearnPtCNN(batch_size=64, cnn_activation={1: 'relu', 2: 'relu'},                     
             cnn_dropout={1: 0.4, 2: 0.6}, cnn_filter={1: 8, 2: 40},
             cnn_kernel={1: 20, 2: 8}, cnn_pool={1: 2, 2: 2},
             dense_activation={1: 'relu'}, dense_dropout={1: 0.6},
             dense_unit={1: 120}, epoch=200, input_shape=(175, 175, 1),
             loss_function='binary_crossentropy', lr=0.04480881925875512,
             model_path='keras-models/ros_new_cnn_pt_ft_disc_pfi_x.h5',
             moment

In [26]:
best_trial['result']['params']

{'clf__ft_params': {'clf__batch_size': 128,
  'clf__lr': 0.0014544476342579621,
  're_sample_ft__sampling_strategy': 0.5},
 'clf__pt_params': {'clf__batch_size': 256,
  'clf__cnn_dropout_1': 0.8,
  'clf__cnn_dropout_2': 0.2,
  'clf__cnn_filter_1': 12,
  'clf__cnn_filter_2': 12,
  'clf__cnn_kernel_1': 8,
  'clf__cnn_kernel_2': 4,
  'clf__dense_choice': {'clf__add_dense': 1,
   'clf__dense_activation_2': 'relu',
   'clf__dense_dropout_1': 0.8,
   'clf__dense_dropout_2': 0.6,
   'clf__dense_unit_1': 160,
   'clf__dense_unit_2': 100},
  'clf__lr': 0.0029640098355867947,
  're_sample_pt__sampling_strategy': 0.5}}

In [27]:
scores = best_trial['result']['test_score']
res = pd.DataFrame({'AUC': scores['test_auc'], 
              'ACC': scores['test_acc'], 
              'Sens': scores['test_sens'], 
              'Recall': scores['test_recall'], 
              'Spec': scores['test_spec'],
              'Prec': scores['test_prec'],
              'F-1': scores['test_f1'],
              'MCC': scores['test_mcc'],
              'Thres': scores['test_thres']})
res.describe()

Unnamed: 0,ACC,AUC,F-1,MCC,Prec,Recall,Sens,Spec,Thres
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.688538,0.732602,0.310144,0.246918,0.217582,0.675147,0.675147,0.689874,0.255182
std,0.131418,0.05257,0.067082,0.075432,0.084078,0.145862,0.145862,0.15768,0.231786
min,0.415205,0.631048,0.21875,0.136791,0.125,0.3125,0.3125,0.367742,0.014001
25%,0.580409,0.702353,0.257525,0.190115,0.151974,0.625,0.625,0.558065,0.104518
50%,0.684211,0.730242,0.303977,0.24627,0.20487,0.696691,0.696691,0.676372,0.140271
75%,0.80848,0.771069,0.352544,0.30732,0.25,0.764706,0.764706,0.829032,0.360953
max,0.912281,0.842742,0.545455,0.49724,0.5625,0.882353,0.882353,0.954545,0.847045


In [28]:
# Save results
file_path = 'results/ros_pt_ft_new_disc_pfi_lung_ft_100_iter_rep_kfold.csv'
res.to_csv(file_path, index=False)