## Measure training time for transfer learning

- This notebook will show you how to quickly build a new model by pre-computing the activations of the frozen part of the network from Figure 3 and then training a simple MLP on top.

In [3]:
from m_kipoi.config import get_data_dir
ddir = get_data_dir()

In [4]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


In [5]:
# Get the model
import kipoi
m = kipoi.get_model("Divergent421")

Using downloaded and verified file: /users/avsec/.kipoi/models/Divergent421/downloaded/model_files/weights/2a0ae0a29337eb8106d65e1baeda85d1
Using downloaded and verified file: /users/avsec/.kipoi/models/Divergent421/downloaded/model_files/arch/6903bcab337a6753ad010f43f208df42


Using TensorFlow backend.
  from ._conv import register_converters as _register_converters


Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead


In [7]:
m.model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
data/genome_data_dir (InputLayer (None, 1000, 4)       0                                            
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 982, 300)      23100       data/genome_data_dir[0][0]       
____________________________________________________________________________________________________
batchnormalization_1 (BatchNorma (None, 982, 300)      1200        convolution1d_1[0][0]            
____________________________________________________________________________________________________
prelu_1 (PReLU)                  (None, 982, 300)      294600      batchnormalization_1[0][0]       
___________________________________________________________________________________________

### Run kipoi predict and measure execution time

#### Command
```bash
kipoi predict Divergent421 \
  --dataloader_args='{
"intervals_file":"data/raw/tlearn/intervals_files_complete.bed3",
"fasta_file":"data/raw/dataloader_files/shared/hg19.fa"}' \
  --layer=dropout_1 \
  -n 10 \
  --output=data/processed/tlearn/bottlenecks/Divergent420/dropout_1.h5
```

This took 3:01:06 on Titan X (Pascal) GPU.

### Train a new model

In [10]:
from m_kipoi.exp.tlearn import get_all_task_names, get_evaluated_task_names, get_heldout_names, get_multitask_names, get_exp
from kipoi.readers import HDF5Reader
import numpy as np
import pandas as pd

#### Load the data

In [11]:
%time f = HDF5Reader.load(f"{ddir}/processed/tlearn/bottlenecks/Divergent420/dropout_1.h5")

CPU times: user 8min 8s, sys: 1min 44s, total: 9min 53s
Wall time: 10min 6s


Takes quite some time to load it. With .npy it takes ~2 min to load but takes more space. However, these activations are useful for any cell-type and need to be computed / loaded only once.

In [35]:
f.keys()

dict_keys(['metadata', 'preds'])

In [36]:
f['preds'][0].shape

(16551625, 1000)

In [15]:
all_tasks = np.array(get_all_task_names())
evaluated_tasks = np.array(get_evaluated_task_names())
heldout_tasks = np.array(get_heldout_names())

In [16]:
evaluated_tasks

array(['ENCSR000EMT', 'ENCSR000EPC', 'ENCSR452SPC', 'ENCSR714DIF',
       'ENCSR917VCP', 'ENCSR000EOS', 'ENCSR731QLJ', 'ENCSR000EMX',
       'ENCSR076YBB;ENCSR456KDF;ENCSR482HQE;ENCSR930AUG',
       'ENCSR122VUW;ENCSR191EII;ENCSR320TUJ;ENCSR468ZXN;ENCSR603BXE'],
      dtype='<U59')

In [38]:
# This notebooks only shows the results for one sample
task_name = evaluated_tasks[0]
task_name

'ENCSR000EMT'

In [19]:
# Extract y for each specific task
idx_task = [(i, task) for i, task in enumerate(all_tasks) if task in evaluated_tasks]

In [21]:
# Setup the training set
test_chr = ['chr1', 'chr8', 'chr9', 'chr21']

# Read the chromosomes
chr_vec = pd.Series(f['metadata']['ranges']['chr'])

is_test = chr_vec.isin(test_chr)

x_train = f['preds'][0][~is_test]
x_test = f['preds'][0][is_test]

y = np.loadtxt(f"{ddir}/raw/tlearn/labels/{task_name}.txt.gz", dtype=bool)

y_train = y[~is_test]
y_test = y[is_test]

#### Train

In [39]:
import keras
import keras.layers as kl
from keras.models import Model, Sequential, load_model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import average_precision_score
from sklearn.model_selection import train_test_split
import os
import numpy as np
from tqdm import tqdm

In [27]:
X_train_train, X_valid, y_train_train, y_valid = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

In [28]:
m.model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
data/genome_data_dir (InputLayer (None, 1000, 4)       0                                            
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 982, 300)      23100       data/genome_data_dir[0][0]       
____________________________________________________________________________________________________
batchnormalization_1 (BatchNorma (None, 982, 300)      1200        convolution1d_1[0][0]            
____________________________________________________________________________________________________
prelu_1 (PReLU)                  (None, 982, 300)      294600      batchnormalization_1[0][0]       
___________________________________________________________________________________________

Let's use the same architecture after the `dropout_1` layer as before with a slight modification - the output should only be one-dimensional.

In [29]:
model = Sequential()
model.add(kl.Dense(1000, input_shape=(1000,)))
model.add(m.model.get_layer('batchnormalization_5'))
model.add(m.model.get_layer('prelu_5'))
model.add(kl.Dense(1, activation='sigmoid', name='final_layer'))
model.compile("adam", "binary_crossentropy", metrics={"auPR": average_precision_score})

Instructions for updating:
keep_dims is deprecated, use keepdims instead


We can initialize the weights of the `dense_2` layer from before to match the transfer-learning example from Figure 3.

In [30]:
# transfer the weights from before
model.layers[0].set_weights(m.model.get_layer("dense_2").get_weights())

Since we saw that it basically takes only a single epoch to fit the model well, I'll just run it for a single epoch:

In [31]:
for i in range(1):
    print(f"Batch: {i}")
    print("Training the model")
    %time model.fit(X_train_train, y_train_train, nb_epoch=1, batch_size=512)
    print("Getting the prediction")
    %time y_pred = model.predict(X_valid)
    aupr = average_precision_score(y_valid, y_pred)
    print(f"auPR: {aupr}")

Batch: 0
Training the model
Epoch 1/1
CPU times: user 5min 16s, sys: 1min 27s, total: 6min 43s
Wall time: 3min 42s
Getting the prediction
CPU times: user 3min 17s, sys: 35.6 s, total: 3min 53s
Wall time: 2min 11s
auPR: 0.44433984245195457


In [None]:
# Evaluate on the test set
y_pred_test = model.predict(x_test)

In [None]:
aupr = average_precision_score(y_test, y_pred_test)
print(f"test auPR: {aupr}")

test auPR: 0.4082066697326232


Test auPR of the other example was 0.406 which is almost the same as we got here.