## Drug-Kinase Interaction Prediction with CNN

## Problem:

* This is a Multi-Label Classification problem where multiple labels may be assigned to each instance.

https://nickcdryan.com/2017/01/23/multi-label-classification-a-guided-tour/

https://stats.stackexchange.com/questions/12702/what-are-the-measure-for-accuracy-of-multilabel-data?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

## Network Architecture


* INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC -> RELU -> FC ->SIGMOID ->OUTPUT

![title](img/CNN1.png)

## Cost Function

There are two ways to penalize the instances:
* if you do not want to miss any label in an image then if the classification gets all right but one,you should consider the whole things wrong,
* you can also that the label missed or misclassified is an error

We use the second method:

`sigmoid_cross_entropy_with_logits` is a TensorFlow function that penalizes each output node independently. It uses binary loss and model the output of the netowrk as an independed Bernouli distribution per label.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from sklearn.metrics import confusion_matrix
import time
from datetime import timedelta
import math

## Load data

In [2]:
drug_fingerprints_fh = 'sample/sample_fingerprints.csv'
drug_targets_fh      = 'sample/sample_targets.csv'
drug_weights_fh      = 'sample/sample_weights.csv'

### Dimensions of data 

In [3]:
sample_size       = 10000
fingerprint_size  = 1024
fingerprint_width = 32
targets_num       = 420
weights_num       = 420
num_channels      = 1

### Function helper that populates data structures with the actual data

In [4]:
import re
def populate_data(file_handle,data_matrix, data_size):
    with open(file_handle) as fh:
        j=0
        content = fh.readlines()
        content = [x.strip() for x in content]
        for line in content:
            result = re.split(r'[,\t]\s*',line)
            for i in range(1,data_size+1):
                data_matrix[j][i-1] = np.float32(result[i])
            j = j+1
    print(j)
    fh.close()

### Data structures for loaded data

In [5]:
drug_fingerprints = []
drug_targets      = []
drug_weights      = []


for i in range(sample_size):
    fingerprint_holder = [0]* fingerprint_size
    drug_fingerprints.append(fingerprint_holder)
    
for i in range(sample_size):
    target_holder = [0]* targets_num
    drug_targets.append(target_holder)

for i in range(sample_size):
    weight_holder = [0]* weights_num
    drug_weights.append(weight_holder)

In [6]:
populate_data(drug_weights_fh, drug_weights, weights_num)
populate_data(drug_targets_fh, drug_targets, targets_num)
populate_data(drug_fingerprints_fh, drug_fingerprints, fingerprint_size)

10000
10000
10000


In [7]:
drug_fingerprints = np.array(drug_fingerprints)
drug_targets      = np.array(drug_targets)
drug_weights      = np.array(drug_weights)

# TensorFlow

## Placeholders
Placeholder for the flat 'array' with **fingerprint** of each compound. `None` means that this tensor can hold arbitrary number of arrays with the fingerprints.

In [8]:
x = tf.placeholder(tf.float32, [None, fingerprint_size],name = "In_Flat_Drug_Fingerprint")

X is first define as and vecor of size `fingerprint_size` and its then redefine and reshape input as a 2D matrix (image)

In [9]:
drug_image = tf.reshape(x, [-1, fingerprint_width, fingerprint_width, num_channels], name="Drug_Image_32x32")

Placeholder for the true labels (true targets) for each compound. (Here we have 420 targets - kinases).

In [10]:
y_true = tf.placeholder(tf.float32, [None, targets_num],name='True_Labels')

Placeholder for weights. The weights will be used to calculate cross-entropy cost function.

In [11]:
cross_entropy_weights = tf.placeholder(tf.float32, [None, weights_num],name = "Cross_Entropy_Weights")

## Variables to Optimize

In [12]:
def new_weights(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.05), name="Weights")

In [13]:
def new_biases(length):
    return tf.Variable(tf.constant(0.05, shape=[length]), name="Biases")

## Network Architecture

* INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC ->RELU->FC->SIGMOID->FC

## Helper Functions to Create the Network's Layers

### NEW CONVOLUTION LAYER

In [14]:
def new_conv_layer_with_RELU(input, num_input_channels, filter_size, num_filters):        
    
    #number and shape of the filters
    shape = [filter_size, filter_size, num_input_channels, num_filters]
    
    #create the filters
    weights = new_weights(shape=shape)

    #create bias for each depth slice of the filter volumne
    biases = new_biases(length=num_filters)

    #convolute stride =[1,1,1,1] move by pixel, padding = "SAME" -> add zero padding to keep the dimensions
    layer = tf.nn.conv2d(input   = input,filter  = weights, strides = [1, 1, 1, 1],
                         padding = 'SAME', name="CONVOLUTION_LAYER")
    
    # Add the biases 
    layer += biases
    
    #  non-linearity (ReLU).
    layer = tf.nn.relu(layer)
                   
    return layer

### NEW POOLING LAYER

In [15]:
def new_pooling_layer(input, stride):
    
    layer = tf.nn.max_pool(value=input,ksize=[1, 2, 2, 1],strides=[1, stride, stride, 1],
                           padding='SAME', name = "POOLING_LAYER")
    
    return layer

### Helper Function for Flattening a Layer

In [16]:
def flatten_layer(layer):
        
    # get the shape of the input layer in a format [num_images, img_height, img_width, num_channels]
    layer_shape = layer.get_shape()

    # number of features is  img_height * img_width * num_channels
    num_features = int(layer_shape[1] * layer_shape[2] * layer_shape[3])
    
    # reshape the layer to [num_images, num_features].
    # -1  means the size in that dimension is calculated so the total size of the tensor is unchanged from the reshaping.
    layer_flat = tf.reshape(layer, [-1, num_features],name = "FLAT_LAYER")

    # return both the flattened layer and the number of features.
    return layer_flat, num_features

### NEW FULLY-CONNECTED LAYER

In [17]:
def new_fc_layer(input, num_inputs,num_outputs,use_relu, use_sigmoid): 

    # new weights and biases for the layer
    weights = new_weights(shape = [num_inputs, num_outputs])
    biases = new_biases(length = num_outputs)

    # calculate the layer as the matrix multiplication of the input and weights, and then add the bias-values.
    layer = tf.matmul(input, weights) + biases

    if use_relu:
        layer = tf.nn.relu(layer, name = "FULLY_CONNECTED_WITH_RELU")
       
    if use_sigmoid:
        layer = tf.nn.sigmoid(layer,name = "FULLY_CONNECTED_WITH_SIGMOID")

    return layer

# Design Computational Graph for CNN

### HYPER-PARAMETERS OF THE NETWORK

INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC -> RELU -> FC ->SIGMOID ->OUTPUT

* **CONV 1**

In [18]:
filter_size1 = 2
num_filters1 = 16

* **CONV 2**

In [19]:
filter_size2 = 3
num_filters2 = 32

* **FC 1 & FC 2**

In [20]:
fc_size1 = 2576
fc_size2 = 420

DATA DIMENSION REMINDER:
* sample_size       = 10000
* fingerprint_size  = 1024
* fingerprint_width = 32
* targets_num       = 420
* weights_num       = 420
* num_channels      = 1

## DEFINE THE LAYERS

### Convolutional Layer #1 with RELU

In [21]:
conv_layer1 = new_conv_layer_with_RELU(input=drug_image,
                                       num_input_channels = num_channels,
                                       filter_size = filter_size1,
                                       num_filters = num_filters1 )  

In [22]:
conv_layer1

<tf.Tensor 'Relu:0' shape=(?, 32, 32, 16) dtype=float32>

### Convolutional Layer #2 with RELU

In [23]:
conv_layer2 = new_conv_layer_with_RELU(input = conv_layer1,
                                       num_input_channels = num_filters1 ,
                                       filter_size = filter_size2,
                                       num_filters = num_filters2)  

In [24]:
conv_layer2

<tf.Tensor 'Relu_1:0' shape=(?, 32, 32, 32) dtype=float32>

### Pooling Layer

In [25]:
pooling_layer = new_pooling_layer(input= conv_layer2, stride = 2)

In [26]:
pooling_layer 

<tf.Tensor 'POOLING_LAYER:0' shape=(?, 16, 16, 32) dtype=float32>

### Prepare input to FC layer aka flattering the layer

In [27]:
layer_flat, features_num = flatten_layer(pooling_layer)

In [28]:
layer_flat

<tf.Tensor 'FLAT_LAYER:0' shape=(?, 8192) dtype=float32>

### Fully- Connected Layer 1

In [29]:
fc_layer1 = new_fc_layer(input = layer_flat,
                         num_inputs = features_num,
                         num_outputs = fc_size1,
                         use_relu = True,
                         use_sigmoid = False)

In [30]:
fc_layer1

<tf.Tensor 'FULLY_CONNECTED_WITH_RELU:0' shape=(?, 2576) dtype=float32>

### Fully- Connected Layer 2

In [31]:
fc_layer2 = new_fc_layer(input = fc_layer1,
                         num_inputs = fc_size1,
                         num_outputs = fc_size2,
                         use_relu = False,
                         use_sigmoid = True)

In [32]:
fc_layer2

<tf.Tensor 'FULLY_CONNECTED_WITH_SIGMOID:0' shape=(?, 420) dtype=float32>

### OUTPUT -> Predicted Classes 

In [33]:
output = tf.round(fc_layer2)

# Cost Function to Optimize

Because we want to penalize each output node independently, we pick a binary loss and model the ouput of the network as an independent Bernouli Distribution per label.

In [34]:
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = fc_layer2,
                                                        labels = y_true)

### Multiply logistic loss with weights (ELEMENT-WISE) 

In [35]:
# sum of cost for all labels with weight 1
cost_sum = tf.reduce_sum(tf.multiply(cross_entropy_weights,cross_entropy))

# number of labels with weight 1
num_nonzero_weights = tf.count_nonzero(input_tensor=cross_entropy_weights,dtype = tf.float32)

# average cost
cost = tf.divide(cost_sum, num_nonzero_weights, name= "COST")

### Optimization Method

In [36]:
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost)

In [37]:
accuracy, accuracy_ops =tf.metrics.accuracy(labels=y_true,predictions=output, weights = cross_entropy_weights)

In [38]:
from sklearn import *
from sklearn.metrics import hamming_loss

In [39]:
# Local variables need to show updated accuracy on each iteration 
stream_vars = [i for i in tf.local_variables()]

# Create TensorFlow session

In [40]:
session = tf.Session()
init = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
session.run(init)
saver = tf.train.Saver()

In [41]:
train_batch_size = 50

In [42]:
def fetch_batch(batch_size, available_indexes):
    chosen = np.random.choice(available_indexes,batch_size, replace=False)
    available_indexes = set(available_indexes) - set(chosen)
    X_batch = drug_fingerprints[chosen, :]
    y_batch = drug_targets[chosen, :]
    cross_entropy_weights = drug_weights[chosen,:]
    return X_batch,y_batch,cross_entropy_weights, list(available_indexes)

In [54]:
# counter for total number of epochs
total_epochs = 0

def optimize(num_epochs):
    
    # update the global variable rather than a local copy.
    global total_epochs

    # start-time 
    start_time = time.time()

    for i in range(total_epochs, total_epochs + num_epochs):

        for j in range(int(len(drug_targets)/train_batch_size)):
            if j == 0:
                available_indexes = list(range(len(drug_targets)))                         
            x_batch,y_true_batch, weights_batch, available_indexes = fetch_batch(train_batch_size, available_indexes)

            # put the batch into a dict with the proper names for placeholder variables
            feed_dict_train = {x: x_batch,
                               y_true: y_true_batch,
                              cross_entropy_weights: weights_batch}

            # run the optimizer with the btch training data
            session.run(optimizer, feed_dict=feed_dict_train)
            # save the model's weights at the end of each epoch
            saver.save(session, "./temp/my_model.ckpt")

            # print update every 10 iterations
            if j % 20 == 0:

                # calculate the accuracy on the training-set.
                acc_ops = session.run(accuracy_ops, feed_dict=feed_dict_train)
                curr_hamming_loss = hamming_loss(y_true_batch, x_batch, None,weights_batch)
                print("Hamming_loss: {}".format(curr_hamming_loss))
                # print update
                print('[Total correct, Total count]:',session.run(stream_vars)) 
                print("Epoch: {}, Optimization Iteration (batch #): {}, Training Accuracy: {} \n".format(i+1,j+1,acc_ops))                        

        # update the total number of iterations
    total_epochs += num_epochs

    # end time
    end_time = time.time()

    # difference between start and end-times.
    time_dif = end_time - start_time

    #time-usage
    print("Time usage: " + str(timedelta(seconds=int(round(time_dif)))))

### How total count is calculated?
<br>
Per iteration we have 50 example (batch size) each of them has 420 variable
<br>
This is wrong because many entries are 0 because of inativity as well as luck of knowlegde.
<br>

**(In that case the Network could perform very well if it starts outputs zeros all the time as the data is unbalanced )**


**Accuracy should only calculate among the examples with weight 1 as cross_entropy is.**

In [56]:
optimize(num_epochs = 1)

Hamming_loss: (array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), None, array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32))
[Total correct, Total count]: [20556.0, 22497.0]
Epoch: 2, Optimization Iteration (batch #): 1, Training Accuracy: 0.9137217998504639 

Hamming_loss: (array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0.,

Hamming_loss: (array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32), None, array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32))
[Total correct, Total count]: [25005.0, 27397.0]
Epoch: 2, Optimization Iteration (batch #): 181, Training Accuracy: 0.9126911759376526 

Time usage: 0:06:54


In [44]:
writer = tf.summary.FileWriter("./logs/CNN_1", session.graph)

In [45]:
save_path= saver.save(session, "./temp/my_model_final.ckpt")

In [46]:
# ! tensorboard --logdir=log+

21000

In [49]:
output_array = output.eval(session=session)

InvalidArgumentError: You must feed a value for placeholder tensor 'In_Flat_Drug_Fingerprint' with dtype float and shape [?,1024]
	 [[node In_Flat_Drug_Fingerprint (defined at <ipython-input-8-79b151949cc7>:1)  = Placeholder[dtype=DT_FLOAT, shape=[?,1024], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'In_Flat_Drug_Fingerprint', defined at:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 505, in start
    self.io_loop.start()
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 427, in run_forever
    self._run_once()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 1440, in _run_once
    handle._run()
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 1233, in inner
    self.run()
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 370, in dispatch_queue
    yield self.process_one()
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 346, in wrapper
    runner = Runner(result, future, yielded)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 1080, in __init__
    self.run()
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 357, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 267, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 534, in execute_request
    user_expressions, allow_stdin,
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 294, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2819, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2845, in _run_cell
    return runner(coro)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 67, in _pseudo_sync_runner
    coro.send(None)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3020, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3185, in run_ast_nodes
    if (yield from self.run_code(code, result)):
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-79b151949cc7>", line 1, in <module>
    x = tf.placeholder(tf.float32, [None, fingerprint_size],name = "In_Flat_Drug_Fingerprint")
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1747, in placeholder
    return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5206, in placeholder
    "Placeholder", dtype=dtype, shape=shape, name=name)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/patrycja/dnn/deep_env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'In_Flat_Drug_Fingerprint' with dtype float and shape [?,1024]
	 [[node In_Flat_Drug_Fingerprint (defined at <ipython-input-8-79b151949cc7>:1)  = Placeholder[dtype=DT_FLOAT, shape=[?,1024], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
