## Drug-Kinase Interaction Prediction with CNN

## Problem:

* This is a Multi-Label Classification problem where multiple labels may be assigned to each instance.

https://nickcdryan.com/2017/01/23/multi-label-classification-a-guided-tour/

https://stats.stackexchange.com/questions/12702/what-are-the-measure-for-accuracy-of-multilabel-data?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

## Network Architecture


* INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC -> RELU -> FC ->SIGMOID ->OUTPUT

![title](img/CNN1.png)

## Cost Function

There are two ways to penalize the instances:
* if you do not want to miss any label in an image then if the classification gets all right but one,you should consider the whole things wrong,
* you can also that the label missed or misclassified is an error

We use the second method:

`sigmoid_cross_entropy_with_logits` is a TensorFlow function that penalizes each output node independently. It uses binary loss and model the output of the netowrk as an independed Bernouli distribution per label.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from sklearn.metrics import confusion_matrix
import time
from datetime import timedelta
import math

## Load data

In [2]:
drug_fingerprints_fh = 'sample/sample_fingerprints.csv'
drug_targets_fh      = 'sample/sample_targets.csv'
drug_weights_fh      = 'sample/sample_weights.csv'

### Dimensions of data 

In [3]:
sample_size       = 10000
fingerprint_size  = 1024
fingerprint_width = 32
targets_num       = 420
weights_num       = 420
num_channels      = 1

### Function helper that populates data structures with the actual data

In [4]:
import re
def populate_data(file_handle,data_matrix, data_size):
    with open(file_handle) as fh:
        j=0
        content = fh.readlines()
        content = [x.strip() for x in content]
        for line in content:
            result = re.split(r'[,\t]\s*',line)
            for i in range(1,data_size+1):
                data_matrix[j][i-1] = np.float32(result[i])
            j = j+1
    print(j)
    fh.close()

### Data structures for loaded data

In [5]:
drug_fingerprints = []
drug_targets      = []
drug_weights      = []


for i in range(sample_size):
    fingerprint_holder = [0]* fingerprint_size
    drug_fingerprints.append(fingerprint_holder)
    
for i in range(sample_size):
    target_holder = [0]* targets_num
    drug_targets.append(target_holder)

for i in range(sample_size):
    weight_holder = [0]* weights_num
    drug_weights.append(weight_holder)

In [6]:
populate_data(drug_weights_fh, drug_weights, weights_num)
populate_data(drug_targets_fh, drug_targets, targets_num)
populate_data(drug_fingerprints_fh, drug_fingerprints, fingerprint_size)

10000
10000
10000


In [7]:
drug_fingerprints = np.array(drug_fingerprints)
drug_targets      = np.array(drug_targets)
drug_weights      = np.array(drug_weights)

# TensorFlow

## Placeholders
Placeholder for the flat 'array' with **fingerprint** of each compound. `None` means that this tensor can hold arbitrary number of arrays with the fingerprints.

In [8]:
x = tf.placeholder(tf.float32, [None, fingerprint_size],name = "In_Flat_Drug_Fingerprint")

X is first define as and vecor of size `fingerprint_size` and its then redefine and reshape input as a 2D matrix (image)

In [9]:
drug_image = tf.reshape(x, [-1, fingerprint_width, fingerprint_width, num_channels], name="Drug_Image_32x32")

Placeholder for the true labels (true targets) for each compound. (Here we have 420 targets - kinases).

In [10]:
y_true = tf.placeholder(tf.float32, [None, targets_num],name='True_Labels')

Placeholder for weights. The weights will be used to calculate cross-entropy cost function.

In [11]:
cross_entropy_weights = tf.placeholder(tf.float32, [None, weights_num],name = "Cross_Entropy_Weights")

## Variables to Optimize

In [12]:
def new_weights(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.05), name="Weights")

In [13]:
def new_biases(length):
    return tf.Variable(tf.constant(0.05, shape=[length]), name="Biases")

## Network Architecture

* INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC ->RELU->FC->SIGMOID->FC

## Helper Functions to Create the Network's Layers

### NEW CONVOLUTION LAYER

In [14]:
def new_conv_layer_with_RELU(input, num_input_channels, filter_size, num_filters):        
    
    #number and shape of the filters
    shape = [filter_size, filter_size, num_input_channels, num_filters]
    
    #create the filters
    weights = new_weights(shape=shape)

    #create bias for each depth slice of the filter volumne
    biases = new_biases(length=num_filters)

    #convolute stride =[1,1,1,1] move by pixel, padding = "SAME" -> add zero padding to keep the dimensions
    layer = tf.nn.conv2d(input   = input,filter  = weights, strides = [1, 1, 1, 1],
                         padding = 'SAME', name="CONVOLUTION_LAYER")
    
    # Add the biases 
    layer += biases
    
    #  non-linearity (ReLU).
    layer = tf.nn.relu(layer)
                   
    return layer

### NEW POOLING LAYER

In [15]:
def new_pooling_layer(input, stride):
    
    layer = tf.nn.max_pool(value=input,ksize=[1, 2, 2, 1],strides=[1, stride, stride, 1],
                           padding='SAME', name = "POOLING_LAYER")
    
    return layer

### Helper Function for Flattening a Layer

In [16]:
def flatten_layer(layer):
        
    # get the shape of the input layer in a format [num_images, img_height, img_width, num_channels]
    layer_shape = layer.get_shape()

    # number of features is  img_height * img_width * num_channels
    num_features = int(layer_shape[1] * layer_shape[2] * layer_shape[3])
    
    # reshape the layer to [num_images, num_features].
    # -1  means the size in that dimension is calculated so the total size of the tensor is unchanged from the reshaping.
    layer_flat = tf.reshape(layer, [-1, num_features],name = "FLAT_LAYER")

    # return both the flattened layer and the number of features.
    return layer_flat, num_features

### NEW FULLY-CONNECTED LAYER

In [17]:
def new_fc_layer(input, num_inputs,num_outputs,use_relu, use_sigmoid): 

    # new weights and biases for the layer
    weights = new_weights(shape = [num_inputs, num_outputs])
    biases = new_biases(length = num_outputs)

    # calculate the layer as the matrix multiplication of the input and weights, and then add the bias-values.
    layer = tf.matmul(input, weights) + biases

    if use_relu:
        layer = tf.nn.relu(layer, name = "FULLY_CONNECTED_WITH_RELU")
       
    if use_sigmoid:
        layer = tf.nn.sigmoid(layer,name = "FULLY_CONNECTED_WITH_SIGMOID")

    return layer

# Design Computational Graph for CNN

### HYPER-PARAMETERS OF THE NETWORK

INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC -> RELU -> FC ->SIGMOID ->OUTPUT

* **CONV 1**

In [18]:
filter_size1 = 2
num_filters1 = 16

* **CONV 2**

In [19]:
filter_size2 = 3
num_filters2 = 32

* **FC 1 & FC 2**

In [20]:
fc_size1 = 2576
fc_size2 = 420

DATA DIMENSION REMINDER:
* sample_size       = 10000
* fingerprint_size  = 1024
* fingerprint_width = 32
* targets_num       = 420
* weights_num       = 420
* num_channels      = 1

## DEFINE THE LAYERS

### Convolutional Layer #1 with RELU

In [21]:
conv_layer1 = new_conv_layer_with_RELU(input=drug_image,
                                       num_input_channels = num_channels,
                                       filter_size = filter_size1,
                                       num_filters = num_filters1 )  

In [22]:
conv_layer1

<tf.Tensor 'Relu:0' shape=(?, 32, 32, 16) dtype=float32>

### Convolutional Layer #2 with RELU

In [23]:
conv_layer2 = new_conv_layer_with_RELU(input = conv_layer1,
                                       num_input_channels = num_filters1 ,
                                       filter_size = filter_size2,
                                       num_filters = num_filters2)  

In [24]:
conv_layer2

<tf.Tensor 'Relu_1:0' shape=(?, 32, 32, 32) dtype=float32>

### Pooling Layer

In [25]:
pooling_layer = new_pooling_layer(input= conv_layer2, stride = 2)

In [26]:
pooling_layer 

<tf.Tensor 'POOLING_LAYER:0' shape=(?, 16, 16, 32) dtype=float32>

### Prepare input to FC layer aka flattering the layer

In [27]:
layer_flat, features_num = flatten_layer(pooling_layer)

In [28]:
layer_flat

<tf.Tensor 'FLAT_LAYER:0' shape=(?, 8192) dtype=float32>

### Fully- Connected Layer 1

In [29]:
fc_layer1 = new_fc_layer(input = layer_flat,
                         num_inputs = features_num,
                         num_outputs = fc_size1,
                         use_relu = True,
                         use_sigmoid = False)

In [30]:
fc_layer1

<tf.Tensor 'FULLY_CONNECTED_WITH_RELU:0' shape=(?, 2576) dtype=float32>

### Fully- Connected Layer 2

In [31]:
fc_layer2 = new_fc_layer(input = fc_layer1,
                         num_inputs = fc_size1,
                         num_outputs = fc_size2,
                         use_relu = False,
                         use_sigmoid = True)

In [32]:
fc_layer2

<tf.Tensor 'FULLY_CONNECTED_WITH_SIGMOID:0' shape=(?, 420) dtype=float32>

### OUTPUT -> Predicted Classes 

In [33]:
output = tf.round(fc_layer2)

# Cost Function to Optimize

Because we want to penalize each output node independently, we pick a binary loss and model the ouput of the network as an independent Bernouli Distribution per label.

In [34]:
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits = fc_layer2,
                                                        labels = y_true)

### Multiply logistic loss with weights (ELEMENT-WISE) 

In [35]:
# sum of cost for all labels with weight 1
cost_sum = tf.reduce_sum(tf.multiply(cross_entropy_weights,cross_entropy))

# number of labels with weight 1
num_nonzero_weights = tf.count_nonzero(input_tensor=cross_entropy_weights,dtype = tf.float32)

# average cost
cost = tf.divide(cost_sum, num_nonzero_weights, name= "COST")

### Optimization Method

In [36]:
optimizer = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(cost)

In [37]:
accuracy, accuracy_ops =tf.metrics.accuracy(labels=y_true,predictions=output, weights = cross_entropy_weights)

In [38]:
# Local variables need to show updated accuracy on each iteration 
stream_vars = [i for i in tf.local_variables()]

# Create TensorFlow session

In [39]:
session = tf.Session()
init = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
session.run(init)

In [40]:
train_batch_size = 50

In [41]:
def fetch_batch(batch_size):
    chosen = np.random.randint(len(drug_fingerprints), size = batch_size)
    X_batch = drug_fingerprints[chosen, :]
    y_batch = drug_targets[chosen, :]
    cross_entropy_weights = drug_weights[chosen,:]
    return X_batch,y_batch,cross_entropy_weights

In [42]:
# counter for total number of iterations
total_iterations = 0

def optimize(num_iterations):
    
    # update the global variable rather than a local copy.
    global total_iterations

    # start-time 
    start_time = time.time()

    for i in range(total_iterations, total_iterations + num_iterations):

        # batch of training examples
        x_batch, y_true_batch, weights_batch = fetch_batch(train_batch_size)

        # put the batch into a dict with the proper names for placeholder variables
        feed_dict_train = {x: x_batch,
                           y_true: y_true_batch,
                          cross_entropy_weights: weights_batch}

        # run the optimizer with the btch training data
        session.run(optimizer, feed_dict=feed_dict_train)

        # print update every 10 iterations
        if i % 1 == 0:
            
            # calculate the accuracy on the training-set.
            acc_ops = session.run(accuracy_ops, feed_dict=feed_dict_train)
            
            # print update
            print('[Total correct, Total count]:',session.run(stream_vars)) 
            print("Optimization Iteration: {}, Training Accuracy: {} \n".format(i+1,acc_ops))                        

    # update the total number of iterations
    total_iterations += num_iterations

    # end time
    end_time = time.time()

    # difference between start and end-times.
    time_dif = end_time - start_time

    #time-usage
    print("Time usage: " + str(timedelta(seconds=int(round(time_dif)))))

### How total count is calculated?
<br>
Per iteration we have 50 example (batch size) each of them has 420 variable
<br>
This is wrong because many entries are 0 because of inativity as well as luck of knowlegde.
<br>

**(In that case the Network could perform very well if it starts outputs zeros all the time as the data is unbalanced )**


**Accuracy should only calculate among the examples with weight 1 as cross_entropy is.**

In [47]:
optimize(num_iterations=501)

[Total correct, Total count]: [60815.0, 67395.0]
Optimization Iteration: 103, Training Accuracy: 0.9023666381835938 

[Total correct, Total count]: [61995.0, 68661.0]
Optimization Iteration: 104, Training Accuracy: 0.9029143452644348 

[Total correct, Total count]: [62502.0, 69213.0]
Optimization Iteration: 105, Training Accuracy: 0.9030384421348572 

[Total correct, Total count]: [63586.0, 70394.0]
Optimization Iteration: 106, Training Accuracy: 0.9032872319221497 

[Total correct, Total count]: [63933.0, 70761.0]
Optimization Iteration: 107, Training Accuracy: 0.9035061597824097 

[Total correct, Total count]: [64667.0, 71546.0]
Optimization Iteration: 108, Training Accuracy: 0.9038520455360413 

[Total correct, Total count]: [65355.0, 72290.0]
Optimization Iteration: 109, Training Accuracy: 0.9040669798851013 

[Total correct, Total count]: [65871.0, 72829.0]
Optimization Iteration: 110, Training Accuracy: 0.904461145401001 

[Total correct, Total count]: [66271.0, 73264.0]
Optimiza

[Total correct, Total count]: [105953.0, 116493.0]
Optimization Iteration: 173, Training Accuracy: 0.9095224738121033 

[Total correct, Total count]: [106725.0, 117314.0]
Optimization Iteration: 174, Training Accuracy: 0.9097379446029663 

[Total correct, Total count]: [107921.0, 118544.0]
Optimization Iteration: 175, Training Accuracy: 0.9103876948356628 

[Total correct, Total count]: [108667.0, 119319.0]
Optimization Iteration: 176, Training Accuracy: 0.9107267260551453 

[Total correct, Total count]: [109954.0, 120685.0]
Optimization Iteration: 177, Training Accuracy: 0.9110825657844543 

[Total correct, Total count]: [110605.0, 121414.0]
Optimization Iteration: 178, Training Accuracy: 0.9109740257263184 

[Total correct, Total count]: [111963.0, 122932.0]
Optimization Iteration: 179, Training Accuracy: 0.9107717871665955 

[Total correct, Total count]: [112709.0, 123830.0]
Optimization Iteration: 180, Training Accuracy: 0.9101914167404175 

[Total correct, Total count]: [113271.0,

[Total correct, Total count]: [155191.0, 169782.0]
Optimization Iteration: 243, Training Accuracy: 0.9140603542327881 

[Total correct, Total count]: [155816.0, 170446.0]
Optimization Iteration: 244, Training Accuracy: 0.9141663908958435 

[Total correct, Total count]: [156442.0, 171135.0]
Optimization Iteration: 245, Training Accuracy: 0.9141438007354736 

[Total correct, Total count]: [157355.0, 172095.0]
Optimization Iteration: 246, Training Accuracy: 0.9143496155738831 

[Total correct, Total count]: [157958.0, 172739.0]
Optimization Iteration: 247, Training Accuracy: 0.9144315719604492 

[Total correct, Total count]: [158221.0, 173097.0]
Optimization Iteration: 248, Training Accuracy: 0.9140597581863403 

[Total correct, Total count]: [158909.0, 173824.0]
Optimization Iteration: 249, Training Accuracy: 0.9141948223114014 

[Total correct, Total count]: [159289.0, 174263.0]
Optimization Iteration: 250, Training Accuracy: 0.9140723943710327 

[Total correct, Total count]: [159902.0,

[Total correct, Total count]: [205112.0, 223788.0]
Optimization Iteration: 313, Training Accuracy: 0.9165459871292114 

[Total correct, Total count]: [205973.0, 224741.0]
Optimization Iteration: 314, Training Accuracy: 0.9164905548095703 

[Total correct, Total count]: [206462.0, 225297.0]
Optimization Iteration: 315, Training Accuracy: 0.9163992404937744 

[Total correct, Total count]: [206653.0, 225519.0]
Optimization Iteration: 316, Training Accuracy: 0.9163441061973572 

[Total correct, Total count]: [207317.0, 226229.0]
Optimization Iteration: 317, Training Accuracy: 0.9164032936096191 

[Total correct, Total count]: [207572.0, 226517.0]
Optimization Iteration: 318, Training Accuracy: 0.9163638949394226 

[Total correct, Total count]: [208013.0, 226987.0]
Optimization Iteration: 319, Training Accuracy: 0.9164093136787415 

[Total correct, Total count]: [208978.0, 228005.0]
Optimization Iteration: 320, Training Accuracy: 0.9165500998497009 

[Total correct, Total count]: [209961.0,

[Total correct, Total count]: [251988.0, 274359.0]
Optimization Iteration: 383, Training Accuracy: 0.9184608459472656 

[Total correct, Total count]: [252087.0, 274497.0]
Optimization Iteration: 384, Training Accuracy: 0.9183597564697266 

[Total correct, Total count]: [252719.0, 275200.0]
Optimization Iteration: 385, Training Accuracy: 0.9183103442192078 

[Total correct, Total count]: [253507.0, 276065.0]
Optimization Iteration: 386, Training Accuracy: 0.9182873368263245 

[Total correct, Total count]: [254191.0, 276785.0]
Optimization Iteration: 387, Training Accuracy: 0.9183698296546936 

[Total correct, Total count]: [254931.0, 277589.0]
Optimization Iteration: 388, Training Accuracy: 0.9183757305145264 

[Total correct, Total count]: [256230.0, 278931.0]
Optimization Iteration: 389, Training Accuracy: 0.9186142683029175 

[Total correct, Total count]: [256913.0, 279671.0]
Optimization Iteration: 390, Training Accuracy: 0.9186258316040039 

[Total correct, Total count]: [257836.0,

[Total correct, Total count]: [292251.0, 318065.0]
Optimization Iteration: 453, Training Accuracy: 0.9188404679298401 

[Total correct, Total count]: [293479.0, 319359.0]
Optimization Iteration: 454, Training Accuracy: 0.9189626574516296 

[Total correct, Total count]: [294509.0, 320448.0]
Optimization Iteration: 455, Training Accuracy: 0.9190539717674255 

[Total correct, Total count]: [295015.0, 321031.0]
Optimization Iteration: 456, Training Accuracy: 0.9189611077308655 

[Total correct, Total count]: [295941.0, 322065.0]
Optimization Iteration: 457, Training Accuracy: 0.9188859462738037 

[Total correct, Total count]: [296493.0, 322685.0]
Optimization Iteration: 458, Training Accuracy: 0.9188310503959656 

[Total correct, Total count]: [297229.0, 323464.0]
Optimization Iteration: 459, Training Accuracy: 0.918893575668335 

[Total correct, Total count]: [297670.0, 323935.0]
Optimization Iteration: 460, Training Accuracy: 0.9189189076423645 

[Total correct, Total count]: [298548.0, 

[Total correct, Total count]: [335717.0, 365338.0]
Optimization Iteration: 523, Training Accuracy: 0.9189216494560242 

[Total correct, Total count]: [336414.0, 366076.0]
Optimization Iteration: 524, Training Accuracy: 0.9189730882644653 

[Total correct, Total count]: [337029.0, 366759.0]
Optimization Iteration: 525, Training Accuracy: 0.9189385771751404 

[Total correct, Total count]: [337459.0, 367224.0]
Optimization Iteration: 526, Training Accuracy: 0.9189459085464478 

[Total correct, Total count]: [338030.0, 367866.0]
Optimization Iteration: 527, Training Accuracy: 0.918894350528717 

[Total correct, Total count]: [338181.0, 368050.0]
Optimization Iteration: 528, Training Accuracy: 0.9188452363014221 

[Total correct, Total count]: [338989.0, 368919.0]
Optimization Iteration: 529, Training Accuracy: 0.9188711047172546 

[Total correct, Total count]: [339553.0, 369554.0]
Optimization Iteration: 530, Training Accuracy: 0.9188183546066284 

[Total correct, Total count]: [339641.0, 

[Total correct, Total count]: [379623.0, 413156.0]
Optimization Iteration: 593, Training Accuracy: 0.9188369512557983 

[Total correct, Total count]: [380654.0, 414283.0]
Optimization Iteration: 594, Training Accuracy: 0.9188260436058044 

[Total correct, Total count]: [381137.0, 414825.0]
Optimization Iteration: 595, Training Accuracy: 0.9187898635864258 

[Total correct, Total count]: [382418.0, 416290.0]
Optimization Iteration: 596, Training Accuracy: 0.9186336398124695 

[Total correct, Total count]: [383008.0, 416931.0]
Optimization Iteration: 597, Training Accuracy: 0.9186364412307739 

[Total correct, Total count]: [383951.0, 417930.0]
Optimization Iteration: 598, Training Accuracy: 0.918696939945221 

[Total correct, Total count]: [384535.0, 418555.0]
Optimization Iteration: 599, Training Accuracy: 0.9187203645706177 

[Total correct, Total count]: [384976.0, 419028.0]
Optimization Iteration: 600, Training Accuracy: 0.9187357425689697 

[Total correct, Total count]: [385068.0, 

In [44]:
writer = tf.summary.FileWriter("./logs/CNN_1", session.graph)

In [45]:
# ! tensorboard --logdir=log+

21000