## Drug-Kinase Interaction Prediction with CNN

## Problem:

* This is a Multi-Label Classification problem where multiple labels may be assigned to each instance.

https://nickcdryan.com/2017/01/23/multi-label-classification-a-guided-tour/

https://stats.stackexchange.com/questions/12702/what-are-the-measure-for-accuracy-of-multilabel-data?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

## Network Architecture

* INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC ->SIGMOID ->FC
<br>
**OR**
<br>
* INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC -> RELU -> FC ->SIGMOID ->FC

## Cost Function

There are two ways to penalize the instances:
* if you do not want to miss any label in an image then if the classification gets all right but one,you should consider the whole things wrong,
* you can also that the label missed or misclassified is an error

We use the second method:

`sigmoid_cross_entropy_with_logits` is a TensorFlow function that penalizes each output node independently. It uses binary loss and model the output of the netowrk as an independed Bernouli distribution per label.

In [8]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from sklearn.metrics import confusion_matrix
import time
from datetime import timedelta
import math

## Load data

In [3]:
drug_fingerprints_fh = 'sample/sample_fingerprints.csv'
drug_targets_fh      = 'sample/sample_targets.csv'
drug_weights_fh      = 'sample/sample_weights.csv'

### Dimensions of data 

In [14]:
sample_size       = 10000
fingerprint_size  = 1024
fingerprint_width = 32
targets_num       = 420
weights_num       = 420
num_channels      = 1

### Function helper that populates data structures with the actual data

In [21]:
import re
def populate_data(file_handle,data_matrix, data_size):
    with open(file_handle) as fh:
        j=0
        content = fh.readlines()
        content = [x.strip() for x in content]
        for line in content:
            result = re.split(r'[,\t]\s*',line)
            for i in range(1,data_size+1):
                data_matrix[j][i-1] = np.float32(result[i])
            j = j+1
    print(j)
    fh.close()

### Data structures for loaded data

In [22]:
drug_fingerprints = []
drug_targets      = []
drug_weights      = []


for i in range(sample_size):
    fingerprint_holder = [0]* fingerprint_size
    drug_fingerprints.append(fingerprint_holder)
    
for i in range(sample_size):
    target_holder = [0]* targets_num
    drug_targets.append(target_holder)

for i in range(sample_size):
    weight_holder = [0]* weights_num
    drug_weights.append(weight_holder)

In [23]:
populate_data(drug_weights_fh, drug_weights, weights_num)
populate_data(drug_targets_fh, drug_targets, targets_num)
populate_data(drug_fingerprints_fh, drug_fingerprints, fingerprint_size)

10000
10000
10000


In [25]:
drug_fingerprints = np.array(drug_fingerprints)
drug_targets      = np.array(drug_targets)
drug_weights      = np.array(drug_weights)

# TensorFlow

## Placeholders
Placeholder for the flat 'array' with **fingerprint** of each compound. `None` means that this tensor can hold arbitrary number of arrays with the fingerprints.

In [26]:
x = tf.placeholder(tf.float32, [None, fingerprint_size])

X is first define as and vecor of size `fingerprint_size` and its then redefine and reshape input as a 2D matrix (image)

In [27]:
x_image = tf.reshape(x, [-1, fingerprint_width, fingerprint_width, num_channels])

Placeholder for the true labels (true targets) for each compound. (Here we have 420 targets - kinases).

In [28]:
y_true = tf.placeholder(tf.float32, [None, targets_num])

Placeholder for weights. The weights will be used to calculate cross-entropy cost function.

In [29]:
cross_entropy_weights = tf.placeholder(tf.float32, [None, weights_num])

## Variables to Optimize

In [30]:
def new_weights(shape):
    return tf.Variable(tf.truncated_normal(shape, stddev=0.05))

In [31]:
def new_biases(length):
    return tf.Variable(tf.constant(0.05, shape=[length]))

## Network Architecture

* INPUT->[CONV -> RELU -> CONV ->RELU ->POOL] ->FC ->RELU->FC->SIGMOID->FC

## Helper Functions to Create the Network's Layers

### NEW CONVOLUTION LAYER

In [36]:
def new_conv_layer_with_RELU(input, num_input_channels, filter_size, num_filters):        
    
    #number and shape of the filters
    shape = [filter_size, filter_size, num_input_channels, num_filters]
    
    #create the filters
    weights = new_weights(shape=shape)

    #create bias for each depth slice of the filter volumne
    biases = new_biases(length=num_filters)

    #convolute stride =[1,1,1,1] move by pixel, padding = "SAME" -> add zero padding to keep the dimensions
    layer = tf.nn.conv2d(input   = input,filter  = weights, strides = [1, 1, 1, 1], padding = 'SAME')
    
    # Add the biases 
    layer += biases
    
    #  non-linearity (ReLU).
    layer = tf.nn.relu(layer)
                   
    return layer

### NEW POOLING LAYER

In [37]:
def new_pooling_layer(input):
    
    layer = tf.nn.max_pool(value=input,ksize=[1, 2, 2, 1],strides=[1, 2, 2, 1],padding='SAME')
    
    return layer

### Helper Function for Flattening a Layer

In [41]:
def flatten_layer(layer):
        
    # get the shape of the input layer in a format [num_images, img_height, img_width, num_channels]
    layer_shape = layer.get_shape()

    # number of features is  img_height * img_width * num_channels
    num_features = layer_shape[1] * layer_shape[2] * layer_shape[3]
    
    # reshape the layer to [num_images, num_features].
    # -1  means the size in that dimension is calculated so the total size of the tensor is unchanged from the reshaping.
    layer_flat = tf.reshape(layer, [-1, num_features])

    # return both the flattened layer and the number of features.
    return layer_flat, num_features

### NEW FULLY-CONNECTED LAYER

In [42]:
def new_fc_layer(input, num_inputs,num_outputs,use_relu, use_sigmoid): 

    # new weights and biases for the layer
    weights = new_weights(shape=[num_inputs, num_outputs])
    biases = new_biases(length=num_outputs)

    # calculate the layer as the matrix multiplication of the input and weights, and then add the bias-values.
    layer = tf.matmul(input, weights) + biases

    if use_relu:
        layer = tf.nn.relu(layer)
       
    if use_sigmoid:
        layer = tf.nn.sigmoid(layer)

    return layer

# Design Computational Graph for CNN

### HYPER-PARAMETERS OF THE NETWORK