Standard imports

In [15]:
import deepchem as dc
import tensorflow as tf
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import pandas as pd

Load the data using the DeepChem module.

The labels are binary, for compounds that either interact or don't interact with the androgen receptor.

Note that these datasets are imbalanced because there are far fewer positive examples than there are negative examples.

In addition, $w$ comes with recommended per-example weights in order to combat the imbalanced-ness of the dataset by giving more emphasis to positive examples.

In [2]:
_, (train, valid, test), _ = dc.molnet.load_tox21()
train_X, train_y, train_w = train.X, train.y, train.w
valid_X, valid_y, valid_w = valid.X, valid.y, valid.w
test_X, test_y, test_w = test.X, test.y, test.w

Loading dataset from disk.
Loading dataset from disk.
Loading dataset from disk.


Because Tox21 contains more datasets than we'll need, we can remove the labels associated with the extra datasets by selecting only the first columns of the label and weight arrays

In [3]:
train_y = train_y[:, 0]
valid_y = valid_y[:, 0]
test_y = test_y[:, 0]
train_w = train_w[:,0]
valid_w = valid_w[:, 0]
test_w = test_w[:, 0]

We will make use of the MoleculeNet dataset collection curated as part of DeepChem. Each molecule of Tox21 is processed into a bit-vector of length 1024 by DeepChem. 

We've used `None` to specify the number of rows for $x$ and $y$ to hold so that we can have indefinite numbers of rows in each of the mini-batches we will use to iterate over the dataset. 

To implement 

We use the matrix multiplication order $xW$ instead of $Wx$ in order to more conveniently deal with a mini-batch of input at a time. This is so that we can use whatever size $N$ we want for a single mini-batch. 

Then we apply the ReLU nonlinearity using the built-in `tf.nn.relu` activation function.

Here, we build the entire computation graph

In [4]:
d = 1024
n_hidden = 50
learning_rate = .001
n_epochs = 10
batch_size = 100
dropout_prob = 0.5

In [5]:
with tf.name_scope("placeholders"):
    x = tf.placeholder( tf.float32, (None, d)) # indefinite number of rows
    y = tf.placeholder( tf.float32, (None,))   # indefinite number of rows that matches that of the above
#     keep_prob = tf.placeholder(tf.float32)
with tf.name_scope("hidden-layer"):
    W = tf.Variable( tf.random_normal((d, n_hidden)))
    b = tf.Variable( tf.random_normal((n_hidden, )))
    x_hidden = tf.nn.relu( tf.matmul(x, W) + b )
    # apply dropout
#     x_hidden = tf.nn.dropout( x_hidden, keep_prob )
with tf.name_scope("output"):
    W = tf.Variable(tf.random_normal((n_hidden,1)))
    b = tf.Variable(tf.random_normal((1,)))
    y_logit = tf.matmul(x_hidden, W) + b
    y_one_prob = tf.sigmoid(y_logit)
    y_pred = tf.round(y_one_prob)
with tf.name_scope("loss"):
    y_expand = tf.expand_dims(y,1)
    entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits=y_logit, labels=y_expand)
    l = tf.reduce_sum(entropy)
    
with tf.name_scope("optim"):
    train_op = tf.train.AdamOptimizer(learning_rate).minimize(l)
    
with tf.name_scope("summaries"):
    tf.summary.scalar("loss",l)
    merged = tf.summary.merge_all()

W0725 16:12:02.921087 140696018024256 deprecation.py:323] From /home/joseph/miniconda3/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [6]:
train_writer = tf.summary.FileWriter('/tmp/fcnet-tox21',
                                     tf.get_default_graph())

Now we'll do the training part with mini-batching:

We need to call `sess.run` for each mini-batch worth of data. 

In [7]:
print(train_X.shape)
print(train_y.shape)

(6264, 1024)
(6264,)


In [8]:
N = train_X.shape[0]
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    step = 0
    for epoch in range(n_epochs):
        pos = 0
        while pos < N:
            batch_X = train_X[pos:pos+batch_size]
            batch_y = train_y[pos:pos+batch_size]
            feed_dict = {x: batch_X, y: batch_y}
            _, summary, loss = sess.run([train_op, merged, l], feed_dict=feed_dict)
            print("epoch %d, step %d, loss: %f" % (epoch, step, loss))
            train_writer.add_summary(summary, step)

            step += 1
            pos += batch_size
    # Make Predictions
    valid_y_pred = sess.run(y_pred, feed_dict={x: valid_X})

epoch 0, step 0, loss: 465.333679
epoch 0, step 1, loss: 550.427124
epoch 0, step 2, loss: 732.376709
epoch 0, step 3, loss: 534.642700
epoch 0, step 4, loss: 591.761780
epoch 0, step 5, loss: 538.295776
epoch 0, step 6, loss: 552.799744
epoch 0, step 7, loss: 398.739838
epoch 0, step 8, loss: 461.339905
epoch 0, step 9, loss: 387.447632
epoch 0, step 10, loss: 620.415222
epoch 0, step 11, loss: 664.459106
epoch 0, step 12, loss: 316.986786
epoch 0, step 13, loss: 410.047913
epoch 0, step 14, loss: 389.063873
epoch 0, step 15, loss: 435.414124
epoch 0, step 16, loss: 334.097595
epoch 0, step 17, loss: 433.203308
epoch 0, step 18, loss: 406.787109
epoch 0, step 19, loss: 381.007202
epoch 0, step 20, loss: 479.168030
epoch 0, step 21, loss: 265.992615
epoch 0, step 22, loss: 366.928619
epoch 0, step 23, loss: 295.207825
epoch 0, step 24, loss: 331.018372
epoch 0, step 25, loss: 392.106201
epoch 0, step 26, loss: 313.021729
epoch 0, step 27, loss: 388.259064
epoch 0, step 28, loss: 239.09

epoch 3, step 249, loss: 171.662201
epoch 3, step 250, loss: 69.805496
epoch 3, step 251, loss: 100.209732
epoch 4, step 252, loss: 10.694839
epoch 4, step 253, loss: 51.636086
epoch 4, step 254, loss: 51.075825
epoch 4, step 255, loss: 137.131439
epoch 4, step 256, loss: 168.768784
epoch 4, step 257, loss: 176.651917
epoch 4, step 258, loss: 97.034019
epoch 4, step 259, loss: 56.121288
epoch 4, step 260, loss: 65.213737
epoch 4, step 261, loss: 104.311432
epoch 4, step 262, loss: 133.579285
epoch 4, step 263, loss: 101.272522
epoch 4, step 264, loss: 79.795746
epoch 4, step 265, loss: 166.654266
epoch 4, step 266, loss: 177.122650
epoch 4, step 267, loss: 67.156273
epoch 4, step 268, loss: 74.364006
epoch 4, step 269, loss: 137.096451
epoch 4, step 270, loss: 234.714355
epoch 4, step 271, loss: 60.245285
epoch 4, step 272, loss: 98.633804
epoch 4, step 273, loss: 48.436493
epoch 4, step 274, loss: 138.755127
epoch 4, step 275, loss: 48.128147
epoch 4, step 276, loss: 84.072327
epoch 4

epoch 7, step 494, loss: 7.168288
epoch 7, step 495, loss: 51.849518
epoch 7, step 496, loss: 34.809265
epoch 7, step 497, loss: 11.077578
epoch 7, step 498, loss: 47.410999
epoch 7, step 499, loss: 190.971649
epoch 7, step 500, loss: 83.303085
epoch 7, step 501, loss: 120.162445
epoch 7, step 502, loss: 50.159286
epoch 7, step 503, loss: 63.377087
epoch 8, step 504, loss: 4.884189
epoch 8, step 505, loss: 17.922279
epoch 8, step 506, loss: 21.515381
epoch 8, step 507, loss: 86.758476
epoch 8, step 508, loss: 117.856796
epoch 8, step 509, loss: 135.427582
epoch 8, step 510, loss: 42.019623
epoch 8, step 511, loss: 39.861877
epoch 8, step 512, loss: 36.025208
epoch 8, step 513, loss: 88.110184
epoch 8, step 514, loss: 89.409103
epoch 8, step 515, loss: 58.499126
epoch 8, step 516, loss: 41.354362
epoch 8, step 517, loss: 105.648102
epoch 8, step 518, loss: 116.583359
epoch 8, step 519, loss: 29.259830
epoch 8, step 520, loss: 24.914564
epoch 8, step 521, loss: 110.241150
epoch 8, step 5

Now we can score our predictions to see what our accuracy is.

Because we are doing binary classification, and we have imbalanced datasets, in real life, you should do a full classification analysis with recall and precision and the confusion matrix, etc.

In [11]:
score = accuracy_score(valid_y, valid_y_pred)
print("Unweighted Classification Accuracy: %f" % score)

weighted_score = accuracy_score(valid_y, valid_y_pred, sample_weight=valid_w)
print("Weighted Classification Accuracy: %f" % weighted_score)

Unweighted Classification Accuracy: 0.931034
Weighted Classification Accuracy: 0.644231


We can see that we get a high unweighted classification accuracy and a low weighted classification accuracy, which is a problematic sign. We know we have imbalanced datasets.

In [16]:
train_y_series = pd.Series(train_y)

In [17]:
train_y_series.value_counts()

0.0    6023
1.0     241
dtype: int64

Here, we see we have significantly more instances of class 0 than that of class 1.

This means that if we were to make a model that simply guessed always 0, we would get a 96% accuracy. This is very misleading, and illustrates the reason as to why we cannot judge our model based solely on accuracy.

You can view the results with:

```bash
tensorboard --logdir /tmp/fcnet-tox21&
```