First CNN for SMILES multilabel classification.

I slightly changed approach from the one I mentioned with you, followed in "Learnign to SMILES" paper, because Keras manage text input in a different way.

Here's the code and the explanation

In [1]:
import os
import sys
import numpy as np
import tensorflow as tf

from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Convolution1D, MaxPooling1D, GlobalMaxPooling1D, Dropout, Dense
from keras.layers.embeddings import Embedding
from keras.optimizers import SGD
from keras import backend as K

# The default Tensorflow behavior is to allocate memory on all the available GPUs, even if it runs only on the selected
# one. To avoid it, only the selceted GPU (selected by cmd line input) is made visible
gpu = str(sys.argv[1])
os.environ["CUDA_VISIBLE_DEVICES"] = gpu
# For allocating memory gradually as it is needed 
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
K.set_session(sess)

Using TensorFlow backend.


Load the data previously transformed and saved - list of SMILES strings and labels "matrix": each labels is a 213-dimensional vector with 1 at the indices of the associated terms 

In [2]:
DATA_LOC = '../data/'
smiles = np.load(DATA_LOC+'smiles.npy')
y = np.load(DATA_LOC+'multi_labels.npy')

Text transformation: each string is tokenized at char level, with the fit_on_text method a vocabulary for mapping char to integer indices is learned, and finally the strings are transformed to sequences of integers and padded with 0 at the end.

Data are then splitted in training/test set. This is just for a first evaluation of the model, a more accurate evaluation should be performed using k-fold CV.

In [3]:
t = Tokenizer(filters='', lower=False, char_level=True)
t.fit_on_texts(smiles)
seqs = t.texts_to_sequences(smiles)
X = pad_sequences(seqs)

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('Number of training examples: ', X_train.shape[0])
print('Number of test examples: ', X_test.shape[0])
print('Multi-label classification, number of labels: ', y_train.shape[1])

Number of training examples:  8268
Number of test examples:  2068
Multi-label classification, number of labels:  213


In [4]:
# Model
sequence_length = X.shape[1]
vocabulary_size = len(t.word_index)
n_class = y_train.shape[1]
embedding_size = 32

model = Sequential()

From what I've seen, this is how integers sequences representing text are usually managed in Keras. The embedding layer basically turns the positive integers into dense vectors of fixed size.

In [5]:
model = Sequential()
model.add(Embedding(output_dim=embedding_size, input_dim=vocabulary_size,
                    input_length=sequence_length))
model.add(Convolution1D(32, 2, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Convolution1D(32, 3, activation='relu'))
model.add(MaxPooling1D(pool_size=3))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(n_class, activation='sigmoid'))

From what I have understood reading about this, in multi-label classification problems the top layer should have sigmoid activation instead of softmax, in order to predict the probabilities for each node (each label) indipendently.

Also binary crossentropy and sgd optimizer are indicated for this type of problem, but now I've noticed that Adam optimizer outperformed sgd.

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=100, batch_size=64, verbose=1)

The high accuracy value I got was probably due (according to some issue comments found on GitHub) to the sparse labels vectors, so that there was an high number of correctly predicted 0s pushing the accuracy to 0.99, using the standard "evaluate" method of Keras model.

I found [this example](https://github.com/suraj-deshmukh/Multi-Label-Image-Classification/blob/master/miml.ipynb "Multi-Label-Image-Classification") where they use this approach for having a more reliable accuracy measure. I'd like to know if you agree with this or not.

Basically, the predicted labels has to be obtained thresholding the probabilities computed by the output sigmoid layer. This threshold can be fixed (e.g. 0.5) or can be adapted for each label, choosing the threshold that gives greater MCC value. I got better accuracy values "tuning" different thresholds.

In [None]:
out = model.predict(X_test)
out = np.array(out, dtype=np.float32)

# # Thresholdin probabilites at 0.5
# y_pred = np.zeros(out.shape)
# y_pred[np.where(out>=0.5)] = 1

# # Thresholding probabilities adapting the threshold for each label
threshold = np.arange(0.1,1,0.1)
acc = []
accuracies = []
best_threshold = np.zeros(out.shape[1])

for i in range(out.shape[1]):
    y_prob = np.array(out[:,i])
    for j in threshold:
        y_pred = [1 if prob>=j else 0 for prob in y_prob]
        acc.append(matthews_corrcoef(y_test[:,i], y_pred))
    acc = np.array(acc)
    index = np.where(acc==acc.max()) 
    accuracies.append(acc.max()) 
    best_threshold[i] = threshold[index[0][0]]
    acc = []
y_pred = np.array([[1 if out[i,j]>=best_threshold[j] else 0 for j\
                    in range(y_test.shape[1])] for i in range(len(y_test))])
total_correctly_predicted = len([i for i in range(len(y_test)) if (y_test[i]==y_pred[i]).sum() == n_class])

print('hamming_loss: ', hamming_loss(y_test, y_pred))
print('Acc: ', str(total_correctly_predicted/y_test.shape[0]))
print('total_correctly_predicted: ', total_correctly_predicted)

hamming_loss:  0.006002488172101597

Acc:  0.3176982591876209

total_correctly_predicted:  657 (out of 2019 examples in test set)