First CNN for SMILES multilabel classification.

I slightly changed approach from the one I mentioned with you, followed in "Learnign to SMILES" paper, because Keras manage text input in a different way.

Here's the code and the explanation

In [1]:
import os
import sys
import numpy as np
import tensorflow as tf

from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Convolution1D, MaxPooling1D, GlobalMaxPooling1D, Dropout, Dense
from keras.layers.embeddings import Embedding
from keras.optimizers import SGD
from keras import backend as K

# The default Tensorflow behavior is to allocate memory on all the available GPUs, even if it runs only on the selected
# one. To avoid it, only the selceted GPU (selected by cmd line input) is made visible
gpu = str(sys.argv[1])
os.environ["CUDA_VISIBLE_DEVICES"] = gpu
# For allocating memory gradually as it is needed 
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
K.set_session(sess)

Using TensorFlow backend.


Load the data previously transformed and saved - list of SMILES strings and labels "matrix": each labels is a 213-dimensional vector with 1 at the indices of the associated terms 

In [2]:
DATA_LOC = '../data/'
smiles = np.load(DATA_LOC+'smiles.npy')
y = np.load(DATA_LOC+'multi_labels.npy')

Text transformation: each string is tokenized at char level, with the fit_on_text method a vocabulary for mapping char to integer indices is learned, and finally the strings are transformed to sequences of integers and padded with 0 at the end.

Data are then splitted in training/test set. This is just for a first evaluation of the model, a more accurate evaluation should be performed using k-fold CV.

In [3]:
t = Tokenizer(filters='', lower=False, char_level=True)
t.fit_on_texts(smiles)
seqs = t.texts_to_sequences(smiles)
X = pad_sequences(seqs)

# Split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('Number of training examples: ', X_train.shape[0])
print('Number of test examples: ', X_test.shape[0])
print('Multi-label classification, number of labels: ', y_train.shape[1])

Number of training examples:  8268
Number of test examples:  2068
Multi-label classification, number of labels:  213


In [4]:
# Model
sequence_length = X.shape[1]
vocabulary_size = len(t.word_index)
n_class = y_train.shape[1]
embedding_size = 32

model = Sequential()

From what I've seen this is how integers sequences representing text are usually managed in Keras. The embedding layer basically turns the positive integers into dense vectors of fixed size.

In [5]:
model.add(Embedding(output_dim=embedding_size, input_dim=vocabulary_size, input_length=sequence_length))
model.add(Convolution1D(64, 3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))
model.add(Convolution1D(64, 3, activation='relu'))

At this level I tried using either MaxPooling and Global Pooling, with the same result.

In [6]:
model.add(GlobalMaxPooling1D())
model.add(Dropout(0.25))

The output of this Dense layer could be our fingerprint, so I set its size to 1024 (usual size for fingerprints)

In [None]:
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.25))

From what I have understood reading about this, in multi-label classification problems the top layer should have sigmoid activation instead of softmax, in order to predict the probabilities for each node (each label) indipendently.

Also binary crossentropy and sgd optimizer are indicated for this type of problem.

In [None]:
model.add(Dense(n_class, activation='sigmoid'))
print(model.summary())

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)
score = model.evaluate(X_test, y_test, batch_size=32)
print('Test score: ', model.metrics_names, score)

Some considerations:
I tried to change some parameters, for example the number of filters of the conv layers, the embedding size or using Max Pooling instead of Global Max, but I always get this high accuracy value (always 0.99), that makes me think of overfitting. I don't know if it is because there's something wrong in the network architecture, but if this is ok I'd like to ask your opinion about this points:

 The maximum SMILES length is 1021, that means that all the sequences are padded to this length. I don't think this is good, since the average length of the strings is about 60 and there are also very short strings, but I don't know if it could be better to discard SMILES greather than a certain length or just truncate them (but I suppose that this would lead to have duplicated SMILES again)
 
 The approach followed in the paper, where they say that they don't use padding, seems impossible to implement in Keras, because they just say that they manage it fixing the number of pooling unit and dynamically determining the size of the pooling window basing on the sequence length. This is not very clear to me, but anyway I don't think it's possible to manage input of different lengths since the Embedding and the Convolutional layers require fixed input size.