### CNN-based embedding (fingerprinting)

CNN for embedding (fingerprinting) molecules, trained on the multi-label classification of pharmacologic action MeSH terms.

The input to the net are SMILES, string notation of the molecules, that are converted in sequences of integers - representing key for each symbol in a vocabulary. The sequences are then embedded in a dense vector representation, on which the concolution is applied.
The vocabulary was built accordingly to the [openSMILES specification](http://opensmiles.org/opensmiles.html), so that the model can represent every possible SMILES and not only the ones in the dataset used for training.

The output of the penultimate layer of the net represents the fingerprint, a 512-long real-valued vector.

In [1]:
import os
import sys
parent_path = os.path.abspath(os.path.join('..'))
if parent_path not in sys.path:
    sys.path.append(parent_path)
    
import pickle
import tensorflow as tf

from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model, load_model
from keras.layers import Convolution1D, MaxPooling1D, Dropout, Dense, Flatten
from keras.layers.embeddings import Embedding
from keras import backend as K

from preprocess.data_handler import load_data, load_pickle, categorical_labels

Using TensorFlow backend.


Loading dataset, a dictionary for converting terms to categorical labels and the vocabulary for turning SMILES into sequences of integers.

In [2]:
DATA_LOC = '../data/'
termdict = load_pickle(DATA_LOC+'termdict.pickle')
vocabulary = load_pickle(DATA_LOC+'smiles_vocabulary.pickle')
dataset = load_data(DATA_LOC+'dataset.csv')
smiles = dataset['SMILES']

seqs = [[vocabulary[c] for c in list(s)] for s in smiles]
X = pad_sequences(seqs, padding='post')
y = categorical_labels(dataset['Terms'], termdict)

sequence_length = X.shape[1]
vocabulary_size = len(vocabulary)
n_class = y.shape[1]
embedding_size = 64

# Model
model = Sequential()
model.add(Embedding(output_dim=embedding_size, input_dim=vocabulary_size,
                    input_length=sequence_length))
model.add(Convolution1D(32, 3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Convolution1D(32, 3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_class, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

# ## Model trained on Titan GPU
# # model.fit(X, y, epochs=100, batch_size=64, verbose=1)
# # model.save('fp-embedder.h5')

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1021, 64)          4544      
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1019, 32)          6176      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 509, 32)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 507, 32)           3104      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 253, 32)           0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 253, 32)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 8096)              0         
__________

Input representation and output fingerprint for a SMILES

In [7]:
fp_model = load_model('../analysis/fp-embedder.h5')
seq_embedder = Model(inputs=fp_model.input, outputs=fp_model.layers[0].output)
emb = seq_embedder.predict(X[0:1], batch_size=1000) 

print('SMILES: ', smiles[0])
print('Sequence: ', X[0])
print('Embedded sequence:\n', emb[0])
# print('Fingerprint: ', fp[0])

SMILES:  CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C
Sequence:  [26 26  4 ...,  0  0  0]
Embedded sequence:
 [[ 0.03640511  0.11953979  0.13228315 ..., -0.24824536  0.31792849
  -0.29289258]
 [ 0.03640511  0.11953979  0.13228315 ..., -0.24824536  0.31792849
  -0.29289258]
 [ 0.08647189  0.10602581  0.02729583 ...,  0.11304798  0.19153979
   0.28203148]
 ..., 
 [ 0.06509708 -0.01844917 -0.06520764 ...,  0.03294779 -0.05305507
   0.01718416]
 [ 0.06509708 -0.01844917 -0.06520764 ...,  0.03294779 -0.05305507
   0.01718416]
 [ 0.06509708 -0.01844917 -0.06520764 ...,  0.03294779 -0.05305507
   0.01718416]]


In [8]:
fp_embedder = Model(inputs=fp_model.input, outputs=fp_model.layers[-2].output)
fp = fp_embedder.predict(X[0:1], batch_size=1000)

print('Fingerprint:\n', fp[0])

Fingerprint:
 [ 0.          0.          0.          0.28545275  0.          0.
  1.20412922  0.          0.          0.          0.10057554  0.          0.1966182
  3.41473317  0.          0.          1.57863247  0.          0.
  0.43136638  0.          0.          1.05833948  0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.92761707  0.          0.          0.0114923   0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          2.04378629  1.98045826  0.          0.          0.          0.
  0.          0.          0.          0.42110276  0.          0.          0.
  0.          0.          0.          0.          0.          0.          0.
  0.          0.          1.47133362  0.          0.          0.          0.
  0.          0.          2.79303241  0.          0.29305971  0.          0.
  0.          