## Build QSAR model with pytorch and rdkit #RDKit


There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

In [1]:
import pprint
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
 
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing

In [2]:
traindata = [mol for mol in Chem.SDMolSupplier("solubility.train.sdf") if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier("solubility.test.sdf") if mol is not None]
 
def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
        fps.append(arr)
    fps = np.array(fps, dtype = np.float)
    return fps

In [3]:
classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
 
trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)


torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

In [4]:
X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_train.size(),Y_train.size())
print(X_test.size(), Y_train.size())
 
class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

torch.Size([1024, 2048]) torch.Size([1024])
torch.Size([257, 2048]) torch.Size([1024])


After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

In [5]:
model = QSAR_mlp()
print(model)
epochs = 150
 
losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    optimizer.zero_grad()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    #print("Loss: {}".format(loss.data[0]))
    loss.backward()
    optimizer.step()
 
pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]
 
for i in range(len(predicted)):
    print("pred:{}, target:{}".format(predicted.data[i], Y_test[i]))
 
print( "Accuracy: {}".format(sum(p==t for p,t in zip(predicted.data, Y_test))/len(Y_test)))


QSAR_mlp(
  (fc1): Linear(in_features=2048, out_features=524, bias=True)
  (fc2): Linear(in_features=524, out_features=10, bias=True)
  (fc3): Linear(in_features=10, out_features=10, bias=True)
  (fc4): Linear(in_features=10, out_features=3, bias=True)
)




pred:2, target:0
pred:1, target:0
pred:0, target:1
pred:1, target:1
pred:0, target:1
pred:1, target:1
pred:0, target:0
pred:0, target:1
pred:1, target:0
pred:0, target:0
pred:1, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:1
pred:0, target:0
pred:0, target:0
pred:2, target:1
pred:1, target:1
pred:1, target:1
pred:1, target:1
pred:1, target:1
pred:0, target:1
pred:1, target:1
pred:0, target:1
pred:1, target:1
pred:0, target:1
pred:0, target:0
pred:0, target:0
pred:1, target:1
pred:0, target:0
pred:1, target:0
pred:0, target:1
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:1, target:1
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:0, target:0
pred:2, target:2
pred:1, target:2
pred:2, target

Hmm, accuracy is not so high. I believe there is still room for improvement