### CNNFP and ECFP comparison on the ClinTox dataset

The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures: (1) clinical trial toxicity (or absence of toxicity) and (2) FDA approval status.

For both tasks, Logistic Regression and Random Forest were evaluated, using as input CNNFP, ECFP and a fingerprint obtained as a concatenation of both.

In [13]:
import os
import sys
parent_path = os.path.abspath(os.path.join('..'))
if parent_path not in sys.path:
    sys.path.append(parent_path)
import time
import numpy as np
import pickle

import deepchem as dc
from deepchem.utils.save import load_from_disk

import tensorflow as tf

from preprocess.smiles_embedder import get_cnn_fingerprint

from keras.models import load_model, Model
from keras.preprocessing.sequence import pad_sequences

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


dataset_file = "../comparison/clintox.csv"
dataset = load_from_disk(dataset_file)
pretty_columns = (
    "[" + ",".join(["'%s'" % column for column in dataset.columns.values]) + "]")
print("Columns of dataset: %s" % pretty_columns)
print("Number of examples in dataset: %s" % str(dataset.shape[0]))

smiles_field = 'smiles'
class_field = 'CT_TOX'

Columns of dataset: ['smiles','FDA_APPROVED','CT_TOX']
Number of examples in dataset: 1484


In [14]:
smiles = [m for m in dataset[smiles_field]]
labels = [c for c in dataset[class_field]]

# Embedding smiles
fps = get_cnn_fingerprint(smiles)

# 10-fold CV on CNN embedding
rf = RandomForestClassifier(n_estimators=500)
logreg = LogisticRegression()
auc_lr_cnn = cross_val_score(logreg, fps, labels, cv=10, scoring='roc_auc', n_jobs=-1)
auc_rf_cnn = cross_val_score(rf, fps, labels, cv=10, scoring='roc_auc', n_jobs=-1)

# ECFP embedding using deepchem utilities
featurizer_func = dc.feat.CircularFingerprint(size=512)
loader = dc.data.CSVLoader(tasks=[class_field], smiles_field=smiles_field, id_field=smiles_field,
                           featurizer=featurizer_func)
dataset = loader.featurize(dataset_file)

X = np.array(dataset.X)
y = np.array(dataset.y, dtype=np.int32)
y = y.reshape(y.shape[0],)

# 10-fold CV on CNN embedding
rf = RandomForestClassifier(n_estimators=500)
logreg = LogisticRegression()
auc_lr = cross_val_score(logreg, X, y, cv=10, scoring='roc_auc', n_jobs=-1)
auc_rf = cross_val_score(rf, X, y, cv=10, scoring='roc_auc', n_jobs=-1)

# Results
print('\nLogReg, 10-fold CV on CNN fingerprint')
print("%.2f (+/- %.2f)" % (auc_lr_cnn.mean(), auc_lr_cnn.std()))
print('LogReg, 10-fold CV on ECFP fingerprint')
print("%.2f (+/- %.2f)" % (auc_lr.mean(), auc_lr.std()))

print('\nRF, 10-fold CV on CNN fingerprint')
print("%.2f (+/- %.2f)" % (auc_rf_cnn.mean(), auc_rf_cnn.std()))
print('RF, 10-fold CV on ECFP fingerprint')
print("%.2f (+/- %.2f)" % (auc_rf.mean(), auc_rf.std()))

# Combine features, CNNFP + ECFP
valid_inds = [i for i, s in enumerate(smiles) if s in dataset.ids]

fp_combo = [np.concatenate((c, e)) for c, e in zip(fps[valid_inds], X)]
fp_combo = np.array(fp_combo)

auc_lr_combo = cross_val_score(logreg, fp_combo, y, cv=10, scoring='roc_auc', n_jobs=-1)
auc_rf_combo = cross_val_score(rf, fp_combo, y, cv=10, scoring='roc_auc', n_jobs=-1)

print('\nLogReg, 10-fold CV on combined fingerprint')
print("%.2f (+/- %.2f)" % (auc_lr_combo.mean(), auc_lr_combo.std()))
print('RF, 10-fold CV on combined fingerprint')
print("%.2f (+/- %.2f)" % (auc_rf_combo.mean(), auc_rf_combo.std()))

Embedding smiles...
Embedding complete - 1.1613032817840576 seconds, 1484 smiles
Loading raw samples now.
shard_size: 8192
About to start loading CSV from ../comparison/clintox.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
TIMING: featurizing shard 0 took 2.391 s
TIMING: dataset construction took 2.461 s
Loading dataset from disk.

LogReg, 10-fold CV on CNN fingerprint
0.93 (+/- 0.05)
LogReg, 10-fold CV on ECFP fingerprint
0.72 (+/- 0.07)

RF, 10-fold CV on CNN fingerprint
0.94 (+/- 0.07)
RF, 10-fold CV on ECFP fingerprint
0.74 (+/- 0.11)

LogReg, 10-fold CV on combined fingerprint
0.95 (+/- 0.05)
RF, 10-fold CV on combined fingerprint
0.96 (+/- 0.05)


Complete results with comparisons on different datasets [here](https://docs.google.com/spreadsheets/d/1Pc8MnpoEmvonWnoFyi4lWyqYSCn7aNviDb8evSpfPBw/edit?usp=sharing)