<a href="https://colab.research.google.com/github/nihal-rao/deepchem/blob/master/baselines/Tox21_GraphConv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Installing DeepChem

DeepChem is a python-based open source deep learning framework and offers feature rich set toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.

In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3501  100  3501    0     0  32719      0 --:--:-- --:--:-- --:--:-- 32719


add /root/miniconda/lib/python3.7/site-packages to PYTHONPATH
python version: 3.7.10
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added conda-forge to channels
added omnia to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/04/7f/3f678587e621d1b904ed6d1af65e353b1d681d6b9f4ffaf243c79745c654/deepchem-2.6.0.dev20210403043508-py3-none-any.whl (552kB)
[K     |▋                               | 10kB 17.0MB/s eta 0:00:01[K     |█▏                              | 20kB 20.2MB/s eta 0:00:01[K     |█▉                              | 30kB 25.0MB/s eta 0:00:01[K     |██▍                             | 40kB 28.0MB/s eta 0:00:01[K     |███                             | 51kB 22.1MB/s eta 0:00:01[K     |███▋                            | 61kB 24.3MB/s eta 0:00:01[K     |████▏                           | 71kB 23.2MB/s eta 0:00:01[K     |████▊                           | 81kB 21.3MB/s eta 0:00:01[K     |█████▍                          | 92kB 20.1MB/s eta 0:00:01[K     |██████                          | 102kB 21.0MB/s eta 0:00:01[K     |██████▌                         | 112kB 21.0MB/s eta 0:00:01[K     |███████▏                

We can now import the `deepchem` package to play with.

In [3]:
import deepchem as dc
dc.__version__

'2.6.0.dev'

## Baseline - Fingerprints + NN

Implementing and recording the baseline for Tox21 dataset.

In [4]:
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)

<DiskDataset X.shape: (6264,), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>


The graph convolution model similar to a recurrent neural network in which the set of descriptors per atom is updated with each iteration based on those of its neighbours. The final layer is a fully connected layer which predicts output in a multi task setting.

The graph convolutions start with a set of descriptiors, it then combines and recombines over various convolutional layers.

In [5]:
tasks

['NR-AR',
 'NR-AR-LBD',
 'NR-AhR',
 'NR-Aromatase',
 'NR-ER',
 'NR-ER-LBD',
 'NR-PPAR-gamma',
 'SR-ARE',
 'SR-ATAD5',
 'SR-HSE',
 'SR-MMP',
 'SR-p53']

Above are the tasks in the Tox21 dataset - there are 12 tasks, each corresponding to different biotoxicity targets, such as cell receptors and stress response pathways.

In [6]:
datasets[0]

<DiskDataset X.shape: (6264,), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>

There are three dataset objects - train split, val split and test split. Each split consists of X and y. X is the features and y is the output label. 

For example the train split has X.shape (6264, ) and y.shape (6264, 12). This implies that there are 6264 samples in the train split.

##Training a Model on Fingerprints

In [7]:
import warnings
warnings.filterwarnings("ignore")
n_tasks = len(tasks)
model = dc.models.GraphConvModel(n_tasks, mode='classification')
model.fit(train_dataset, nb_epoch=50)

0.2773268127441406

The GraphConv method is based on Duvenaud et al., . It uses a graph convolution model similar to a recurrent neural network in which the set of descriptors per atom is updated with each iteration based on those of its neighbours. The final layer is a fully connected layer which predicts output in a multi task setting.


In [8]:
import numpy as np

metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))

training set score: {'roc_auc_score': 0.9725301575583809}
test set score: {'roc_auc_score': 0.7081477950456065}


The training set score is much higher than test set score. This indicates overfitting - and is why metrics on the validation set need to be measured in otder to tune parameters and detect overfitting.

In [16]:
print(test_dataset.ids[0])

CC1(C)S[C@@H]2[C@H](NC(=O)Cc3ccccc3)C(=O)N2[C@H]1C(=O)O.CC1(C)S[C@@H]2[C@H](NC(=O)Cc3ccccc3)C(=O)N2[C@H]1C(=O)O.c1ccc(CNCCNCc2ccccc2)cc1


In [17]:
print(test_dataset.y[0])
model.predict(test_dataset)[0]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


array([[1.9777729e-01, 8.0222273e-01],
       [9.0129119e-01, 9.8708786e-02],
       [8.1991351e-01, 1.8008654e-01],
       [9.3443751e-01, 6.5562524e-02],
       [8.3893728e-01, 1.6106269e-01],
       [9.9980229e-01, 1.9778100e-04],
       [9.9659061e-01, 3.4094481e-03],
       [7.5596905e-01, 2.4403101e-01],
       [9.9961579e-01, 3.8413037e-04],
       [9.9699223e-01, 3.0077701e-03],
       [9.5980692e-01, 4.0193070e-02],
       [9.9769384e-01, 2.3061496e-03]], dtype=float32)