<a href="https://colab.research.google.com/github/nihal-rao/deepchem/blob/master/baselines/Tox21_GCN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Installing DeepChem

DeepChem is a python-based open source deep learning framework and offers feature rich set toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.

In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3501  100  3501    0     0  17078      0 --:--:-- --:--:-- --:--:-- 17078


add /root/miniconda/lib/python3.7/site-packages to PYTHONPATH
python version: 3.7.10
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added conda-forge to channels
added omnia to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem
!pip install dgl
!pip install dgllife

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/04/7f/3f678587e621d1b904ed6d1af65e353b1d681d6b9f4ffaf243c79745c654/deepchem-2.6.0.dev20210403043508-py3-none-any.whl (552kB)
[K     |▋                               | 10kB 21.9MB/s eta 0:00:01[K     |█▏                              | 20kB 27.0MB/s eta 0:00:01[K     |█▉                              | 30kB 17.4MB/s eta 0:00:01[K     |██▍                             | 40kB 15.2MB/s eta 0:00:01[K     |███                             | 51kB 9.6MB/s eta 0:00:01[K     |███▋                            | 61kB 7.8MB/s eta 0:00:01[K     |████▏                           | 71kB 8.6MB/s eta 0:00:01[K     |████▊                           | 81kB 9.0MB/s eta 0:00:01[K     |█████▍                          | 92kB 8.8MB/s eta 0:00:01[K     |██████                          | 102kB 8.7MB/s eta 0:00:01[K     |██████▌                         | 112kB 8.7MB/s eta 0:00:01[K     |███████▏                       

We can now import the `deepchem` package to play with.

In [3]:
import deepchem as dc
dc.__version__

'2.6.0.dev'

## Using GraphConv featuriser

Implementing and recording the baseline for Tox21 dataset.

In [4]:
featurizer = dc.feat.MolGraphConvFeaturizer()
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer=featurizer)
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)

Failed to featurize datapoint 95, [I-].[K+]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 255, [Hg+2]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 659, [Ba+2]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 985, [TlH2+]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 1423, [Cr+3]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 1534, [Fe+2]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 1722, [Co+2]. Appending empty array
Exception m

<DiskDataset X.shape: (6249,), y.shape: (6249, 12), w.shape: (6249, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>


This MolGraphConv is a featurizer of general graph convolution networks for molecules.

In [5]:
tasks

['NR-AR',
 'NR-AR-LBD',
 'NR-AhR',
 'NR-Aromatase',
 'NR-ER',
 'NR-ER-LBD',
 'NR-PPAR-gamma',
 'SR-ARE',
 'SR-ATAD5',
 'SR-HSE',
 'SR-MMP',
 'SR-p53']

Above are the tasks in the Tox21 dataset - there are 12 tasks, each corresponding to different biotoxicity targets, such as cell receptors and stress response pathways.

In [6]:
print(datasets[0])

<DiskDataset X.shape: (6249,), y.shape: (6249, 12), w.shape: (6249, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>


There are three dataset objects - train split, val split and test split. Each split consists of X and y - X is the features and y is the output label. 

For example the train split has X.shape (6249, ) and y.shape (6249, 12). This implies that there are 6249 samples in the train split. 

##Training the GCNNN

In [7]:
from deepchem.models import GCNModel
model = GCNModel(mode='classification', n_tasks=len(tasks))
model.fit(train_dataset, nb_epoch=50)

DGL backend not selected or invalid.  Assuming PyTorch for now.
Using backend: pytorch


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


0.4685784149169922

The GCN method is a modification of the GraphConv method with some features like:
1. A different method of computing graph-level representations.
2. The learnable weight in GCN model is shared across all nodes. 
3. There are also minor differences in using dropout, skip connections and batch normalization


In [8]:
import numpy as np

metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))

training set score: {'roc_auc_score': 0.9337739933216821}
test set score: {'roc_auc_score': 0.7116885043270934}


The training set score is much higher than test set score. This indicates overfitting - and is why metrics on the validation set need to be measured in otder to tune parameters and detect overfitting.

In [9]:
print(test_dataset.ids[0])

Cc1cc(C(C)(C)C)c(O)c(C)c1Cn1c(=O)n(Cc2c(C)cc(C(C)(C)C)c(O)c2C)c(=O)n(Cc2c(C)cc(C(C)(C)C)c(O)c2C)c1=O


In [10]:
print(test_dataset.y[0])
model.predict(test_dataset)[0]

[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0.]


array([[0.19085124, 0.8091488 ],
       [0.99550456, 0.00449542],
       [0.5820026 , 0.41799742],
       [0.64814484, 0.3518552 ],
       [0.5408875 , 0.45911252],
       [0.56046486, 0.4395351 ],
       [0.94069886, 0.05930116],
       [0.40026724, 0.59973276],
       [0.9904799 , 0.0095201 ],
       [0.49269363, 0.50730634],
       [0.05511509, 0.9448849 ],
       [0.68707335, 0.31292665]], dtype=float32)