<a href="https://colab.research.google.com/github/nihal-rao/deepchem/blob/master/baselines/Tox21_GCN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Installing DeepChem

DeepChem is a python-based open source deep learning framework and offers feature rich set toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.

In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3501  100  3501    0     0  16671      0 --:--:-- --:--:-- --:--:-- 16671


add /root/miniconda/lib/python3.7/site-packages to PYTHONPATH
python version: 3.7.10
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added conda-forge to channels
added omnia to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem
!pip install dgl
!pip install dgllife

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/1f/b7/b2f36388bdd60420d2f6923076a30c57ca08557b7fa6b63e720440188c13/deepchem-2.6.0.dev20210323214627-py3-none-any.whl (552kB)
[K     |▋                               | 10kB 11.6MB/s eta 0:00:01[K     |█▏                              | 20kB 17.2MB/s eta 0:00:01[K     |█▉                              | 30kB 8.9MB/s eta 0:00:01[K     |██▍                             | 40kB 8.4MB/s eta 0:00:01[K     |███                             | 51kB 9.0MB/s eta 0:00:01[K     |███▋                            | 61kB 8.1MB/s eta 0:00:01[K     |████▏                           | 71kB 7.9MB/s eta 0:00:01[K     |████▊                           | 81kB 8.2MB/s eta 0:00:01[K     |█████▍                          | 92kB 7.7MB/s eta 0:00:01[K     |██████                          | 102kB 8.1MB/s eta 0:00:01[K     |██████▌                         | 112kB 8.1MB/s eta 0:00:01[K     |███████▏                        |

We can now import the `deepchem` package to play with.

In [3]:
import deepchem as dc
dc.__version__

'2.6.0.dev'

## Using GraphConv featuriser

Implementing and recording the baseline for Tox21 dataset.

In [4]:
featurizer = dc.feat.MolGraphConvFeaturizer()
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer=featurizer)
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)

Failed to featurize datapoint 95, [I-].[K+]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 255, [Hg+2]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 659, [Ba+2]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 985, [TlH2+]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 1423, [Cr+3]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 1534, [Fe+2]. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
Failed to featurize datapoint 1722, [Co+2]. Appending empty array
Exception m

<DiskDataset X.shape: (6249,), y.shape: (6249, 12), w.shape: (6249, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>


ECFP featurizer is used. Extended Connectivity Fingerprints  is a **fingerprinting** method. They are also sometimes called "circular fingerprints". The ECFP algorithm begins by classifying atoms based only on their direct properties and bonds. 

Each unique pattern is a feature. For example, "carbon atom bonded to two hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any molecule that contains that feature. It then iteratively identifies new features by looking at larger circular neighborhoods. 

One specific feature bonded to two other specific features becomes a higher level feature, and the corresponding element is set for any molecule that contains it. This continues for a fixed number of iterations, most often two.

In [12]:
tasks

['NR-AR',
 'NR-AR-LBD',
 'NR-AhR',
 'NR-Aromatase',
 'NR-ER',
 'NR-ER-LBD',
 'NR-PPAR-gamma',
 'SR-ARE',
 'SR-ATAD5',
 'SR-HSE',
 'SR-MMP',
 'SR-p53']

Above are the tasks in the Tox21 dataset - there are 12 tasks, each corresponding to different biotoxicity targets, such as cell receptors and stress response pathways.

In [8]:
print(datasets[0])

<DiskDataset X.shape: (6249,), y.shape: (6249, 12), w.shape: (6249, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>


There are three dataset objects - train split, val split and test split. Each split consists of X and y - X is the features and y is the output label. 

For example the train split has X.shape (6249, ) and y.shape (6249, 12). This implies that there are 6249 samples in the train split. 

##Training the GCNNN

In [9]:
from deepchem.models import GCNModel
model = GCNModel(mode='classification', n_tasks=len(tasks))
model.fit(train_dataset, nb_epoch=50)

DGL backend not selected or invalid.  Assuming PyTorch for now.
Using backend: pytorch


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)
0.4454317855834961


`MultitaskClassifier` is a simple stack of fully connected layers. A single hidden layer of width 1000 is used. Each input will have 1024 features, and it should produce predictions for 12 different tasks.

Note that the above network is performing multitask learning - a single network is used for all 12 tasks. This is because inter task correlations exist in the data, and to take if advantage of this single neural network is used for multiple tasks.

In [10]:
import numpy as np

metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))

training set score: {'roc_auc_score': 0.9363091418658644}
test set score: {'roc_auc_score': 0.7183378974396456}


The training set score is much higher than test set score. This indicates overfitting - and is why metrics on the validation set need to be measured in otder to tune parameters and detect overfitting.