<a href="https://colab.research.google.com/github/nihal-rao/deepchem/blob/master/baselines/Lipo_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Installing DeepChem

DeepChem is a python-based open source deep learning framework and offers feature rich set toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.

In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3501  100  3501    0     0  21090      0 --:--:-- --:--:-- --:--:-- 20964


add /root/miniconda/lib/python3.7/site-packages to PYTHONPATH
python version: 3.7.10
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added conda-forge to channels
added omnia to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/04/7f/3f678587e621d1b904ed6d1af65e353b1d681d6b9f4ffaf243c79745c654/deepchem-2.6.0.dev20210403043508-py3-none-any.whl (552kB)
[K     |▋                               | 10kB 15.4MB/s eta 0:00:01[K     |█▏                              | 20kB 19.8MB/s eta 0:00:01[K     |█▉                              | 30kB 11.0MB/s eta 0:00:01[K     |██▍                             | 40kB 9.3MB/s eta 0:00:01[K     |███                             | 51kB 7.9MB/s eta 0:00:01[K     |███▋                            | 61kB 8.2MB/s eta 0:00:01[K     |████▏                           | 71kB 8.3MB/s eta 0:00:01[K     |████▊                           | 81kB 8.0MB/s eta 0:00:01[K     |█████▍                          | 92kB 7.8MB/s eta 0:00:01[K     |██████                          | 102kB 7.5MB/s eta 0:00:01[K     |██████▌                         | 112kB 7.5MB/s eta 0:00:01[K     |███████▏                        

We can now import the `deepchem` package to play with.

In [3]:
import deepchem as dc
dc.__version__

'2.6.0.dev'

## Baseline - Fingerprints + NN

Implementing and recording the baseline for Tox21 dataset.

In [4]:
tasks, datasets, transformers = dc.molnet.load_lipo(featurizer='ECFP')
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)

<DiskDataset X.shape: (3360, 1024), y.shape: (3360, 1), w.shape: (3360, 1), task_names: ['exp']>


ECFP featurizer is used. Extended Connectivity Fingerprints  is a **fingerprinting** method. They are also sometimes called "circular fingerprints". The ECFP algorithm begins by classifying atoms based only on their direct properties and bonds. Each unique pattern is a feature.

For example, "carbon atom bonded to two hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any molecule that contains that feature. It then iteratively identifies new features by looking at larger circular neighborhoods. 

One specific feature bonded to two other specific features becomes a higher level feature, and the corresponding element is set for any molecule that contains it. This continues for a fixed number of iterations, most often two.

In [5]:
tasks

['exp']

Above are the tasks in the Lipo dataset. The set contains one task which measures octanol/water distribution coefficient (logD) of the compound

In [6]:
datasets[0]

<DiskDataset X.shape: (3360, 1024), y.shape: (3360, 1), w.shape: (3360, 1), task_names: ['exp']>

There are three dataset objects - train split, val split and test split. Each split consists of X and y.

X is the features and y is the output label. For example the train split has X.shape (3360, 1024) and y.shape (3360, 1). This implies that there are 3360 samples in the train split - and each sample is represented by an ECFP vector of size 1024. 

##Training a Model on Fingerprints

In [7]:
model = dc.models.MultitaskRegressor(n_tasks=1, n_features=1024, layer_sizes=[1000])

A MultiTaskRegressor model provides lots of options for customizing aspects of the model: the number and widths of layers, the activation functions, regularization methods, etc.

It optionally can compose the model from pre-activation residual blocks rather than a simple stack of dense layers. This often leads to easier training, especially when using a large number of layers. The residual blocks can only be used when successive layers have the same width. Wherever the layer width changes, a simple dense layer will be used even if residual=True.

The layer size is 1000 and will be performed on 1 task only.

In [8]:
import numpy as np

model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))

training set score: {'pearson_r2_score': 0.6974517176731267}
test set score: {'pearson_r2_score': 0.22355885227147854}


The training set score is much higher than test set score. This indicates overfitting - and is why metrics on the validation set need to be measured in otder to tune parameters and detect overfitting.

In [9]:
print(test_dataset.ids[0])

O[C@@H](CNCCCOCCNCCc1cccc(Cl)c1)c2ccc(O)c3NC(=O)Sc23


In [10]:
print(test_dataset.y[0])
model.predict(test_dataset)[0]

[-1.81083219]


array([[-0.722492]], dtype=float32)