<a href="https://colab.research.google.com/github/nihal-rao/deepchem/blob/master/baselines/FREESOLV_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Installing DeepChem

In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3490  100  3490    0     0  26044      0 --:--:-- --:--:-- --:--:-- 26044


add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added omnia to channels
added conda-forge to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/5a/7a/456cd8731b855d33b6db7043627a71f88d89bcf2f570110b8f7603558b06/deepchem-2.4.0rc1.dev20201110082157.tar.gz (402kB)
[K     |████████████████████████████████| 409kB 2.7MB/s 
Building wheels for collected packages: deepchem
  Building wheel for deepchem (setup.py) ... [?25l[?25hdone
  Created wheel for deepchem: filename=deepchem-2.4.0rc1.dev20201111072123-cp36-none-any.whl size=513167 sha256=81fff405976f5b435f0f32f3740692836fb4446b00bca092074ce5264f032066
  Stored in directory: /root/.cache/pip/wheels/dc/48/f7/fe7a5c16e27692765c9f01bdb939afbeebde95cad7e35bac30
Successfully built deepchem
Installing collected packages: deepchem
Successfully installed deepchem-2.4.0rc1.dev20201111072123


We can now import the `deepchem` package to play with.

In [3]:
import deepchem as dc
dc.__version__

'2.4.0-rc1.dev'

## Baseline - Fingerprints + NN

Implementing and recording the baseline for Tox21 dataset.

In [4]:
tasks, datasets, transformers = dc.molnet.load_sampl(featurizer='ECFP')
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)

<DiskDataset X.shape: (513, 1024), y.shape: (513, 1), w.shape: (513, 1), ids: ['CS(=O)(=O)Cl' 'CC(C)C=C' 'CCCCCCCO' ... 'C1CCC(=O)CC1'
 'C[C@@H]1CC[C@H](C(=O)C1)C(C)C' 'C[C@@H]1CC[C@H](CC1=O)C(=C)C'], task_names: ['expt']>


ECFP featurizer is used. Extended Connectivity Fingerprints  is a **fingerprinting** method. They are also sometimes called "circular fingerprints". The ECFP algorithm begins by classifying atoms based only on their direct properties and bonds. Each unique pattern is a feature. For example, "carbon atom bonded to two hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any molecule that contains that feature. It then iteratively identifies new features by looking at larger circular neighborhoods. One specific feature bonded to two other specific features becomes a higher level feature, and the corresponding element is set for any molecule that contains it. This continues for a fixed number of iterations, most often two.

In [5]:
tasks

['expt']

Above are the tasks in the Tox21 dataset - there are 12 tasks, each corresponding to different biotoxicity targets, such as cell receptors and stress response pathways.

In [6]:
datasets

(<DiskDataset X.shape: (513, 1024), y.shape: (513, 1), w.shape: (513, 1), ids: ['CS(=O)(=O)Cl' 'CC(C)C=C' 'CCCCCCCO' ... 'C1CCC(=O)CC1'
  'C[C@@H]1CC[C@H](C(=O)C1)C(C)C' 'C[C@@H]1CC[C@H](CC1=O)C(=C)C'], task_names: ['expt']>,
 <DiskDataset X.shape: (64, 1024), y.shape: (64, 1), w.shape: (64, 1), ids: ['c1ccc2cc(ccc2c1)O' 'Cc1ccc2cc(ccc2c1)C' 'c1ccc2ccccc2c1' ...
  'C([C@@H]1[C@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O'
  'Cc1c(nc(nc1OC(=O)N(C)C)N(C)C)C' 'CCOP(=S)(OCC)Oc1cc(nc(n1)C(C)C)C'], task_names: ['expt']>,
 <DiskDataset X.shape: (65, 1024), y.shape: (65, 1), w.shape: (65, 1), ids: ['c1cnc[nH]1' 'Cn1ccnc1' 'Cc1c[nH]cn1' ... 'Cc1cccc(c1C)Nc2ccccc2C(=O)O'
  'C1CCCC(CC1)O' 'c1ccc2c(c1)CCC2'], task_names: ['expt']>)

There are three dataset objects - train split, val split and test split. Each split consists of X and y - X is the features and y is the output label. Froe example the train split has X.shape (513, 1024)
and y.shape (513, 1). This implies that there are 513 samples in the train split - and each sample is represented by an ECFP vector of size 1024. 

##Training a Model on Fingerprints

In [8]:
model = dc.models.MultitaskRegressor(n_tasks=1, n_features=1024, layer_sizes=[1000])

`MultitaskClassifier` is a simple stack of fully connected layers. A single hidden layer of width 1000 is used. Each input will have 1024 features, and it should produce predictions for 12 different tasks.

Note that the above network is performing multitask learning - a single network is used for all 12 tasks. This is because inter task correlations exist in the data, and to take if advantage of this single neural network is used for multiple tasks.

In [9]:
import numpy as np

model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))

training set score: {'pearson_r2_score': 0.849167746922176}
test set score: {'pearson_r2_score': 0.16165382288093907}


The training set score is much higher than test set score. This indicates overfitting - and is why metrics on the validation set need to be measured in otder to tune parameters and detect overfitting.