# DeepChem

- [Installation](https://github.com/deepchem/deepchem#installation)
- [Tutorial](https://deepchem.readthedocs.io/en/latest/get_started/tutorials.html)
- [Sample Notebooks](https://github.com/deepchem/deepchem/tree/master/examples/tutorials) 

DeepChem has implementation of several different graph models. 

Some of the models are implemented by Keras and others are implemented by PyTorch. 

We compare several Keras & PyTorch models. 


**Additional Dependencies:**

Weave Model requires "TensorFlow Probability" library:
- https://www.tensorflow.org/probability
 
 
PyTorch models require "DGL" and "DGL LifeSci" to be installed:
- https://www.dgl.ai/
- https://github.com/awslabs/dgl-lifesci


**Documentation:**

https://deepchem.readthedocs.io/en/latest/api_reference/models.html#keras-models

https://deepchem.readthedocs.io/en/latest/api_reference/models.html#pytorch-models


**Note:**

https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#graph-convolution-featurizers

Keras & PyTorch models use different featurizers. Also, some models have their own specific featurizer.
In summary:

- ConvMolFeaturizer and WeaveFeaturizer are used with graph convolution models which inherited Keras Model. 

- ConvMolFeaturizer is used with graph convolution models except WeaveModel. WeaveFeaturizer are only used with WeaveModel. 

- MolGraphConvFeaturizer is used with graph convolution models which inherited TorchModel. 


**Important:**

Before model fitting, test all graph featurizers on all data to be sure that featurizers work on all daat.


In [1]:
import numpy as np
import pandas as pd

import deepchem as dc

from sklearn.metrics import r2_score

# Parameters , Settings

In [2]:
data_filepath = './data/ESOL.csv'   

model_dir = './result'   # folder to save fitted model

n_tasks = 1   # No. of tasks (No. of dependent variables)

nb_epoch = 100   # No. of epochs

# Initialze the metrics
# https://deepchem.readthedocs.io/en/latest/api_reference/metrics.html
metric_r2 = dc.metrics.Metric(dc.metrics.r2_score)
metric_mse = dc.metrics.Metric(dc.metrics.mean_squared_error)
metrics = [metric_r2, metric_mse]

# Data

In [3]:
data = pd.read_csv(data_filepath)   
smiles = data['smiles']   # should be be 1D
y = data['measured log solubility in mols per litre']   # can be 1D or 2D

# Keras Models

# GraphConvModel

### Make Dataset by ConvMolFeaturizer

In [4]:
''' ConvMolFeaturizer '''
# Duvenaud graph convolutions ; Can be used with Keras models
# https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#convmolfeaturizer

featurizer = dc.feat.ConvMolFeaturizer(use_chirality=False, per_atom_fragmentation=False)
features = featurizer.featurize(smiles)   # numpy array, it returns a graph object for every molecule
type(features[0])   # deepchem.feat.mol_graphs.ConvMol

deepchem.feat.mol_graphs.ConvMol

In [5]:
dataset = dc.data.NumpyDataset(X=features, y=np.array(y))
print(dataset)

<NumpyDataset X.shape: (1128,), y.shape: (1128,), w.shape: (1128,), task_names: [0]>


In [6]:
# Split Dataset
splitter = dc.splits.RandomSplitter()
train_dataset, test_dataset = splitter.train_test_split(dataset=dataset, frac_train=0.8, seed=0)
print(test_dataset)

y_test = test_dataset.y

<NumpyDataset X.shape: (226,), y.shape: (226,), w.shape: (226,), ids: [148 1014 1021 ... 835 559 684], task_names: [0]>


### Model , Train

In [8]:
# https://deepchem.readthedocs.io/en/latest/api_reference/models.html#graphconvmodel
# This Class uses Keras models
model = dc.models.GraphConvModel(n_tasks=n_tasks,              # No. of tasks 
                                 mode='regression',            # Either “classification” or “regression”
                                 batch_size=100,               # Batch size for training and evaluating
                                 learning_rate=0.001,
                                 dropout=0.2,                  # Dropout probablity to use for each layer. The length of this list should equal len(graph_conv_layers)+1 (one value for each convolution layer, and one for the dense layer). Alternatively this may be a single value instead of a list, in which case the same value is used for every layer.
                                 graph_conv_layers=[64, 64],   # Width of channels for the Graph Convolution Layers
                                 dense_layer_size=128          # Width of channels for Atom Level Dense Layer before GraphPool
                                 )

In [9]:
loss_avg = model.fit(train_dataset, nb_epoch=nb_epoch)
loss_avg

  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." %

  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." %

  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)


In [10]:
print("Training set score:", model.evaluate(train_dataset, metrics))
print("Test set score:", model.evaluate(test_dataset, metrics))

Training set score: {'r2_score': 0.7699905728761282, 'mean_squared_error': 1.0149233396525408}
Test set score: {'r2_score': 0.7233671241375563, 'mean_squared_error': 1.1911568851754661}


In [11]:
y_pred = model.predict(test_dataset)   # numpy array [n, 1]
r2_score(y_test, y_pred)

0.7233671241375563

# WeaveModel

### Make Dataset by WeaveFeaturizer

In [12]:
''' WeaveFeaturizer '''
# Weave convolutions ; Can be used with WeaveModel (WeaveModel has its own featurizer)
# Compared to "ConvMolFeaturizer", it has extra descriptors and may provide for additional descriptive power but at the cost of a larger featurized dataset.
# https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#weavefeaturizer

# "graph_distance" : If True, use graph distance for distance features. Otherwise, use Euclidean distance. Note that this means that molecules that this featurizer is invoked on must have valid conformer information if this option is set.
featurizer = dc.feat.WeaveFeaturizer(graph_distance=True)
features = featurizer.featurize(smiles)
type(features[0])   # deepchem.feat.mol_graphs.WeaveMol

deepchem.feat.mol_graphs.WeaveMol

In [13]:
dataset = dc.data.NumpyDataset(X=features, y=np.array(y))
print(dataset)

<NumpyDataset X.shape: (1128,), y.shape: (1128,), w.shape: (1128,), task_names: [0]>


In [14]:
# Split Dataset
splitter = dc.splits.RandomSplitter()
train_dataset, test_dataset = splitter.train_test_split(dataset=dataset, frac_train=0.8, seed=0)
print(test_dataset)

y_test = test_dataset.y

<NumpyDataset X.shape: (226,), y.shape: (226,), w.shape: (226,), ids: [148 1014 1021 ... 835 559 684], task_names: [0]>


### Model , Train

In [15]:
# https://deepchem.readthedocs.io/en/latest/api_reference/models.html#weavemodel
# This model implements the Weave style graph convolution. This Class uses Keras models.
# Weave model has different architectures. The default settings in this class correspond to the W2N2 variant which is the most commonly used variant.
# This model cannot compute uncertainties

del model

model = dc.models.WeaveModel(n_tasks=n_tasks,              # No. of tasks 
                             mode='regression',            # Either “classification” or “regression”
                             batch_size=100,               # Batch size for training and evaluating
                             learning_rate=0.001,
                             dropout=0.25,                 # Dropout probablity to use for each fully connected layer. Default value is 0.25. Name of parameter is 'dropouts'. For other models is called 'dropout'. If you use 'dropout' here, still works.       
                             n_weave=2,                    # No. of weave layers
                             fully_connected_layer_sizes=[2000, 100],   # Size of each dense layer in the network. The length of this list determines the number of layers.                                 
                             batch_normalize=False         # Use of batch normalization can cause issues with NaNs. If you’re having trouble with NaNs while using this model, consider setting batch_normalize=False.
                             )

In [16]:
# Slower than GraphConvModel
model.fit(train_dataset, nb_epoch=nb_epoch)



0.7614833831787109

In [17]:
print("Training set score:", model.evaluate(train_dataset, metrics))
print("Test set score:", model.evaluate(test_dataset, metrics))

Training set score: {'r2_score': 0.9269669655017868, 'mean_squared_error': 0.322260405604883}
Test set score: {'r2_score': 0.9001402970449006, 'mean_squared_error': 0.4299871168808256}


In [18]:
y_pred = model.predict(test_dataset)   # numpy array [n, 1]
r2_score(y_test, y_pred)

0.9001402970449006

# PyTorch Models

# GAT Model

### Make Dataset by MolGraphConvFeaturizer

In [19]:
''' MolGraphConvFeaturizer '''
# General graph convolution networks for molecules , Can be used with PyTorch models
# https://deepchem.readthedocs.io/en/latest/api_reference/featurizers.html#molgraphconvfeaturizer
# Note: If SMILES is only one atom and you set "use_edges=True", it gives error "Failed to featurize datapoint".
# Some PyTorch models require "use_edges=True". Therefore, we set it as True. It means that we use both node features and edge features. 
featurizer = dc.feat.MolGraphConvFeaturizer(use_edges=True)
features = featurizer.featurize(smiles)   # numpy array, it returns a graph object for every molecule

Failed to featurize datapoint 934, C. Appending empty array
Exception message: zero-size array to reduction operation maximum which has no identity
  return array(a, dtype, copy=False, order=order)


There is a warning about datapoint #934. We remove this observation from dataset

In [20]:
features[934]   # empty array

array([], dtype=float64)

In [21]:
type(features[0])   # deepchem.feat.graph_data.GraphData

deepchem.feat.graph_data.GraphData

In [22]:
# Warning (Failed to featurize datapoint 934, C. Appending empty array)
# Remove this feature
features = np.delete(features, 934)

# Also we need to remove from y. We reshape y into 2D to avoid another warning when fitting model (Using a target size (torch.Size([100])) that is different to the input size (torch.Size([100, 1])). This will likely lead to incorrect results due to broadcasting.)
# Some PyTorch models require y as a 2D array
y = np.array(y)
y = np.delete(y, 934).reshape(-1,1)
y.shape

(1127, 1)

In [26]:
dataset = dc.data.NumpyDataset(X=features, y=y)
y_test = test_dataset.y
print(dataset)

<NumpyDataset X.shape: (1127,), y.shape: (1127, 1), w.shape: (1127, 1), task_names: [0]>


In [24]:
# Split Dataset (splitter object is already initialized)
train_dataset, test_dataset = splitter.train_test_split(dataset=dataset, frac_train=0.8, seed=0)

### Model , Train

In [25]:
# https://deepchem.readthedocs.io/en/latest/api_reference/models.html#gatmodel
# Model for Graph Property Prediction Based on Graph Attention Networks (GAT)
# It works with both "use_edges=True or False" (Parameter of MolGraphConvFeaturizer)
# This model cannot compute uncertainties

del model

model = dc.models.GATModel(n_tasks=n_tasks,              # No. of tasks 
                           mode='regression',            # Either “classification” or “regression”
                           batch_size=100,               # Batch size for training and evaluating
                           learning_rate=0.001,
                           dropout=0                     # Dropout probability within each GAT layer
                           )

print(model)

Using backend: pytorch


GATModel(activation=None, alpha=None, dropout=None, mode=None,
         n_attention_heads=None, n_classes=None, n_tasks=None,
         number_atom_features=None, predictor_dropout=None,
         predictor_hidden_feats=None, residual=None, self_loop=None)




In [27]:
# https://deepchem.readthedocs.io/en/latest/api_reference/models.html#pytorch-models
loss_avg = model.fit(train_dataset, nb_epoch=nb_epoch)

In [28]:
print("Training set score:", model.evaluate(train_dataset, metrics))
print("Test set score:", model.evaluate(test_dataset, metrics))

Training set score: {'r2_score': 0.9435374576239887, 'mean_squared_error': 0.25392900477060387}
Test set score: {'r2_score': 0.8751975961613548, 'mean_squared_error': 0.49507168892227976}


In [29]:
y_pred = model.predict(test_dataset)   # numpy array [n, 1]
r2_score(y_test, y_pred)

0.8751975961613548

# GCN Model

### Make Dataset by MolGraphConvFeaturizer

We use dataset prepared for GAT model because featurizer is the same.

### Model , Train

In [40]:
# https://deepchem.readthedocs.io/en/latest/api_reference/models.html#gcnmodel
# Model for Graph Property Prediction Based on Graph Convolution Networks (GCN)
# This model is different from deepchem.models.GraphConvModel
# It works with both "use_edges=True or False" (Parameter of MolGraphConvFeaturizer)
# This model cannot compute uncertainties

del model

model = dc.models.GCNModel(n_tasks=n_tasks,              # No. of tasks 
                           mode='regression',            # Either “classification” or “regression”
                           batch_size=100,               # Batch size for training and evaluating
                           learning_rate=0.001,
                           graph_conv_layers=[64, 64],   # Width of channels for GCN layers
                           dropout=0.1,                  # Dropout probability for the output of each GCN layer
                           predictor_dropout=0.1,        # Dropout probability in the output MLP predictor
                           batchnorm=False               # Whether to apply batch normalization to the output of each GCN layer
                           )

In [41]:
loss_avg = model.fit(train_dataset, nb_epoch=nb_epoch)
loss_avg

0.38655326843261717

In [43]:
print("Training set score:", model.evaluate(train_dataset, metrics))
print("Test set score:", model.evaluate(test_dataset, metrics))

Training set score: {'r2_score': 0.8913996043320103, 'mean_squared_error': 0.48840858433223366}
Test set score: {'r2_score': 0.8709806313139318, 'mean_squared_error': 0.511799731371196}


In [44]:
y_pred = model.predict(test_dataset)   # numpy array [n, 1]
r2_score(y_test, y_pred)

0.8709806313139318

# AttentiveFPModel

### Make Dataset by MolGraphConvFeaturizer

We use dataset prepared for GAT model because featurizer is the same.

### Model , Train

In [46]:
# https://deepchem.readthedocs.io/en/latest/api_reference/models.html#attentivefpmodel
# Model for Graph Property Prediction. This model combines node features and edge features for initializing node representations.
# For each graph, compute its representation by combining the representations of all nodes in it, which involves a gated recurrent unit (GRU).
# It requires "use_edges=True" (Parameter of MolGraphConvFeaturizer)
# This model cannot compute uncertainties

del model

model = dc.models.AttentiveFPModel(n_tasks=n_tasks,              # No. of tasks 
                                   mode='regression',            # Either “classification” or “regression”
                                   batch_size=100,               # Batch size for training and evaluating
                                   learning_rate=0.001,
                                   num_layers=2,                 # No. of graph neural network layers
                                   dropout=0.1                   # Dropout probability 
                                   )

In [47]:
loss_avg = model.fit(train_dataset, nb_epoch=nb_epoch)
loss_avg

0.13810452461242675

In [48]:
print("Training set score:", model.evaluate(train_dataset, metrics))
print("Test set score:", model.evaluate(test_dataset, metrics))

Training set score: {'r2_score': 0.9744812788976148, 'mean_squared_error': 0.11476535026344031}
Test set score: {'r2_score': 0.917496779318387, 'mean_squared_error': 0.32727742053095}


In [49]:
y_pred = model.predict(test_dataset)   # numpy array [n, 1]
r2_score(y_test, y_pred)

0.917496779318387