<a href="https://colab.research.google.com/github/kensingera24/DeepChem/blob/main/Introduction_to_MoleculeNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# This notebook will explore uses of the MoleculeNet suite of datasets

In [2]:
!pip install --pre deepchem

Collecting deepchem
  Downloading deepchem-2.8.1.dev20240419190927-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting rdkit (from deepchem)
  Downloading rdkit-2023.9.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.4/34.4 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit, deepchem
Successfully installed deepchem-2.8.1.dev20240419190927 rdkit-2023.9.5


In [3]:
import deepchem as dc
dc.__version__

Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


'2.8.1.dev'

In [5]:
# Load in the Delaney dataset of molecular solubilities
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv', splitter='random')

In [6]:
# Take a look at the full collection of leaders available
[method for method in dir(dc.molnet) if "load_" in method]

['load_Platinum_Adsorption',
 'load_bace_classification',
 'load_bace_regression',
 'load_bandgap',
 'load_bbbc001',
 'load_bbbc002',
 'load_bbbc003',
 'load_bbbc004',
 'load_bbbc005',
 'load_bbbp',
 'load_cell_counting',
 'load_chembl',
 'load_chembl25',
 'load_clearance',
 'load_clintox',
 'load_delaney',
 'load_factors',
 'load_freesolv',
 'load_function',
 'load_hiv',
 'load_hopv',
 'load_hppb',
 'load_kaggle',
 'load_kinase',
 'load_lipo',
 'load_mp_formation_energy',
 'load_mp_metallicity',
 'load_muv',
 'load_nci',
 'load_pcba',
 'load_pdbbind',
 'load_perovskite',
 'load_ppb',
 'load_qm7',
 'load_qm8',
 'load_qm9',
 'load_sampl',
 'load_sider',
 'load_sweet',
 'load_thermosol',
 'load_tox21',
 'load_toxcast',
 'load_uspto',
 'load_uv',
 'load_zinc15']

In [7]:
# Check how many datasets there are in MoleculeNet today
len([method for method in dir(dc.molnet) if "load_" in method])

45

In [11]:
# Here is a list of the types of datasets:
# 1. Quantum Mechanical Datasets
  # QM property prediction tasks
# 2. Physical Chemistry Datasets
  # Physical property prediction tasks
# 3. Chemical Reaction Datasets
  # Computational retrosynthesis / forward synthesis
# 4. Biochemical/Biophysical Datasets
  # Quantities like binding affinity of compounds to proteins
# 5. Molecular Catalog Datasets
  # Molecular datasets which have no associated properties
  # beyon the raw SMILES formula or structure
# 6. Physiology Datasets
  # How molescules interact with human patients
# 7. Structural Biology Datasets
  # Contains 3D structures of macromolecues along with associated properties
# 8. Microscopy Datasets
  # Contains microscopy images, typically of cell lines
# 9. Materials Properties Datasets
  # Compute properties of various materials


In [13]:
# List of task-names or multiple labels associated with Delaney set
tasks

['measured log solubility in mols per litre']

In [14]:
# Take a look at datasets
datasets

(<DiskDataset X.shape: (902,), y.shape: (902, 1), w.shape: (902, 1), ids: ['Clc1c(Cl)c(Cl)c(N(=O)=O)c(Cl)c1Cl' 'Clc1ccccc1N(=O)=O' 'c1ccc2ncccc2c1'
  ... 'CC(C)Nc1nc(Cl)nc(NC(C)C)n1' 'CCOc1ccc(NC(N)=O)cc1'
  'Cn1c(=O)n(C)c2nc[nH]c2c1=O'], task_names: ['measured log solubility in mols per litre']>,
 <DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['C1OC(O)C(O)C(O)C1O' 'FC(F)(F)C(Cl)Br '
  'CC(=O)N(S(=O)c1ccc(N)cc1)c2onc(C)c2C ' ... 'O=N(=O)c1cc(Cl)c(Cl)cc1'
  'Cc1c[nH]c2ccccc12 ' 'CNC(=O)Oc1cc(C)cc(C)c1'], task_names: ['measured log solubility in mols per litre']>,
 <DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl' 'NC(=O)NC1NC(=O)NC1=O '
  'Cc1ccc(cc1)S(=O)(=O)N' ... 'OCC(NC(=O)C(Cl)Cl)C(O)c1ccc(cc1)N(=O)=O'
  'Oc1c(Cl)ccc(Cl)c1Cl' 'ClCCC#N '], task_names: ['measured log solubility in mols per litre']>)

In [15]:
# Split out the tuple of 3 datasets
train, valid, test=datasets

In [16]:
train

<DiskDataset X.shape: (902,), y.shape: (902, 1), w.shape: (902, 1), ids: ['Clc1c(Cl)c(Cl)c(N(=O)=O)c(Cl)c1Cl' 'Clc1ccccc1N(=O)=O' 'c1ccc2ncccc2c1'
 ... 'CC(C)Nc1nc(Cl)nc(NC(C)C)n1' 'CCOc1ccc(NC(N)=O)cc1'
 'Cn1c(=O)n(C)c2nc[nH]c2c1=O'], task_names: ['measured log solubility in mols per litre']>

In [17]:
valid

<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['C1OC(O)C(O)C(O)C1O' 'FC(F)(F)C(Cl)Br '
 'CC(=O)N(S(=O)c1ccc(N)cc1)c2onc(C)c2C ' ... 'O=N(=O)c1cc(Cl)c(Cl)cc1'
 'Cc1c[nH]c2ccccc12 ' 'CNC(=O)Oc1cc(C)cc(C)c1'], task_names: ['measured log solubility in mols per litre']>

In [18]:
test

<DiskDataset X.shape: (113,), y.shape: (113, 1), w.shape: (113, 1), ids: ['COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl' 'NC(=O)NC1NC(=O)NC1=O '
 'Cc1ccc(cc1)S(=O)(=O)N' ... 'OCC(NC(=O)C(Cl)Cl)C(O)c1ccc(cc1)N(=O)=O'
 'Oc1c(Cl)ccc(Cl)c1Cl' 'ClCCC#N '], task_names: ['measured log solubility in mols per litre']>

In [19]:
train.X[0]

<deepchem.feat.mol_graphs.ConvMol at 0x7917f150f2e0>

In [21]:
# Take a look at the transformer field
transformers

[<deepchem.trans.transformers.NormalizationTransformer at 0x79194809cac0>]

In [22]:
# See that one transformer was applied,
# the dc.trans.transformers.NormalizationTransformer

You can use the 'featurizer' and 'splitter' keyword arguments and pass in different strings. Common possible choices for 'featurizer' are 'ECFP', 'GraphConv', 'Weave' and 'smiles2img' corresponding to the dc.feat.CircularFingerprint , dc.feat.ConvMolFeaturizer , dc.feat.WeaveFeaturizer and dc.feat.SmilesToImage featurizers. Common possible choices for 'splitter' are None , 'index', 'random', 'scaffold' and 'stratified' corresponding to no split, dc.splits.IndexSplitter , dc.splits.RandomSplitter , dc.splits.SingletaskStratifiedSplitter . We haven't talked much about splitters yet, but intuitively they're a way to partition a dataset based on different criteria.



In [23]:
# Instead of a string, you can pass in any Featurizer or Splitter object
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP", splitter="scaffold")

In [24]:
(train, valid, test) = datasets

In [25]:
train

<DiskDataset X.shape: (902, 1024), y.shape: (902, 1), w.shape: (902, 1), ids: ['CC(C)=CCCC(C)=CC(=O)' 'CCCC=C' 'CCCCCCCCCCCCCC' ...
 'Nc2cccc3nc1ccccc1cc23 ' 'C1CCCCCC1' 'OC1CCCCCC1'], task_names: ['measured log solubility in mols per litre']>