<a href="https://colab.research.google.com/github/jinwoo3239/DeepLearning_study/blob/main/Deepchem_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DeepChem tutorial

[reference1](https://github.com/StillWork/AIDD-2208-add/blob/main/c_83_2_Datasets_MoleculeNet.ipynb)  
[reference2](https://github.com/deepchem/deepchem/tree/master/examples/tutorials)

In [1]:
!pip install deepchem

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepchem
  Downloading deepchem-2.6.1-py3-none-any.whl (608 kB)
[K     |████████████████████████████████| 608 kB 5.1 MB/s 
Collecting rdkit-pypi
  Downloading rdkit_pypi-2022.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.8 MB)
[K     |████████████████████████████████| 36.8 MB 37 kB/s 
Installing collected packages: rdkit-pypi, deepchem
Successfully installed deepchem-2.6.1 rdkit-pypi-2022.3.5


In [2]:
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem, Draw, Descriptors

from rdkit.Chem.Draw import IPythonConsole

import deepchem as dc

##Understanding `Datasets` in Deepchem
- `task, datasets, transforms = deepchem.molnet.load_dataset()`
- load_delaney() options
    - **featurizer**  
        - 'ECFP', 'GraphConv', 'Weave', 'smiles2img' 등을 선택할 수 있으며 각각 `dc.feat.CircularFingerprint`, `dc.feat.ConvMolFeaturizer`, `dc.feat.WeaveFeaturizer` and `dc.feat.SmilesToImage` 특성화기를 선택한다

    - **splitter**  
        - `None`, 'index', 'random', 'scaffold' and 'stratified' 등을 선택할 수 있으며 각각 no split, `dc.splits.IndexSplitter`, `dc.splits.RandomSplitter`, `dc.splits.SingletaskStratifiedSplitter`를 선택한다

- load_delaney() return values
 - `tasks`: 단일 또는 다중작업 이름을 알려준다 (타겟 작업이 무엇인지)
 - `datasets`: `dc.data.Dataset` 객체를 나타내며 `(train, valid, test)` 세 부분으로 나누어준다
 - `transformers`: `dc.trans.Transformer` 객체로서 전처리 방법을 알려준다

- `DiskDataset` is a dataset that has been saved to disk.  
- `NumpyDataset` is an in-memory dataset that holds all the data in NumPy arrays.  
- `ImageDataset` stores image files on disk. 

- Every Dataset stores a list of *samples*.  In this case, each sample is a molecule.  
- For every sample the dataset stores the following information.

 - The *features*, referred to as `X`.  
 - The *labels*, referred to as `y`.  
 - The *weights*, referred to as `w` - This can be used to indicate that some data values are more important than others.  
 - An *ID*, is a unique identifier for the sample.  This can be anything as long as it is unique.  In this dataset the ID is a SMILES string describing the molecule.

 - Notice that `X`, `y`, and `w` all have 113 as the size of their first dimension. 

 - `task_names`.  Some datasets contain multiple pieces of information for each sample.  For example, if a sample represents a molecule, the dataset might record the results of several different experiments on that molecule.  
 - This dataset has only a single task: "measured log solubility in mols per litre".  
 - Also notice that `y` and `w` each have shape (113, 1).  The second dimension of these arrays usually matches the number of tasks.


In [3]:
task, datasets, transforms = dc.molnet.load_delaney(featurizer='GraphConv')

In [5]:
print(f'task = {task}')
print(f'transforms = {transforms}')

task = ['measured log solubility in mols per litre']
transforms = [<deepchem.trans.transformers.NormalizationTransformer object at 0x7feb915eebd0>]


In [7]:
train_dataset, val_dataset, test_dataset = datasets

In [8]:
train_dataset

<DiskDataset X.shape: (902,), y.shape: (902, 1), w.shape: (902, 1), ids: ['CC(C)=CCCC(C)=CC(=O)' 'CCCC=C' 'CCCCCCCCCCCCCC' ...
 'Nc2cccc3nc1ccccc1cc23 ' 'C1CCCCCC1' 'OC1CCCCCC1'], task_names: ['measured log solubility in mols per litre']>

In [11]:
train_dataset.to_dataframe()

Unnamed: 0,X,y,w,ids
0,<deepchem.feat.mol_graphs.ConvMol object at 0x...,0.390413,1.0,CC(C)=CCCC(C)=CC(=O)
1,<deepchem.feat.mol_graphs.ConvMol object at 0x...,0.090421,1.0,CCCC=C
2,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-2.464346,1.0,CCCCCCCCCCCCCC
3,<deepchem.feat.mol_graphs.ConvMol object at 0x...,0.704920,1.0,CC(C)Cl
4,<deepchem.feat.mol_graphs.ConvMol object at 0x...,1.159746,1.0,CCC(C)CO
...,...,...,...,...
897,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.649881,1.0,CC(=O)OCC(=O)C3(O)CCC4C2CCC1=CC(=O)CCC1(C)C2C(...
898,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.388598,1.0,c3ccc2nc1ccccc1cc2c3
899,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.654719,1.0,Nc2cccc3nc1ccccc1cc23
900,<deepchem.feat.mol_graphs.ConvMol object at 0x...,-0.311180,1.0,C1CCCCCC1


In [12]:
for X, y, w, ids in test_dataset.itersamples():
    print(y, ids)
    break

[-1.60114461] c1cc2ccc3cccc4ccc(c1)c2c34


In [13]:
for X, y, w, ids in test_dataset.iterbatches(batch_size=50):
    print(y.shape)

(50, 1)
(50, 1)
(13, 1)


## Generation of Custom Datasets

In [15]:
import numpy as np

X = np.random.random((10, 5))
y = np.random.random((10, 2))
dataset = dc.data.NumpyDataset(X=X, y=y)
dataset.to_dataframe()

Unnamed: 0,X1,X2,X3,X4,X5,y1,y2,w,ids
0,0.290688,0.334935,0.586817,0.721915,0.168566,0.18641,0.626511,1.0,0
1,0.084236,0.022929,0.564743,0.162688,0.218657,0.518383,0.056544,1.0,1
2,0.67817,0.719031,0.950113,0.963679,0.798815,0.775811,0.915999,1.0,2
3,0.220355,0.678659,0.967092,0.293465,0.242667,0.348655,0.912953,1.0,3
4,0.311997,0.427651,0.369925,0.309545,0.650012,0.452896,0.397915,1.0,4
5,0.767783,0.956806,0.856695,0.498917,0.035278,0.446732,0.333616,1.0,5
6,0.608119,0.940144,0.705194,0.406852,0.700647,0.225698,0.431483,1.0,6
7,0.341454,0.997788,0.208238,0.109503,0.42364,0.491798,0.377888,1.0,7
8,0.120179,0.70548,0.9034,0.633186,0.925771,0.175898,0.208549,1.0,8
9,0.007994,0.791005,0.586123,0.897975,0.767511,0.659343,0.984535,1.0,9


## MoleculeNet Dataset 카테고리

- The original MoleculeNet paper [1] provides details about a subset of these papers. We've marked these datasets as "V1" below. All remaining dataset are "V2" 

## Quantum Mechanical Datasets

- contain various quantum mechanical property prediction tasks. The current set of quantum mechanical datasets includes QM7, QM7b, QM8, QM9. The associated loaders are 

- [`dc.molnet.load_qm7`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm7): V1
- [`dc.molnet.load_qm7b_from_mat`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm7): V1
- [`dc.molnet.load_qm8`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm8): V1
- [`dc.molnet.load_qm9`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_qm9): V1

## Physical Chemistry Datasets

- contain a variety of tasks for predicting various physical properties of molecules.

- [`dc.molnet.load_delaney`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_delaney): V1. This dataset is also referred to as ESOL in the original  paper.
- [`dc.molnet.load_sampl`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_sampl): V1. This dataset is also referred to as FreeSolv in the original  paper.
- [`dc.molnet.load_lipo`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_lipo): V1. This dataset is also referred to as Lipophilicity in the original  paper.
- [`dc.molnet.load_thermosol`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_thermosol): V2.
- [`dc.molnet.load_hppb`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_hppb): V2.
- [`dc.molnet.load_hopv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_hopv): V2. This dataset is drawn from a recent publication [3]

## Chemical Reaction Datasets

- chemical reaction datasets for use in computational retrosynthesis / forward synthesis.

- [`dc.molnet.load_uspto`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_uspto)

## Biochemical/Biophysical Datasets

- e.g., the binding affinity of compounds to proteins.

- [`dc.molnet.load_pcba`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_pcba): V1
- [`dc.molnet.load_nci`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_nci): V2.
- [`dc.molnet.load_muv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_muv): V1
- [`dc.molnet.load_hiv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_hiv): V1
- [`dc.molnet.load_ppb`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#ppb-datasets): V2.
- [`dc.molnet.load_bace_classification`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bace_classification): V1. This loader loads the classification task for the BACE dataset from the original MoleculeNet paper.
- [`dc.molnet.load_bace_regression`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bace_regression): V1. This loader loads the regression task for the BACE dataset from the original MoleculeNet paper.
- [`dc.molnet.load_kaggle`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_kaggle): V2. This dataset is from Merck's drug discovery kaggle contest and is described in [4].
- [`dc.molnet.load_factors`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_factors): V2. This dataset is from [4].
- [`dc.molnet.load_uv`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_uv): V2. This dataset is from [4].
- [`dc.molnet.load_kinase`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_kinase): V2. This datset is from [4].

## Molecular Catalog Datasets

These datasets provide molecular datasets which have no associated properties beyond the raw SMILES formula or structure. These types of datasets are useful for generative modeling tasks.

- [`dc.molnet.load_zinc15`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_zinc15): V2
- [`dc.molnet.load_chembl`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_chembl): V2
- [`dc.molnet.load_chembl25`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#chembl25-datasets): V2

## Physiology Datasets

These datasets measure physiological properties of how molecules interact with human patients.

- [`dc.molnet.load_bbbp`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bbbp): V1
- [`dc.molnet.load_tox21`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_tox21): V1
- [`dc.molnet.load_toxcast`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_toxcast): V1
- [`dc.molnet.load_sider`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_sider): V1
- [`dc.molnet.load_clintox`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_clintox): V1
- [`dc.molnet.load_clearance`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_clearance): V2.

## Structural Biology Datasets

These datasets contain 3D structures of macromolecules along with associated properties.

- [`dc.molnet.load_pdbbind`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_pdbbind): V1


## Microscopy Datasets

These datasets contain microscopy image datasets, typically of cell lines. These datasets were not in the original MoleculeNet paper.

- [`dc.molnet.load_bbbc001`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bbbc001): V2
- [`dc.molnet.load_bbbc002`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bbbc002): V2
- [`dc.molnet.load_cell_counting`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#cell-counting-datasets): V2

## Materials Properties Datasets

These datasets compute properties of various materials.

- [`dc.molnet.load_bandgap`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_bandgap): V2
- [`dc.molnet.load_perovskite`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_perovskite): V2
- [`dc.molnet.load_mp_formation_energy`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_mp_formation_energy): V2
- [`dc.molnet.load_mp_metallicity`](https://deepchem.readthedocs.io/en/latest/moleculenet.html#deepchem.molnet.load_mp_metallicity): V2

[1] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.

[3] Lopez, Steven A., et al. "The Harvard organic photovoltaic dataset." Scientific data 3.1 (2016): 1-7.

[4] Ramsundar, Bharath, et al. "Is multitask deep learning practical for pharma?." Journal of chemical information and modeling 57.8 (2017): 2068-2076.

## Dataset and prediction using simple DC models

The “Toxicology in the 21st Century” (Tox21) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. This dataset contains qualitative toxicity measurements for 8k compounds on 12 different targets, including nuclear receptors and stress response pathways.

In [18]:
tasks, datasets, transforms = dc.molnet.load_tox21(featurizer='ECFP')
train_dataset, val_dataset, test_dataset = datasets

6264

In [20]:
train_dataset.to_dataframe().shape

(6264, 1049)

In [22]:
train_dataset.y.shape

(6264, 12)

In [36]:
model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[512], )
model.fit(train_dataset, nb_epoch=10)

0.5265356381734212

In [38]:
metric1 = dc.metrics.Metric(dc.metrics.roc_auc_score)
metric2 = dc.metrics.Metric(dc.metrics.accuracy_score)
print('training set score:', model.evaluate(train_dataset, [metric1, metric2]))
print('test set score:', model.evaluate(test_dataset, [metric1, metric2]))

training set score: {'roc_auc_score': 0.9539797618856087, 'accuracy_score': 0.8621753937845891}
test set score: {'roc_auc_score': 0.6908563865619751, 'accuracy_score': 0.7955994897959183}
