# TDC @ DGL GNN User Group Demo

Therapeutics Data Commons (TDC) is the first unifying framework to systematically access and evaluate machine learning across the entire range of therapeutics. It provides support for the entire lifecycles of ML research for therapeutics. In this demo, we walk through key features of TDC. 

TDC Website: [tdcommons.ai](tdcommons.ai)

## Installation

TDC can be installed hassel-free via pip. The core package only requires minimal external dependencies. 

In [None]:
!pip install PyTDC

## Section 1: TDC Data Loaders

TDC provides 3-lines of codes data loaders for 66 datasets in 22 tasks. Here, we demonstrate how to use the data loaders.

Detailed documentations about each task and dataset can be found via our website: [https://tdcommons.ai/overview/](https://tdcommons.ai/overview/)

### Sample Dataset: Single-instance Problem

To retrieve the hERG dataset in the Tox task under the single-instance prediction problem: 

In [1]:
from tdc.single_pred import Tox
data = Tox(name = 'hERG')
data.get_data().head(2)

Downloading...
100%|██████████| 50.2k/50.2k [00:00<00:00, 288kiB/s]
Loading...
Done!


Unnamed: 0,Drug_ID,Drug,Y
0,DEMETHYLASTEMIZOLE,Oc1ccc(CCN2CCC(Nc3nc4ccccc4n3Cc3ccc(F)cc3)CC2)cc1,1.0
1,GBR-12909,Fc1ccc(C(OCC[NH+]2CC[NH+](CCCc3ccccc3)CC2)c2cc...,1.0


This `data` class contains various utility functions, such as generating data split using various spliting schemes. Here, we use scaffold split:

In [2]:
split = data.get_split(method = 'scaffold', 
                       seed = 42, 
                       frac = [0.7, 0.1, 0.2])
split['train'].head(2)

100%|██████████| 655/655 [00:00<00:00, 1455.23it/s]


Unnamed: 0,Drug_ID,Drug,Y
0,DELAVIRDINE,CC(C)Nc1cccnc1N1CCN(C(=O)C2=CC3=C[C@H](NS(C)(=...,0.0
1,SOPHOCARPINE,O=C1C=CC[C@@H]2[C@H]3CCCN4CCC[C@H](CN12)[C@H]34,0.0


### Sample Dataset: Multi-instance Problem

Similarly, suppose we want to retrieve the GDSC1 dataset for drug response prediction under the multi-instance problem, you can do:

In [3]:
from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC1')
data.get_data().head(2)

Downloading...
100%|██████████| 140M/140M [00:07<00:00, 19.5MiB/s] 
Loading...
Done!


Unnamed: 0,Drug_ID,Drug,Cell Line_ID,Cell Line,Y
0,Erlotinib,COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...,MC-CAR,"[3.23827250519154, 2.98225419469807, 10.235490...",2.395685
1,Erlotinib,COCCOC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC=CC(=C3)C#C...,ES3,"[8.690197905033282, 3.0914731119366, 9.9924871...",3.140923


### Sample Dataset: Generation Problem
Lastly, if we want to retrieve USPTO-50K dataset for retrosynthesis prediction under the generation problem, we can do:

In [4]:
from tdc.generation import RetroSyn
data = RetroSyn(name = 'USPTO-50K')
data.get_data().head(2)

Downloading...
100%|██████████| 5.22M/5.22M [00:00<00:00, 5.49MiB/s]
Loading...
Done!


Unnamed: 0,input,output
0,COC(=O)CCC(=O)c1ccc(OC2CCCCO2)cc1O,C1=COCCC1.COC(=O)CCC(=O)c1ccc(O)cc1O
1,COC(=O)c1cccc(-c2nc3cccnc3[nH]2)c1,COC(=O)c1cccc(C(=O)O)c1.Nc1cccnc1N


You can see that all data are ML-ready!

## Section 2: TDC Data Functions

TDC provides numerous data functions to support ML for therapeutics research! Checkout here for more info: [https://tdcommons.ai/fct_overview/](https://tdcommons.ai/fct_overview/).

We have lots of evaluators for model performance evaluation and also various realistic splits. In addition, many data processing helpers are provided. For example, for molecule data, we provide the most accessible SMILES string, where you can transform it to a DGL object:

In [5]:
from tdc.chem_utils import MolConvert
converter = MolConvert(src = 'SMILES', dst = 'DGL')
converter(['Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)\C(=O)OC',
       'CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C'])

Using backend: pytorch


[DGLGraph(num_nodes=28, num_edges=58,
          ndata_schemes={}
          edata_schemes={}),
 DGLGraph(num_nodes=31, num_edges=68,
          ndata_schemes={}
          edata_schemes={})]

You can also call it when using data loader:

In [6]:
from tdc.single_pred import Tox
data = Tox(name = 'hERG', convert_format = 'DGL')
data.get_data().head(2)

Found local copy...
Loading...
Done!


Unnamed: 0,Drug_ID,Drug,Drug_DGL,Y
0,DEMETHYLASTEMIZOLE,Oc1ccc(CCN2CCC(Nc3nc4ccccc4n3Cc3ccc(F)cc3)CC2)cc1,"DGLGraph(num_nodes=33, num_edges=74,\n ...",1.0
1,GBR-12909,Fc1ccc(C(OCC[NH+]2CC[NH+](CCCc3ccccc3)CC2)c2cc...,"DGLGraph(num_nodes=33, num_edges=72,\n ...",1.0


Another example is for many multi-instance prediction problem, we can formulate the interaction prediction between two biomedical entities as a link prediction in a biomedical entity graph. We also provide helper funcitons to construct this graph from the raw data:

In [7]:
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
output = data.to_graph(threshold = 30, 
                       format = 'dgl', 
                       split = True, 
                       frac = [0.7, 0.1, 0.2], 
                       seed = 42, 
                       order = 'descending')

Downloading...
100%|██████████| 26.3M/26.3M [00:01<00:00, 13.4MiB/s]
Loading...
Done!
The dataset label consists of affinity scores. Binarization using threshold 30 is conducted to construct the positive edges in the network. Adjust the threshold by to_graph(threshold = X)


In [8]:
output['dgl_graph']

DGLGraph(num_nodes=379, num_edges=1474,
         ndata_schemes={}
         edata_schemes={})

Another big category of TDC is molecule generation. Molecule generation aims to generate new molecule that has ideal properties, which are measured by oracle functions. We provide 17 oracle functions. This can be formulated as a graph generation problems. 

You can access each oracle through 3 lines of code. For example, to access the Synthetic accessibility oracle:

In [9]:
from tdc import Oracle
oracle = Oracle(name = 'SA')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])

Downloading Oracle...
100%|██████████| 9.05M/9.05M [00:01<00:00, 5.23MiB/s]
Done!


[2.706977149048555, 2.8548373344538067, 2.659973244931228]

## Section 3: TDC Leaderboard

TDC leaderboard provides a place for fair model comparison for important therapeutics ML tasks. We devise the benchmarks to reflect the realistic drug discovery challenges.

Here we demonstrate on using GNN to predict ADMET Property and submitting to TDC ADMET Caco2_Wang Benchmark.

### Step 1: Load the benchmark dataset.

In [10]:
from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')
benchmark = group.get('Caco2_Wang')

train_val, test = benchmark['train_val'], benchmark['test']

Downloading Benchmark Group...
100%|██████████| 1.47M/1.47M [00:00<00:00, 2.43MiB/s]
Extracting zip file...
Done!


### Step 2: Train Your Models With Five Runs

We use [DeepPurpose](https://github.com/kexinhuang12345/DeepPurpose), our latest framework for encoding compounds and proteins, as an example.

In [12]:
from DeepPurpose import CompoundPred as models
from DeepPurpose.utils import data_process, generate_config

drug_encoding = 'MPNN'
prediction_runs = []

for seed in [1, 2, 3, 4, 5]:
    ### Generate Different Train, Valid Splits Given Seed ###
    train, valid = group.get_train_valid_split(benchmark = benchmark['name'], split_type = 'default', seed = seed)
    
    ### Train the Model on Train, Valid Set ###
    train = data_process(X_drug = train.Drug.values, y = train.Y.values, drug_encoding = drug_encoding, split_method='no_split')
    val = data_process(X_drug = valid.Drug.values, y = valid.Y.values, drug_encoding = drug_encoding, split_method='no_split')
    test = data_process(X_drug = benchmark['test'].Drug.values, y = benchmark['test'].Y.values, drug_encoding = drug_encoding, split_method='no_split')

    config = generate_config(drug_encoding = drug_encoding, train_epoch = 3, LR = 0.001, batch_size = 128)
    model = models.model_initialize(**config)
    model.train(train, val, test, verbose = False)
    
    ### Generate Predictions on the Test Set ###
    y_pred = model.predict(test)
    prediction_runs.append({benchmark['name']: y_pred})

generating training, validation splits...
100%|██████████| 728/728 [00:00<00:00, 1169.60it/s]


Drug Property Prediction Mode...
in total: 637 drugs
encoding drug...
unique drugs: 634
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 91 drugs
encoding drug...
unique drugs: 91
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 182 drugs
encoding drug...
unique drugs: 181
do not do train/test split on the data for already splitted data
predicting...


generating training, validation splits...
100%|██████████| 728/728 [00:00<00:00, 1295.03it/s]


Drug Property Prediction Mode...
in total: 637 drugs
encoding drug...
unique drugs: 635
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 91 drugs
encoding drug...
unique drugs: 90
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 182 drugs
encoding drug...
unique drugs: 181
do not do train/test split on the data for already splitted data
predicting...


generating training, validation splits...
100%|██████████| 728/728 [00:00<00:00, 1194.70it/s]


Drug Property Prediction Mode...
in total: 637 drugs
encoding drug...
unique drugs: 634
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 91 drugs
encoding drug...
unique drugs: 91
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 182 drugs
encoding drug...
unique drugs: 181
do not do train/test split on the data for already splitted data
predicting...


generating training, validation splits...
100%|██████████| 728/728 [00:00<00:00, 1260.67it/s]


Drug Property Prediction Mode...
in total: 637 drugs
encoding drug...
unique drugs: 634
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 91 drugs
encoding drug...
unique drugs: 91
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 182 drugs
encoding drug...
unique drugs: 181
do not do train/test split on the data for already splitted data
predicting...


generating training, validation splits...
100%|██████████| 728/728 [00:00<00:00, 1228.35it/s]


Drug Property Prediction Mode...
in total: 637 drugs
encoding drug...
unique drugs: 635
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 91 drugs
encoding drug...
unique drugs: 90
do not do train/test split on the data for already splitted data
Drug Property Prediction Mode...
in total: 182 drugs
encoding drug...
unique drugs: 181
do not do train/test split on the data for already splitted data
predicting...


### Step 3: Evaluate the testing set prediction with pre-specified TDC evaluator

The mean and standard deviation of the model performances are generated.

In [13]:
group.evaluate_many(prediction_runs)

{'caco2_wang': [1.221, 0.524]}

### Step 4: Copy the above results and submit to [THIS FORM](https://forms.gle/HYupGaV7WDuutbr9A).

### That's it! Your results will be reflected on the [leaderboard website](https://tdcommons.ai/benchmark/admet_group/01caco2/) soon!

We can see that model performance is far from perfect! There are many opportunities for algorithmic innovation! Submit your state-of-the-art models to TDC leaderboard!

* Visit our website: [https://tdcommons.ai/](https://tdcommons.ai/)

* Star our Github repo: [mims-harvard/TDC](https://github.com/mims-harvard/TDC)

* Join our Slack Workspace: [https://rb.gy/bv0ffd](https://rb.gy/bv0ffd) 

We are looking for contributors!

You can find this notebook at [https://github.com/mims-harvard/TDC/blob/master/tutorials/DGL_User_Group_Demo.ipynb](https://github.com/mims-harvard/TDC/blob/master/tutorials/DGL_User_Group_Demo.ipynb)