# Dataset and Benchmark creation with `Polaris`
The first step of creating a benchmark is to set up a standard dataset which allows accessing the curated dataset (which has been demonstrated in <01_ADME_data_curation.ipynb>), and all necessary information about the dataset such as data source, description of endpoints, units etc. 

In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import datamol as dm
import numpy as np
from sklearn.model_selection import ShuffleSplit

import polaris 
from polaris.curation._chemistry_curator import SMILES_COL, UNIQUE_ID
from polaris.dataset import Dataset, ColumnAnnotation
from polaris.dataset._column import Modality
from polaris.benchmark import SingleTaskBenchmarkSpecification, MultiTaskBenchmarkSpecification

## Create the ADME dataset with `polaris.Dataset` 

A dataset in Polaris is at its core a tabular data-structure where each row stores a datapoint. Here, we will process ADME dataset from [`Fang et al. 2023`](https://doi.org/10.1021/acs.jcim.3c00160).

In [2]:
# Load data
PATH = 'gs://polaris-private/curated_datasets/ADME/fang2023_public_set_3521_curated.csv'
table = pd.read_csv(PATH)



In [3]:
table.head(5)

Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg),...,LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_zscore,LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_stereo_cliff,LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_zscore,LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_stereo_cliff,LOG MDR1-MDCK ER (B-A/A-B)_zscore,LOG MDR1-MDCK ER (B-A/A-B)_stereo_cliff,LOG SOLUBILITY PH 6.8 (ug/mL)_zscore,LOG SOLUBILITY PH 6.8 (ug/mL)_stereo_cliff,UMAP_0,UMAP_1
0,Mol2754,49006909,O=C(NCC1(Sc2ccccc2)CC1)c1ccc(=O)[nH]n1,emolecules,0.896416,,,,,2.753398,...,,,,,,,,,0.916336,4.428758
1,Mol1188,LN01313047,CC(/C=C/C(=O)NO)=C\[C@@H](C)C(=O)c1ccc(N(C)C)cc1,labnetworkBB,1.36661,0.723417,1.344392,,,2.180917,...,,,,,0.989082,,-0.732932,,1.486225,3.721844
2,Mol1585,32419804,CCNc1ccnc(N(C)Cc2nc3ccccc3n2C)n1,emolecules,1.4691,0.107651,1.567849,,,2.637425,...,,,,,-0.079177,,0.092429,,-1.574456,2.46546
3,Mol1297,32278068,Clc1ccc(C2(c3ccc(-c4cn[nH]c4)cc3)CCNCC2)cc1,emolecules,0.675687,1.995635,1.267172,,,1.02792,...,,,,,3.196185,,-1.018154,,1.231019,3.479056
4,Mol1364,4752649,c1ccc(-n2ncc3c(-n4ccnc4)ncnc32)cc1,emolecules,1.204093,-0.209238,0.696356,,,2.575138,...,,,,,-0.628933,,-3.126516,,0.307897,3.615973


In [4]:
# Here we simplify the column names 
table = table.rename(columns={"molhash_id": "UNIQUE_ID",
                             "LOG HLM_CLint (mL/min/kg)": "LOG_HLM_CLint",
                            "LOG RLM_CLint (mL/min/kg)": "LOG_RLM_CLint",
                            "LOG MDR1-MDCK ER (B-A/A-B)":"LOG_MDR1-MDCK_ER",
                            "LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)": "LOG_HPPB",
                            "LOG PLASMA PROTEIN BINDING (RAT) (% unbound)": "LOG_RPPB",
                            "LOG SOLUBILITY PH 6.8 (ug/mL)": "LOG_SOLUBILITY"})

Not all the columns are necessary, only the columns which are useful for the benchmarks will be annotated. Here we only use the columns that were used for training in the original paper. 

It's necessary to specify the key bioactivity columns, molecules structures and identifiers in dataset with `ColumnAnnotation`. It is possible to add `user_attributes` with any key and values when is needed, such as `unit`, `organism`, `scale` and optimization `objective`. 

**Abbrevations for the endpoint objective**
- THTB: the higher the better
- TLTB: the lower the better

In [5]:
?ColumnAnnotation

[0;31mInit signature:[0m
[0mColumnAnnotation[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0misPointer[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmodality[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpolaris[0m[0;34m.[0m[0mdataset[0m[0;34m.[0m[0m_column[0m[0;34m.[0m[0mModality[0m[0;34m][0m [0;34m=[0m [0;34m<[0m[0mModality[0m[0;34m.[0m[0mUNKNOWN[0m[0;34m:[0m [0;34m'unknown'[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdescription[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0muserAttributes[0m[0;34m:[0m [0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:

In [6]:
adme_annotations = {
    "UNIQUE_ID": ColumnAnnotation(
        description="Molecular hash ID. See <datamol.mol.hash_mol>"
    ),
    "smiles": ColumnAnnotation(
        description="Molecule SMILES string after cleaning and standardization.",
        modality=Modality.MOLECULE
    ),
    "ORIGINAL_SMILES": ColumnAnnotation(
        description="Original molecule SMILES string from the publication."
    ),  
    "LOG_HLM_CLint": ColumnAnnotation(
        description="Human liver microsomal stability reported as intrinsic clearance",
        user_attributes={
            "unit": "mL/min/kg",
            "scale": "log",
            "organism": "human",
            "objective": "TLTB",
        },
    ),
    "LOG_RLM_CLint": ColumnAnnotation(
        description="Rat liver microsomal stability reported as intrinsic clearance",
        user_attributes={
            "unit": "mL/min/kg",
            "scale": "log",
            "organism": "rat",
            "objective": "TLTB",
        },
    ),
    "LOG_MDR1-MDCK_ER": ColumnAnnotation(
        description="MDR1-MDCK efflux ratio (B-A/A-B)",
        user_attributes={"unit": "mL/min/kg", "scale": "log", "objective": "THTB"},
    ),
    "LOG_HPPB": ColumnAnnotation(
        description="Human plasma protein binding",
        user_attributes={"unit": "% unbound", "objective": "TLTB"},
    ),
    "LOG_RPPB": ColumnAnnotation(
        description="Rat plasma protein binding",
        user_attributes={"unit": "% unbound", "objective": "TLTB"},
    ),
    "LOG_SOLUBILITY": ColumnAnnotation(
        description="Solubility was measured after equilibrium between the dissolved and solid state",
        user_attributes={
            "unit": "ug/mL",
            "scale": "log",
            "PH": "6.8",
            "objective": "THTB",
        },
    )
}

## Create `Dataset` object

In [7]:
from polaris.utils.types import HubOwner
owner = HubOwner(organizationId="PolarisTest", slug="polaristest")
owner.owner

'PolarisTest'

In [8]:
dataset = Dataset(
    table=table[adme_annotations.keys()],
    name="Fang_2023_ADME_public",
    description="Disclosed ADME datasets collected over 20 months across six ADME in vitro endpoints",
    source="https://doi.org/10.1021/acs.jcim.3c00160",
    annotations=adme_annotations,
    owner=owner,
    tags=["ADME"]
)

In [9]:
# save the dataset
SAVE_DIR = "gs://polaris-private/Datasets/ADME/fang2023_public_set"
dataset.to_json(SAVE_DIR)

  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


'gs://polaris-private/Datasets/ADME/fang2023_public_set/dataset.json'

In [10]:
fs = dm.fs.get_mapper(SAVE_DIR).fs
fs.ls(SAVE_DIR)

['polaris-private/Datasets/ADME/fang2023_public_set/dataset.json',
 'polaris-private/Datasets/ADME/fang2023_public_set/table.parquet']

## Benchmark creation with `Polaris`
Creating a benchmark involves setting up a standard dataset, designing the train-validation-test set and defining evaluation metrics which is used to establish baseline performance level. 

#### Load existing Dataset object

In [11]:
dataset = polaris.load_dataset("gs://polaris-private/Datasets/ADME/fang2023_public_set/dataset.json")

In [12]:
# Visualize all information about the dataset
dataset

  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


0,1
name,Fang_2023_ADME_public
description,Disclosed ADME datasets collected over 20 months across six ADME in vitro endpoints
tags,ADME
user_attributes,
owner,slugpolaristestorganization_idPolarisTestuser_idNoneownerPolarisTest
md5sum,9a3167e2aac5adc16c4abbfa762e0387
readme,
annotations,UNIQUE_IDis_pointerFalsemodalityUNKNOWNdescriptionMolecular hash ID. See <datamol.mol.hash_mol>user_attributessmilesis_pointerFalsemodalityMOLECULEdescriptionMolecule SMILES string after cleaning and standardization.user_attributesORIGINAL_SMILESis_pointerFalsemodalityUNKNOWNdescriptionOriginal molecule SMILES string from the publication.user_attributesLOG_HLM_CLintis_pointerFalsemodalityUNKNOWNdescriptionHuman liver microsomal stability reported as intrinsic clearanceuser_attributesunitmL/min/kgscalelogorganismhumanobjectiveTLTBLOG_RLM_CLintis_pointerFalsemodalityUNKNOWNdescriptionRat liver microsomal stability reported as intrinsic clearanceuser_attributesunitmL/min/kgscalelogorganismratobjectiveTLTBLOG_MDR1-MDCK_ERis_pointerFalsemodalityUNKNOWNdescriptionMDR1-MDCK efflux ratio (B-A/A-B)user_attributesunitmL/min/kgscalelogobjectiveTHTBLOG_HPPBis_pointerFalsemodalityUNKNOWNdescriptionHuman plasma protein bindinguser_attributesunit% unboundobjectiveTLTBLOG_RPPBis_pointerFalsemodalityUNKNOWNdescriptionRat plasma protein bindinguser_attributesunit% unboundobjectiveTLTBLOG_SOLUBILITYis_pointerFalsemodalityUNKNOWNdescriptionSolubility was measured after equilibrium between the dissolved and solid stateuser_attributesunitug/mLscalelogPH6.8objectiveTHTB
source,https://doi.org/10.1021/acs.jcim.3c00160
license,

0,1
slug,polaristest
organization_id,PolarisTest
user_id,
owner,PolarisTest

0,1
UNIQUE_ID,is_pointerFalsemodalityUNKNOWNdescriptionMolecular hash ID. See <datamol.mol.hash_mol>user_attributes
smiles,is_pointerFalsemodalityMOLECULEdescriptionMolecule SMILES string after cleaning and standardization.user_attributes
ORIGINAL_SMILES,is_pointerFalsemodalityUNKNOWNdescriptionOriginal molecule SMILES string from the publication.user_attributes
LOG_HLM_CLint,is_pointerFalsemodalityUNKNOWNdescriptionHuman liver microsomal stability reported as intrinsic clearanceuser_attributesunitmL/min/kgscalelogorganismhumanobjectiveTLTB
LOG_RLM_CLint,is_pointerFalsemodalityUNKNOWNdescriptionRat liver microsomal stability reported as intrinsic clearanceuser_attributesunitmL/min/kgscalelogorganismratobjectiveTLTB
LOG_MDR1-MDCK_ER,is_pointerFalsemodalityUNKNOWNdescriptionMDR1-MDCK efflux ratio (B-A/A-B)user_attributesunitmL/min/kgscalelogobjectiveTHTB
LOG_HPPB,is_pointerFalsemodalityUNKNOWNdescriptionHuman plasma protein bindinguser_attributesunit% unboundobjectiveTLTB
LOG_RPPB,is_pointerFalsemodalityUNKNOWNdescriptionRat plasma protein bindinguser_attributesunit% unboundobjectiveTLTB
LOG_SOLUBILITY,is_pointerFalsemodalityUNKNOWNdescriptionSolubility was measured after equilibrium between the dissolved and solid stateuser_attributesunitug/mLscalelogPH6.8objectiveTHTB

0,1
is_pointer,False
modality,UNKNOWN
description,Molecular hash ID. See <datamol.mol.hash_mol>
user_attributes,

0,1
is_pointer,False
modality,MOLECULE
description,Molecule SMILES string after cleaning and standardization.
user_attributes,

0,1
is_pointer,False
modality,UNKNOWN
description,Original molecule SMILES string from the publication.
user_attributes,

0,1
is_pointer,False
modality,UNKNOWN
description,Human liver microsomal stability reported as intrinsic clearance
user_attributes,unitmL/min/kgscalelogorganismhumanobjectiveTLTB

0,1
unit,mL/min/kg
scale,log
organism,human
objective,TLTB

0,1
is_pointer,False
modality,UNKNOWN
description,Rat liver microsomal stability reported as intrinsic clearance
user_attributes,unitmL/min/kgscalelogorganismratobjectiveTLTB

0,1
unit,mL/min/kg
scale,log
organism,rat
objective,TLTB

0,1
is_pointer,False
modality,UNKNOWN
description,MDR1-MDCK efflux ratio (B-A/A-B)
user_attributes,unitmL/min/kgscalelogobjectiveTHTB

0,1
unit,mL/min/kg
scale,log
objective,THTB

0,1
is_pointer,False
modality,UNKNOWN
description,Human plasma protein binding
user_attributes,unit% unboundobjectiveTLTB

0,1
unit,% unbound
objective,TLTB

0,1
is_pointer,False
modality,UNKNOWN
description,Rat plasma protein binding
user_attributes,unit% unboundobjectiveTLTB

0,1
unit,% unbound
objective,TLTB

0,1
is_pointer,False
modality,UNKNOWN
description,Solubility was measured after equilibrium between the dissolved and solid state
user_attributes,unitug/mLscalelogPH6.8objectiveTHTB

0,1
unit,ug/mL
scale,log
PH,6.8
objective,THTB


## Single task training performance compare to the results from paper Fang2023 as baseline.
The tasks use the same test sets as in the fang2023 paper.
Here we create a single task benchmark for each the six ADME endpoints.
The test set was created based on the train and test split provided in https://github.com/molecularinformatics/Computational-ADME/tree/main/MPNN. \
The dataset is slightly different to the dataset published in Fang et al. 2023 after removing the undesired molecules in the context of small molecules.

In [13]:
# Specify names and extract the test set from their dataset
endpoints = {
    "HLM": "HLM_CLint",
    "RLM": "RLM_CLint",
    "hPPB": "HPPB",
    "rPPB": "RPPB",
    "MDR1_ER": "MDR1-MDCK_ER",
    "Sol": "SOLUBILITY",
}

_endpoint = list(endpoints.keys())
INDIR = "gs://polaris-private/original_datasets/ADME/fang2023/MPNN"

In [14]:
split_key = "fang2023split"
paper_splits = {}
for endpoint in _endpoint:
    testset = dm.read_csv(f"{INDIR}/ADME_{endpoint}_test.csv")
    paper_splits[endpoints[endpoint]] = (
        table.loc[~table.SMILES.isin(testset.smiles)].index.values,
        table.loc[table.SMILES.isin(testset.smiles)].index.values,
    )

In [15]:
testset.smiles.loc[testset.smiles.isin(table.SMILES)]

0               Cc1cc(C)cc(C(=O)Nc2nn(C(C)(C)C)cc2C#N)c1
1             COc1ccc(C(=O)Nc2cc(-c3ccc(F)c(F)c3)no2)cc1
2                         CCOC(=O)c1cccnc1-c1cccc2ccnn12
3      CN(c1ccccc1)S(=O)(=O)c1csc(C(=O)Nc2ccc(C(=O)O)...
4          CCn1c(=O)c2cc(OC)c(OC)cc2n(Cc2ccc(Cl)cc2)c1=O
                             ...                        
430                   Fc1ccc2oc(Cn3nnc(-c4ccsc4)n3)nc2c1
431                   COc1ccccc1-c1csc(-n2ncc(C#N)c2N)n1
432                 Cn1c(C2CC2)nc2c1CCN(c1ncnc3ccsc13)C2
433    Cc1ncsc1C(=O)N1CCCCC1c1nc(N)ncc1-c1cccc(C(F)(F...
434      Cc1ccc2[nH]c(C3CN(C(=O)Cc4cccc5ccccc45)C3)nc2c1
Name: smiles, Length: 400, dtype: object

In [16]:
table.loc[table.SMILES.isin(testset.smiles)]

Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG_HLM_CLint,LOG_MDR1-MDCK_ER,LOG_SOLUBILITY,LOG_HPPB,LOG_RPPB,LOG_RLM_CLint,...,LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_zscore,LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_stereo_cliff,LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_zscore,LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_stereo_cliff,LOG MDR1-MDCK ER (B-A/A-B)_zscore,LOG MDR1-MDCK ER (B-A/A-B)_stereo_cliff,LOG SOLUBILITY PH 6.8 (ug/mL)_zscore,LOG SOLUBILITY PH 6.8 (ug/mL)_stereo_cliff,UMAP_0,UMAP_1
7,Mol1269,20740589,CC(=O)Nc1ccnn1C1CCN(Cc2ccccc2C#Cc2ccccc2)CC1,emolecules,1.916343,1.423183,1.412293,,,2.705213,...,,,,,2.203069,,-0.482136,,0.894489,3.733960
10,Mol3337,27444778,Clc1cccc(Nc2ncnc3[nH]ncc23)c1Cl,emolecules,,,0.509874,,,,...,,,,,,,-3.815305,,0.802149,2.417912
17,Mol3334,1397911,NC(=O)Cn1c2ccccc2c2nc3ccccc3nc21,emolecules,,,-0.823909,,,,...,,,,,,,-8.741762,,-0.189674,4.285650
20,Mol1487,32138137,Cc1noc(CCNc2ncnc3ccccc23)n1,emolecules,0.675687,-0.033532,1.581608,,,1.974811,...,,,,,-0.324109,,0.143249,,-0.229339,1.459318
27,Mol82,13329354,CCN1CCN(S(=O)(=O)Cc2ccc(Cl)c(Cl)c2)CC1,emolecules,1.917836,-0.387053,1.409764,1.249125,1.071845,3.159066,...,0.381164,,0.336831,,-0.937414,,-0.491475,,2.306123,2.169771
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3432,Mol1634,37468092,CN(C)C(=O)Cn1cc(C(=O)O)ccc1=O,emolecules,0.675687,,1.323252,,,1.027920,...,,,,,,,-0.811015,,1.623073,3.882370
3474,Mol409,43397821,Cc1cc(C)cc(C(=O)Nc2nn(C(C)(C)C)cc2C#N)c1,emolecules,1.049993,0.323142,1.586700,,,2.430891,...,,,,,0.294666,,0.162054,,2.324817,5.079394
3485,Mol885,181210453,CCN(Cc1ccccc1)C(=O)c1cccnc1N,emolecules,0.675687,-0.187283,1.564666,,,2.239815,...,,,,,-0.590844,,0.080671,,0.273193,4.057323
3490,Mol155,71008775,Cc1cc(F)c(C(=O)Nc2cccc(-c3nncn3C(C)C)n2)cc1-n1...,emolecules,1.113241,1.800037,1.811240,,,1.775967,...,,,,,2.856852,,0.991414,,1.945578,5.463241


In [17]:
testset.dropna()

Unnamed: 0,smiles,activity
0,Cc1cc(C)cc(C(=O)Nc2nn(C(C)(C)C)cc2C#N)c1,1.586700
1,COc1ccc(C(=O)Nc2cc(-c3ccc(F)c(F)c3)no2)cc1,-0.455932
2,CCOC(=O)c1cccnc1-c1cccc2ccnn12,1.660676
3,CN(c1ccccc1)S(=O)(=O)c1csc(C(=O)Nc2ccc(C(=O)O)...,1.904607
4,CCn1c(=O)c2cc(OC)c(OC)cc2n(Cc2ccc(Cl)cc2)c1=O,-0.397940
...,...,...
430,Fc1ccc2oc(Cn3nnc(-c4ccsc4)n3)nc2c1,1.480582
431,COc1ccccc1-c1csc(-n2ncc(C#N)c2N)n1,-0.886057
432,Cn1c(C2CC2)nc2c1CCN(c1ncnc3ccsc13)C2,1.702431
433,Cc1ncsc1C(=O)N1CCCCC1c1nc(N)ncc1-c1cccc(C(F)(F...,1.155336


In [18]:
table

Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG_HLM_CLint,LOG_MDR1-MDCK_ER,LOG_SOLUBILITY,LOG_HPPB,LOG_RPPB,LOG_RLM_CLint,...,LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_zscore,LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_stereo_cliff,LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_zscore,LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_stereo_cliff,LOG MDR1-MDCK ER (B-A/A-B)_zscore,LOG MDR1-MDCK ER (B-A/A-B)_stereo_cliff,LOG SOLUBILITY PH 6.8 (ug/mL)_zscore,LOG SOLUBILITY PH 6.8 (ug/mL)_stereo_cliff,UMAP_0,UMAP_1
0,Mol2754,49006909,O=C(NCC1(Sc2ccccc2)CC1)c1ccc(=O)[nH]n1,emolecules,0.896416,,,,,2.753398,...,,,,,,,,,0.916336,4.428758
1,Mol1188,LN01313047,CC(/C=C/C(=O)NO)=C\[C@@H](C)C(=O)c1ccc(N(C)C)cc1,labnetworkBB,1.366610,0.723417,1.344392,,,2.180917,...,,,,,0.989082,,-0.732932,,1.486225,3.721844
2,Mol1585,32419804,CCNc1ccnc(N(C)Cc2nc3ccccc3n2C)n1,emolecules,1.469100,0.107651,1.567849,,,2.637425,...,,,,,-0.079177,,0.092429,,-1.574456,2.465460
3,Mol1297,32278068,Clc1ccc(C2(c3ccc(-c4cn[nH]c4)cc3)CCNCC2)cc1,emolecules,0.675687,1.995635,1.267172,,,1.027920,...,,,,,3.196185,,-1.018154,,1.231019,3.479056
4,Mol1364,4752649,c1ccc(-n2ncc3c(-n4ccnc4)ncnc32)cc1,emolecules,1.204093,-0.209238,0.696356,,,2.575138,...,,,,,-0.628933,,-3.126516,,0.307897,3.615973
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3511,Mol2093,82736647,CC(=O)c1ccccc1-c1cccc(C(=O)NCCCN2CCCC2)c1,emolecules,0.675687,2.124927,,,,2.870665,...,,,,,3.420487,,,,1.986493,2.524152
3512,Mol2519,30194144,COc1ccccc1C(=O)CSc1nncn1C1CC1,emolecules,1.390847,0.113544,,,,2.665128,...,,,,,-0.068955,,,,1.149241,3.716617
3513,Mol3304,11847209,CCOC(=O)N1CCC(C(=O)Nc2cccc(C)c2)CC1,emolecules,,,1.750354,,,,...,,,,,,,0.766527,,3.016378,4.726665
3514,Mol2518,30663852,O=C(CSc1ccccn1)N1CCSc2ccccc21,emolecules,2.629072,-0.232005,,,,3.805995,...,,,,,-0.668430,,,,2.516532,1.928639


In [19]:
data_cols = ['LOG_HLM_CLint', 'LOG_RLM_CLint', 'LOG_MDR1-MDCK_ER', 'LOG_HPPB', 'LOG_RPPB','LOG_SOLUBILITY']
BENCHMARK_DIR = "gs://polaris-private/benchmarks/ADME/fang2023"

In [20]:
benchmark_path = {}
split_key = 'fang2023_split'
for target_col in data_cols:
    name = f"singletask_{target_col}_{split_key}"
    print(f"{target_col}-{name}")
    benchmark = SingleTaskBenchmarkSpecification(
        name=name,
        dataset=dataset,
        target_cols=target_col,
        input_cols=["smiles"],
        split=paper_splits[target_col.replace("LOG_", "")],
        metrics=["mean_squared_error"],
        tags=['ADME', 'Singletask'], 
        owner=owner, 
        description=f"Single task benchmark for {target_col}"
    )
    SAVE_DIR = f"{BENCHMARK_DIR}/{split_key}/{target_col}"
    path = benchmark.to_json(SAVE_DIR)
    benchmark_path[target_col]= path

LOG_HLM_CLint-singletask_LOG_HLM_CLint_fang2023_split


  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


LOG_RLM_CLint-singletask_LOG_RLM_CLint_fang2023_split


  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


LOG_MDR1-MDCK_ER-singletask_LOG_MDR1-MDCK_ER_fang2023_split


  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


LOG_HPPB-singletask_LOG_HPPB_fang2023_split


  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


LOG_RPPB-singletask_LOG_RPPB_fang2023_split


  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


LOG_SOLUBILITY-singletask_LOG_SOLUBILITY_fang2023_split


  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


In [21]:
benchmark_path

{'LOG_HLM_CLint': 'gs://polaris-private/benchmarks/ADME/fang2023/fang2023_split/LOG_HLM_CLint/benchmark.json',
 'LOG_RLM_CLint': 'gs://polaris-private/benchmarks/ADME/fang2023/fang2023_split/LOG_RLM_CLint/benchmark.json',
 'LOG_MDR1-MDCK_ER': 'gs://polaris-private/benchmarks/ADME/fang2023/fang2023_split/LOG_MDR1-MDCK_ER/benchmark.json',
 'LOG_HPPB': 'gs://polaris-private/benchmarks/ADME/fang2023/fang2023_split/LOG_HPPB/benchmark.json',
 'LOG_RPPB': 'gs://polaris-private/benchmarks/ADME/fang2023/fang2023_split/LOG_RPPB/benchmark.json',
 'LOG_SOLUBILITY': 'gs://polaris-private/benchmarks/ADME/fang2023/fang2023_split/LOG_SOLUBILITY/benchmark.json'}

## Multitask for all the six ADME endpoints with a common random split. 

In [22]:
# regression
TEST_SIZE = 0.2
SEED = 111

# random split
random_splitter = ShuffleSplit(n_splits=5, test_size=TEST_SIZE, random_state=SEED)
random_split = next(random_splitter.split(X=dataset.table.smiles.values))
split_key = "random"

In [23]:
name = f"multitask_sixADME_{split_key}"
print(f"{name}")
benchmark_multi = MultiTaskBenchmarkSpecification(
    name=name,
    dataset=dataset,
    target_cols=data_cols,
    input_cols="smiles",
    split=random_split,
    metrics="mean_squared_error",
    tags=["ADME", "Multitask"], 
    owner=owner,
    description="A multitask benchmark for all the ADME endpoints with a common random split. "
)
SAVE_DIR = f"{BENCHMARK_DIR}/{split_key}/multitask_sixADME"
path = benchmark_multi.to_json(SAVE_DIR)
print(path)

multitask_sixADME_random


  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


gs://polaris-private/benchmarks/ADME/fang2023/random/multitask_sixADME/benchmark.json
