## This script demonstrates how to use `polaris.curation` module to perform the data curation
- DMPK datasets published in in Fang et al. 2023 (DOI:10.1021/acs.jcim.3c00160). 

- Curate the chemistry information on the molecules.
  - Clean the molecules by perform molecule fix and sanitization,  standardization molecules, salts/solvents removals.
  - Remove stereochemistry information if `ignore_stereo` is set to `True`. This is recommended if the downstream molecule representation is not able to differentiate the stereoisomers. 

- Curate the measured endpoint values in the datasets
  - Merge measurements of repeated molecules in the dataset. The identification of the repeated molecules is defined by `dm.hash_mol` with or without stereochemistry information.
  - Classify the measured values based on provided threshold values for classification tasks.
  - Detect activity cliff between the stereoisomers. When `mask_stereo_cliff` is set to true, the targeted activity values of those molecules pairs will be set to `None`. This is recommended if the downstream molecule representation is not able to differentiate the stereoisomers.


In [1]:
%load_ext autoreload
%autoreload 2
import datamol as dm
import pandas as pd
from polaris import curation

### Data curation for DMPK datasets

In [2]:
INDIR = "gs://polaris-private/dataset/DMPK/Fang2023"
OUTDIR = "gs://polaris-private/dataset/DMPK"

In [3]:
# Define data column names
endpoints = {
    "HLM": "LOG HLM_CLint (mL/min/kg)",
    "RLM": "LOG RLM_CLint (mL/min/kg)",
    "hPPB": "LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)",
    "rPPB": "LOG PLASMA PROTEIN BINDING (RAT) (% unbound)",
    "MDR1_ER": "LOG MDR1-MDCK ER (B-A/A-B)",
    "Sol": "LOG SOLUBILITY PH 6.8 (ug/mL)",
}

# Define thresholds for class conversions
class_thresholds = {
    "hPPB": {"thresholds": [0.3, 1], "label_order": "descending"},
    "rPPB": {"thresholds": [0.3, 1], "label_order": "descending"},
    "MDR1_ER": {"thresholds": [1, 2]},
    "Sol": {
        "thresholds": [
            0,
            1,
        ]
    },
}

### Perform curation which takes stereochemistry information into account. 

It's important to detect and analyze the activity cliff between the stereoisomers.

In [4]:
data = dm.read_csv("gs://polaris-private/dataset/DMPK/Fang2023/ADME_public_set_3521.csv")



In [5]:
data.describe()

Unnamed: 0,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg)
count,3087.0,2642.0,2173.0,194.0,168.0,3054.0
mean,1.320019,0.397829,1.259943,0.765722,0.764177,2.256207
std,0.623952,0.688465,0.683416,0.847902,0.798988,0.750422
min,0.675687,-1.162425,-1.0,-1.59346,-1.638272,1.02792
25%,0.675687,-0.162356,1.15351,0.168067,0.226564,1.688291
50%,1.205313,0.153291,1.542825,0.867555,0.776427,2.311068
75%,1.803115,0.905013,1.687351,1.501953,1.375962,2.835274
max,3.372714,2.725057,2.179264,2.0,2.0,3.969622


In [6]:
data_cols = list(endpoints.values())
mol_col = "SMILES"

In [7]:
data_cols

['LOG HLM_CLint (mL/min/kg)',
 'LOG RLM_CLint (mL/min/kg)',
 'LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)',
 'LOG PLASMA PROTEIN BINDING (RAT) (% unbound)',
 'LOG MDR1-MDCK ER (B-A/A-B)',
 'LOG SOLUBILITY PH 6.8 (ug/mL)']

In [8]:
# curate
curator_with_stereo = curation.MolecularCurator(
    data=data,
    data_cols=data_cols,
    mol_col=mol_col,
    mask_stereo_undefined_mols=True,
    class_thresholds={endpoints[ep]: class_thresholds[ep] for ep in class_thresholds.keys()},
)
df_full = curator_with_stereo.run()



The curation raised warning for potential outliers from the bioactivity readouts 'LOG HLM_CLint (mL/min/kg)', 'LOG PLASMA PROTEIN BINDING (RAT) (% unbound)', 'LOG MDR1-MDCK ER (B-A/A-B)', 'LOG SOLUBILITY PH 6.8 (ug/mL)'.

The outlier labels are added in the curated output.  It's important to revise those data points and verify whether they are real outliers and should be removed from the dataset. 

We can also use other outlier detection methods by passing parameters to `outlier_params`. Please see more details in <polaris.curation.utils.outlier_detection>. 

In [13]:
for ep in [
    "LOG HLM_CLint (mL/min/kg)",
    "LOG PLASMA PROTEIN BINDING (RAT) (% unbound)",
    "LOG MDR1-MDCK ER (B-A/A-B)",
    "LOG SOLUBILITY PH 6.8 (ug/mL)",
]:
    display(df_full.query(f"`OUTLIER_{ep}` == True"))

Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg),...,CLASS_LOG MDR1-MDCK ER (B-A/A-B),CLASS_LOG SOLUBILITY PH 6.8 (ug/mL),LOG HLM_CLint (mL/min/kg)_zscore,LOG HLM_CLint (mL/min/kg)_stereo_cliff,LOG RLM_CLint (mL/min/kg)_zscore,LOG RLM_CLint (mL/min/kg)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_stereo_cliff,CLASS_LOG MDR1-MDCK ER (B-A/A-B)_stereo_cliff,CLASS_LOG SOLUBILITY PH 6.8 (ug/mL)_stereo_cliff
594,Mol304,32033566,COc1ccccc1CNC(=O)C(C)N1CCCN(c2ccccc2C#N)CC1,emolecules,3.339893,0.178072,1.67071,,,3.604436,...,0.0,2.0,2.718437,,1.5229,,,,,
2784,Mol2788,1821515,Cc1cc(C)nc(SCC(=O)N2c3ccccc3C(C)(c3ccccc3)CC2(...,emolecules,3.372714,,,,,3.495299,...,,,2.760236,,1.394394,,,,,
3406,Mol910,32033578,CC(C(=O)N1CCc2sccc2C1)N1CCCN(c2ccccc2C#N)CC1,emolecules,3.328241,-0.059773,1.40824,,,3.626562,...,0.0,2.0,2.703598,,1.548953,,,,,


Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg),...,CLASS_LOG MDR1-MDCK ER (B-A/A-B),CLASS_LOG SOLUBILITY PH 6.8 (ug/mL),LOG HLM_CLint (mL/min/kg)_zscore,LOG HLM_CLint (mL/min/kg)_stereo_cliff,LOG RLM_CLint (mL/min/kg)_zscore,LOG RLM_CLint (mL/min/kg)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_stereo_cliff,CLASS_LOG MDR1-MDCK ER (B-A/A-B)_stereo_cliff,CLASS_LOG SOLUBILITY PH 6.8 (ug/mL)_stereo_cliff
1184,Mol98,901943,CCC1=C(C)CN(C(=O)NCCc2ccc(S(=O)(=O)NC(=O)N[C@H...,emolecules,1.284273,1.836054,,-1.180456,-1.638272,2.484587,...,1.0,,0.100558,,0.204313,,,,,


Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg),...,CLASS_LOG MDR1-MDCK ER (B-A/A-B),CLASS_LOG SOLUBILITY PH 6.8 (ug/mL),LOG HLM_CLint (mL/min/kg)_zscore,LOG HLM_CLint (mL/min/kg)_stereo_cliff,LOG RLM_CLint (mL/min/kg)_zscore,LOG RLM_CLint (mL/min/kg)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_stereo_cliff,CLASS_LOG MDR1-MDCK ER (B-A/A-B)_stereo_cliff,CLASS_LOG SOLUBILITY PH 6.8 (ug/mL)_stereo_cliff
3085,Mol1915,258223286,c1nn(C2CCOCC2)cc1Nc1ncc2nnn(-c3ccc4cn[nH]c4c3)...,emolecules,1.985754,2.725057,,,,2.883853,...,2.0,,0.99391,,0.674436,,,,,


Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg),...,CLASS_LOG MDR1-MDCK ER (B-A/A-B),CLASS_LOG SOLUBILITY PH 6.8 (ug/mL),LOG HLM_CLint (mL/min/kg)_zscore,LOG HLM_CLint (mL/min/kg)_stereo_cliff,LOG RLM_CLint (mL/min/kg)_zscore,LOG RLM_CLint (mL/min/kg)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)_stereo_cliff,CLASS_LOG PLASMA PROTEIN BINDING (RAT) (% unbound)_stereo_cliff,CLASS_LOG MDR1-MDCK ER (B-A/A-B)_stereo_cliff,CLASS_LOG SOLUBILITY PH 6.8 (ug/mL)_stereo_cliff
17,Mol3334,1397911,NC(=O)Cn1c2ccccc2c2nc3ccccc3nc21,emolecules,,,-0.823909,,,,...,,0.0,,,,,,,,
160,Mol3387,1528442,O=C1CCCc2nc(Nc3nc(-c4ccccc4)c4ccccc4n3)ncc21,emolecules,,,-0.79588,,,,...,,0.0,,,,,,,,
161,Mol3420,1542031,COc1ccc(Nc2nc3ccccc3c3nncn23)cc1,emolecules,,,-0.853872,,,,...,,0.0,,,,,,,,
289,Mol3212,1542039,CCc1nnc2c3ccccc3nc(Nc3ccc(OC)cc3)n12,emolecules,,-1.034437,-0.853872,,,,...,0.0,0.0,,,,,,,,
448,Mol3191,11930272,Cc1cc(C)c(-c2csc(NC(=O)Cn3cnnn3)n2)c(C)c1,emolecules,,0.885943,-0.920819,,,,...,0.0,0.0,,,,,,,,
758,Mol3202,27448904,Fc1ccc(Nc2nc(N3CCCCC3)c3nccnc3n2)cc1,emolecules,,-0.516295,-0.920819,,,,...,0.0,0.0,,,,,,,,
1115,Mol3428,49839446,c1cn2cc(-c3ccc4cn[nH]c4c3)nc(Nc3ccc(N4CCOCC4)c...,emolecules,,,-1.0,,,,...,,0.0,,,,,,,,
1339,Mol3448,1185031,CCCn1c(=O)c2c(nc3n2CCN3c2ccc(C)cc2)n(C)c1=O,emolecules,,,-0.823909,,,,...,,0.0,,,,,,,,
1505,Mol3159,31362621,COc1ccc2nc(C(=O)Nc3cnc4c(cnn4C(C)C)c3)ccc2c1,emolecules,,-0.333575,-0.823909,,,,...,0.0,0.0,,,,,,,,
1530,Mol3251,29665183,COc1cccc(-c2nnn(Cc3cc(F)cc4cccnc34)n2)c1,emolecules,,-0.313073,-0.853872,,,,...,0.0,0.0,,,,,,,,


In [14]:
df_full.to_csv(f"{OUTDIR}/ADME_public_set_3521_curated_v2.csv", index=False)

### Extend the PPB datasets with public datasets which were used in Fang et al.  

In [16]:
_endpoint = ["hPPB", "rPPB"]

In [17]:
data_dict = {}
for endpoint in _endpoint:
    data_dict[endpoint] = dm.read_sdf(f"{INDIR}/ADME_{endpoint}.sdf", as_df=True)

In [18]:
data_dict["hPPB"]["CollectionName"].value_counts()

CollectionName
chembl           1614
emolecules        187
mcule               3
labnetworkBB        2
enamineBB_pmc       1
enamineHTS          1
Name: count, dtype: int64

In [20]:
data = dm.read_csv("gs://polaris-private/dataset/DMPK/Fang2023/ADME_public_set_3521.csv")

In [21]:
data.dropna(subset=endpoints["hPPB"])["CollectionName"].value_counts()

CollectionName
emolecules       187
mcule              3
labnetworkBB       2
enamineBB_pmc      1
enamineHTS         1
Name: count, dtype: int64

In [22]:
chembl_dict = {}
chembl_dict["hPPB"] = data_dict["hPPB"].query("CollectionName == 'chembl'")
chembl_dict["rPPB"] = data_dict["rPPB"].query("CollectionName == 'chembl'")

In [23]:
cols = ["LOG PLASMA PROTEIN BINDING (RAT) (% unbound)", "SMILES", "Internal ID", "Source", "CollectionName"]
pbb_df = chembl_dict["hPPB"].merge(
    chembl_dict["rPPB"][cols], on=["SMILES", "Internal ID", "Source", "CollectionName"], how="outer"
)
pbb_df.drop(columns="smiles", inplace=True)

In [24]:
extended_data = pd.concat([data, pbb_df], axis=0)

In [25]:
extended_data.reset_index(drop=True).to_csv(f"{OUTDIR}/ADME_public_set_extended.csv", index=False)

In [26]:
for col in data_cols:
    print(col)
    print(extended_data.dropna(subset=[col])["CollectionName"].value_counts())
    print("------------------")

LOG HLM_CLint (mL/min/kg)
CollectionName
emolecules       3027
enamineHTS         20
labnetworkBB       17
mcule              17
enamineBB_pmc       6
Name: count, dtype: int64
------------------
LOG RLM_CLint (mL/min/kg)
CollectionName
emolecules       2997
enamineHTS         19
labnetworkBB       17
mcule              15
enamineBB_pmc       6
Name: count, dtype: int64
------------------
LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)
CollectionName
chembl           1614
emolecules        187
mcule               3
labnetworkBB        2
enamineBB_pmc       1
enamineHTS          1
Name: count, dtype: int64
------------------
LOG PLASMA PROTEIN BINDING (RAT) (% unbound)
CollectionName
chembl           717
emolecules       162
labnetworkBB       3
enamineBB_pmc      2
mcule              1
Name: count, dtype: int64
------------------
LOG MDR1-MDCK ER (B-A/A-B)
CollectionName
emolecules       2594
labnetworkBB       16
enamineHTS         14
mcule              13
enamineBB_pmc       5
Name: c

In [27]:
file = f"{OUTDIR}/ADME_public_set_extended.csv"

In [29]:
data = pd.read_csv(file)
curator_with_stereo = curation.MolecularCurator(
    data=data,
    data_cols=data_cols,
    mol_col="SMILES",
    mask_stereo_undefined_mols=True,
    class_thresholds={endpoints[ep]: class_thresholds[ep] for ep in class_thresholds.keys()},
)
df_full = curator_with_stereo.run()



In [30]:
file_out = "gs://polaris-private/dataset/DMPK/ADME_public_set_extended_curated_v2.csv"
df_full.to_csv(file_out, index=False)