![ADME](https://storage.googleapis.com/polaris-public/icons/icon_fang.png) 

# `biogen/adme-fang-1` data curation


## Background

The goal of accessing ADME properties is to understand how a potential drug candidate interacts with the human body, including absorption, distribution, metabolism, and excretion. This knowledge is crucial for evaluating efficacy, safety, and clinical potential, guiding drug development for optimal therapeutic outcomes. Fang et al. 2023 has disclosed DMPK datasets collected over 20 months across six ADME in vitro endpoints, which are human and rat liver microsomal stability, MDR1-MDCK efflux ratio, solubility, and human and rat plasma protein binding. The dataset contains 885 to 3087 measures for the corresponding endpoints. The compounds show the chemical diversity across all ranges of the endpoints which are microsomal stability, plasma protein binding, permeability, and solubility.

## Description of readout
- Microsomal stability (human and rat):  `LOG HLM_CLint (mL/min/kg)`, `LOG RLM_CLint (mL/min/kg)`
- Plasma protein binding (human and rat): `LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)`, `LOG PLASMA PROTEIN BINDING (RAT) (% unbound)`
- Permeability: `LOG MDR1-MDCK ER (B-A/A-B)`
- Solubility: `LOG SOLUBILITY PH 6.8 (ug/mL)`


## Data resource
**Reference**: [Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective]( https://doi.org/10.1021/acs.jcim.3c00160)

**Github**: https://github.com/molecularinformatics/Computational-ADME

**Raw data**: https://github.com/molecularinformatics/Computational-ADME/blob/main/ADME_public_set_3521.csv 

## Dataset entry point on Polaris
The dataset is available on Polaris [polaris/adme-fang-1](https://polarishub.io/datasets/polaris/adme-fang-1). 

## Curation reproducibility
The curation process in this notebook can be reproduced by command line:

```shell
auroris curate org-Biogen/fang2023_ADME/curation_config.json org-Biogen/fang2023_ADME
```

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

root = pathlib.Path("__file__").absolute().parents[2]
# set to recipe root directory
os.chdir(root)
sys.path.insert(0, str(root))

In [2]:
org = "biogen"
data_name = "fang2023_ADME"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"


# Load the data
source_data_path = f"gs://polaris-public/polaris-recipes/org-biogen/{data_name}/raw/fang2023_ADME_public_set_3521.csv"
data = pd.read_csv(source_data_path)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
data.describe(include="all")

Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg)
count,3521,3521.0,3521,3521,3087.0,2642.0,2173.0,194.0,168.0,3054.0
unique,3521,3521.0,3521,5,,,,,,
top,Mol1,317714313.0,CNc1cc(Nc2cccn(-c3ccccn3)c2=O)nn2c(C(=O)N[C@@H...,emolecules,,,,,,
freq,1,1.0,1,3452,,,,,,
mean,,,,,1.320019,0.397829,1.259943,0.765722,0.764177,2.256207
std,,,,,0.623952,0.688465,0.683416,0.847902,0.798988,0.750422
min,,,,,0.675687,-1.162425,-1.0,-1.59346,-1.638272,1.02792
25%,,,,,0.675687,-0.162356,1.15351,0.168067,0.226564,1.688291
50%,,,,,1.205313,0.153291,1.542825,0.867555,0.776427,2.311068
75%,,,,,1.803115,0.905013,1.687351,1.501953,1.375962,2.835274


In [4]:
# Define data column names
data_cols = [
    "LOG HLM_CLint (mL/min/kg)",
    "LOG RLM_CLint (mL/min/kg)",
    "LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)",
    "LOG PLASMA PROTEIN BINDING (RAT) (% unbound)",
    "LOG MDR1-MDCK ER (B-A/A-B)",
    "LOG SOLUBILITY PH 6.8 (ug/mL)",
]
mol_col = "SMILES"

### Perform data curation with `auroris.curation` module
The curation process includes:
- assign unique identifier to molecules
- detect the stereochemistry information of molecules.
- inspect the potential outliers of bioactivity values
- merge rows of replicated molecules
- detect isomers which show the activity shifts

Check out the curation module in [Auroris](https://github.com/polaris-hub/auroris). 

In [5]:
# import key curation components from auroris
from auroris.curation import Curator
from auroris.curation.actions import (
    MoleculeCuration,
    OutlierDetection,
    Deduplication,
    StereoIsomerACDetection,
    ContinuousDistributionVisualization,
)

# Define the curation workflow
curator = Curator(
    data_path=source_data_path,
    steps=[
        MoleculeCuration(input_column=mol_col, y_cols=data_cols),
        ContinuousDistributionVisualization(y_cols=data_cols),
        OutlierDetection(
            method="zscore", columns=data_cols, threshold=3, use_modified_zscore=True
        ),
        StereoIsomerACDetection(y_cols=data_cols, threshold=3),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

curator.to_json(f"{dirname}/inspection_config.json")

In [6]:
# Run the curation step defined as above
data_inspection, report = curator(data)

[32m2024-07-09 23:50:52.882[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: mol_curation[0m
[32m2024-07-09 23:51:21.070[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-09 23:51:21.346[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: outlier_detection[0m
[32m2024-07-09 23:51:21.638[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: ac_stereoisomer[0m


In [7]:
#  get the curation logger
from auroris.report.broadcaster import LoggerBroadcaster

broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

[31;1m===== Curation Report =====[0m
[38;20mTime: 2024-07-09 23:50:52[0m
[38;20mVersion: dev[0m
[34;1m===== mol_curation =====[0m
[38;20m[LOG]: New column added: MOL_smiles[0m
[38;20m[LOG]: New column added: MOL_molhash_id[0m
[38;20m[LOG]: New column added: MOL_molhash_id_no_stereo[0m
[38;20m[LOG]: New column added: MOL_num_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_defined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_stereo_center[0m
[38;20m[LOG]: New column added: MOL_undefined_E_D[0m
[38;20m[LOG]: New column added: MOL_undefined_E/Z[0m
[38;20m[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.[0m
[38;20m[LOG]: Molecules with undefined stereocenter detected: 186.[0m
[38;20m[IMG]: Dimensions 2400 x 1800[0m
[38;20m[IMG]: Dimensions 1200 x 600[0m
[34;1m===== distribution

In [8]:
# Generate an HTML report with embedded visualizations showcasing the data analysis.
from utils.auroris_utils import HTMLBroadcaster

# export report to local directory
broadcaster = HTMLBroadcaster(report, f"{root}/inspection_report")
report_path = broadcaster.broadcast()

In [9]:
# check the curated data
data_inspection.describe(include="all")

Unnamed: 0,Internal ID,Vendor ID,SMILES,CollectionName,LOG HLM_CLint (mL/min/kg),LOG MDR1-MDCK ER (B-A/A-B),LOG SOLUBILITY PH 6.8 (ug/mL),LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),LOG PLASMA PROTEIN BINDING (RAT) (% unbound),LOG RLM_CLint (mL/min/kg),...,OUTLIER_LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),OUTLIER_LOG PLASMA PROTEIN BINDING (RAT) (% unbound),OUTLIER_LOG MDR1-MDCK ER (B-A/A-B),OUTLIER_LOG SOLUBILITY PH 6.8 (ug/mL),AC_LOG HLM_CLint (mL/min/kg),AC_LOG RLM_CLint (mL/min/kg),AC_LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound),AC_LOG PLASMA PROTEIN BINDING (RAT) (% unbound),AC_LOG MDR1-MDCK ER (B-A/A-B),AC_LOG SOLUBILITY PH 6.8 (ug/mL)
count,3521,3521.0,3521,3521,3087.0,2642.0,2173.0,194.0,168.0,3054.0,...,3521,3521,3521,3521,3521,3521,3521,3521,3521,3521
unique,3521,3521.0,3521,5,,,,,,,...,1,2,2,2,1,1,1,1,1,1
top,Mol1,317714313.0,CNc1cc(Nc2cccn(-c3ccccn3)c2=O)nn2c(C(=O)N[C@@H...,emolecules,,,,,,,...,False,False,False,False,False,False,False,False,False,False
freq,1,1.0,1,3452,,,,,,,...,3521,3520,3520,3501,3521,3521,3521,3521,3521,3521
mean,,,,,1.320019,0.397829,1.259943,0.765722,0.764177,2.256207,...,,,,,,,,,,
std,,,,,0.623952,0.688465,0.683416,0.847902,0.798988,0.750422,...,,,,,,,,,,
min,,,,,0.675687,-1.162425,-1.0,-1.59346,-1.638272,1.02792,...,,,,,,,,,,
25%,,,,,0.675687,-0.162356,1.15351,0.168067,0.226564,1.688291,...,,,,,,,,,,
50%,,,,,1.205313,0.153291,1.542825,0.867555,0.776427,2.311068,...,,,,,,,,,,
75%,,,,,1.803115,0.905013,1.687351,1.501953,1.375962,2.835274,...,,,,,,,,,,


## Signals or outliers
This process utilized `zscore` as the default method, but one can adjust the outlier detection method by defining parameters within the `method`. \
For more information and details on this, please refer to `auroris.curation.actions.OutlierDetection`.

During the curation process, several potential outliers were flagged across multiple endpoints. These outliers have been marked and included in the curated output. 


Below is the probability plot of data `LOG_HLM_CLint__mL_min_kg` which is also avaiable in the `Outlier detection` section of the curation report.  

![LOG_HLM_CLint_mL_min_kg](inspection_report/images/Outlier_detection_-_LOG_HLM_CLint__mL_min_kg_.png)

It's worth noting that the flagged outliers (highlighted in red), which are located at the extremes of the data distributions, are still in the value range of readout `LOG_HLM_CLint__mL_min_kg` measurement and are likely to be false positive outliers. Therefore, they should be examined closely.

Readouts `LOG PLASMA PROTEIN BINDING (RAT) (% unbound)` , `LOG MDR1-MDCK ER (B-A/A-B)` and `LOG SOLUBILITY PH 6.8 (ug/mL)` fall into the similar scenario. 

## Chemical space coverage of the dataset

![chemical space chem_all](inspection_report/images/Distribution_in_Chemical_Space_-_ECFP.png)

The above plots show the coverage in the chemical space of the molecules in dataset with respect of six endpoints.

### Examples of molecule with incomplete stereochemical information

![undefined_stereo](inspection_report/images/Molecules_with_undefined_stereocenters.png)

There is no activity shifts among the sterero isomers. Therefore, it's not necessary to remove the molecules with undefined stereo centers from the dataset. 

## Rerun data curation and export curated data for downstream tasks

In [10]:
# Define the final curation workflow
curator = Curator(
    source_data=source_data_path,
    steps=[
        MoleculeCuration(input_column=mol_col, y_cols=data_cols),
        ContinuousDistributionVisualization(y_cols=data_cols),
        Deduplication(
            deduplicate_on=mol_col, y_cols=data_cols
        ),  # remove the replicated molecules
        OutlierDetection(
            method="zscore", columns=data_cols, threshold=3, use_modified_zscore=True
        ),
        StereoIsomerACDetection(y_cols=data_cols, threshold=3),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

In [11]:
# The final curation configuration is exported for reproducibility
path = f"{gcp_root}/data/curation/curation_config.json"
curator.to_json(path)

In [12]:
# Run the curation step defined as above
data_curated, report = curator(data)

[32m2024-07-09 23:51:24.809[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: mol_curation[0m
[32m2024-07-09 23:51:38.951[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-09 23:51:39.229[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: deduplicate[0m
[32m2024-07-09 23:51:41.468[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: outlier_detection[0m
[32m2024-07-09 23:51:41.774[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: ac_stereoisomer[0m


In [13]:
broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

[31;1m===== Curation Report =====[0m
[38;20mTime: 2024-07-09 23:51:24[0m
[38;20mVersion: dev[0m
[34;1m===== mol_curation =====[0m
[38;20m[LOG]: New column added: MOL_smiles[0m
[38;20m[LOG]: New column added: MOL_molhash_id[0m
[38;20m[LOG]: New column added: MOL_molhash_id_no_stereo[0m
[38;20m[LOG]: New column added: MOL_num_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_defined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_stereo_center[0m
[38;20m[LOG]: New column added: MOL_undefined_E_D[0m
[38;20m[LOG]: New column added: MOL_undefined_E/Z[0m
[38;20m[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.[0m
[38;20m[LOG]: Molecules with undefined stereocenter detected: 186.[0m
[38;20m[IMG]: Dimensions 2400 x 1800[0m
[38;20m[IMG]: Dimensions 1200 x 600[0m
[34;1m===== distribution

In [14]:
# Export report to polaris public directory on GCP
# The report is ready to reviewed in the HTML file.
broadcaster = HTMLBroadcaster(
    report, f"{gcp_root}/data/curation/report", embed_images=True
)
broadcaster.broadcast()

'gs://polaris-public/polaris-recipes/org-biogen/fang2023_ADME/data/curation/report/index.html'

## Export the final curated data

In [15]:
fout = f"{gcp_root}/data/curation/{data_name}_curated.csv"
data_curated.reset_index(drop=True).to_csv(fout, index=False)