Note: this notebook is set up to run with the env.yml containing the name 'polaris_datasets'

## Background
ADME@NCATS is a resource developed by NCATS to host in silico prediction models for various ADME (Absorption, Distribution, Metabolism and Excretion) properties. The resource serves as an important tool for the drug discovery community with potential uses in compound optimization and prioritization. The models were retrospectively validated on a subset of marketed drugs which resulted in very good accuracies.

Data that were used for developing the models are made publicly accessible by depositing them into PubChem database. In some instances, when complete data cannot be made public, a subset of the data are deposited into PubChem. Links to the PubChem assays can be found in the individual model pages. The users are highly encouraged to use these data for development and validation of QSAR models.

## Assay Information
Aqueous solubility is one of the most important properties in drug discovery, as it has profound impact on various drug properties, including biological activity, pharmacokinetics (PK), toxicity, and in vivo efficacy. Both kinetic and thermodynamic solubilities are determined during different stages of drug discovery and development. One way of assessing solubility is as follows:

![image.png](https://storage.googleapis.com/polaris-public/readme/datasets/img/04_02_ADME_NCATS_Solubility_data_curation.jpeg)

Image is from [here](https://www.emdmillipore.com/CA/en/product/MultiScreenHTS-PCF-Filter-Plates-for-Solubility-Assays,MM_NF-C8875?ReferrerURL=https%3A%2F%2Fwww.google.com%2F).


## Description of readout:
- **PUBCHEM_ACTIVITY_OUTCOME**: Corresponds to the phenotype observed. For all compounds with Moderate/High phenotype, PUBCHEM_ACTIVITY_OUTCOME is "active" (class = 1). For all        compounds with Low phenotype, PUBCHEM_ACTIVITY_OUTCOME is "inactive" (class = 0).
- **PUBCHEM_ACTIVITY_SCORE**: Whole number in Solubility (ug/mL) of the compound.
- **PHENOTYPE**: Indicates type of activity observed: 0-10: Low Solubility (class = 0) >10: Moderate/High Solubility (class = 1)
- **KINETIC_AQUEOUS_SOLUBILITY**: Numerical value of the observed aqueous solubility, measured in ug/mL.

## Data resource

**Reference**: https://pubmed.ncbi.nlm.nih.gov/31176566/ 

**Raw data**: https://pubchem.ncbi.nlm.nih.gov/bioassay/1645848

In [17]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

root = pathlib.Path("__file__").absolute().parents[3]
# set to recipe root directory
os.chdir(root)
sys.path.insert(0, str(root))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
org = "polaris"
data_name = "ncats_adme/Solubility"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

All datasets were downloaded directly from Pubchem on 2024-03-21 by following the PubChem Bioassay links on https://opendata.ncats.nih.gov/adme/data.

In [19]:
# Load the data
source_data_path = f"{gcp_root}/data/raw/AID_1645848_raw.parquet"
data = pd.read_parquet(source_data_path)

Rows 0 and 1 are metadata; we will keep them separate.

In [20]:
meta_start = 0  # Start row index
meta_end = 2  # End row index + 1
data_meta = data.iloc[meta_start:(meta_end), :].copy()  # Save the metadata rows
data = data.drop(labels=list(range(meta_start, meta_end)), axis=0).reset_index(
    drop=True
)  # Drop those rows from the main dataframe
data_meta

Unnamed: 0,PUBCHEM_RESULT_TAG,PUBCHEM_SID,PUBCHEM_CID,PUBCHEM_EXT_DATASOURCE_SMILES,PUBCHEM_ACTIVITY_OUTCOME,PUBCHEM_ACTIVITY_SCORE,PUBCHEM_ACTIVITY_URL,PUBCHEM_ASSAYDATA_COMMENT,Phenotype,Kinetic Aqueous Solubility (ug/mL),Analysis Comment
0,RESULT_TYPE,,,,,,,,STRING,STRING,STRING
1,RESULT_DESCR,,,,,,,,Indicates type of activity observed: 0-10: Low...,Numerical value of the observed aqueous solubi...,Annotation/notes on a particular compound's da...


Drop the metadata rows and keep only the smiles, ID and outcome rows

In [21]:
# Keep only the SMILES, ID and outcome rows
columns_to_keep = [
    "PUBCHEM_SID",
    "PUBCHEM_EXT_DATASOURCE_SMILES",
    "PUBCHEM_ACTIVITY_OUTCOME",
    "PUBCHEM_ACTIVITY_SCORE",
    "Phenotype",
    "Kinetic Aqueous Solubility (ug/mL)",
]
data = data[columns_to_keep]
data.rename(columns={"PUBCHEM_EXT_DATASOURCE_SMILES": "SMILES"}, inplace=True)
# Rename Kinetic Aqueous Solubility (ug/mL) (we will specify minutes in the metadata)
data.rename(
    columns={"Kinetic Aqueous Solubility (ug/mL)": "Kinetic_Aqueous_Solubility"},
    inplace=True,
)
# Rename all columns to uppercase
for col in data.columns:
    data.rename(columns={col: col.upper()}, inplace=True)
# Drop rows where we don't have a solubility score (the other columns depend on this)
print(data.shape)
data.dropna(inplace=True, ignore_index=True)
print(f"after dropping inconclusive solubility: {data.shape}")

(2532, 6)
after dropping inconclusive solubility: (2456, 6)


We dropped ~80 compounds with inconclusive results.

In [22]:
data.describe()

Unnamed: 0,PUBCHEM_SID,PUBCHEM_ACTIVITY_SCORE
count,2456.0,2456.0
mean,185126400.0,19.962541
std,134446700.0,24.440681
min,843706.0,0.0
25%,89650120.0,1.0
50%,161004100.0,4.0
75%,363681100.0,39.0
max,404904700.0,100.0


Looking at the kinds of variables in each column:

In [23]:
data[
    [
        "PUBCHEM_ACTIVITY_OUTCOME",
        "PUBCHEM_ACTIVITY_SCORE",
        "PHENOTYPE",
        "KINETIC_AQUEOUS_SOLUBILITY",
    ]
].value_counts()

PUBCHEM_ACTIVITY_OUTCOME  PUBCHEM_ACTIVITY_SCORE  PHENOTYPE      KINETIC_AQUEOUS_SOLUBILITY
Inactive                  1.0                     Low            <1                            915
Active                    51.0                    Moderate/High  >51                            21
                          58.0                    Moderate/High  >58                            21
                          45.0                    Moderate/High  >45                            20
                          61.0                    Moderate/High  >61                            19
                                                                                              ... 
                          37.0                    Moderate/High  36.88                           1
                                                                 36.89                           1
                                                                 37.15                           1
                 

In [24]:
print(
    f'number of ">" assignments in kinetic aqueous solubility: {len([i for i in data['KINETIC_AQUEOUS_SOLUBILITY'].unique() if '>' in str(i)])}'
)
print(f'unique values in phenotype: {data['PHENOTYPE'].unique()}')

number of ">" assignments in kinetic aqueous solubility: 74
unique values in phenotype: ['Moderate/High' 'Low']


In [25]:
# Map active/inactive and stable/unstable to 1 and 0
data["PUBCHEM_ACTIVITY_OUTCOME"] = data["PUBCHEM_ACTIVITY_OUTCOME"].map(
    {"Active": 1.0, "Inactive": 0.0}
)
data["PHENOTYPE"] = data["PHENOTYPE"].map({"Moderate/High": 1.0, "Low": 0.0})

# Remove the '>' and '<' symbols from 'KINETIC_AQUEOUS_SOLUBILITY' values
data["KINETIC_AQUEOUS_SOLUBILITY"] = data["KINETIC_AQUEOUS_SOLUBILITY"].apply(
    lambda x: float(x.replace(">", "").replace("<", ""))
)

In [26]:
print(
    f"Number of soluble compounds based on column `PUBCHEM_ACTIVITY_OUTCOME`: {sum(data['PUBCHEM_ACTIVITY_OUTCOME']==1)}"
)
print(
    f"Number of soluble compounds based on column `PUBCHEM_ACTIVITY_SCORE`: {sum(data['PUBCHEM_ACTIVITY_SCORE']>10)}"
)
print(
    f"Number of soluble compounds based on column `KINETIC_AQUEOUS_SOLUBILITY`: {sum(data['KINETIC_AQUEOUS_SOLUBILITY']>10)}"
)

Number of soluble compounds based on column `PUBCHEM_ACTIVITY_OUTCOME`: 1056
Number of soluble compounds based on column `PUBCHEM_ACTIVITY_SCORE`: 1046
Number of soluble compounds based on column `KINETIC_AQUEOUS_SOLUBILITY`: 1056


The PUBCHEM_ACTIVITY_OUTCOME is computed based on PUBCHEM_ACTIVITY_OUTCOME instead of PUBCHEM_ACTIVITY_SCORE. 
Therefore, we keep readout PUBCHEM_ACTIVITY_OUTCOME and KINETIC_AQUEOUS_SOLUBILITY as categorical and continous data column respectfully. 

In [27]:
data.isna().any()

PUBCHEM_SID                   False
SMILES                        False
PUBCHEM_ACTIVITY_OUTCOME      False
PUBCHEM_ACTIVITY_SCORE        False
PHENOTYPE                     False
KINETIC_AQUEOUS_SOLUBILITY    False
dtype: bool

In [28]:
data.describe()

Unnamed: 0,PUBCHEM_SID,PUBCHEM_ACTIVITY_OUTCOME,PUBCHEM_ACTIVITY_SCORE,PHENOTYPE,KINETIC_AQUEOUS_SOLUBILITY
count,2456.0,2456.0,2456.0,2456.0,2456.0
mean,185126400.0,0.429967,19.962541,0.429967,19.969126
std,134446700.0,0.495172,24.440681,0.495172,24.440162
min,843706.0,0.0,0.0,0.0,0.005
25%,89650120.0,0.0,1.0,0.0,1.0
50%,161004100.0,0.0,4.0,0.0,4.382
75%,363681100.0,1.0,39.0,1.0,39.28
max,404904700.0,1.0,100.0,1.0,100.0


In [29]:
# Define data column names
data_cols = [
    "PUBCHEM_ACTIVITY_OUTCOME",
    "KINETIC_AQUEOUS_SOLUBILITY",
]
mol_col = "SMILES"

### Run preliminary curation for data inspection

In [30]:
data_cols

['PUBCHEM_ACTIVITY_OUTCOME', 'KINETIC_AQUEOUS_SOLUBILITY']

In [31]:
# import key curation components from auroris
from auroris.curation import Curator
from auroris.curation.actions import (
    MoleculeCuration,
    OutlierDetection,
    Discretization,
    Deduplication,
    StereoIsomerACDetection,
    ContinuousDistributionVisualization,
)

# Define the curation workflow
curator = Curator(
    data_path=source_data_path,
    steps=[
        MoleculeCuration(input_column=mol_col, y_cols=data_cols),
        ContinuousDistributionVisualization(
            y_cols=["PUBCHEM_ACTIVITY_OUTCOME"], bins=[0.5]
        ),
        ContinuousDistributionVisualization(
            y_cols=["KINETIC_AQUEOUS_SOLUBILITY"], bins=[10]
        ),
        OutlierDetection(
            method="zscore", columns=data_cols, threshold=3, use_modified_zscore=True
        ),
        StereoIsomerACDetection(y_cols=data_cols, threshold=3),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

curator.to_json(f"{dirname}/inspection_config.json")

In [32]:
# Run the curation step defined as above
data_inspection, report = curator(data)

[32m2024-07-10 02:19:36.830[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: mol_curation[0m
[32m2024-07-10 02:19:45.428[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-10 02:19:45.484[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-10 02:19:45.543[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: outlier_detection[0m
[32m2024-07-10 02:19:45.648[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: ac_stereoisomer[0m


In [33]:
#  get the curation logger
from auroris.report.broadcaster import LoggerBroadcaster

broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

[31;1m===== Curation Report =====[0m
[38;20mTime: 2024-07-10 02:19:36[0m
[38;20mVersion: dev[0m
[34;1m===== mol_curation =====[0m
[38;20m[LOG]: New column added: MOL_smiles[0m
[38;20m[LOG]: New column added: MOL_molhash_id[0m
[38;20m[LOG]: New column added: MOL_molhash_id_no_stereo[0m
[38;20m[LOG]: New column added: MOL_num_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_defined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_stereo_center[0m
[38;20m[LOG]: New column added: MOL_undefined_E_D[0m
[38;20m[LOG]: New column added: MOL_undefined_E/Z[0m
[38;20m[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.[0m
[38;20m[LOG]: Molecules with undefined stereocenter detected: 425.[0m
[38;20m[IMG]: Dimensions 2400 x 600[0m
[38;20m[IMG]: Dimensions 1200 x 2400[0m
[34;1m===== distribution

In [34]:
# Generate an HTML report with embedded visualizations showcasing the data analysis.
from utils.auroris_utils import HTMLBroadcaster

# export report to local directory
broadcaster = HTMLBroadcaster(report, f"{dirname}/inspection_report")
report_path = broadcaster.broadcast()

In [35]:
# check the curated data
data_inspection.describe(include="all")

Unnamed: 0,PUBCHEM_SID,SMILES,PUBCHEM_ACTIVITY_OUTCOME,PUBCHEM_ACTIVITY_SCORE,PHENOTYPE,KINETIC_AQUEOUS_SOLUBILITY,MOL_smiles,MOL_molhash_id,MOL_molhash_id_no_stereo,MOL_num_stereoisomers,MOL_num_undefined_stereoisomers,MOL_num_defined_stereo_center,MOL_num_undefined_stereo_center,MOL_num_stereo_center,MOL_undefined_E_D,MOL_undefined_E/Z,OUTLIER_PUBCHEM_ACTIVITY_OUTCOME,OUTLIER_KINETIC_AQUEOUS_SOLUBILITY,AC_PUBCHEM_ACTIVITY_OUTCOME,AC_KINETIC_AQUEOUS_SOLUBILITY
count,2456.0,2456,2456.0,2456.0,2456.0,2456.0,2456,2456,2456,2456.0,2456.0,2456.0,2456.0,2456.0,2456,2456.0,2456,2456,2456,2456
unique,,2454,,,,,2454,2454,2454,,,,,,2,1.0,1,2,2,2
top,,CC(=O)NC1=CC=C(C=C1)OCC2=C(C=CC(=C2)C3NC4=CC=C...,,,,,COc1ccc(C2Nc3ccccc3C(=O)N2Cc2ccco2)cc1COc1ccc(...,8e83d520d92f8a07dfc22d0c9c91c6a3ec3f7a6b,5b722d3c844e6405c6c6603c319ae804f34e183d,,,,,,False,0.0,False,False,False,False
freq,,2,,,,,2,2,2,,,,,,2058,2456.0,2456,2439,2454,2454
mean,185126400.0,,0.429967,19.962541,0.429967,19.969126,,,,939.1535,1.337134,0.197068,0.202769,0.399837,,,,,,
std,134446700.0,,0.495172,24.440681,0.495172,24.440162,,,,42418.56,2.536484,0.980589,0.522808,1.113399,,,,,,
min,843706.0,,0.0,0.0,0.0,0.005,,,,1.0,1.0,0.0,0.0,0.0,,,,,,
25%,89650120.0,,0.0,1.0,0.0,1.0,,,,1.0,1.0,0.0,0.0,0.0,,,,,,
50%,161004100.0,,0.0,4.0,0.0,4.382,,,,1.0,1.0,0.0,0.0,0.0,,,,,,
75%,363681100.0,,1.0,39.0,1.0,39.28,,,,2.0,1.0,0.0,0.0,0.0,,,,,,


### Check the data distribution

<img src="inspection_report/images/3-Data_distribution_KINETIC_AQUEOUS_SOLUBILITY.png" width=600 height=300>

The solubulity are rather balenced between `soluble` and `insoluble` classes.

### Check activity shift between stereoisomers

Few activity shifts were detected in the dataset.
Let's check those molecules

In [36]:
data_inspection.loc[[964, 1803]]

Unnamed: 0,PUBCHEM_SID,SMILES,PUBCHEM_ACTIVITY_OUTCOME,PUBCHEM_ACTIVITY_SCORE,PHENOTYPE,KINETIC_AQUEOUS_SOLUBILITY,MOL_smiles,MOL_molhash_id,MOL_molhash_id_no_stereo,MOL_num_stereoisomers,MOL_num_undefined_stereoisomers,MOL_num_defined_stereo_center,MOL_num_undefined_stereo_center,MOL_num_stereo_center,MOL_undefined_E_D,MOL_undefined_E/Z,OUTLIER_PUBCHEM_ACTIVITY_OUTCOME,OUTLIER_KINETIC_AQUEOUS_SOLUBILITY,AC_PUBCHEM_ACTIVITY_OUTCOME,AC_KINETIC_AQUEOUS_SOLUBILITY
964,124888752.0,CC(=O)NC1=CC=C(C=C1)OCC2=C(C=CC(=C2)C3NC4=CC=C...,1.0,15.0,1.0,14.83,COc1ccc(C2Nc3ccccc3C(=O)N2Cc2ccco2)cc1COc1ccc(...,8e83d520d92f8a07dfc22d0c9c91c6a3ec3f7a6b,5b722d3c844e6405c6c6603c319ae804f34e183d,2,2,0,1,1,True,False,False,False,True,False
1803,57655021.0,CC(=O)NC1=CC=C(C=C1)OCC2=C(C=CC(=C2)C3NC4=CC=C...,0.0,1.0,0.0,1.0,COc1ccc(C2Nc3ccccc3C(=O)N2Cc2ccco2)cc1COc1ccc(...,8e83d520d92f8a07dfc22d0c9c91c6a3ec3f7a6b,5b722d3c844e6405c6c6603c319ae804f34e183d,2,2,0,1,1,True,False,False,False,True,False


As we suspected, two samples with the same smiles but different SIDs are reported as being active or inactive. Looking at the molecules, they seem identical. We can't know which call is correct, so we should remove both. We'll do that below, after final curation.

SID 124888752 is [NCGC00066484-02](https://pubchem.ncbi.nlm.nih.gov/substance/124888752). SID 57655021 is [NCGC00168126-01](https://pubchem.ncbi.nlm.nih.gov/substance/57655021). The 2D representations are the same.

![image-2.png](inspection_report/images/6-Activity_shifts_among_stereoisomers__PUBCHEM_ACTIVITY_OUTCOME.png)

In [37]:
# Remove the above two samples
data_to_curate = data.query("PUBCHEM_SID not in [124888752, 57655021]").reset_index(
    drop=True
)

### Re-run curation

In [38]:
# Define the final curation workflow
curator = Curator(
    source_data=source_data_path,
    steps=[
        MoleculeCuration(input_column=mol_col, y_cols=data_cols),
        Deduplication(
            deduplicate_on=mol_col, y_cols=data_cols
        ),  # remove the replicated molecules
        ContinuousDistributionVisualization(
            y_cols=["PUBCHEM_ACTIVITY_OUTCOME"], bins=[0.5]
        ),
        ContinuousDistributionVisualization(
            y_cols=["KINETIC_AQUEOUS_SOLUBILITY"], bins=[10]
        ),
        OutlierDetection(
            method="zscore", columns=data_cols, threshold=3, use_modified_zscore=True
        ),
        StereoIsomerACDetection(y_cols=data_cols, threshold=3),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

In [39]:
# The final curation configuration is exported for reproducibility
path = f"{gcp_root}/data/curation/curation_config.json"
curator.to_json(path)

In [40]:
# Run the curation step defined as above
data_curated, report = curator(data_to_curate)

[32m2024-07-10 02:19:48.132[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: mol_curation[0m
[32m2024-07-10 02:19:56.688[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: deduplicate[0m
[32m2024-07-10 02:19:58.144[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-10 02:19:58.201[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-10 02:19:58.260[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: outlier_detection[0m
[32m2024-07-10 02:19:58.365[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: ac_stereoisomer[0m


In [41]:
broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

[31;1m===== Curation Report =====[0m
[38;20mTime: 2024-07-10 02:19:48[0m
[38;20mVersion: dev[0m
[34;1m===== mol_curation =====[0m
[38;20m[LOG]: New column added: MOL_smiles[0m
[38;20m[LOG]: New column added: MOL_molhash_id[0m
[38;20m[LOG]: New column added: MOL_molhash_id_no_stereo[0m
[38;20m[LOG]: New column added: MOL_num_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_defined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_stereo_center[0m
[38;20m[LOG]: New column added: MOL_undefined_E_D[0m
[38;20m[LOG]: New column added: MOL_undefined_E/Z[0m
[38;20m[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.[0m
[38;20m[LOG]: Molecules with undefined stereocenter detected: 423.[0m
[38;20m[IMG]: Dimensions 2400 x 600[0m
[38;20m[IMG]: Dimensions 1200 x 2400[0m
[34;1m===== deduplicate 

In [42]:
# Export report to polaris public directory on GCP
# The report is ready to reviewed in the HTML file.
broadcaster = HTMLBroadcaster(
    report, f"{gcp_root}/data/curation/report", embed_images=True
)
broadcaster.broadcast()

'gs://polaris-public/polaris-recipes/org-polaris/ncats_adme/Solubility/data/curation/report/index.html'

## Export the final curated data

In [43]:
fout = f"{gcp_root}/data/curation/{data_name}_curated.csv"
data_curated.reset_index(drop=True).to_csv(fout, index=False)