Note: this notebook is set up to run with the env.yml containing the name 'polaris_datasets'

## Background
This is part of a release of experimental data determined at AstraZeneca on a set of compounds in the following assays: pKa, lipophilicity (LogD7.4), aqueous solubility, plasma protein binding (human, rat, dog , mouse and guinea pig), intrinsic clearance (human liver microsomes, human and rat hepatocytes). 

## Assay Information
Hepatic metabolic stability is a key pharmacokinetic parameter in drug discovery. Metabolic stability is usually assessed in microsomal fractions and only the best compounds progress in the drug discovery process.

![image.png](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs00216-016-9929-6/MediaObjects/216_2016_9929_Fig1_HTML.gif)

Image is from [this paper](https://link.springer.com/article/10.1007/s00216-016-9929-6). Biologically active compounds can be transformed or destroyed by the action of enzymes in the liver. Microsomes are small membrane bubbles (vesicles) that come from a fragmented cell membrane, and can be used as a proxy for how well a drug survives a trip through the liver.

## Description of readout:
- **HLM_CLEARANCE**: Intrinsic clearance measured in human liver microsomes following incubation at 37C. Experimental range <3 to >150 microL/min/mg. Rapid Commun. Mass Spectrom. 2010, 24, 1730-1736.

## Data resource

**Reference**: https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL3301361/

**Raw data**: https://www.ebi.ac.uk/chembl/assay_report_card/CHEMBL3301370/

## Curation reproducibility
The curation process in this notebook can be reproduced by command line:

```shell
auroris curate org-Polaris/astra_zeneca/HLM_clearance/curation_config.json org-Polaris/astra_zeneca/HLM_clearance
```

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

root = pathlib.Path("__file__").absolute().parents[3]
# set to recipe root directory
os.chdir(root)
sys.path.insert(0, str(root))

We can get the dataset directly from ChEMBL (https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL3301361/ gives the overall experimental data information, and subsequent numbers give data for a specific endpoint). The following map gives information on each:

In [2]:
# chembl_map = {
#     'CHEMBL3301362' : 'Most basic pKa value (pKa B1)', # 261
#     'CHEMBL3301363' : 'Octan-1-ol/water (pH7.4) distribution coefficent', # 4200
#     'CHEMBL3301364' : 'Solubility in pH7.4 buffer', #1763
#     'CHEMBL3301365' : '\% bound to plasma by equilibrium dialysis, human plasma', # PPB # 1614
#     'CHEMBL3301366' : '\% bound to plasma by equilibrium dialysis, rat plasma', # 717
#     'CHEMBL3301367' : '\% bound to plasma by equilibrium dialysis, dog plasma', # 244
#     'CHEMBL3301368' : '\% bound to plasma by equilibrium dialysis, mouse plasma', # 162
#     'CHEMBL3301369' : '\% bound to plasma by equilibrium dialysis, guinea pig plasma', # 91
#     'CHEMBL3301370' : 'Intrinsic clearance measured in human liver microsomes', #1102,
#     'CHEMBL3301371' : 'Intrinsic clearance measured in rat hepatocytes', # 837
#     'CHEMBL3301372' : 'Intrinsic clearance measured in human hepatocytes', # 408
# }

In [3]:
org = "polaris"
data_name = "HLM_clearance"
dirname = dm.fs.join(root, f"org-{org}", "astra_zeneca", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/astra_zeneca/{data_name}"


# Load the data
source_data_path = "gs://polaris-public/polaris-recipes/org-polaris/astra_zeneca/HLM_clearance/CHEMBL3301370_raw.parquet"
data = pd.read_parquet(source_data_path)

  from .autonotebook import tqdm as notebook_tqdm


If we look at the columns in this raw dataframe, we see there are a lot of columns that we don't need. We will focus only on necessary columns below.

In [4]:
columns_to_keep = [
    "canonical_smiles",
    "standard_value",
]
data = data[columns_to_keep].copy()

# Convert the readout to numeric values
data["standard_value"] = pd.to_numeric(data["standard_value"])

# Rename columns
data = data.rename(
    columns={
        "canonical_smiles": "SMILES",
        "standard_value": "HLM_CLEARANCE",
    }
)

data

Unnamed: 0,SMILES,HLM_CLEARANCE
0,Cc1nc(-c2ccc(OCC(C)C)c(C#N)c2)sc1C(=O)O,3.0
1,O=S(=O)(c1ccccc1)N(CC(F)(F)F)c1ccc(C(O)(C(F)(F...,3.0
2,O=C(O)CCc1nc(-c2ccccc2)c(-c2ccccc2)o1,3.0
3,Cc1cc(C)nc(SCC(N)=O)n1,3.0
4,CN1CCN(CC(=O)N2c3ccccc3C(=O)Nc3cccnc32)CC1,3.0
...,...,...
1097,O=C(NCCc1ccccc1)c1cc(-n2ncc(=O)[nH]c2=O)ccc1Cl,3.0
1098,C[C@H](CO)n1ccc2c(NC(=O)Cc3ccc(Cl)c(C(F)(F)F)c...,3.0
1099,Cc1nocc1C(=O)Nc1ccc(-c2ccccc2OC(F)(F)F)c(N)n1,3.0
1100,CCCCc1nc2c(N)nc3ccccc3c2n1CC(C)C,3.0


### Run preliminary curation for data inspection

In [5]:
# Define data column names
data_cols = ["HLM_CLEARANCE"]
mol_col = "SMILES"

In [6]:
# import key curation components from auroris
from auroris.curation import Curator
from auroris.curation.actions import (
    MoleculeCuration,
    OutlierDetection,
    Deduplication,
    StereoIsomerACDetection,
    ContinuousDistributionVisualization,
)

# Define the curation workflow
curator = Curator(
    data_path=source_data_path,
    steps=[
        MoleculeCuration(input_column=mol_col, y_cols=data_cols),
        ContinuousDistributionVisualization(y_cols=data_cols),
        OutlierDetection(
            method="zscore", columns=data_cols, threshold=3, use_modified_zscore=True
        ),
        StereoIsomerACDetection(y_cols=data_cols, threshold=3),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

curator.to_json(f"{dirname}/inspection_config.json")

In [7]:
# Run the curation step defined as above
data_inspection, report = curator(data)

[32m2024-07-10 01:51:08.338[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: mol_curation[0m
[32m2024-07-10 01:51:25.594[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-10 01:51:25.639[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: outlier_detection[0m
[32m2024-07-10 01:51:25.688[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: ac_stereoisomer[0m


In [8]:
#  get the curation logger
from auroris.report.broadcaster import LoggerBroadcaster

broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

[31;1m===== Curation Report =====[0m
[38;20mTime: 2024-07-10 01:51:08[0m
[38;20mVersion: dev[0m
[34;1m===== mol_curation =====[0m
[38;20m[LOG]: New column added: MOL_smiles[0m
[38;20m[LOG]: New column added: MOL_molhash_id[0m
[38;20m[LOG]: New column added: MOL_molhash_id_no_stereo[0m
[38;20m[LOG]: New column added: MOL_num_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_defined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_stereo_center[0m
[38;20m[LOG]: New column added: MOL_undefined_E_D[0m
[38;20m[LOG]: New column added: MOL_undefined_E/Z[0m
[38;20m[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.[0m
[38;20m[LOG]: Molecules with undefined stereocenter detected: 110.[0m
[38;20m[IMG]: Dimensions 1200 x 600[0m
[38;20m[IMG]: Dimensions 1200 x 2400[0m
[34;1m===== distribution

In [9]:
# Generate an HTML report with embedded visualizations showcasing the data analysis.
from utils.auroris_utils import HTMLBroadcaster

# export report to local directory
broadcaster = HTMLBroadcaster(report, f"{dirname}/inspection_report")
report_path = broadcaster.broadcast()

In [10]:
# check the curated data
data_inspection.describe(include="all")

Unnamed: 0,SMILES,HLM_CLEARANCE,MOL_smiles,MOL_molhash_id,MOL_molhash_id_no_stereo,MOL_num_stereoisomers,MOL_num_undefined_stereoisomers,MOL_num_defined_stereo_center,MOL_num_undefined_stereo_center,MOL_num_stereo_center,MOL_undefined_E_D,MOL_undefined_E/Z,OUTLIER_HLM_CLEARANCE,AC_HLM_CLEARANCE
count,1102,1102.0,1102,1102,1102,1102.0,1102.0,1102.0,1102.0,1102.0,1102,1102.0,1102,1102
unique,1102,,1102,1102,1088,,,,,,2,1.0,1,1
top,Cc1nc(-c2ccc(OCC(C)C)c(C#N)c2)sc1C(=O)O,,Cc1nc(-c2ccc(OCC(C)C)c(C#N)c2)sc1C(=O)O,78ab8361a140f6853a7bce3e128b86de30f34ba2,3cfbb6042a2b1f1888e8902ec41acd97affa7a27,,,,,,False,0.0,False,False
freq,1,,1,1,3,,,,,,1005,1102.0,1102,1102
mean,,34.221633,,,,2.622505,1.129764,0.475499,0.111615,0.587114,,,,
std,,44.805477,,,,10.049874,0.465371,0.923425,0.353099,0.949754,,,,
min,,3.0,,,,1.0,1.0,0.0,0.0,0.0,,,,
25%,,3.0,,,,1.0,1.0,0.0,0.0,0.0,,,,
50%,,12.735,,,,1.0,1.0,0.0,0.0,0.0,,,,
75%,,42.66,,,,2.0,1.0,1.0,0.0,1.0,,,,


### Re-run curation, removing molecules as needed

In [11]:
# import key curation components from auroris
from auroris.curation import Curator
from auroris.curation.actions import (
    MoleculeCuration,
    OutlierDetection,
    Deduplication,
    StereoIsomerACDetection,
    ContinuousDistributionVisualization,
)

# Define the curation workflow
curator = Curator(
    data_path=source_data_path,
    steps=[
        MoleculeCuration(input_column=mol_col, y_cols=data_cols),
        ContinuousDistributionVisualization(y_cols=data_cols),
        Deduplication(
            deduplicate_on=mol_col, y_cols=data_cols
        ),  # remove the replicated molecules
        OutlierDetection(
            method="zscore", columns=data_cols, threshold=3, use_modified_zscore=True
        ),
        StereoIsomerACDetection(y_cols=data_cols, threshold=3),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

In [12]:
# The final curation configuration is exported for reproducibility
path = f"{gcp_root}/data/curation/curation_config.json"
curator.to_json(path)

In [13]:
# Run the curation step defined as above
data_curated, report = curator(data)

[32m2024-07-10 01:51:27.699[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: mol_curation[0m
[32m2024-07-10 01:51:31.255[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m
[32m2024-07-10 01:51:31.298[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: deduplicate[0m
[32m2024-07-10 01:51:31.862[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: outlier_detection[0m
[32m2024-07-10 01:51:31.919[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: ac_stereoisomer[0m


In [14]:
broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

[31;1m===== Curation Report =====[0m
[38;20mTime: 2024-07-10 01:51:27[0m
[38;20mVersion: dev[0m
[34;1m===== mol_curation =====[0m
[38;20m[LOG]: New column added: MOL_smiles[0m
[38;20m[LOG]: New column added: MOL_molhash_id[0m
[38;20m[LOG]: New column added: MOL_molhash_id_no_stereo[0m
[38;20m[LOG]: New column added: MOL_num_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_defined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_stereo_center[0m
[38;20m[LOG]: New column added: MOL_undefined_E_D[0m
[38;20m[LOG]: New column added: MOL_undefined_E/Z[0m
[38;20m[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.[0m
[38;20m[LOG]: Molecules with undefined stereocenter detected: 110.[0m
[38;20m[IMG]: Dimensions 1200 x 600[0m
[38;20m[IMG]: Dimensions 1200 x 2400[0m
[34;1m===== distribution

In [15]:
# Export report to polaris public directory on GCP
# The report is ready to reviewed in the HTML file.
broadcaster = HTMLBroadcaster(
    report, f"{gcp_root}/data/curation/report", embed_images=True
)
broadcaster.broadcast()

'gs://polaris-public/polaris-recipes/org-polaris/astra_zeneca/HLM_clearance/data/curation/report/index.html'

## Export the final curated data

In [16]:
fout = f"{gcp_root}/data/curation/{data_name}_curated.csv"
data_curated.reset_index(drop=True).to_csv(fout, index=False)