# Dataset creation with [Polaris](https://github.com/polaris-hub/polaris)
The first step of creating a benchmark is to set up a standard dataset which allows accessing the curated dataset (which has been demonstrated in <01_polaris_adme-fang_data_curation.ipynb>), and all necessary information about the dataset such as data source, description of endpoints, units etc. 

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import pathlib

import pandas as pd
import datamol as dm

# polaris dataset
from polaris.dataset import Dataset, ColumnAnnotation
from polaris.utils.types import HubOwner


root = pathlib.Path("__file__").absolute().parents[2]
os.chdir(root)
sys.path.insert(0, str(root))
from utils.docs_utils import load_readme

In [2]:
# Get the owner and organization
org = "biogen"
data_name = "fang2023_ADME"
dirname = dm.fs.join(root, f"org-{org}", "biogen", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"

owner = HubOwner(slug=org, type="organization")
owner

HubOwner(slug='biogen', external_id=None, type='organization')

In [3]:
BENCHMARK_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/benchmarks"
DATASET_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/datasets"
FIGURE_DIR = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}/figures"

## Load existing data
> **Attention:** \
> The original dataset is published in [`Fang et al. 2023`](https://doi.org/10.1021/acs.jcim.3c00160). 
https://github.com/molecularinformatics/Computational-ADME/blob/main/ADME_public_set_3521.csv \ \
> To **maintain consistency** with other benchmarks in the Polaris Hub, a thorough data curation process is carried out to ensure the accuracy of molecular presentations.
> Therefore, the raw data from the data resource is not used here. 
> See more curation details in [01_polaris_adme-fang-1_data_curation.ipynb](https://github.com/polaris-hub/polaris-recipes/org-Biogen/fang2023_ADME/01_polaris_adme-fang-1_data_curation.ipynb).

In [4]:
# Load the curated data
PATH = f"gs://polaris-public/polaris-recipes/org-biogen/fang2023_ADME/data/curation/fang2023_ADME_curated.csv"
table = pd.read_csv(PATH)

### Below we specify the meta information of data columns

In [5]:
# Here we simplify the column names
table = table.rename(
    columns={
        "MOL_molhash_id": "UNIQUE_ID",
        "LOG HLM_CLint (mL/min/kg)": "LOG_HLM_CLint",
        "LOG RLM_CLint (mL/min/kg)": "LOG_RLM_CLint",
        "LOG MDR1-MDCK ER (B-A/A-B)": "LOG_MDR1-MDCK_ER",
        "LOG PLASMA PROTEIN BINDING (HUMAN) (% unbound)": "LOG_HPPB",
        "LOG PLASMA PROTEIN BINDING (RAT) (% unbound)": "LOG_RPPB",
        "LOG SOLUBILITY PH 6.8 (ug/mL)": "LOG_SOLUBILITY",
    }
)

# molecule column
mol_col = "MOL_smiles"

In [6]:
table.reset_index(drop=True, inplace=True)

Not all the columns are necessary, only the columns which are useful for the benchmarks will be annotated. Here we only use the columns that were used for training in the original paper. 

It's necessary to specify the key bioactivity columns, molecule structures and identifiers in the dataset with `ColumnAnnotation`. It is possible to add `user_attributes` with any key and values when needed, such as `unit`, `organism`, `scale` and optimization `objective`. 

**Abbrevations for the endpoint objective**
- THTB: the higher the better
- TLTB: the lower the better

In [7]:
annotations = {
    "UNIQUE_ID": ColumnAnnotation(
        description="Molecular hash ID. See <datamol.mol.hash_mol>"
    ),
    "MOL_smiles": ColumnAnnotation(
        description="Molecule SMILES string after cleaning and standardization.",
        modality="molecule",
    ),
    "SMILES": ColumnAnnotation(
        description="Original molecule SMILES string from the publication."
    ),
    "LOG_HLM_CLint": ColumnAnnotation(
        description="Human liver microsomal stability reported as intrinsic clearance",
        user_attributes={
            "unit": "mL/min/kg",
            "scale": "log",
            "organism": "human",
            "objective": "Higher value",
        },
    ),
    "LOG_RLM_CLint": ColumnAnnotation(
        description="Rat liver microsomal stability reported as intrinsic clearance",
        user_attributes={
            "unit": "mL/min/kg",
            "scale": "log",
            "organism": "rat",
            "objective": "Lower value",
        },
    ),
    "LOG_MDR1-MDCK_ER": ColumnAnnotation(
        description="MDR1-MDCK efflux ratio (B-A/A-B)",
        user_attributes={
            "unit": "mL/min/kg",
            "scale": "log",
            "objective": "Higher value",
        },
    ),
    "LOG_HPPB": ColumnAnnotation(
        description="Human plasma protein binding",
        user_attributes={"unit": "% unbound", "objective": "Lower value"},
    ),
    "LOG_RPPB": ColumnAnnotation(
        description="Rat plasma protein binding",
        user_attributes={"unit": "% unbound", "objective": "Lower value"},
    ),
    "LOG_SOLUBILITY": ColumnAnnotation(
        description="Solubility was measured after equilibrium between the dissolved and solid state",
        user_attributes={
            "unit": "ug/mL",
            "scale": "log",
            "PH": "6.8",
            "objective": "Higher value",
        },
    ),
}

### Define `Dataset` object

In [8]:
dataset_version = "v2"
dataset_name = f"adme-fang-{dataset_version}"

In [9]:
dataset = Dataset(
    table=table[annotations.keys()].copy(),
    name=dataset_name,
    description="A DMPK datasets of six ADME in vitro endpoints from fang et al. 2023. ",
    source="https://doi.org/10.1021/acs.jcim.3c00160",
    annotations=annotations,
    owner=owner,
    tags=["adme"],
    readme=load_readme(f"org-Biogen/{data_name}/fang2023_ADME_readme.md"),
    license="CC-BY-4.0",
    curation_reference="https://github.com/polaris-hub/polaris-recipes/org-Biogen/fang2023_ADME/01_polaris_adme-fang-1_data_curation.ipynb",
)

In [10]:
# save the dataset to GCP
SAVE_DIR = f"{DATASET_DIR}/{dataset_name}"
dataset_path = dataset.to_json(SAVE_DIR)
dataset_path

'gs://polaris-public/polaris-recipes/org-biogen/fang2023_ADME/datasets/adme-fang-v2/dataset.json'

In [13]:
from polaris.hub.client import PolarisHubClient

client = PolarisHubClient()
client.login()

client.upload_dataset(dataset=dataset, access="private", owner=owner)

[32m2024-07-30 13:02:28.788[0m | [32m[1mSUCCESS [0m | [36mpolaris.hub.client[0m:[36mlogin[0m:[36m260[0m - [32m[1mYou are successfully logged in to the Polaris Hub.[0m


✅ SUCCESS: [1mYour dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io/datasets/biogen/adme-fang-v2[0m
 


  self._color = self._set_color(value) if value else value


{'id': 'bXClU6RThexkUaj4cDHF6',
 'createdAt': '2024-07-30T17:02:28.969Z',
 'deletedAt': None,
 'name': 'adme-fang-v2',
 'slug': 'adme-fang-v2',
 'description': 'A DMPK datasets of six ADME in vitro endpoints from fang et al. 2023. ',
 'tags': ['adme'],
 'userAttributes': {},
 'access': 'private',
 'isCertified': False,
 'polarisVersion': '0.7.9',
 'readme': '![ADME](https://storage.googleapis.com/polaris-public/icons/icon_fang.png) \n\n## Background\n\nThe goal of assessing ADME properties is to understand how a potential drug candidate interacts with the human body, including absorption, distribution, metabolism, and excretion. This knowledge is crucial for evaluating efficacy, safety, and clinical potential, guiding drug development for optimal therapeutic outcomes. [Fang et al. 2023](https://doi.org/10.1021/acs.jcim.3c00160) disclosed DMPK datasets collected over 20 months across six ADME in vitro endpoints: human and rat liver microsomal stability, MDR1-MDCK efflux ratio, solubilit

## Disclaimers

<div style="background-color: lightyellow; padding: 10px; border: 1px solid black;">
    <span>Here are some additional details that may be of use when deciding whether or not to use these datasets.</span><br /><br />
    <!-- <strong><span style="color: red;">Disclaimer:</span></strong>  -->
     <strong>Some advantages include: </strong>
        <ul>
        <li>The assays were carried out by one group under a consistent set of conditions.</li>
        <li>Dataset contains only a small number of molecules with unspecified stereocenters.</li>
        <li>There are no duplicated structures in the dataset.</li>
        <li>The data is based on a well-defined ADME endpoints.</li>
        </ul>
     <strong>Some limitations to consider: </strong>
        <ul>
        <li>The size of the PPB datasets are small, making it challenging to determine a statistically significant difference between methods on these sets.</li>
        </ul>
        

</div>
