![molprop](https://storage.googleapis.com/polaris-public/icons/icons8-bear-100-Molprop.png)
## Molecular representation benchmarks - MolProp250K

## Background

Molecular representations are crucial for understanding molecular structure, predicting properties, QSAR studies, toxicology and chemical modeling and other aspects in drug discovery tasks. Therefore, benchmarks for molecular representations are critical tools that drive progress in the field of computational chemistry and drug design. In recent years, many large models have been trained for learning molecular representation. The aim is to evaluate if large pretrained models are capable of predicting various “easy-to-compute” molecular properties. 

## Description of molecular properties
 The computed properties are molecular weight, fraction of sp3 carbon atoms (fsp3), number of rotatable bonds, topological polar surface area, computed logP, formal charge, number of charged atoms, refractivity and number of aromatic rings. These properties are widely used in molecule design and molecule prioritization.

## Data resource
**Reference**: https://pubs.acs.org/doi/10.1021/acs.jcim.5b00559 

**Raw data**: https://raw.githubusercontent.com/aspuru-guzik-group/chemical_vae/master/models/zinc_properties/250k_randm_zinc_drugs_clean_3.csv

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import datamol as dm

import os
import sys
import pathlib

# utils
root = pathlib.Path("__file__").absolute().parents[2]
os.chdir(root)
sys.path.insert(0, str(root))

In [3]:
org = "polaris"
data_name = "molprop"
dirname = dm.fs.join(root, f"org-{org}", data_name)
gcp_root = f"gs://polaris-public/polaris-recipes/org-{org}/{data_name}"


# Load the data
source_data_path = f"{gcp_root}/data/raw/250k_rndm_zinc_drugs_clean_3.csv.zip"
data = pd.read_csv(source_data_path, compression="zip")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
mol_col = "smiles"

# Keep only the SMILES, ID and outcome rows
columns_to_keep = ["smiles"]
data = data[columns_to_keep].copy()

In [5]:
data.describe(include="all")

Unnamed: 0,smiles
count,249455
unique,249455
top,CC(C)(C)c1ccc2occ(CC(=O)Nc3ccccc3F)c2c1\n
freq,1


## Define the key molecular properties function

In [6]:
properties_fn = {
    "mw": dm.descriptors.mw,
    "fsp3": dm.descriptors.fsp3,
    "n_rotatable_bonds": dm.descriptors.n_rotatable_bonds,
    "tpsa": dm.descriptors.tpsa,
    "clogp": dm.descriptors.clogp,
    "formal_charge": dm.descriptors.formal_charge,
    "n_charged_atoms": dm.descriptors.n_charged_atoms,
    "refractivity": dm.descriptors.refractivity,
    "n_aromatic_rings": dm.descriptors.n_aromatic_rings,
}

In [7]:
mols = dm.utils.parallelized(
    fn=dm.to_mol, inputs_list=data[mol_col].values, n_jobs=-1, progress=True
)

100%|██████████| 249455/249455 [00:14<00:00, 17572.92it/s]


In [9]:
results = dm.descriptors.batch_compute_many_descriptors(
    mols=mols,
    progress=True,
    n_jobs=-1,
    batch_size=1000,
    properties_fn=properties_fn,
    add_properties=False,
)

In [None]:
data = pd.concat([data, results], axis=1)

### Perform data curation with `auroris.curation` module
The curation process includes:
- assign unique identifier to molecules
- detect the stereochemistry information of molecules.
- inspect the potential outliers of bioactivity values
- merge rows of replicated molecules
- detect isomers which show the activity shifts

Check out the curation module in [Auroris](https://github.com/polaris-hub/auroris). 

In [None]:
data_cols = properties_fn.keys()

In [None]:
# Define the final curation workflow
# import key curation components from auroris
from auroris.curation import Curator
from auroris.curation.actions import (
    MoleculeCuration,
    Deduplication,
    ContinuousDistributionVisualization,
)

curator = Curator(
    source_data=source_data_path,
    steps=[
        MoleculeCuration(input_column=mol_col, y_cols=data_cols),
        Deduplication(
            deduplicate_on=mol_col, y_cols=data_cols
        ),  # remove the replicated molecules
        ContinuousDistributionVisualization(y_cols=data_cols),
    ],
    parallelized_kwargs={"n_jobs": -1},
)

In [None]:
# Run the curation step defined as above
data_curated, report = curator(data)

[32m2024-06-03 11:46:53.379[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: mol_curation[0m
[32m2024-06-03 11:52:35.034[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: deduplicate[0m
[32m2024-06-03 11:56:00.426[0m | [1mINFO    [0m | [36mauroris.curation._curator[0m:[36mtransform[0m:[36m106[0m - [1mPerforming step: distribution[0m


In [None]:
#  get the curation logger
from auroris.report.broadcaster import LoggerBroadcaster

broadcaster = LoggerBroadcaster(report)
broadcaster.broadcast()

[31;1m===== Curation Report =====[0m
[38;20mTime: 2024-06-03 11:46:53[0m
[38;20mVersion: dev[0m
[34;1m===== mol_curation =====[0m
[38;20m[LOG]: New column added: MOL_smiles[0m
[38;20m[LOG]: New column added: MOL_molhash_id[0m
[38;20m[LOG]: New column added: MOL_molhash_id_no_stereo[0m
[38;20m[LOG]: New column added: MOL_num_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereoisomers[0m
[38;20m[LOG]: New column added: MOL_num_defined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_undefined_stereo_center[0m
[38;20m[LOG]: New column added: MOL_num_stereo_center[0m
[38;20m[LOG]: New column added: MOL_undefined_E_D[0m
[38;20m[LOG]: New column added: MOL_undefined_E/Z[0m
[38;20m[LOG]: Default `ecfp` fingerprint is used to visualize the chemical space.[0m
[38;20m[LOG]: Molecules with undefined stereocenter detected: 17111.[0m
[38;20m[IMG]: Dimensions 2400 x 3000[0m
[38;20m[IMG]: Dimensions 1200 x 2400[0m
[34;1m===== deduplica

In [None]:
# Generate an HTML report with embedded visualizations showcasing the data analysis.
from utils.auroris_utils import HTMLBroadcaster

# export report to local directory
broadcaster = HTMLBroadcaster(report, f"{dirname}/inspection_report")
report_path = broadcaster.broadcast()

In [None]:
# check the curated data
data_curated.describe(include="all")

Unnamed: 0,smiles,mw,fsp3,n_rotatable_bonds,tpsa,clogp,formal_charge,n_charged_atoms,refractivity,n_aromatic_rings,MOL_smiles,MOL_molhash_id,MOL_molhash_id_no_stereo,MOL_num_stereoisomers,MOL_num_undefined_stereoisomers,MOL_num_defined_stereo_center,MOL_num_undefined_stereo_center,MOL_num_stereo_center,MOL_undefined_E_D,MOL_undefined_E/Z
count,249455,249455.0,249455.0,249455.0,249455.0,249455.0,249455.0,249455.0,249455.0,249455.0,249455,249455,249455,249455.0,249455.0,249455.0,249455.0,249455.0,249455,249455.0
unique,249455,,,,,,,,,,249455,249455,247804,,,,,,2,1.0
top,BrC#Cc1ccccc1\n,,,,,,,,,,BrC#Cc1ccccc1,844d0566ac0446222fcca319a4a201de2d98a034,897d96e980fc2d255edc16c6e06ead5066323ea2,,,,,,False,0.0
freq,1,,,,,,,,,,1,1,3,,,,,,246610,249455.0
mean,,331.754739,0.411125,4.560173,64.820972,2.457121,0.202638,0.413209,89.161769,1.849833,,,,3.040204,1.072029,0.89081,0.070117,0.960927,,
std,,61.843063,0.220376,1.550658,22.93468,1.434336,0.543069,0.668481,17.147501,0.969474,,,,13.33079,0.280362,0.961192,0.261426,1.029837,,
min,,149.975153,0.0,0.0,0.0,-6.8762,-3.0,0.0,17.49,0.0,,,,1.0,1.0,0.0,0.0,0.0,,
25%,,290.197537,0.25,3.0,49.31,1.5749,0.0,0.0,77.7507,1.0,,,,1.0,1.0,0.0,0.0,0.0,,
50%,,333.205242,0.384615,5.0,64.11,2.6056,0.0,0.0,89.1882,2.0,,,,2.0,1.0,1.0,0.0,1.0,,
75%,,368.153621,0.555556,6.0,79.71,3.48676,0.0,1.0,100.28125,2.0,,,,4.0,1.0,1.0,0.0,2.0,,


## Chemical space coverage of the dataset

![chemical space chem_all](inspection_report/images/0-Distribution_in_Chemical_Space_ECFP.png)

## Export the final curated data

In [None]:
fout = f"{gcp_root}/data/curation/{data_name}_curated.csv.gz"
data_curated.reset_index(drop=True).to_csv(fout, index=False, compression="gzip")

In [None]:
fout

'gs://polaris-public/polaris-recipes/org-polaris/molprop/data/curation/molprop_curated.csv.gz'