## Molecular representation benchmarks - MolProp250K leadlike

**Molecular Properties Datasets**
ZINC is a widely utilized public access database and tool set, playing a crucial role in various applications including virtual screening, ligand discovery, pharmacophore screens, benchmarking, and force field development. The **MolProp250KLeadlike** dataset consists of 250,000 leadlike compounds randomly selected from ZINC25.

**Benchmarking goal:** The objective is to comprehend the proficiency of a model in predicting these 'easy' properties, gauging its effectiveness. Ideally, any pre-trained models should, at the very least, demonstrate good performance in those tasks before applying them to the downstream tasks. 

**Molecule data resource**: https://cartblanche22.docking.org/search/random

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import datamol as dm
import numpy as np
from sklearn.model_selection import ShuffleSplit
import polaris 
from polaris.curation._chemistry_curator import SMILES_COL, UNIQUE_ID
from polaris.dataset import Dataset, ColumnAnnotation
from polaris.benchmark import SingleTaskBenchmarkSpecification, MultiTaskBenchmarkSpecification
from polaris.utils.types import HubOwner

In [3]:
owner = HubOwner(organizationId="PolarisTest", slug="polaristest")
owner.owner

'PolarisTest'

## Load existing Dataset object

In [4]:
url = "gs://polaris-private/Data/zinc250k_leadlike/molecular_properties_2023-08-14.parquet"
table = pd.read_parquet(url)

In [5]:
annotations = {
    "smiles": ColumnAnnotation(
        description="Molecule Smiles string", 
        modality="molecule"
    ),
    "mw": ColumnAnnotation(
        description="Molecular weight computed with <datamol.descriptor.mw>"
    ),
    "fsp3": ColumnAnnotation(
        description="Fraction of saturated carbons computed with <datamol.descriptor.fsp3>"
    ),
    "n_rotatable_bonds": ColumnAnnotation(
        description="A rotatable bond is defined as any single non-ring bond, attached to a non-terminal, non-hydrogen atom, computed with <datamol.descriptor.n_rotatable_bonds>"
    ),
    "tpsa": ColumnAnnotation(
        description="Topological polar surface area of a molecule is defined as the surface sum over all polar atoms or molecules, primarily oxygen and nitrogen, also including their attached hydrogen atoms. Computed with <datamol.descriptor.tpsa>"
    ),
    "clogp": ColumnAnnotation(
        description="Wildman-Crippen LogP value, computed with <datamol.descriptor.clogp>"
    ),
    "formal_charge": ColumnAnnotation(
        description="Formal Charge is a charge assigned to an atom under the assumption that all electrons in bonds are shared equally, computed with <datamol.descriptor.formal_charge>"
    ),
    "n_charged_atoms": ColumnAnnotation(
        description="Number of charged atoms in a molecule, computed with <datamol.descriptor.n_charged_atoms>"
    ),
    "refractivity": ColumnAnnotation(
        description="The total polarizability of a mole of a substance and is dependent on the temperature, the index of refraction, and the pressure. Computed with <datamol.descriptor.refeactivity>"
    ),
    "n_aromatic_rings": ColumnAnnotation(
        description="Number of aromatic rings in the molecule, computed with <datamol.descriptor.n_aromatic_rings>"
    ),
}

In [6]:
dataset = Dataset(
    table=table[annotations.keys()],
    name="MolProp250KLeadlike",
    description=" Molecule properties computed for ZINC15 250K dataset. Those molecular properties are used to examinate the usefullness of any pretrained models. Especially, any model for generation purpose should not fail on these tasks.",
    source="https://www.valencelabs.com",
    annotations=annotations,
    owner=owner,
    tags=["Representation"]
)

In [7]:
# save the dataset
SAVE_DIR = "gs://polaris-private/Datasets/MolProp/MolProp250KLeadlike"
dataset.to_json(SAVE_DIR)

  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


'gs://polaris-private/Datasets/MolProp/MolProp250KLeadlike/dataset.json'

# Create scaffold split for MolProp250KLeadlike

In [8]:
# scaffold split
from partitio._scaffold_split import ScaffoldSplit

TEST_SIZE = 0.2
SEED = 111
splitter = ScaffoldSplit(smiles=dataset.table.smiles.values, n_jobs=-1, test_size=TEST_SIZE, random_state=SEED)
scaffold_split = next(splitter.split(X=dataset.table.smiles.values))

## Multitask for all properties with a shared scaffold split. 

In [9]:
from polaris.benchmark import MultiTaskBenchmarkSpecification

benchmark = MultiTaskBenchmarkSpecification(
    name = "MolProp250KLeadlike_multitask_reg", 
    dataset=dataset,
    target_cols=[
        "mw",
        "fsp3",
        "n_rotatable_bonds",
        "tpsa",
        "clogp",
        "formal_charge",
        "n_charged_atoms",
        "refractivity",
        "n_aromatic_rings",
    ],
    input_cols="smiles",
    split=scaffold_split,
    metrics="mean_absolute_error",
    tags=["Representation", 'multitask', 'Regression'],
    description='A multitask benchmark to predict nine molecular properties including "mw", "fsp3", "n_rotatable_bonds", "tpsa", "clogp", "formal_charge", "n_charged_atoms",  "refractivity", "n_aromatic_rings" for 250K leadlike compounds from ZINC22. "Scaffold-based" splitter was used to define training and test set."', 
    owner=owner
)

### Save the benchmark

In [10]:
name = "MolProp250KLeadlike_multitask_reg"
BENCHMARK_DIR = f"gs://polaris-private/benchmarks/molprop/{name}"
path = benchmark.to_json(BENCHMARK_DIR)

  Expected `Union[Dataset, str, dict[str, any]]` but got `Dataset` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


In [11]:
fs = dm.fs.get_mapper(BENCHMARK_DIR).fs
fs.ls(BENCHMARK_DIR)

['polaris-private/benchmarks/molprop/MolProp250KLeadlike_multitask_reg/benchmark.json',
 'polaris-private/benchmarks/molprop/MolProp250KLeadlike_multitask_reg/dataset.json',
 'polaris-private/benchmarks/molprop/MolProp250KLeadlike_multitask_reg/table.parquet']