# Custom datasets and benchmarks
We have already seen how easy it is to load a benchmark or dataset from the Polaris Hub. Let's now see how you could create your own!  

## Create the dataset

A dataset in Polaris is at its core a tabular data-structure in which each row stores a single datapoint. For this example, we will process a multi-task DMPK dataset from [`Fang et al.`](https://doi.org/10.1021/acs.jcim.3c00160). For the sake of simplicity, we don't do any curation and download the dataset as is from their Github.

<div class="admonition warning highlight">
    <p class="admonition-title">The importance of curation</p>
    <p>While we do not address it in this tutorial, data curation is essential to an impactful benchmark. Because of this, we have not just made several high-quality benchmarks readily available on the Polaris Hub, but also open-sourced some of the tools we've built to curate these datasets.</p>
</div>

In [1]:
import platformdirs
import datamol as dm

# We will save the data for this tutorial to our cache dir!
SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "001")

In [2]:
import pandas as pd

PATH = "https://raw.githubusercontent.com/molecularinformatics/Computational-ADME/main/ADME_public_set_3521.csv"
table = pd.read_csv(PATH)

Since all data fits is contained within the table, creating a dataset is simple. While optional, we will specify some additional meta-data to demonstrate the API. 

In [3]:
from polaris.dataset import Dataset, ColumnAnnotation

dataset = Dataset(
    # The table is the core data-structure required to construct a dataset
    table=table, 
    
    # All other arguments provide additional meta-data and are optional.
    # The exception is the `is_pointer` attribute in the `ColumnAnnotation` object, which
    # we will get back to in a later tutorial. 
    name="Fang_2023_DMPK", 
    description="120 prospective data sets, collected over 20 months across six ADME in vitro endpoints", 
    source="https://doi.org/10.1021/acs.jcim.3c00160",
    annotations={
        "SMILES": ColumnAnnotation(modality="molecule"),
        "LOG HLM_CLint (mL/min/kg)": ColumnAnnotation(user_attributes={"unit": "mL/min/kg"}),
        "LOG SOLUBILITY PH 6.8 (ug/mL)": ColumnAnnotation(
            protocol="Solubility was measured after equilibrium between the dissolved and solid state"
        ),
    }
)

## Create the benchmark specification
A benchmark is represented by the `BenchmarkSpecification`, which wraps a `Dataset` with additional data to produce a benchmark. 

Specifically, it specifies:
1. Which dataset to use (see Dataset);
2. Which columns are used as input and which columns are used as target;
3. Which metrics should be used to evaluate performance on this task;
4. A predefined, static train-test split to use during evaluation.

In [4]:
import numpy as np
from polaris.benchmark import SingleTaskBenchmarkSpecification

# For the sake of simplicity, we use a very simple, ordered split
split = (
    np.arange(3000).tolist(), # train
    (np.arange(521) + 3000).tolist()  # test
)

benchmark = SingleTaskBenchmarkSpecification(
    dataset=dataset, 
    target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
    input_cols="SMILES",
    split=split,
    metrics="mean_absolute_error",
)

Metrics should be supported in the polaris framework
For more information, see the `Metric` class.

In [5]:
from polaris.evaluate import Metric

Metric.list_supported_metrics()

['accuracy', 'mean_absolute_error', 'mean_squared_error']

To support to vast flexibility in specifying a benchmark, we have different classes that correspond to different types of benchmarks. Each of these sub-classes make the data-model or logic more specific to a particular case. For example, trying to create a multi-task benchmark with the same arguments will throw an error as there is just a single target column specified.

In [6]:
from polaris.benchmark import MultiTaskBenchmarkSpecification

benchmark = MultiTaskBenchmarkSpecification(
    dataset=dataset, 
    target_cols="LOG SOLUBILITY PH 6.8 (ug/mL)",
    input_cols="SMILES",
    split=split,
    metrics="mean_absolute_error",
)

ValidationError: 1 validation error for MultiTaskBenchmarkSpecification
target_cols
  Value error, A multi-task benchmark should specify at least two target columns [type=value_error, input_value='LOG SOLUBILITY PH 6.8 (ug/mL)', input_type=str]
    For further information visit https://errors.pydantic.dev/2.1.2/v/value_error

In [7]:
# Let's try that again, but now with two target columns
benchmark = MultiTaskBenchmarkSpecification(
    dataset=dataset, 
    target_cols=["LOG SOLUBILITY PH 6.8 (ug/mL)", "LOG HLM_CLint (mL/min/kg)"],
    input_cols="SMILES",
    split=split,
    metrics="mean_absolute_error",
)

## Save the benchmark
Saving the benchmark is easy and can be done with a single line of code.

In [8]:
path = benchmark.to_json(SAVE_DIR)

In [9]:
fs = dm.fs.get_mapper(SAVE_DIR).fs
fs.ls(SAVE_DIR)

['/home/cas/.cache/polaris-tutorials/001/benchmark.json',
 '/home/cas/.cache/polaris-tutorials/001/dataset.json',
 '/home/cas/.cache/polaris-tutorials/001/table.parquet']

This created three files. Two `json` files and a single `parquet` file. The `parquet` file saves the tabular structure at the base of the `Dataset` class, whereas the `json` files save all the meta-data for the `Dataset` and `BenchmarkSpecification`. 

## Load the benchmark
Loading the benchmark is easy!

In [10]:
import polaris as po

benchmark = po.load_benchmark(path)

## Use the benchmark

Using your custom benchmark is seamless. It supports the exact same API as any benchmark that would be loaded through the hub:

1. `get_train_test_split()`: For creating objects through which we can access the different dataset partitions. 
2. `evaluate()`: For evaluating a set of predictions in accordance with the benchmark protocol. 

### Data access

In [11]:
train, test = benchmark.get_train_test_split()

The created objects support various flavours to access the data.
1. The objects are iterable;
2. The objects can be indexed; 
3. The objects have properties to access all data at once.

In [12]:
for x, y in train: 
    pass

In [13]:
for i in range(len(train)):
    x, y = train[i]

In [14]:
x = train.inputs
y = train.targets

To avoid accidental access to the test targets, the test object does not expose the labels and will throw an error if you try access them explicitly. 

In [15]:
for x in test: 
    pass

In [16]:
for i in range(len(test)):
    x = test[i]

In [17]:
x = test.inputs
y = test.targets

TestAccessError: Within Polaris, you should not need to access the targets of the test set

### Evaluation
To evaluate a set of predictions within Polaris, you should use the `evaluate()` endpoint.This requires you to just provide the predictions. The targets of the test set are automatically extract so that the chance of the user accessing the test labels is minimal. 

In [18]:
# Since we have a multi-task dataset, we should provide predictions for both targets
y_pred = {
    "LOG SOLUBILITY PH 6.8 (ug/mL)": np.random.random(len(test)),
    "LOG HLM_CLint (mL/min/kg)": np.random.random(len(test)),
}

results = benchmark.evaluate(y_pred)

The resulting object does not just store the results, but also allows for additional meta-data that can be uploaded to the hub. 

In [19]:
results.results

{'LOG SOLUBILITY PH 6.8 (ug/mL)': {'mean_absolute_error': 0.9311836043999315},
 'LOG HLM_CLint (mL/min/kg)': {'mean_absolute_error': 0.7306570958492925}}

In [20]:
results._created_at

datetime.datetime(2023, 7, 20, 16, 36, 20, 24731)

This will currently fail as we do not have a client (or a Hub for that matter) yet. 

In [21]:
# results.upload_to_hub()

The End. 