# Custom datasets and benchmarks
We have already seen how easy it is to load a benchmark or dataset from the Polaris Hub. Let's now see how you could create your own!



**Overview**:
- [How to use the polaris library to curate your own datasets?](#curate)
- [How to create Dataset?](#dataset)

## Create the dataset

A dataset in Polaris is at its core a tabular data-structure in which each row stores a single datapoint. For this example, we used the curated dataset from [02.Data_curation.ipynb](https://github.com/polaris-hub/polaris/blob/hub-integration/docs/tutorials/02.Data_curation.ipynb).

In [54]:
%load_ext autoreload
%autoreload 2
import tempfile
import datamol as dm
import pandas as pd
import polaris as po
from polaris import curation
import warnings

warnings.filterwarnings("ignore")
from IPython import display
from polaris.curation.utils import visulize_distribution, verify_stereoisomers, check_undefined_stereocenters
from polaris.benchmark import MultiTaskBenchmarkSpecification
from polaris.dataset import Dataset, ColumnAnnotation

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [24]:
from polaris.hub.client import PolarisHubClient

client = PolarisHubClient()
client.login()

[32m2023-10-12 14:57:34.940[0m | [1mINFO    [0m | [36mpolaris.hub.client[0m:[36mlogin[0m:[36m234[0m - [1mYou are already logged in to the Polaris Hub as luzhu (lu@valencediscovery.com). Set `overwrite=True` to force re-authentication.[0m


In [47]:
# define the owner of the dataset and benchmark, Optional
from polaris.utils.types import HubOwner

owner = HubOwner(organizationId="polaristest", slug="polaristest")

In [17]:
import platformdirs
import datamol as dm

# We will save the data for this tutorial to our cache dir!
SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "001")

In [91]:
# Load dataset
# table = pd.read_csv("data/tutorial_data_curated.csv")
table = datamol.data.freesolv()

<a id='dataset'></a>
## How to create Dataset?

**Define the annotation of the dataset by `polaris.dataset.ColumnAnnoation`**

It's necessary to specify the key bioactivity columns, molecules structures and identifiers in dataset with `ColumnAnnotation`. It is possible to add `user_attributes` with any key and values when is needed, such as `unit`, `organism`, `scale` and optimization `objective`. 

In [94]:
annotations = {
    "iupac": ColumnAnnotation(desription="Compound ID from original dataset."),
    "smiles": ColumnAnnotation(desription="Molecule SMILES string after cleaning and standardization."),
    "expt": ColumnAnnotation(desription="Experimental small molecule hydration free energies"),
    "calc": ColumnAnnotation(desription="Calculated small molecule hydration free energies"),
}

In [95]:
# Define Dataset object
dataset = Dataset(
    table=table,
    name="tutorial_freesol",
    description="FreeSolv: Experimental and Calculated Small Molecule Hydration Free Energies",
    source="https://github.com/MobleyLab/FreeSolv",
    annotations=annotations,
    owner=owner,
    tags=["polaris-tutorial"],
)

**Save the dataset to a local path**

In [96]:
temp_dir = tempfile.TemporaryDirectory().name

In [97]:
SAVE_DIR = f"{temp_dir}/dataset/rdkit_solubility"
dataset.to_json(SAVE_DIR)

'/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmp3h1u3cx4/dataset/rdkit_solubility/dataset.json'

**OR Upload dataset to the hub**

In [98]:
from polaris.hub.client import PolarisHubClient

client = PolarisHubClient()
client.login()

[32m2023-10-12 15:48:41.453[0m | [1mINFO    [0m | [36mpolaris.hub.client[0m:[36mlogin[0m:[36m234[0m - [1mYou are already logged in to the Polaris Hub as luzhu (lu@valencediscovery.com). Set `overwrite=True` to force re-authentication.[0m


In [99]:
# response = client.upload_dataset(dataset=dataset)

[32m2023-10-12 15:48:49.958[0m | [32m[1mSUCCESS [0m | [36mpolaris.hub.client[0m:[36mupload_dataset[0m:[36m416[0m - [32m[1mYour dataset has been successfully uploaded to the Hub. View it here: https://polarishub.io//datasets/polaristest/tutorial_freesol[0m


## Create the benchmark specification
A benchmark is represented by the `BenchmarkSpecification`, which wraps a `Dataset` with additional data to produce a benchmark.

Specifically, it specifies:
1. Which dataset to use (see Dataset);
2. Which columns are used as input and which columns are used as target;
3. Which metrics should be used to evaluate performance on this task;
4. A predefined, static train-test split to use during evaluation.

### SingleTaskBenchmark

In [102]:
import numpy as np
from polaris.benchmark import SingleTaskBenchmarkSpecification

# For the sake of simplicity, we use a very simple, ordered split
split = (np.arange(600).tolist(), (np.arange(42) + 600).tolist())  # train  # test

benchmark = SingleTaskBenchmarkSpecification(
    dataset=dataset,
    target_cols="expt",
    input_cols="smiles",
    split=split,
    metrics="mean_absolute_error",
    description="Single task benchmark for solubility",
    tags=["SingleTask", "Regression"],
    owner=owner,  # optional, but required for Polaris Hub upload
)

#### Evaluation metrics

Metrics should be supported in the polaris framework
For more information, see the `Metric` class.

In [109]:
from polaris.evaluate import Metric

list(Metric)

[<Metric.mean_absolute_error: MetricInfo(fn=<function mean_absolute_error at 0x169014720>, is_multitask=False)>,
 <Metric.mean_squared_error: MetricInfo(fn=<function mean_squared_error at 0x169014ae0>, is_multitask=False)>,
 <Metric.accuracy: MetricInfo(fn=<function accuracy_score at 0x1689e7240>, is_multitask=False)>]

### MultiTaskBenchmark

To support the vast flexibility in specifying a benchmark, we have different classes that correspond to different types of benchmarks. Each of these subclasses make the data-model or logic more specific to a particular case. For example, trying to create a multitask benchmark with the same arguments will throw an error as there is just a single target column specified. 
In this example, for simplicity columns `expt` and `calc` as targets are used for demonstration purpose. 

In [118]:
# Let's try that again, but now with two target columns
benchmark = MultiTaskBenchmarkSpecification(
    dataset=dataset,
    target_cols=["expt", "calc"],
    input_cols="smiles",
    split=split,
    metrics="mean_absolute_error",
    description="Multitask regression benchmark for experimental and calculated free energy",
    owner=owner,
    name="tutorial_freesolv",
)

#### Save the benchmark
Saving the benchmark is easy and can be done with a single line of code.

In [119]:
path = benchmark.to_json(SAVE_DIR)
fs = dm.fs.get_mapper(SAVE_DIR).fs
fs.ls(SAVE_DIR)

['/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmp3h1u3cx4/dataset/rdkit_solubility/table.parquet',
 '/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmp3h1u3cx4/dataset/rdkit_solubility/benchmark.json',
 '/var/folders/_7/ffxc1f251dbb5msn977xl4sm0000gr/T/tmp3h1u3cx4/dataset/rdkit_solubility/dataset.json']

This created three files. Two `json` files and a single `parquet` file. The `parquet` file saves the tabular structure at the base of the `Dataset` class, whereas the `json` files save all the meta-data for the `Dataset` and `BenchmarkSpecification`.

### Upload benchmark to Polaris Hub

In [120]:
response = client.upload_benchmark(benchmark)

[32m2023-10-12 16:25:08.736[0m | [32m[1mSUCCESS [0m | [36mpolaris.hub.client[0m:[36mupload_benchmark[0m:[36m453[0m - [32m[1mYour benchmark has been successfully uploaded to the Hub. View it here: https://polarishub.io//benchmarks/polaristest/tutorial_freesolv[0m
