# The Basics

<div class="admonition abstract highlight">
    <p class="admonition-title">In short</p>
    <p>This tutorial walks you through the basic usage of Polaris. We will first login to the hub and will then see how easy it is to load a dataset or benchmark from it. Finally, we will train a simple baseline to submit a first set of results!</p>
</div>

Polaris is designed to standardize the process of constructing datasets, specifying benchmarks and evaluating novel machine learning techniques within the realms of biology, chemistry, and drug discovery.

While the Polaris library can be used independently from the <a href="https://polarishub.io/">Polaris Hub</a>, the two were designed to seamlessly work together. The hub provides various pre-made, high quality datasets and benchmarks to develop and evaluate novel ML methods. In this tutorial, we will see how easy it is to load and use these datasets and benchmarks.

In [1]:
# Note: Cell is tagged to not show up in the mkdocs build
%load_ext autoreload
%autoreload 2

In [2]:
import polaris as po
from polaris.hub.client import PolarisHubClient

### Login
To be able to complete this step, you will require a Polaris Hub account. Go to [https://polarishub.io/](https://polarishub.io/) to create one. You only have to log in once at the start or when you haven't used your account in a while.

In [3]:
client = PolarisHubClient()
client.login()

[32m2023-11-27 14:54:08.788[0m | [1mINFO    [0m | [36mpolaris.hub.client[0m:[36mlogin[0m:[36m262[0m - [1mYou are already logged in to the Polaris Hub as cwognum (cas@valencediscovery.com). Set `overwrite=True` to force re-authentication.[0m


Instead of through the Python API, you could also use the Polaris CLI. See:
```sh
polaris login --help
```

### Load from the Hub
Both datasets and benchmarks are identified by a `owner/name` id. You can easily find and copy these through the Hub. Once you have the id, loading a dataset or benchmark is incredibly easy. 

In [4]:
dataset = po.load_dataset("polaris/hello_world_dataset")
benchmark = po.load_benchmark("polaris/hello_world_benchmark")

### Use the benchmark
The polaris library is designed to make it easy to participate in a benchmark. In just a few lines of code, we can get the train and test partition, access the associated data in various ways and evaluate our predictions. There's two main API endpoints. 

- `get_train_test_split()`: For creating objects through which we can access the different dataset partitions.
- `evaluate()`: For evaluating a set of predictions in accordance with the benchmark protocol.

In [5]:
train, test = benchmark.get_train_test_split()

The created objects support various flavours to access the data.

- The objects are iterable;
- The objects can be indexed;
- The objects have properties to access all data at once.

In [6]:
for x, y in train:
    pass

In [7]:
for i in range(len(train)):
    x, y = train[i]

In [8]:
x = train.inputs
y = train.targets

To avoid accidental access to the test targets, the test object does not expose the labels and will throw an error if you try access them explicitly.

In [9]:
for x in test:
    pass

In [10]:
for i in range(len(test)):
    x = test[i]

In [11]:
x = test.inputs

# NOTE: The below will throw an error!
# y = test.targets

### Partake in the benchmark

To complete our example, let's participate in the benchmark. We will train a simple random forest model on the ECFP representation through scikit-learn and datamol.

In [12]:
import datamol as dm
from sklearn.ensemble import RandomForestRegressor

# Convert smiles to ECFP fingerprints
train_fps = [dm.to_fp(smi) for smi in train.inputs]

# Define a model and train
model = RandomForestRegressor(max_depth=2, random_state=0)
model.fit(train_fps, train.targets)

To evaluate a model within Polaris, you should use the `evaluate()` endpoint. This requires you to just provide the predictions. The targets of the test set are automatically extracted so that the chance of the user accessing the test labels is minimal

In [13]:
test_fps = [dm.to_fp(smi) for smi in test.inputs]
predictions = model.predict(test_fps)

In [14]:
results = benchmark.evaluate(predictions)
results

Test set,Target label,Metric,Score
test,SOL,mean_squared_error,2.6875139821
test,SOL,mean_absolute_error,1.2735690161
name,,,
description,,,
tags,,,
user_attributes,,,
owner,,,
benchmark_name,hello_world_benchmark,,
benchmark_owner,slugpolarisexternal_idorg_2WG9hRFgKNIRtGw4orsMPcr1F4Stypeorganization,,
github_url,,,

0,1
slug,polaris
external_id,org_2WG9hRFgKNIRtGw4orsMPcr1F4S
type,organization

Test set,Target label,Metric,Score
test,SOL,mean_squared_error,2.6875139821
test,SOL,mean_absolute_error,1.2735690161


Before uploading the results to the Hub, you can provide some additional information about the results that will be displayed on the Polaris Hub.

In [15]:
results.name = f"hello-world-result"
results.github_url = "https://github.com/polaris-hub/polaris-hub"
results.paper_url = "https://polaris-hub.vercel.app"
results.description = "Hello, World!"

Finally, let's upload the results to the Hub! The result will be private, but visiting the link in the logs you can decide to make it public through the Hub.

In [16]:
client.upload_results(results, owner="cwognum")
client.close()

  Expected `url` but got `str` - serialized value may not be as expected
  Expected `url` but got `str` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
[32m2023-11-27 14:54:46.649[0m | [32m[1mSUCCESS [0m | [36mpolaris.hub.client[0m:[36mupload_results[0m:[36m428[0m - [32m[1mYour result has been successfully uploaded to the Hub. View it here: https://polarishub.io/benchmarks/polaris/hello_world_benchmark/ns4JrC3hQNK9M1hbVPchy[0m


That's it! Just like that you have partaken in your first Polaris benchmark. In next tutorials, we will consider more advanced use cases of Polaris, such as creating and uploading your own datasets and benchmarks. 

The End.

---