# Creating datasets with zarr

<div class="admonition abstract highlight">
    <p class="admonition-title">In short</p>
    <p>This tutorial shows how to create datasets with more advanced data-modalities through the .zarr format.</p>
</div>

## Pointer columns

Not all data might fit the tabular format, e.g. images or conformers. In that case, we have _pointer_ columns. Pointer columns do not contain the data itself, but rather store a reference to an external file from which the content can be loaded.

For now, we only support `.zarr` files as references. To learn more about `.zarr`, visit their documentation. Their [tutorial](https://zarr.readthedocs.io/en/stable/tutorial.html) is a specifically good read to better understand the main features. 

### Dummy example
For the sake of simplicity, let's assume we have just two datapoints. We will use this to demonstrate the idea behind pointer columns. 

In [1]:
import zarr
import platformdirs

import numpy as np
import datamol as dm
import pandas as pd

SAVE_DIR = dm.fs.join(platformdirs.user_cache_dir(appname="polaris-tutorials"), "002")

In [2]:
# Create a single image and save it to a .zarr directory
images = np.random.random((2, 64, 64, 3))

train_path = dm.fs.join(SAVE_DIR, "single_train.zarr")
zarr.save(train_path, images[0])

test_path = dm.fs.join(SAVE_DIR, "single_test.zarr")
zarr.save(test_path, images[1])

In [3]:
table = pd.DataFrame(
    {
        "images": [train_path, test_path],  # Instead of the content, we specify paths
        "target": np.random.random(2),
    }
)

In [4]:
from polaris.dataset import Dataset, ColumnAnnotation

dataset = Dataset(
    table=table,
    # To indicate that we are dealing with a pointer column here,
    # we need to annotate the column.
    annotations={"images": ColumnAnnotation(is_pointer=True)},
)

Note how the table does not contain the image data, but rather stores a path. 

In [5]:
dataset.table.loc[0, "images"]

'/home/cas/.cache/polaris-tutorials/002/single_train.zarr'

To load the data that is being pointed to, you can simply use the `Dataset.get_data()` utility method. 

In [6]:
dataset.get_data(col="images", row=0).shape

(64, 64, 3)

Creating a benchmark and the associated `Subset` objects will automatically do so! 

In [7]:
from polaris.benchmark import SingleTaskBenchmarkSpecification

benchmark = SingleTaskBenchmarkSpecification(
    dataset=dataset,
    input_cols="images",
    target_cols="target",
    metrics="mean_absolute_error",
    split=([0], [1]),
)

In [8]:
train, test = benchmark.get_train_test_split()

for x, y in train:
    # At this point, the content is loaded from the path specified in the table
    print(x.shape)

(64, 64, 3)


## Creating datasets from `.zarr` arrays

While the above example works, creating the table with all paths from scratch is time-consuming when datasets get large. Instead, you can also automatically parse a `.zarr` hierarchy into the expected tabular data structure. 

A little more about zarr: A `.zarr` file can contain groups and arrays, where each group can again contain groups and arrays. Each array can be saved as one or multiple chunks. Additional user attributes (for any array or group) are saved as JSON files.

Within Polaris:

1. Each subgroup of the root group corresponds to a single column.
2. Each subgroup can contain:
    - A single array with all datapoints.
    - A single array per datapoint.
3. Additional meta-data is saved to the user attributes of the root group.
4. The indices are required to be integers.

To better explain what this works, let's look at two examples corresponding to the two cases in point 2 above. 

### A single array _per_ data point
In this first example we will create a zarr array _per_ data point. The structure of the zarr will look like: 

```
/
  column_a/
      array_1
      array_2
      ...
      array_N
```

and as we will see, this will get parsed into

| column_a                             |
| ------------------------------------ |
| /path/to/root.zarr/column_a/array_1  |
| /path/to/root.zarr/column_a/array_2  |
|                  ...                 |
| /path/to/root.zarr/column_a/array_N  |


<div class="admonition info highlight">
    <p class="admonition-title">Note</p>
    <p>Notice dataset now no longer stores the content of the array itself, but rather a reference to the array.</p>
</div>

In [9]:
# Let's first create some dummy dataset with 1000 64x64 "images"
images = np.random.random((1000, 64, 64, 3))

To be able to use these images in Polaris, we need to save them in the zarr hierarchy.

In [10]:
path = dm.fs.join(SAVE_DIR, "zarr", "archive_multi.zarr")

with zarr.open(path, "w") as root:
    with root.create_group("images") as group:
        for i, arr in enumerate(images):
            # If you're saving an array per datapoint,
            # the name of the array needs to be an integer
            group.array(i, arr)

    # he root directory can furthermore contain all additional meta-data in its user attributes.
    root.attrs["name"] = "dummy_image_dataset"
    root.attrs["description"] = "Randomly generated 64x64 images"
    root.attrs["source"] = "https://doi.org/xx.xxxx"

    # To ensure proper processing, it is important that we annotate the column.
    # As this has to be JSON serializable, we create a dict instead of the object.
    # Due to using Pydantic, this will work seamlessly.
    root.attrs["annotations"] = {"images": {"is_pointer": True}}

In [11]:
dataset = Dataset.from_zarr(path)

In [12]:
dataset.get_data(col="images", row=0).shape

(64, 64, 3)

### A single array for _all_ datapoints 
Instead of having an array per datapoint, you might also batch all arrays in a single array. This could for example speed up compression.

In this case, our zarr hierarchy will look like this: 
```
/
  column_a/
      array
```

Which will get parsed into a table like: 

| column_a                             |
| ------------------------------------ |
| /path/to/root.zarr/column_a/array#1  |
| /path/to/root.zarr/column_a/array#2  |
|                 ...                  |
| /path/to/root.zarr/column_a/array#N  |

<div class="admonition info highlight">
    <p class="admonition-title">Note</p>
    <p>Notice the # suffix in the path, which indicates the index at which the data-point is stored within the big array. </p>
</div>

In [13]:
path = dm.fs.join(SAVE_DIR, "zarr", "archive_single.zarr")

with zarr.open(path, "w") as root:
    with root.create_group("images") as group:
        group.array("data", images)

In [14]:
dataset = Dataset.from_zarr(path)

# The path refers to the original zarr directory we created in the above code block
dataset.table.iloc[0]["images"]

'/home/cas/.cache/polaris-tutorials/002/zarr/archive_single.zarr//images/data#0'

In [15]:
dataset.get_data(col="images", row=0).shape

(64, 64, 3)

## Saving the dataset

We can still easily save the dataset. All the pointer columns will be automatically updated. 

In [16]:
savedir = dm.fs.join(SAVE_DIR, "json")
json_path = dataset.to_json(savedir)

In [17]:
fs = dm.fs.get_mapper(path).fs
fs.ls(SAVE_DIR)

['/home/cas/.cache/polaris-tutorials/002/benchmark.json',
 '/home/cas/.cache/polaris-tutorials/002/single_train.zarr',
 '/home/cas/.cache/polaris-tutorials/002/dataset.json',
 '/home/cas/.cache/polaris-tutorials/002/table.parquet',
 '/home/cas/.cache/polaris-tutorials/002/zarr',
 '/home/cas/.cache/polaris-tutorials/002/single_test.zarr',
 '/home/cas/.cache/polaris-tutorials/002/json']

Besides the `table.parquet` and `dataset.yaml`, we can now also see a `data` folder which stores the content for the additional content from the pointer columns. Instead, we might want to rather save as a single `.zarr` file. With the `array_mode` argument, we can choose between the two structures we outlined in this repository. 

In [18]:
savedir = dm.fs.join(SAVE_DIR, "zarr")
zarr_path = dataset.to_zarr(savedir, array_mode="single")

## Load the dataset

In [19]:
Dataset.from_json(json_path)

In [20]:
Dataset.from_zarr(zarr_path)

The End. 