# Quick Overview

This chapter gives a quick overview into how to use the package. Detailed explanations can be found in the [userguide](../userguide/index.rst).

Start by importing the package:

In [None]:
import h5rdmtoolbox as h5tbx

## Difference to `h5py` package

The `h5RDMtoolbox` is built upon the `h5py` package. The base functionality is kept, but convenient features and interfaces are added.

### Filename
A filename must not be provided when creating a new file. If none is provided, a temporary file is created. Also, `hdf_filename` is provided as an additional property allowing to work with the filename even after the file has been closed *and* to work with `pathlib.Path` objects instead of strings:

In [None]:
with h5tbx.use(None):
    with h5tbx.File() as h5:
        pass
h5.hdf_filename.name  # equal to h5.filename but a pathlib.Path and exists also after the file is closed

---

## Dataset features

The toolbox aims to simplify working with HDF5 files. You can experience this some functionality being added during object creation, e.g. it is possible to create attributes during group and dataset creations.

Also the inspection of the file content is made very easy and user friendly.

Below we create a HDF5 dataset with two attributes and dump the content. Use `dump()` in Notebooks and `dumps()` in scripts or a python console. Note, how the file representation is interactive. Also note, that there is even more interactivity when we explore the [RDF feature](#Semantification-of-HDF-data-using-RDF).

In [None]:
with h5tbx.File() as h5:
    ds_time = h5.create_dataset(
        name='time',
        data=[0, 1, 2, 3],
        attrs=dict(units='s', long_name='measurement time'),
        make_scale=True
    )
    h5.dump(collapsed=False)

### Datasets/xarray interface

Data access will not return `np.ndarray` but a `xr.DataArray` object. It is capable of storing attributes and coordinates (similar concept as HDF dimension scales). Find out about all possibilities this give on [xarray's documentation](https://xarray.pydata.org/).

Let's create some sample data and see how this new return object can help:

In [None]:
import numpy as np

time = np.linspace(0, np.pi/4, 21) # units [s]
signal = np.sin(2*np.pi*3*time) # units [V], physical: [m/s]

with h5tbx.File() as h5:
    vel_hdf_filename = h5.hdf_filename # store for later use
    
    ds_time = h5.create_dataset(name='time',
                                data=time,
                                attrs=dict(units='s',
                                           long_name='measurement time'),
                                make_scale=True)
    
    ds_signal = h5.create_dataset(name='vel',
                                  data=signal,
                                  attrs=dict(units='m/s',
                                             long_name='air velocity in pipe'),
                                  attach_scale=ds_time)

Inspired by `xarray` the methods `sel` and `isel` are implemented:

In [None]:
with h5tbx.File(vel_hdf_filename) as h5:
    vel2 = h5['vel'].sel(time=2, method='nearest')
vel2

Another advantage is using the plotting util form `xarray`:

In [None]:
with h5tbx.File(vel_hdf_filename) as h5:
    vel_data = h5['vel'][:]
    vel_data.plot(marker='o')
    
vel_data  # this returns the interactive view of the array and its meta data

### Natural Naming
Until here, we used the conventional way of addressing variables and groups in a dictionary-like style. `h5RDMtoolbox` allows using "natural naming" which means that we can address those objects as if they were attributes. Make sure `h5tbx.config.natural_naming` is set to `True` (the default)

Let's first disable `natural_naming`:

In [None]:
with h5tbx.set_config(natural_naming=False):
    with h5tbx.File(vel_hdf_filename, 'r') as h5:
        try:
            ds = h5.vel[:]
        except Exception as e:
            print(e)

Enable it:

In [None]:
with h5tbx.set_config(natural_naming=True):
    with h5tbx.File(vel_hdf_filename, 'r') as h5:
        ds = h5.vel[:]

---

## Conventions
The file content is controlled by means of a `convention`. This means that specific attributes are required for HDF groups or datasets.

They can be understood as rules, which are validated during usage. To make those rules to become effective, the convention must be imported and enabled. Conventions can be created by the user, too. More on this [here](../userguide/convention/index.rst).

For now, we select the existing one, which is published on [Zenodo](https://zenodo.org/record/10428795)

In [None]:
from h5rdmtoolbox.repository.zenodo import ZenodoRecord

cv = h5tbx.convention.from_repo(
    ZenodoRecord(source=12526361),
    name="tutorial_convention.yaml"
)
cv

From the above representation string of the convention object we can read which attributes are *optional* or **required** for file creation (`__init__`), dataset creation (`create_dataset`) or group creation (`create_group`).

Without enabling the convention, the working with HDF5 files through the `h5rdmtoolbox` is almost (we got a few additional features which make life a bit easier) as by using `h5py`:

In [None]:
with h5tbx.File() as h5:
    h5.dump()

**Now, we enable the convention ...**

In [None]:
h5tbx.use(cv)

... and get an error, because we are not providing a "data_type":

In [None]:
try:
    with h5tbx.File() as h5:
        pass
except Exception as e:
    print(e)

In [None]:
import numpy as np

time = np.linspace(0, np.pi/4, 21) # units [s]
signal = np.sin(2*np.pi*3*time) # units [V], physical: [m/s]

with h5tbx.File(contact=h5tbx.__author_orcid__, data_type='experimental') as h5:
    vel_hdf_filename = h5.hdf_filename # store for later use
    
    ds_time = h5.create_dataset(name='time',
                                data=time, 
                                units='s',
                                long_name='measurement time',
                                make_scale=True)
    
    ds_signal = h5.create_dataset(name='vel',
                                  data=signal,
                                  units='m/s',
                                  long_name='air velocity in pipe',
                                  attach_scale=ds_time)

---

## Semantification of HDF data using RDF

The files can be described by RDF triples:

In [None]:
import rdflib

In [None]:
h5tbx.use(None)

with h5tbx.File() as h5:
    # use `frdf` for "file attributes":
    h5.attrs["title"] = "Test file"
    h5.frdf["title"].predicate = rdflib.DCTERMS.title
    h5.frdf["title"].object = [rdflib.Literal("Test file", "en"), rdflib.Literal("Test Datei", "de")]

    h5.attrs["created"] = "2025-05-05"
    h5.frdf["created"].predicate = "http://purl.org/dc/terms/created"
    
    grp = h5.create_group('contact', attrs=dict(orcid='https://orcid.org/0000-0001-8729-0482'))   
    grp.rdf.type = 'http://xmlns.com/foaf/0.1/Person'  # what the content of group is, namely a foaf:Person
    grp.frdf.file_predicate = rdflib.DCTERMS.creator  # the creator shall be related to the file
    grp.rdf.subject = 'https://orcid.org/0000-0001-8729-0482'  # corresponds to @ID in JSON-LD
    
    grp.rdf['orcid'].predicate =  'http://w3id.org/nfdi4ing/metadata4ing#orcidId'
    
    grp.attrs['first_name', rdflib.FOAF.givenName] = 'Matthias'
    grp.attrs['last_name', rdflib.FOAF.familyName] = 'Probst'

    h5.dump()

One of the benefits is that the user can understand the meaning of the data. A machine-interpretable and standardised common exchange file format used by Semantic Web technology is JSON-LD. The toolbox also allows exporting to this format:

In [None]:
print(h5tbx.serialize(h5.hdf_filename, format="ttl", file_uri="https://example.org#"))

### Validation with SHACL

Being able to generate RDF data means, that we could also use [SHACL](https://www.w3.org/TR/shacl/) to validate files.

The following SHACL shape defines, that each `hdf:File` must have a creation date specified via `dcterms:created`:

In [None]:
shacl_shape = '''@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix hdf: <http://purl.allotrope.org/ontologies/hdf5/1.8#> .
@prefix ex: <http://example.org/ns#> .

ex:HDFFileCreatedShape
    a sh:NodeShape ;
    sh:targetClass hdf:File ;                # apply only to hdf:File instances
    sh:property [
        sh:path dcterms:created ;            # must have this property
        sh:datatype xsd:date ;               # value must be a date
        sh:minCount 1 ;                      # at least one occurrence
        sh:maxCount 1 ;                      # optional but recommended
        sh:message "Each hdf:File must have exactly one dcterms:created value of type xsd:date." ;
    ] .
'''

The toolbox provides the validation function via the module `ld` (linked-data). In our case, the validation succeeds:

In [None]:
from h5rdmtoolbox.ld.shacl import validate_hdf

res = validate_hdf(
    hdf_source=h5.hdf_filename,
    shacl_data=shacl_shape
)
print(res.results_text)

---

## Databases

The `h5rdmtoolbox` has currently implemented two solutions to use databases with HDF5 file. One solution is mapping metadata into a [mongoDB](https://www.mongodb.com/) database. The other uses the HDF5 file itself as a database and allows querying without any further step.

In this quick tutorial, we use the second solution. More on the topic can be found in the [documentation](https://h5rdmtoolbox.readthedocs.io/en/latest/database/index.html)

Let's find the dataset with name "/vel" (yes, trivial in this case, but just to get an idea). We use `find_one`, because we want to find only one (the first) occurrence:

In [None]:
from h5rdmtoolbox.database import FileDB

In [None]:
res = FileDB(vel_hdf_filename).find_one({'$name': '/vel'})
print(res.name)

The same can be done from an opened file, too:

In [None]:
with h5tbx.File(vel_hdf_filename) as h5:
    res = h5.find_one({'$name': '/vel'})
res.name

Let's find all (`find`) datasets with the attribute "units" and any value:

In [None]:
res = FileDB(vel_hdf_filename).find({'units': {'$regex': '.*'}})
for r in res:
    print(r)

---

## Layouts

Layouts define how a file is expected to be organized, which groups and datasets must exist, which attributes are expected and much more. Layout define expectations and thus help file exchange where multiple users are involved. In the jargon of the toolbox, we call these "specifications".

*Note*: In future versions of the toolbox, the layout definition may be formulated as a SHACL shape, however this is less pythonic and more difficult to write, since it needs knowledge of the SHACL syntax.

**Design concept**<br>
The module *layouts* makes use of the database solution for HDF5 files. The idea is, that we should be able to formulate our expectations/specifications in the form of a query. For more detailed information, see [here](https://h5rdmtoolbox.readthedocs.io/en/latest/layouts.html). So we write down our queries, which we expect to find HDF5 objects in a file, when we validate one in the future.

Let's design a simple one, which requires all datasets to have the attribute "units":

In [None]:
from h5rdmtoolbox.layout import Layout

In [None]:
lay = Layout()

spec_all_dataset = lay.add(
    FileDB.find,  # query function
    flt={},
    objfilter='dataset',
    n=None
)

# The following specification is added to the previous.
# This will apply the query only on results found by the previous query
spec_compression = spec_all_dataset.add(
    FileDB.find,
    flt={'units': {'$exists': True}}, # attribute "units" exists
    n=1
)

# we added one specification to the layout. let's check:
lay.specifications  # note, that the second specification is not shown, because it is part of the first one

In [None]:
res = lay.validate(vel_hdf_filename)
res.is_valid()

In [None]:
res.print_summary(exclude_keys=('kwargs', 'target_name', 'target_type'))

The above layout successfully validate the file.

Now, let's add the specification:
- The file must have one dataset named "pressure".
- The exact location within the file does not play a role.
- This specific dataset must have the unit "Pa":
- The shape of the dataset must be equal to (21, )

In [None]:
lay.add(
    FileDB.find_one,  # query function
    flt={'$name': {'$regex': 'pressure'}, 
         '$shape': (21, ),
         'units': 'Pa'},
    objfilter='dataset',
    n=1
)
lay.specifications

The validation now fails:

In [None]:
res = lay.validate(vel_hdf_filename)
res.is_valid()

Let's add such a dataset:

In [None]:
with h5tbx.File(vel_hdf_filename, 'r+') as h5:
    h5.create_dataset('subgrp/pressure', shape=(21,), attrs={'units': 'Pa'})

And perform the validation again:

In [None]:
res = lay.validate(vel_hdf_filename)
res.is_valid()

Feel free to play with the layout specifications and the HDF5 file content. For sure, knowledge about performing queries with the used database is needed.

---

## Repositories

Finally, we can publish our data. The toolbox has implemented an interface to [Zenodo](https://zenodo.org/). Using it with the sandbox (testing) environment requires an API TOKEN. For this, please provide the environment variable "ZENODO_SANDBOX_API_TOKEN":

In [None]:
# %set_env ZENODO_SANDBOX_API_TOKEN=<your token>

In [None]:
from h5rdmtoolbox.repository import zenodo
from datetime import datetime

Create a new deposit (repo in the testing environment):

In [None]:
deposit = zenodo.ZenodoRecord(None, sandbox=True)

Prepare metadata according to the Zenodo API: 

In [None]:
meta = zenodo.metadata.Metadata(
    version="1.0.0",
    title='H5TBX Quick Overview Test',
    description=f'The file created in the quick overview script using the h5rdmtoolbox version {h5tbx.__version__}.',
    creators=[zenodo.metadata.Creator(name="Probst, Matthias",
                                      affiliation="Karlsruhe Institute of Technology, Institute for Thermal Turbomachinery",
                                      orcid="0000-0001-8729-0482")],
    upload_type='dataset',
    access_right='open',
    keywords=['h5rdmtoolbox', 'tutorial', 'repository-test'],
    publication_date=datetime.now(),
)

push metadata to the repository:

In [None]:
deposit.metadata = meta

**Upload the HDF5 file:**<br>
As HDF5 files cannot be previewed in the repository web interface and HDF5 files may become very large, it is handy to upload a metadata file as an additional resource. This allows to preview the metadata content before downloading the larger HDF5 file.

This functionality is given by the upload method `.upload_file`, which takes the optional parameter `metamapper`. Please provide a function which creates a file with metadata. The toolbox comes with a solution, that maps the HDF5 metadata content into a JSON-LD file:

In [None]:
from h5rdmtoolbox import jsonld

In [None]:
deposit.upload_file(filename=vel_hdf_filename, metamapper=jsonld.hdf2jsonld, skipND=1) # skipND is needed by hdf2jsonld