# Introduction to Conventions

Adhering to **metadata standards** is essential to achieve **findability** and efficient **re-usability** of data. Typos and attribute naming based on personal preferences must be avoided. Even if humans may still be able to understand such data, especially automatic exploration and processing of such data will be impossible.

The `h5RDMtoolbox` solves this by introducing the **conventions**. A convention is a selection of **standard_attributes** which clearly define how metadata needs to be provided and for which HDF5 objects. It is written in a YAML file and is defined by the stakeholders (See left side in Figure below).

The heart of standard attributes is the **validator**. They become effective when metadata is written. The Flow chart illustrates the writing and reading procedure. The example is made for writing and reading the attribute ('units').

1. Writing (`ds.units = 'm/s'`): If "units" is defined in the convention, the validator checks the value. "m/s" is a correct unit, so it will be written. Otherwise, invalid values raise an error.
3. Reading (`print(ds.units)`): The validator becomes effective upon reading attributes, that are standardized. However, then invalid value only raise warnings, in order to allow the user to still work with the file and fix the issue.

<img src="../_static/h5RDMtoolbox_standard_attribute_concept.png" alt="h5RDMtoolbox_standard_attribute_concept.png" width="800"/>

In [1]:
import h5rdmtoolbox as h5tbx

## The convention YAML file

A convention is defined in a YAML file and consists of three parts:

1. General information at the header indicated by double underscores:

2. Definition of standard attributes

A `standard_attribute` has various properties like the `description`, `target_method` (during which the standard attribute can be passed), the `validator`, which validates the input and the `default_value`. The latter can be "\\$EMPTY" indicating that no default value is set, and thus this attribute is obligatory. "\\$NONE" indicates, that the attribute is optional. And finally, a valid value can be given to be written even if no input is provided for this standard attribute.

3. Special type definitions

Here the allowed values for the standard attribute `data_type` is listed:

## Reading a Convention

Let's read the above example file into the class `Convention`:

In [2]:
cv = h5tbx.conventions.from_yaml('example_convention.yaml')
cv

[1mConvention("h5rdmtoolbox-tuturial-convention")[0m
contact: https://orcid.org/0000-0001-8729-0482
  File.__init__():
    * [1mcontact_id[0m:
		ID of a person to contact for questions.
    * [3mdata_type[0m (default=None):
		Type of data in file. Can be numerical or experimental.
  Group.create_dataset():
    * [3mcomment[0m (default=None):
		Comment to a dataset.

In order to make the convention affective in this session, it must be enabled. We do this by calling `use()`:

In [3]:
h5tbx.use(cv)

using("h5rdmtoolbox-tuturial-convention")

Now, we will get an error if we create a HDF5 file without providing the attribute `contact_id`. As we made it a required attribute, it must be provided during file initialization:

In [4]:
try:
    with h5tbx.File() as h5:
        pass
except h5tbx.errors.StandardAttributeError as e:
    print(e)

Parameter (standard attribute) "contact_id" must be provided and cannot be None!


Providing a wrong value raises an error, too:

In [5]:
try:
    with h5tbx.File(contact_id='id1722') as h5:
        h5.create_dataset(name='velocity', shape=(3, 4), comment='velocity field')
except h5tbx.errors.StandardAttributeError as e:
    print(e)

2023-11-20_06:57:21,847 ERROR    [core.py:824] Could not set attribute "comment" with value "velocity field" to dataset "velocity"


Setting "velocity field" for standard attribute "comment" failed. Original error: 1 validation error for comment
value
  Value error, Invalid format for pattern [type=value_error, input_value='velocity field', input_type=str]
    For further information visit https://errors.pydantic.dev/2.3/v/value_error


Now, we got it:

In [6]:
with h5tbx.File(contact_id='id1722') as h5:
    h5.dump()

Note, that if we were to reopen the file not in read-only (r) but in read-write mode, then the standard attributes which already exist are not checked again. So if the HDF5 was written with another package, e.g. h5py, then the value might be wrong:

In [7]:
with h5tbx.File(name=h5.hdf_filename, mode='r+') as h5:
    pass # note, that we were not required to pass "data_type" as it was present already!

Note, that a convention can also be enabled only **temporarily** using the context manager syntax:

In [8]:
with h5tbx.use(cv):
    with h5tbx.File(contact_id='id1722') as h5:
        pass

## Importing/Loading an online convention

The intended distribution of convention is via online repositories. The YAML file hence should be uploaded such it is accessible to all users. The `h5RDMtoolbox` currently favors the usage of [Zenodo](https://zenodo.org) repositories. The advantages are long-term storage and assignment of a DOI. However, files accessible via an URL can also be downloaded.

A tutorial convention is published [here](https://zenodo.org/record/8276817). By calling `from_zenodo()` the convention object is created:

In [9]:
cv = h5tbx.conventions.from_zenodo(doi='https://zenodo.org/record/8357399')
cv

ConnectionError: HTTPSConnectionPool(host='zenodo.org', port=443): Max retries exceeded with url: /api/records?q=doi%3A8357399 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x000001ED6457A850>: Failed to resolve 'zenodo.org' ([Errno 11001] getaddrinfo failed)"))

## Effect of enabling a convention

The convention above defined the usage of certain attributes with certain methods. E.g. "data_type" is to be used when a HDF5 file is created. When the convention is enabled, the **signature of the respective methods is changed**. To proof this, let's implement a small function, which prints all parameters of a given function and inspect the effect of the convention in the `__init__` method:

In [None]:
cv.properties[h5tbx.Dataset]['standard_name']

In [None]:
import inspect

def print_method_parameters(method):
    print(f'\nParameters for "{method.__name__}":')
    for param in inspect.signature(method).parameters.values():
        if not param.name == 'self':
            if param.name in h5tbx.conventions.get_current_convention().methods[h5tbx.File].get('__init__', {}).keys():
                print(f'  - {h5tbx._repr.make_bold(param.name)}')
            else:
                print(f'  - {param.name}')

methods = (h5tbx.File.__init__, h5tbx.Group.create_group, h5tbx.Group.create_dataset)

print('no convention: ')
h5tbx.use(None)
print_method_parameters(h5tbx.File.__init__)

print(f'\n------------\nwith convention {cv.name}: (standard attributes are made bold)')
h5tbx.use(cv)
print_method_parameters(h5tbx.File.__init__)

In [None]:
h5tbx.use(None)  # fall back to the default convention