# Introduction to Conventions

Adhering to **metadata standards** is essential to achieve **findability** and efficient **re-usability** of data. Typos and attribute naming based on personal preferences must be avoided. Even if humans may still be able to understand such data, especially automatic exploration and processing of such data will be impossible.

The `h5RDMtoolbox` solves this by introducing the **conventions**. A convention is a selection of **standard_attributes** which clearly define how metadata need to be provided and for which HDF5 objects. It is written in a YAML file and is defined by the stakeholders (See left side in Figure below). Please note, that it is up to the user, who is defining the standard attributes, whether the standard is good scientific practice and fulfills the requirements of FAIR metadata.

## Learning from a simple practical example
To understand the concept of a convention easier, let's start by examining a simple example:

The work with scientific data should include its physical unit. So, as a data manager, we can define our requirement like so:

    Each dataset in an HDF5 file must have the attribute "units"

<!-- The heart of standard attributes is the **validator**. They become effective when metadata is written. The Flow chart illustrates the writing and reading procedure. The example is made for writing and reading the attribute ('units').

1. Writing (`ds.units = 'm/s'`): If "units" is defined in the convention, the validator checks the value. "m/s" is a correct unit, so it will be written. Otherwise, invalid values raise an error.
3. Reading (`print(ds.units)`): The validator becomes effective upon reading attributes, that are standardized. However, then invalid value only raise warnings, in order to allow the user to still work with the file and fix the issue.

<img src="../_static/h5RDMtoolbox_standard_attribute_concept.png" alt="h5RDMtoolbox_standard_attribute_concept.png" width="800"/> -->

Consider the following code. It shows the creation of an HDF5 dataset:

In [1]:
import h5rdmtoolbox as h5tbx

# h5tbx.conventions.logger.setLevel('DEBUG')

with h5tbx.File() as h5:
    ds = h5.create_dataset('velocity', data=[1, -4, 3])
    ds.attrs['unit'] = 'meter_per_second'

non_compliant_filename = h5.hdf_filename

The creator of the dataset tried to comply with the above requirement by adding a unit attribute to the dataset. However, a spelling error occurred. Also, with a bit of experience, we can already guess that the value for unit is not easily processed by further (Python) processing. Using "m/s" probably would be easier (because we could consider using the package `pint`. Also, printing the unit-attribute in a plot looks not so nice, if it is "meter_per_second").

Now, without knowing how conventions are designed and work yet, let's load a convention definition file (a [YAML](https://pyyaml.org/wiki/PyYAMLDocumentation) file). We activate it by calling `use()`

In [2]:
cv = h5tbx.conventions.from_yaml('units-convention.yaml', overwrite=True)
h5tbx.use(cv)  # or if we know the name: h5tbx.use("tutorial-units-convention")

using("tutorial-units-convention")

It turns out, that the `create_dataset` now requires an additional parameter, namely "units". This was defined in the YAML file (we take a look into the file later):

In [3]:
with h5tbx.File() as h5:
    try:
        ds = h5.create_dataset('velocity', data=[1, -4, 3])
    except h5tbx.errors.StandardAttributeError as e:
        print(e)

2023-12-10_19:02:03,377 ERROR    [core.py:855] Could not set attribute "units" with value "_SpecialDefaults.EMPTY" to dataset "velocity"


Parameter (standard attribute) "units" must be provided and cannot be None!


As a result, the code needs to be fixed.

The above is direct feedback when creating HDF5 files and using the `h5RDMtoolbox` and therefore helpful for developers or data creators.

But what if we have a file from another source that is in compliance with our convention? Then we can call `.validate()` from our convention and pass the filename of the file that we want to validate. We will get a list of dictionary entries, which tell us which attributes are invalid and why:

In [4]:
cv.validate(non_compliant_filename)

[{'name': '/velocity', 'attr_name': 'units', 'reason': 'missing_attribute'}]

## The convention file

A convention is defined in a YAML (or JSON) file and consists of three parts:

1. General information at the header indicated by double underscores:

2. Definition of standard attributes

A `standard_attribute` has various properties like the `description`, `target_method` (during which the standard attribute can be passed), the `validator`, which validates the input and the `default_value`. The latter can be "\\$EMPTY" indicating that no default value is set, and thus this attribute is obligatory. "\\$NONE" indicates, that the attribute is optional. And finally, a valid value can be given to be written even if no input is provided for this standard attribute.

3. Special type definitions

Here the allowed values for the standard attribute `data_type` is listed:

## Reading a Convention

Let's read the above example file into the class `Convention`:

In [5]:
cv = h5tbx.conventions.from_yaml('example_convention.yaml')
cv

[1mConvention("h5rdmtoolbox-tuturial-convention")[0m
contact: https://orcid.org/0000-0001-8729-0482
  File.__init__():
    * [1mcontact_id[0m:
		ID of a person to contact for questions.
    * [3mdata_type[0m (default=None):
		Type of data in file. Can be numerical or experimental.
  Group.create_dataset():
    * [3mcomment[0m (default=None):
		Comment to a dataset.

In order to make the convention affective in this session, it must be enabled. We do this by calling `use()`:

In [6]:
h5tbx.use(cv)

using("h5rdmtoolbox-tuturial-convention")

Now, we will get an error if we create a HDF5 file without providing the attribute `contact_id`. As we made it a required attribute, it must be provided during file initialization:

In [7]:
try:
    with h5tbx.File() as h5:
        pass
except h5tbx.errors.StandardAttributeError as e:
    print(e)

Parameter (standard attribute) "contact_id" must be provided and cannot be None!


Providing a wrong value raises an error, too:

In [8]:
try:
    with h5tbx.File(contact_id='id1722') as h5:
        h5.create_dataset(name='velocity', shape=(3, 4), comment='velocity field')
except h5tbx.errors.StandardAttributeError as e:
    print(e)

2023-12-10_19:02:03,506 ERROR    [core.py:855] Could not set attribute "comment" with value "velocity field" to dataset "velocity"


Setting "velocity field" for standard attribute "comment" failed. Original error: 1 validation error for comment
value
  Value error, Invalid format for pattern [type=value_error, input_value='velocity field', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/value_error


Now, we got it:

In [9]:
with h5tbx.File(contact_id='id1722') as h5:
    h5.dump()

Note, that if we were to reopen the file not in read-only (r) but in read-write mode, then the standard attributes which already exist are not checked again. So if the HDF5 was written with another package, e.g. h5py, then the value might be wrong:

In [10]:
with h5tbx.File(name=h5.hdf_filename, mode='r+') as h5:
    pass # note, that we were not required to pass "data_type" as it was present already!

Note, that a convention can also be enabled only **temporarily** using the context manager syntax:

In [11]:
with h5tbx.use(cv):
    with h5tbx.File(contact_id='id1722') as h5:
        pass

## Importing/Loading an online convention

The intended distribution of convention is via online repositories. The YAML file hence should be uploaded such it is accessible to all users. The `h5RDMtoolbox` currently favors the usage of [Zenodo](https://zenodo.org) repositories. The advantages are long-term storage and assignment of a DOI. However, files accessible via an URL can also be downloaded.

A tutorial convention is published [here](https://zenodo.org/record/8276817). By calling `from_zenodo()` the convention object is created:

In [12]:
cv = h5tbx.conventions.from_zenodo(doi='https://zenodo.org/records/10156750')
cv

[1mConvention("h5rdmtoolbox-tutorial-convention")[0m
contact: https://orcid.org/0000-0001-8729-0482
  File.__init__():
    * [1mdata_type[0m:
		Type of data in file. Can be numerical, analytical or experimental.
    * [1mcontact[0m:
		Contact or responsible person for the full file. Contact is represented by an ORCID.
    * [3mstandard_name_table[0m (default=<h5rdmtoolbox.conventions.consts.DefaultValue object at 0x0000021B79E33820>):
		The standard name table of the convention.
    * [3mcomment[0m (default=None):
		Comment describes the file content in more detail.
    * [3mreferences[0m (default=None):
		Web resources serving as references for the full file.
  Group.create_dataset():
    * [3munits[0m (default=None):
		The physical unit of the dataset. If dimensionless, the unit is ''.
    * [3mstandard_name[0m (default=None):
		Standard name of the dataset. If not set, the long_name attribute must be given.
    * [3mlong_name[0m (default=None):
		An comprehensive d

## Effect of enabling a convention

The convention above defined the usage of certain attributes with certain methods. E.g. "data_type" is to be used when a HDF5 file is created. When the convention is enabled, the **signature of the respective methods is changed**. To proof this, let's implement a small function, which prints all parameters of a given function and inspect the effect of the convention in the `__init__` method:

In [13]:
cv.properties[h5tbx.Dataset]['standard_name']

<StandardAttribute [keyword]("standard_name"): default_value="None" | "Standard name of the dataset. If not set, the long_name attribute must be given.">

In [14]:
import inspect

def print_method_parameters(method):
    print(f'\nParameters for "{method.__name__}":')
    for param in inspect.signature(method).parameters.values():
        if not param.name == 'self':
            if param.name in h5tbx.conventions.get_current_convention().methods[h5tbx.File].get('__init__', {}).keys():
                print(f'  - {h5tbx._repr.make_bold(param.name)}')
            else:
                print(f'  - {param.name}')

methods = (h5tbx.File.__init__, h5tbx.Group.create_group, h5tbx.Group.create_dataset)

print('no convention: ')
h5tbx.use(None)
print_method_parameters(h5tbx.File.__init__)

print(f'\n------------\nwith convention {cv.name}: (standard attributes are made bold)')
h5tbx.use(cv)
print_method_parameters(h5tbx.File.__init__)

no convention: 

Parameters for "__init__":
  - name
  - mode
  - layout
  - attrs
  - kwargs

------------
with convention h5rdmtoolbox-tutorial-convention: (standard attributes are made bold)

Parameters for "__init__":
  - name
  - mode
  - layout
  - attrs
  - [1mdata_type[0m
  - [1mstandard_name_table[0m
  - [1mcomment[0m
  - [1mcontact[0m
  - [1mreferences[0m
  - kwargs


In [15]:
h5tbx.use(None)  # fall back to the default convention

using("h5py")