# Introduction to Conventions

This section will take you through the steps of using and creating a convention.

First, we need to import the convention sub-package and the toolbox itself:

In [1]:
from h5rdmtoolbox import convention
import h5rdmtoolbox as h5tbx

At all times, a convention is currently enabled. Currently, the convention named "h5py" is enabled. In fact, it does nothing and mimics the behavior of the `h5py` package. This means, that methods like `create_dataset` do not expect other parameters, as we know from the `h5py`-package.

To find out, which convention is currently active, call the following (we expect "h5py"):

In [2]:
cv = convention.get_current_convention()
cv

Convention("h5py")

To list all available convention, call `get_registered_conventions()`.

In [3]:
convention.get_registered_conventions()

{'h5tbx': Convention("h5tbx"), 'h5py': Convention("h5py")}

There is also another convention, named "h5tbx".

The string representation of the `Convention` objects tells us which attributes are expected for which method. In this example, the following is defined:
- For the **root group**, indicated by `File.__init__()`: "creation_mode"
- For any dataset, indicated by `Group.create_dataset()`: "units" and "symbol". *Note that units is obligatory and "symbol" is optional*

This means, that we *can* provide "creation_mode" during file creation and we *must* provide the attribute "units" during dataset creation. We will test this in the next section.

In [4]:
cv = convention.get_registered_conventions()['h5tbx']
print(cv)

[1mConvention("h5tbx")[0m
contact: https://orcid.org/0000-0001-8729-0482
  File.__init__():
    * [3mcreation_mode[0m:
		Creation mode of the data. Specific for engineering.
  Group.create_dataset():
    * [1munits (obligatory)[0m :
		The physical unit of the dataset. If dimensionless, the unit is ''.
    * [3msymbol[0m:
		The mathematical symbol of the dataset.



## Applying a convention

Without knowing how to build a convention, let's first apply one. We enable the "h5tbx" convention by calling `use()`:

In [5]:
h5tbx.use('h5tbx')

using("h5tbx")

Now the convention is enabled, and we can start creating a new HDF5 file. As expected, if we don't provide the attribute "units" as a parameter, an error is raised:

In [6]:
with h5tbx.File() as h5:
    try:
        h5.create_dataset('ds', shape=(3, ))
    except convention.errors.StandardAttributeError as e:
        print(e)

Convention "h5tbx" expects standard attribute "units" to be provided as an argument during dataset creation.


But there is more to it. Let's now pass the attribute "units" but purposely set an invalid value. Normally, we can set any value to attributes. But standard attributes can have a validator implemented. This is the case for the attribute "units". We will find out later how this is done.

In [7]:
with h5tbx.File() as h5:
    try:
        h5.create_dataset('ds', shape=(3, ), units='invalid')
    except convention.errors.StandardAttributeError as e:
        print(e)

The value "invalid" for standard attribute "units" could not be set. Please check the convention file wrt. the rule for this attribute. The following error message might not always explain the origin of the problem:
1 validation error for units
value
  Value error, Units cannot be understood using ureg package: invalid. Original error: 'invalid' is not defined in the unit registry [type=value_error, input_value='invalid', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/value_error


The error tells us, that "invalid" is not understood by the validator.

Finally, we provide a valid SI unit:

In [8]:
with h5tbx.File() as h5:
    h5.create_dataset('ds', shape=(3, ), units='m/s')

We can also access all registered standard attributes of the convention like so:

In [9]:
cv.registered_standard_attributes

{'units': <StandardAttribute[positional/obligatory]("units"): "The physical unit of the dataset. If dimensionless, the unit is ''.">,
 'symbol': <StandardAttribute [keyword/optional]("symbol"): default_value="_SpecialDefaults.NONE" | "The mathematical symbol of the dataset.">,
 'creation_mode': <StandardAttribute [keyword/optional]("creation_mode"): default_value="_SpecialDefaults.NONE" | "Creation mode of the data. Specific for engineering.">}

## 1. Creating a new Convention (with code)

We want to create our own convention with our own standard attribute.

Create the object:

In [10]:
cv = convention.Convention(name='MyFirstConvention', contact='John Doe')
cv.register()

Enable it:

In [11]:
h5tbx.use('MyFirstConvention')

using("MyFirstConvention")

Make sure it worked:

In [12]:
convention.get_current_convention()

Convention("MyFirstConvention")

### Define a standard attribute

#### Create a validator

Standard attribute validators are build with [`pydantic`](https://docs.pydantic.dev/latest/concepts/validators/). Please refer to the [package](https://docs.pydantic.dev/latest/concepts/validators/) if you are unsure how to use it, or carefully follow the following.

Let's assume we want to enforce users to provide a `publication_type` as a root attribute (parameter during `__init__` call).<br>
First, we need to define the validator:

In [13]:
from pydantic import BaseModel
from typing_extensions import Literal

PublicationType = Literal[
    "book",
    "conferencepaper",
    "article",
    "patent",
    "report",
    "other",
]

class PublicationValidator(BaseModel):
    """Validate an orcid (a simple naive version)"""
    value: PublicationType

Let's check if it works:

In [14]:
PublicationValidator.model_validate({'value': 'book'})

PublicationValidator(value='book')

In [15]:
from pydantic import ValidationError
try:
    PublicationValidator.model_validate({'value': 'invalid'})
except ValidationError as e:
    print(e)

1 validation error for PublicationValidator
value
  Input should be 'book', 'conferencepaper', 'article', 'patent', 'report' or 'other' [type=literal_error, input_value='invalid', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/literal_error


#### Define a standard attribute

The `StandardAttribute` takes 5 inputs:
- The `name` of the parameter in the function. This will also be the attribute name!
- The validator class: We just created it
- The method, where the attribute should appear as a parameter
- A default value. We want to enforce the usage, so we put `$empty`. Other options are `$none` (makes it optional) or we put an actual value, e.g. "book", which will be written to the attribute if no other value is provided 

In [16]:
std_attr = h5tbx.convention.standard_attributes.StandardAttribute(
    name='publication_type',
    validator=PublicationValidator,
    target_method='__init__',
    description='Publication type',
    default_value='$empty'
)

In [17]:
cv.add_standard_attribute(std_attr)
cv.registered_standard_attributes

{'publication_type': <StandardAttribute[positional/obligatory]("publication_type"): "Publication type.">}

Let's try it (first re-enable the convention to make the changes effective)

In [18]:
h5tbx.use(None)
h5tbx.use(cv.name)

using("MyFirstConvention")

In [19]:
try:
    with h5tbx.File() as h5:
        pass
except convention.errors.StandardAttributeError as e:
    print(e)

Convention "MyFirstConvention" expects standard attribute "publication_type" to be provided as an argument during file creation.


In [20]:
with h5tbx.File(publication_type='book') as h5:
    pass

## 2. Creating a convention (from a file)

A convention can also be defined in a YAML (or JSON) file. It consists of three parts:

    1. General information (keywords must start and end with double underscores)
    2. Definition of *standard attributes*
    3. Definition of *standard attribute validators*

1. General information at the header indicated by double underscores:

2. Definition of standard attributes

A `standard_attribute` has various properties like the `description`, `target_method` (during which the standard attribute can be passed), the `validator`, which validates the input and the `default_value`. The latter can be "\\$EMPTY" indicating that no default value is set, and thus this attribute is obligatory. "\\$NONE" indicates, that the attribute is optional. And finally, a valid value can be given to be written even if no input is provided for this standard attribute.

3. Special type definitions

Here the allowed values for the standard attribute `data_type` is listed:

The heart of standard attributes is the **validator**. A validator becomes effective when metadata is written.
The Flow chart below illustrates the writing and reading procedure for the example of writing the attribute "units".

1. Writing (`ds.units = 'm/s'`): If "units" is defined in the convention, the validator checks the value. "m/s" is a correct unit, so it will be written. Otherwise, invalid values raise an error.
3. Reading (`print(ds.units)`): The validator becomes effective upon reading attributes, that are standardized. However, then invalid value only raise warnings, in order to allow the user to still work with the file and fix the issue.

<img src="../_static/h5RDMtoolbox_standard_attribute_concept.png" alt="h5RDMtoolbox_standard_attribute_concept.png" width="800"/>

## Reading a Convention from a file:

Let's read the above example file into the class `Convention`. The object representation displays the standard attributes which are expected for the root group (`File.__init__()`), group creation (`Group.create_group()`) and dataset creation (`Group.create_dataset()`).

Note, that the standard attributes, which are marked **bold**, are obligatory. The others may or may not be provided during object creation:

In [21]:
h5tbx.convention.utils.yaml2json('example_convention.yaml')

WindowsPath('example_convention.json')

In [22]:
cv = h5tbx.convention.from_yaml('example_convention.yaml')
# you may also use .from_json('example_convention.json')
cv

Convention("h5rdmtoolbox-tuturial-convention")

In order to make the convention affective in this session, it must be enabled. We do this by calling `use()`:

In [23]:
h5tbx.use(cv)

using("h5rdmtoolbox-tuturial-convention")

Now, we will get an error if we create a HDF5 file without providing the attribute `contact_id`. As we made it a required attribute, it must be provided during file initialization:

In [24]:
try:
    with h5tbx.File() as h5:
        pass
except h5tbx.errors.StandardAttributeError as e:
    print(e)

Convention "h5rdmtoolbox-tuturial-convention" expects standard attribute "contact_id" to be provided as an argument during file creation.


Providing a wrong value raises an error, too:

In [25]:
try:
    with h5tbx.File(contact_id='id1722') as h5:
        h5.create_dataset(name='velocity', shape=(3, 4), units='m/s', comment='velocity field')
except h5tbx.errors.StandardAttributeError as e:
    print(e)

The value "velocity field" for standard attribute "comment" could not be set. Please check the convention file wrt. the rule for this attribute. The following error message might not always explain the origin of the problem:
1 validation error for comment
value
  Value error, Invalid format for pattern [type=value_error, input_value='velocity field', input_type=str]
    For further information visit https://errors.pydantic.dev/2.5/v/value_error


Now, we got it:

In [26]:
with h5tbx.File(contact_id='id1722') as h5:
    h5.dump()

Note, that if we were to reopen the file not in read-only (r) but in read-write mode, then the standard attributes which already exist are not checked again. So if the HDF5 was written with another package, e.g. h5py, then the value might be wrong:

In [27]:
with h5tbx.File(name=h5.hdf_filename, mode='r+') as h5:
    pass # note, that we were not required to pass "data_type" as it was present already!

Note, that a convention can also be enabled only **temporarily** using the context manager syntax:

In [28]:
with h5tbx.use(cv):
    with h5tbx.File(contact_id='id1722') as h5:
        pass

## Importing/Loading an online convention

The intended distribution of convention is via online repositories. The YAML file hence should be uploaded such it is accessible to all users. The `h5RDMtoolbox` currently favors the usage of [Zenodo](https://zenodo.org) repositories. The advantages are long-term storage and assignment of a DOI. However, files accessible via an URL can also be downloaded.

A tutorial convention is published [here](https://zenodo.org/record/8276817). By calling `from_zenodo()` the convention object is created:

In [29]:
cv = h5tbx.convention.from_zenodo(doi_or_recid='10156750')
cv

Convention("h5rdmtoolbox-tutorial-convention")

## Effect of enabling a convention

The convention above defined the usage of certain attributes with certain methods. E.g. "data_type" is to be used when a HDF5 file is created. When the convention is enabled, the **signature of the respective methods is changed**. To proof this, let's implement a small function, which prints all parameters of a given function and inspect the effect of the convention in the `__init__` method:

In [30]:
cv.properties[h5tbx.Dataset]['standard_name']

<StandardAttribute [keyword/optional]("standard_name"): default_value="None" | "Standard name of the dataset. If not set, the long_name attribute must be given.">

In [31]:
import inspect

def print_method_parameters(method):
    print(f'\nParameters for "{method.__name__}":')
    for param in inspect.signature(method).parameters.values():
        if not param.name == 'self':
            if param.name in h5tbx.convention.get_current_convention().methods[h5tbx.File].get('__init__', {}).keys():
                print(f'  - {h5tbx._repr.make_bold(param.name)}')
            else:
                print(f'  - {param.name}')

methods = (h5tbx.File.__init__, h5tbx.Group.create_group, h5tbx.Group.create_dataset)

print('no convention: ')
h5tbx.use(None)
print_method_parameters(h5tbx.File.__init__)

print(f'\n------------\nwith convention {cv.name}: (standard attributes are made bold)')
h5tbx.use(cv)
print_method_parameters(h5tbx.File.__init__)

no convention: 

Parameters for "__init__":
  - name
  - mode
  - attrs
  - kwargs

------------
with convention h5rdmtoolbox-tutorial-convention: (standard attributes are made bold)

Parameters for "__init__":
  - name
  - mode
  - attrs
  - [1mdata_type[0m
  - [1mstandard_name_table[0m
  - [1mcomment[0m
  - [1mcontact[0m
  - [1mreferences[0m
  - kwargs


In [32]:
h5tbx.use(None)  # fall back to the default convention

using("h5py")