# Standard Attributes

Data alone is meaningless. Only if it is associated with auxiliary data (meta data) it becomes interpretable and (re-)usable for others. In HDF5 files this is realized by using **attributes**, which are assigned to groups and datasets. HDF attributes are like dictionaries: You provide a name and a value. However, which name and value you are using is generally up to the you.

The `h5RDmtoolbox` let's you specify rules for specific attributes. These attributes are simply called **standard attributes**. They are part of a convention, which can be defined by anyone. It needs to be created or imported and then enabled in order become effective.

<!-- If an attribute is addressed by the user, e.g. the attribute `units`, and a standard attribute implementation exists for this name, then the value is processed by the respective rule and the attribute is set or an error is raised in case of a invalid input.

Standard attributes can be made required **during dataset creation** for instance. This enforces users to pass certain meta information and validates it at the same time. Consequently data becomes re-usable and explorable.

Additionally, so-called [layouts](./layouts.ipynb) can be defined, too. They are used to specify the content of an HDF5 file after it has been written. This concept applies best during file exchange as the layout validates if a file is complete and meets the expertation of the project or collaborative user. -->

In [1]:
import h5rdmtoolbox as h5tbx

## Defining a new standard attribute

Let's define one standardized attribute only for any HDF5 object (group or dataset) and one only for the *root group*.

The first one shall be "comment". It can be assigned to any group or dataset and shall be optional. The user input is verified based on a regular expression match. In this example, the requiremet is, that the passed string starts with a capital letter:

In [2]:
comment = h5tbx.conventions.StandardAttribute(
    name='comment',
    validator={'$regex': r'^[A-Z].*$'},
    method={'__init__': {'optional': True},
           'create_dataset': {'optional': True},
           'create_group': {'optional': True}},
    description='Additional information about the file'
)
comment

<StdAttr("comment"): "Additional information about the file">

The second standardized attribute is called "contact". The attribute is mandatory for the root group and be one or multiple researcher IDs (ORCID IDs). To check, whether the ORCID ID is valid, the built-in `Validator` "$orcid" is used:

In [3]:
contact = h5tbx.conventions.StandardAttribute(
    name='contact',
    validator='$orcid',
    method={'__init__':
            {'optional': False}
           },
    description='One or multiple ORCID IDs representing responsible persons to be contacted upon questions about the file content.'
)
contact

<StdAttr("contact"): "One or multiple ORCID IDs representing responsible persons to be contacted upon questions about the file content.">

## Create and enable a new convention

Based on the above to standard attributes, we now can implement a new convention:

In [4]:
my_convention = h5tbx.conventions.Convention('my_convention')
my_convention.add(comment)
my_convention.add(contact)

my_convention.register()

h5tbx.use('my_convention')

The new (standard) attribute has been designed to be available during file, group and dataset creation respectively, hence when the methods `__init__`, `create_group` or `create_dataset` are called.

By enabling the convention (calling `use(...)`), the signatures of these method changed. Let's convince outselfs by inspecting the methods parameters in the following:

In [5]:
import inspect

methods = (h5tbx.File.__init__, h5tbx.Group.create_group, h5tbx.Group.create_dataset)

for method in methods:
    print(f'\nParameters for "{method.__name__}":')
    for param in inspect.signature(method).parameters.values():
        if not param.name == 'self':
            if param.name in ('contact', 'comment'):
                print(f'  - {h5tbx._repr.make_bold(param.name)}')
            else:
                print(f'  - {param.name}')


Parameters for "__init__":
  - name
  - mode
  - layout
  - attrs
  - [1mcomment[0m
  - [1mcontact[0m
  - kwargs

Parameters for "create_group":
  - name
  - overwrite
  - attrs
  - update_attrs
  - track_order
  - kwargs

Parameters for "create_dataset":
  - name
  - shape
  - dtype
  - data
  - offset
  - scale
  - overwrite
  - chunks
  - make_scale
  - attach_scales
  - ancillary_datasets
  - attrs
  - kwargs


A wrong or missing input will raise an error:

In [6]:
try:
    with h5tbx.File(comment='123') as h5:
        h5.dump()
except Exception as e:
    print(e)

The standard attribute "contact" is required but not provided.


In [7]:
try:
    with h5tbx.File(contact='https://orcid.org/0000-0001-8729-0482',
                    comment='123') as h5:
        h5.dump()
except Exception as e:
    print(e)

The attribute "comment" is standardized. It seems, that the input "123" is not valid. Here is the description of the standard attribute, which may help to find the issue: "Additional information about the file"


This is correct:

In [8]:
with h5tbx.File(contact='https://orcid.org/0000-0001-8729-0482') as h5:
    h5.dump()

## Import a convention

Conventions are defined for a project. Standard attributes can be defined in a single or multiple yaml files. Those files can be loaded into the current work from a local storage or a remote web resource. We first have a look at loading a local defintion of standard names.

### Load a local convention

In [9]:
local_cv = h5tbx.conventions.Convention.from_yaml('tbx_convention_standard_names.yaml')
local_cv.register()
local_cv



[1mConvention("tbx_convention_standard_names")[0m
  File.__init__():
    * [1mcontact[0m
    * [1mtitle[0m
    * [1mreferences[0m
    * [1mcomment[0m
    * [3minstitution[0m (optional)
  Group.create_dataset():
    * [1mscale[0m
    * [1munits[0m
    * [1mlong_name[0m
    * [3moffset[0m (optional)

In [10]:
h5tbx.use(local_cv)

### Load a remote convention

This is generally done only once a due to some revisions a few time. Such a conventions therefore needs to get a version or evene better a persistent identifier like a DOI.

The toolbox suggests using Zenodo as a repository. The following shows, how a convention, wich was uploaded to Zenodo can be integrated into the user's workflow.

The example convention is registered under the DOI 123123 on Zenodo. It contains multiple \*.yaml-files.

In [11]:
cv = h5tbx.conventions.from_zenodo(doi=123123)
h5tbx.use(cv)  # enable the downloaded convention

AttributeError: module 'h5rdmtoolbox.conventions' has no attribute 'from_zenodo'

## List of available conventions

It is possible to register conventions, which is the list of standard attributes for the respective HDF objects. A list can be optained by the dictionary `conventons.registered_conventions`:

In [None]:
h5tbx.conventions.registered_conventions.keys()

Now, we regulated what happens, when this special (standard) attribute is written (`set`) and read (`get`).

## Add to a convention
Next we need to add this attbribute to a convention and assign it to the `Group` calss and the method `create_dataset` in order to make "source" available to the user and enforce its usage.

Let's initialize a new convention and register it (make it available in the package):

In [None]:
cv = conventions.Convention('my_convention')
cv

The output shows which attributes are associated with the objects `File`, `Group` and `Dataset` and the methods `__init__`, `create_group` and `create_dataset`. What this exactly means will get clear shortly. Let's add `SourceAttribute` the class `Dataset`:

In [None]:
cv['create_dataset'].add(SourceAttribute,
                         add_to_method=True,
                         optional=True,
                         position={'after': 'data'})

The `SourceAttribute` is now added to the class `Group`:

In [None]:
cv

For now, it is only registered as a property. This means, the user is yet responsible for setting the "source".

## Register and enable
We need to register the convention `cv` and enable it (and thus enable the "source" attribute)

In [None]:
cv.register()
h5tbx.use('my_convention')
h5tbx.get_current_convention()

## Example:
Let's create a dataset and get the source. As we do not pass the argument `source` (we set it to optional) and we do not set it via the attribute manager, we expect a warning:

In [None]:
with h5tbx.File() as h5:
    ds = h5.create_dataset('data', (4, 5))
    print(ds.source)

We may pass "source" directly as an argument or via "attrs". Both of which will check if the source is "numerical" or "experimental", thus the `set()` method is called in both cases:

In [None]:
with h5tbx.File() as h5:
    ds1 = h5.create_dataset('data1', (4, 5), attrs={'source': 'numerical'})
    ds2 = h5.create_dataset('data2', (4, 5), source='experimental')
    # two example that fail:
    try:
        h5.create_dataset('data3', (4, 5), attrs={'source': 'model-based'})
    except ValueError as e:
        print(e)
    try:
        h5.create_dataset('data4', (4, 5), source='model-based')
    except ValueError as e:
        print(e)

Until now, the source attribute was **optional**. We want to enforce the use, so let's change this property of the standard attribute:

In [None]:
cv.make_required('create_dataset', 'source')

In [None]:
cv

In [None]:
with h5tbx.File() as h5:
    try:
        ds = h5.create_dataset('data', (4, 5))
    except h5tbx.conventions.StandardAttributeError as e:
        print(e)

In [None]:
with h5tbx.File() as h5:
    ds = h5.create_dataset('data', (4, 5), source='Experimental')
    ds.dump()