# Introduction to Standard Attributes

Data alone is meaningless. Only if it is associated with auxiliary data (metadata) it becomes interpretable and (re-)usable for others users or machines. In HDF5 files, this is realized by using **attributes**, which are assigned to groups or datasets. For a certain application, one would like to have specific values for specific attributes. HDF5 does not allow implementing such rules.

The `h5RDmtoolbox` lets you specify rules for those "special attributes". We will call them `standard attributes` and a collection of it a `convention` (More on this [here](conventions.ipynb)).

<!-- If an attribute is addressed by the user, e.g. the attribute `units`, and a standard attribute implementation exists for this name, then the value is processed by the respective rule and the attribute is set or an error is raised in case of a invalid input.

Standard attributes can be made required **during dataset creation,** for instance. This enforces users to pass certain meta information and validates it at the same time. Consequently, data becomes re-usable and explorable.

Additionally, so-called [layouts](./layouts.ipynb) can be defined, too. They are used to specify the content of an HDF5 file after it has been written. This concept applies best during file exchange, as the layout validates if a file is complete and meets the expectation of the project or collaborative user. -->

## Concept
The figure below illustrates the general concept. Standard attributes are defined by the user and added to a convention. A registered convention is activated by calling `.use(<name of convention>)`. By doing so, the signature of the methods `create_dataset`, `create_group` and `__init__` are modified according to the generated standard attributes. Moreover, the docstring will be updated, too, as we will see later.


<img src=concept_of_std_attrs.png width=800px>


Let's see how this is done in practice:

In [1]:
import h5rdmtoolbox as h5tbx

## Standard attributes

Based on the figure above, we need to define two standard attributes. The first one is called "units" and becomes relevant, when a user creates a new dataset. The second one is called "comment" and can be passed during file, dataset or group creation. This attribute is optional while "units" is mandatory.

**The comment attribute:**
The module `h5tbx.conventions` provides the class `StandardAttribute`. It requires the `name`, a `validator`, information about where to apply the standard attribute (`method`) and a `description`:

In [2]:
comment = h5tbx.conventions.StandardAttribute(
    name='comment',
    validator={'$regex': "^[A-Z].*$"},
    target_methods=('__init__', 'create_dataset','create_group'),
    description='Additional information about the file'
)
comment

<PositionalStdAttr("comment"): "Additional information about the file">

The `validator` used here is regular expression. This means, that the user input is matched with the given pattern ('^[A-Z].*$')

For the "units"-attribute, we use another already implemented `validator`, namely "$pintunits":

The second standardized attribute is called "contact". The attribute is mandatory for the root group and be one or multiple researcher IDs (ORCID IDs). To check, whether the ORCID ID is valid, the built-in `Validator` "$orcid" is used:

In [3]:
units = h5tbx.conventions.StandardAttribute(
    name='units',
    validator='$pintunit',
    target_methods='create_dataset',
    description='The physical units of the dataset'
)
units

<PositionalStdAttr("units"): "The physical units of the dataset">

### Validators

The following `validators` are availbale:

In [4]:
list(h5tbx.conventions.standard_attributes.av_validators.keys())

['$type',
 '$in',
 '$regex',
 '$pintunit',
 '$pintquantity',
 '$orcid',
 '$url',
 '$ref',
 '$bibtex',
 '$standard_name',
 '$standard_name_table',
 '$minlength',
 '$maxlength',
 '$datetime',
 'None']

Some validators **require reference values**. One example would be the `$in`-validator, where a list of expected values must be provided. To find out how a validator is used, call the help for the respected validator:

In [5]:
help(h5tbx.conventions.standard_attributes.av_validators['$in'])

Help on class InValidator in module h5rdmtoolbox.conventions.validator:

class InValidator(StandardAttributeValidator)
 |  InValidator(expectation: List[str])
 |  
 |  Validates if the attribute value is in the list of expected values.
 |  During definition, the list of expected values is passed as a list of strings,
 |  see the example usage below, where the validator is used in the standard
 |  attribute "data_source"
 |  
 |  Parameters
 |  ----------
 |  expectation: List[str]
 |      List of expected values
 |  
 |  Example
 |  -------
 |  >>> import h5rdmtoolbox as h5tbx
 |  >>> data_source = h5tbx.conventions.StandardAttribute(
 |  >>>         name='units',
 |  >>>         validator={'$in': ['numerical', 'experimental', 'analytical']},
 |  >>>         method='__init__'
 |  >>>         description='The source of data'
 |  >>>     )
 |  
 |  Method resolution order:
 |      InValidator
 |      StandardAttributeValidator
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 

## Conventions: Enable standard attributes

Conventions contain one or multiple standard attributes. Below, we create one with the prior defined attributes:

In [6]:
# provide a name and an ORCID for the creator(s) of the convention:
my_convention = h5tbx.conventions.Convention('my_convention',
                                            contact='https://orcid.org/0000-0001-8729-0482')
my_convention.add(comment)
my_convention.add(units)

my_convention.register() # only now we an enable it

h5tbx.use('my_convention')  # enable the convention

# print an overview:
my_convention

[1mConvention("my_convention")[0m
contact: https://orcid.org/0000-0001-8729-0482
  File.__init__():
    * [1mcomment[0m:
		Additional information about the file
  Group.create_dataset():
    * [1mcomment[0m:
		Additional information about the file
    * [1munits[0m:
		The physical units of the dataset
  Group.create_group():
    * [1mcomment[0m:
		Additional information about the file

Let's convince, if the signatures of `__init__`, `create_group` and `create_dataset` changed:

In [7]:
import inspect

methods = (h5tbx.File.__init__, h5tbx.Group.create_group, h5tbx.Group.create_dataset)

for method in methods:
    print(f'\nParameters for "{method.__name__}":')
    for param in inspect.signature(method).parameters.values():
        if not param.name == 'self':
            if param.name in ('contact', 'comment'):
                print(f'  - {h5tbx._repr.make_bold(param.name)}')
            else:
                print(f'  - {param.name}')


Parameters for "__init__":
  - name
  - mode
  - layout
  - attrs
  - [1mcomment[0m
  - kwargs

Parameters for "create_group":
  - name
  - overwrite
  - attrs
  - update_attrs
  - track_order
  - [1mcomment[0m
  - kwargs

Parameters for "create_dataset":
  - name
  - shape
  - dtype
  - data
  - overwrite
  - chunks
  - make_scale
  - attach_scales
  - ancillary_datasets
  - attrs
  - [1mcomment[0m
  - units
  - kwargs


The docstrings of the methods also changed. Call `help()` on them:

In [8]:
help(h5tbx.File.__init__)

Help on function __init__ in module h5rdmtoolbox.wrapper.core:

__init__(self, name: pathlib.Path = None, mode: str = None, layout: Union[pathlib.Path, str, h5rdmtoolbox.conventions.layout.core.Layout, NoneType] = None, attrs: Dict = None, comment: str = <_SpecialDefaults.EMPTY: 1>, **kwargs)
    Main wrapper around h5py.File.
    
    Adds additional features and methods to h5py.File in order to streamline the work with
    HDF5 files and to incorporate usage of metadata (attribute naming) conventions and layouts.
    An additional argument is added to the h5py.File with "layout" to specify the layout of the file.
    The layout specifies the structure of the file and the expected content of each group and dataset.
    A check can be performed to verify that the file is in accordance with the layout.
    
    
    .. seealso:: :meth:`check`
    
    
    .. note:: All features from h5py packages are preserved.
    
    
    
    
    Parameters
    ----------
    filename: str = None


In [9]:
help(h5tbx.Group.create_dataset)

Help on function create_dataset in module h5rdmtoolbox.wrapper.core:

create_dataset(self, name, shape=None, dtype=None, data=None, overwrite=None, chunks=True, make_scale=False, attach_scales=None, ancillary_datasets=None, attrs=None, comment: str = <_SpecialDefaults.EMPTY: 1>, units: str = <_SpecialDefaults.EMPTY: 1>, **kwargs)
    Creating a dataset. Allows attaching/making scale, overwriting and setting attributes simultaneously.
    
    
    
    Parameters
    ----------
    name: str = None
            Name of dataset
    shape: tuple = None
            Dataset shape. see h5py doc. Default None. Required if data=None.
    dtype: str = None
            dtype of dataset. see h5py doc. Default is dtype('f')
    data: numpy ndarray, default=None = None
            Provide data to initialize the dataset.  If not used,
            provide shape and optionally dtype via kwargs (see more in
            h5py documentation regarding arguments for create_dataset
    overwrite: bool, defau

## Working with the convention 

First we test the comment attribute:

A wrong or missing input will raise an error:

In [10]:
try:
    with h5tbx.File(comment='123') as h5:
        h5.dump()
except Exception as e:
    print(e)

Setting "123" for standard attribute "comment" failed. Original error: The value "123" does not match the pattern "^[A-Z].*$"


Unexpected parameters to the methods, will raise an error:

In [11]:
try:
    with h5tbx.File(contact='https://orcid.org/0000-0001-8729-0482',
                    comment='123') as h5:
        h5.dump()
except Exception as e:
    print(e)

'contact' is an invalid keyword argument for this function


This is correct:

In [12]:
with h5tbx.File(comment='My first file') as h5:
    h5.dump()

Next we test the units attribute:

In [13]:
with h5tbx.File(comment='My first file') as h5:
    h5.create_dataset('velocity', data=1.3, units='m/s', comment='Hello')
    h5.dump()

## Import a convention

Conventions are defined for a project. Standard attributes can be defined in a single or multiple YAML files. Those files can be loaded into the current work from a local storage or a remote web resource. We first take a look at loading a local definition of standard names.

### Load a local convention

In [14]:
from h5rdmtoolbox import tutorial

In [15]:
convention_filename = tutorial.get_standard_attribute_yaml_filename()

local_cv = h5tbx.conventions.Convention.from_yaml(convention_filename)
local_cv.register()
local_cv

[1mConvention("planar-piv-convention")[0m
contact: https://orcid.org/0000-0001-8729-0482
  File.__init__():
    * [1mcontact[0m:
		Contact or responsible person represented by an ORCID-ID.
    * [3mstandard_name_table[0m (default=<h5rdmtoolbox.conventions.consts.DefaultValue object at 0x000001C7CA169040>):
		The standard name table of the convention.
    * [3mcomment[0m (default=None):
		A comment to further describe the data.
    * [3mreferences[0m (default=None):
		Web resources servering as references for the data
  Group.create_dataset():
    * [1munits[0m:
		Physical unit of the dataset.
    * [1mstandard_name[0m:
		Standard name of the dataset. If not set, the long_name attribute must be given.
    * [1mlong_name[0m:
		An comprehensive description of the dataset. If not set, the standard_name attribute must be given.
    * [3mscale[0m (default=None):
		Scale factor for the dataset values.
    * [3moffset[0m (default=None):
		Scale factor for the dataset values

In [16]:
h5tbx.use(local_cv)

<h5rdmtoolbox.conventions.core.use at 0x1c7ca137a00>

In [17]:
with h5tbx.File(contact='https://orcid.org/0000-0001-8729-0482', mode='r') as h5:
    h5.dump()



### Load a remote convention

This is generally done only once a due to some revisions a few time. Such a conventions therefore needs to get a version or evene better a persistent identifier like a DOI.

The toolbox suggests using Zenodo as a repository. The following shows, how a convention, wich was uploaded to Zenodo can be integrated into the user's workflow.

The example convention is registered under the DOI 123123 on Zenodo. It contains multiple \*.yaml-files.

In [23]:
cv = h5tbx.conventions.from_zenodo(doi='8276817')
h5tbx.use(cv)  # enable the downloaded convention

<h5rdmtoolbox.conventions.core.use at 0x1c7b6fc3760>

## List of available conventions

It is possible to register conventions, which is the list of standard attributes for the respective HDF objects. A list can be optained by the dictionary `conventons.registered_conventions`:

In [24]:
h5tbx.conventions.get_registered_conventions()

{'h5py': [1mConvention("h5py")[0m
 contact: https://orcid.org/0000-0001-8729-0482,
 'h5tbx': [1mConvention("h5tbx")[0m
 contact: https://orcid.org/0000-0001-8729-0482
   Group.create_dataset():
     * [1munits[0m:
 		Physical unit of the dataset.
     * [3mscale[0m (default=None):
 		Scale factor for the dataset values.
     * [3moffset[0m (default=None):
 		Scale factor for the dataset values.,
 'my_convention': [1mConvention("my_convention")[0m
 contact: https://orcid.org/0000-0001-8729-0482
   File.__init__():
     * [1mcomment[0m:
 		Additional information about the file
   Group.create_dataset():
     * [1mcomment[0m:
 		Additional information about the file
     * [1munits[0m:
 		The physical units of the dataset
   Group.create_group():
     * [1mcomment[0m:
 		Additional information about the file,
 'planar-piv-convention': [1mConvention("planar-piv-convention")[0m
 contact: https://orcid.org/0000-0001-8729-0482
   File.__init__():
     * [1mcontact[0m:
