# Layouts

The organization of data, the vocabulary and other important attributes are typically specified by documents. In the case of HDF5 files we would need to specify how groups and datasets are arranged and which are mandatory attributes. It could also be defined which compression level each dataset must have. Especially attributes are ofmajor importance when data is being anlyzed by many users and in particular when processed by other software. Hence, specifying which attributes are used in which cases is important.

All these things must be followed by the file creatr and checked by the file receiver who analyzes the file (which can also be a machine).
The `h5RDMtoolbox` provides a way to translate the rules/specifications made by a colaboration or project into a python obects which can then validate HDF5 files. It ultimately not only increases the value of data, it also saves time and costs, because the risk of missing or incimplete meta information is minimized and data analysts donnot need to contact the file creator for clarifications, for instance.

---
We introduce the concept of *layouts* by an imaginary but reasonable example how data should be organized and provided in a small fluid mechanics project.

We first list our requirements/specifications like we would do in a formal document. Then we translate it into a `Layout` object:

1. Datasets:
    1. Compression opt is `gzip` for all datasets
    2. All datasets must have an attribute `units` and `standard_name`
2. Root group:
    1. The *root group* must hold the attribute `version`, which is the current version of the toolbox. Also it must contain the attributes `title` and  `user`
3. Groups:
    1. All groups must have an attribute `comments`. A Comment must not start with a blank or a number.
    2. The group `measurement_devices` must exist at group `devices`

## Creating a layout
The `layout.File` object is designed such that the syntax is similar to the `h5py.File` class.

In [1]:
import h5rdmtoolbox as h5tbx
from h5rdmtoolbox.conventions.layout import Layout, Any, Regex

Initialize a `Layout` object:

In [2]:
lay = Layout()

### 1. Datasets

Instead of calling `create_datset`, with the Layout object `specify_dataset` is called

#### A. Dataset properties

All datasets must use gzip-compression, so the dataset name is irreleant. For this to specify we can use the `Ellipsis` object of python. Other parameters of `specify_dataset` are the parameters set during `create_dataset` of the h5py dataset. We only nee `compresison` in this case:

In [3]:
dv = lay['*'].specify_dataset(name='...', compression='gzip')
print(dv)

["/"].dataset(name="...", compression=gzip)


#### B. Dataset attributes
We can either use the attribute manager `.attrs` as known from the `h5py` implementation or we use `specify_attrs()`:

In [9]:
ds = lay['*'].specify_dataset()

ds.attrs = {'units': ..., 'standard_name': ...}
# we could also have done this:
# ds.attrs['units']: ...
# ds.attrs['standard_name'] =  ...
# ... or:
# ds.specify_attrs(units=..., standard_name=...)

### 2. Root Group
#### A Root attributes

In [13]:
lay['/'].attrs['__version__'] = h5tbx.__version__
lay['/'].specify_attrs(title=...,user=...)

### 3. Group
#### A. All groups must have an attribute `comments`.
The attribute "comment" must not start with a blank or a number.

In [16]:
lay['*'].specify_group().attrs['comment'] = Regex(r'^[^ 0-9].*')

#### B The group `measurement_devices` must exist at group `devices`

In [17]:
lay['devices'].specify_group('measurement_devices')

GroupValidation(ExistIn('measurement_devices'), opt=False)>

## Performing the validation

Now it is time to test (validate) the specifications.

We start with an empty file. Every specification is expected to fail unless it is optional, as the specification for the "comment"-attribute. Note, that not all specifications will be able to be tested

In [18]:
with h5tbx.File() as h5:
    lay.validate(h5)
lay.print_failed_validations()
lay.fails

["/"].dataset(name="...", compression=gzip)
["/"].attr(name="comment", value="re:^[^ 0-9].*")
["/"].attr(name="__version__", value="0.5.1a1")
["/"].attr(name="title", value="...")
["/"].attr(name="user", value="...")
GroupValidation(ExistIn('devices'), opt=False)>


6

In [19]:
lay.report()

Layout Validation report
------------------------
Number of validations (called/specified): 9/19
Number of inactive validations: 10
Success rate: 0.0% (n_fails=6)


In [20]:
with h5tbx.File() as h5:
    h5.create_dataset('velocity',
                      shape=(10, 20),
                      compression='gzip',
                      attrs={'units': 'm/s'})
    h5.attrs['comment'] = 'This is a valid comment'
    g = h5.create_group('devices/measurement_devices')
    h5['devices'].attrs['comment'] = 'This is a valid comment'
    h5['devices'].attrs['long_name'] = 'an_attribute'
    h5['devices/measurement_devices'].attrs['comment'] = 'This is a valid comment'
    h5.attrs['__version__'] = h5tbx.__version__
    
    res = lay.validate(h5)
lay.print_failed_validations(n=1)

["/"].dataset(name="...", compression=gzip)


In [21]:
lay.report()

Layout Validation report
------------------------
Number of validations (called/specified): 15/19
Number of inactive validations: 4
Success rate: 66.7% (n_fails=6)


### Dataset property specification

Each dataset (despite its location within the hierachical structure), which starts with eith "x", "y" or "z" and ends with "_coordinate" shall be one-dimensional. This can be specified by the dataset property "ndim"

In [22]:
lay['*'].specify_dataset(name=Regex('^[x-z]_coordinate'), ndim=1)

DatasetValidation(Regex('^[x-z]_coordinate'), opt=False)>

## Perform a layout validation

Let's create an empty HDF5 file first:

In [23]:
h5tbx.use(None)

with h5tbx.File() as h5:
    h5.dump()

Running the validation with `lay.validate()` will get us a total of three issues (`res.total_issues()`):

In [24]:
res =lay.validate(h5.hdf_filename)

In [25]:
lay.fails

7

To find out what the issues are, best is to `report()` the issue messages. Note, some issues are "hidden".

Adding the version will reduce the issues about one:

In [26]:
with h5tbx.File() as h5:
    h5.attrs['h5rdmtoolbox_version'] = h5tbx.__version__
    lay.validate(h5)
lay.fails

7

In [27]:
lay.get_failed_validations()

[ValidationResult(DatasetValidation(Equal('...'), opt=False)>, False, obj=None),
 ValidationResult(AttributeValidation(Equal('comment'), opt=False)>, False, obj=<class "File" convention: "h5py">),
 ValidationResult(DatasetValidation(Regex('^[x-z]_coordinate'), opt=False)>, False, obj=None),
 ValidationResult(AttributeValidation(Equal('__version__'), opt=False)>, False, obj=<class "File" convention: "h5py">),
 ValidationResult(AttributeValidation(Equal('title'), opt=False)>, False, obj=<class "File" convention: "h5py">),
 ValidationResult(AttributeValidation(Equal('user'), opt=False)>, False, obj=<class "File" convention: "h5py">),
 ValidationResult(GroupValidation(ExistIn('devices'), opt=False)>, False, obj=<class "File" convention: "h5py">)]

Adding a dataset in the group "fluid" without a unit:

In [28]:
with h5tbx.File() as h5:
    g = h5.create_group('fluid')
    g.attrs['units'] = 'm/s'
    lay.validate(h5)
lay.fails
for f in lay.get_failed_validations():
    print(f.validation)

["/"].dataset(name="...", compression=gzip)
["/"].attr(name="comment", value="re:^[^ 0-9].*")
["/"].dataset(name="re:^[x-z]_coordinate", ndim=1)
["/"].dataset(name="...", compression=gzip)
["/"].attr(name="comment", value="re:^[^ 0-9].*")
["/"].dataset(name="re:^[x-z]_coordinate", ndim=1)
["/"].attr(name="__version__", value="0.5.1a1")
["/"].attr(name="title", value="...")
["/"].attr(name="user", value="...")
GroupValidation(ExistIn('devices'), opt=False)>


## User-defined Validators

Above we specifd, that comments must not start with a space or a number. We could also define a validator class to handle this. For this, inherite the class `Validator` and adjust the `__call__`-method:

In [29]:
from h5rdmtoolbox.conventions.layout import Validator
import re

class Comment(Validator):
    
    def __init__(self):
        reference = r'^[^ 0-9].*'
        super().__init__(reference, '=')

    def __str__(self):
        return f're:{self.reference}'
    
    def __call__(self, value):
        return re.match(self.reference, value) is not None

In [30]:
lay = Layout()
lay['/'].attrs['comment'] = Comment()

In [31]:
with h5tbx.File() as h5:
    h5.attrs['comment'] = '0 wrong comment'
    lay.validate(h5)
lay.report()

Layout Validation report
------------------------
Number of validations (called/specified): 2/2
Number of inactive validations: 0
Success rate: 50.0% (n_fails=1)


In [32]:
with h5tbx.File() as h5:
    h5.attrs['comment'] = 'Correct comment'
    lay.validate(h5)
lay.report()

Layout Validation report
------------------------
Number of validations (called/specified): 2/2
Number of inactive validations: 0
Success rate: 100.0% (n_fails=0)


## Registred layouts

One pre-defined layout exists in the toolbox. It is intendet to be used with the convention `tbx`. Thus it specifies attributes like "title" at the root of the file 

In [33]:
from h5rdmtoolbox.conventions.layout import Registry

The representational string of the class `Registry` tells us which layouts are registered, namely "tbx", which checks basic requirements corresponding to the "tbx"-conventions:

In [34]:
Registry

LayoutRegistry("tbx",)

In [35]:
lay = Registry['tbx']
with h5tbx.File() as h5:
    h5.attrs['title'] = 'This is a title'
    lay.validate(h5)
lay.report()

Layout Validation report
------------------------
Number of validations (called/specified): 6/10
Number of inactive validations: 4
Success rate: 100.0% (n_fails=0)
