# Layouts

Layouts specify the meta content of HDF5 files, e.g. which attributes are required or which shape certain datasets are expected to have. A layout is published by a project management team or a collaboration for instance and helps during data generation and usage: A creator of HDF5 file content can verify if all required data is written. Likewise, the receiver of HDF5 file, e.g. an analyst, can check if the file that is being inspected is "complete". The standardisation of a file layout will reduce back-and-forth actions as it minimizes errors or missing data and ultimately saves costs in the process.

## Creating a layout

The `layout.File` object is designed such that the syntax is similar to the `h5py.File` class.

In [1]:
import h5rdmtoolbox as h5tbx
from h5rdmtoolbox.conventions.layout import Layout, Any, Regex

2023-04-17_14:44:32,478 DEBUG    [__init__.py:35] changed logger level for h5rdmtoolbox from 20 to DEBUG


Initialize a `Layout` object:

In [2]:
lay = Layout()

Next we will specify various attributes of groups and dataset as well as properties of datasets such as their shape for instance. Using explicit HDF5 paths or wildcards we will define whether the specifications apply for a specific HDF object or various ones.

### Attribute specifications

Let's require the user to set the root attribute "h5rdmtoolbox_version", which holds the current version of this package:

In [3]:
lay['/'].attrs['h5rdmtoolbox_version'] = h5tbx.__version__

Any other attribute `version` shall follow the regular expression `^v\d+\.\d+\.\d+$`. We donnot specify the location of the attribute. It could be the attribute of group or a dataset. Wherever the attribute `version` is used, the regular expression is checked.

In [4]:
lay['*'].attrs['version'] = Regex(reference=r'^v\d+\.\d+\.\d+$')

If we only require an attribute to exist but the value is not further specify, we can use the `Ellipsis` object `...`:

In [5]:
lay['/'].attrs['title'] = ... # or: layout.attrs.Exists()

Further define, that each group must have an attribute called "long_name". We don't specify the value of it, we just request to use the attribute. The wildcard (`*`) indicates, that the location of the group does not matter, so that the specification applies to any group within an HDF5 file:

In [6]:
lay['*'].attrs['long_name'] = ...

Now, we specify, that a group "device" must exist. We explicitly tell that the device group must be located at the lowest level (root level):

In [7]:
lay['/'].define_group('device')

GroupValidation(Equal("device"), opt=False)>

Note, that if we would not specify the exact group, hence writing `lay[*].group('device')`, then this would have no effect as this made "device" being optional.

Now, say, that each *dataset* (in any group) shall have an attribute called "standard_name". Again, the wildcard is used and no specific dataset name is set:

In [8]:
lay['*'].define_dataset().attrs['standard_name'] = Any()

Each *dataset* in the group "fluid" (and below) shall have an attribute called "units".

In [9]:
lay['fluid/*'].define_dataset().attrs['units'] = 'm/s'

In [10]:
lay.dumps()

AttributeValidation(Equal("h5rdmtoolbox_version"), opt=False)>
  AttributeValidation(Equal("0.4.0a1"), opt=False)>
GroupValidation(Equal("*"), opt=False)>
  AttributeValidation(Equal("version"), opt=False)>
    AttributeValidation(Regex("^v\d+\.\d+\.\d+$"), opt=False)>
  AttributeValidation(Equal("long_name"), opt=False)>
    AttributeValidation(Any(opt=True), opt=False)>
  DatasetValidation(Any(opt=True), opt=True)>
    AttributeValidation(Equal("standard_name"), opt=False)>
      AttributeValidation(Any(opt=True), opt=False)>
AttributeValidation(Equal("title"), opt=False)>
  AttributeValidation(Any(opt=True), opt=False)>
GroupValidation(Equal("device"), opt=False)>
GroupValidation(Equal("fluid/*"), opt=False)>
  DatasetValidation(Any(opt=True), opt=True)>
    AttributeValidation(Equal("units"), opt=False)>
      AttributeValidation(Equal("m/s"), opt=False)>


### Dataset property specification

Each dataset (despite its location within the hierachical structure), which starts with eith "x", "y" or "z" and ends with "_coordinate" shall be one-dimensional. This can be specified by the dataset property "ndim"

In [12]:
lay['*'].define_dataset(name=Regex('^[x-z]_coordinate'), ndim=1)

DatasetValidation(Regex("^[x-z]_coordinate"), opt=False)>

## Perform a layout validation

Let's create an empty HDF5 file first:

In [13]:
h5tbx.use(None)

with h5tbx.File() as h5:
    h5.dump()

Running the validation with `lay.validate()` will get us a total of three issues (`res.total_issues()`):

In [14]:
lay.validate(h5.hdf_filename)

[ValidationResult(AttributeValidation(Equal("h5rdmtoolbox_version"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("version"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("long_name"), opt=False)>, False),
 ValidationResult(DatasetValidation(Regex("^[x-z]_coordinate"), opt=False)>, False),
 ValidationResult(GroupValidation(Equal("*"), opt=False)>, True),
 ValidationResult(AttributeValidation(Equal("title"), opt=False)>, False),
 ValidationResult(GroupValidation(Equal("device"), opt=False)>, False),
 ValidationResult(GroupValidation(Equal("fluid/*"), opt=False)>, False)]

In [15]:
lay.fails

7

To find out what the issues are, best is to `report()` the issue messages. Note, some issues are "hidden".

Adding the version will reduce the issues about one:

In [16]:
with h5tbx.File() as h5:
    h5.attrs['h5rdmtoolbox_version'] = h5tbx.__version__
    lay.validate(h5)
lay.fails

6

In [17]:
lay.get_failed()

[ValidationResult(AttributeValidation(Equal("version"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("long_name"), opt=False)>, False),
 ValidationResult(DatasetValidation(Regex("^[x-z]_coordinate"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("title"), opt=False)>, False),
 ValidationResult(GroupValidation(Equal("device"), opt=False)>, False),
 ValidationResult(GroupValidation(Equal("fluid/*"), opt=False)>, False)]

Adding a dataset in the group "fluid" without a unit:

In [18]:
with h5tbx.File() as h5:
    g = h5.create_group('fluid')
    g.attrs['units'] = 'm/s'
    lay.validate(h5)
lay.fails
lay.get_failed()

[ValidationResult(AttributeValidation(Equal("h5rdmtoolbox_version"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("version"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("long_name"), opt=False)>, False),
 ValidationResult(DatasetValidation(Regex("^[x-z]_coordinate"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("version"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("long_name"), opt=False)>, False),
 ValidationResult(DatasetValidation(Regex("^[x-z]_coordinate"), opt=False)>, False),
 ValidationResult(AttributeValidation(Equal("title"), opt=False)>, False),
 ValidationResult(GroupValidation(Equal("device"), opt=False)>, False),
 ValidationResult(GroupValidation(Equal("fluid/*"), opt=False)>, False)]

## User-defined Validators

In [None]:
# see tests

## Registred layouts

In [None]:
#layout.Layout.Registry()