# Basics

The package `h5wrapper` adds additional functionalities to the core HDF5-interface, which is implemented in the package `h5py` (https://docs.h5py.org/en/stable/). This is done by providing so-calle wrapper classes. They
 - facilitate and streamline the work with HDF5 files and
 - integrate meta-conventions and file-layout specifications.

Besides high-level methods that enhances the usability, naming `conventions`, the usage of `units` and the defintion of so-called `layouts` are a core feature. They are motivated by the FAIR principles of sustainable data management.

This notebook will guide through the high-level methods and the mentioned concepts.

In [None]:
import h5rdmtoolbox as h5tbx

## Create a HDF file

To work with HDF5 files, the "wrapper"-class `H5File` is used (equiilant to `h5py.File`).

There are several ways to open a file with this package.<br>
First thing to note, is that we don't need to specify a filename when calling the the class. Instead a temporary file will be generated. Like this, the default mode is `r+`:

In [None]:
h5 = h5tbx.H5File()  # note, not passing `mode` while the filename does not exist: 'r+' is used, otherwise defaultis 'r'
h5.close()
print(h5.hdf_filename.name)  # equal to h5.filename but a pathlib.Path and exists also after file is closed

Note, that we have an additional property `hdf_filename`, which in contrast to `filename` works also even if the clss is closed.

**A safer way to work with files** is to use python's context manager. This is highly recommended and used throughout the hole documentation and package.

Thus, the above cell changes to:

In [None]:
with h5tbx.H5File('test.hdf', 'w') as h5:
    print(h5.hdf_filename.name)
h5.hdf_filename.unlink()  # it's a pathlib.Path object, thus easy to delete

## Default content and content exploration

Whenever a file is created, some default attributes are written to it. When can check the content by "dumping" the content to the screen by callind `dump()`. An interactive html representation of the file content is displayed. At the moment only attributes at the root level exist. They were created when the file was opened in write mode. While `creation_time` is considered as regular "attribute" `__h5rdmtoolbox_version__` and `__wrcls__` - attributes starting and ending with `__` - are special attributes reserved/used by the package.

In [None]:
with h5tbx.H5File() as h5file:  # default mode is 'r+'
    h5file.dump()   # call .sdump() outside of notebooks to get a similar but non-interactive representation

## Registration of properties
To facilitate and speed up certain tasks and workflows, it might be usefull to have additional properties. Let's assume we want to store a username as an attribute in the HDF5 file. Let's further assume that a user name has a firstname and a surname and both must start with a capitalized letter. To catch possible errerneous input by the user, we would need an additional method, e.g. "set_username" or a property "username" of `H5File`.<br>**Two** possible ways to achieve this: **inheritance or composition**. The quick and recommended way to add a new class property is to use composition by "registering" the property. Like this we don't have to rewrite (inherite at least) the API but only simply write the porperty class.<br>
Let's first write a property class "username". It needs **three methods**: `get`, `set` and `delete`:

In [None]:
@h5tbx.h5wrapper.register_special_property(h5tbx.H5File, overwrite=True)
class username:
    
    def get(self):
        _username = self.attrs.get('username', None)
        if _username is not None:
            return _username
        raise AttributeError('No user found')
        
    def set(self, _username):
        """Write the user name to HDF attribute of naming convention is matched"""
        _username_split = _username.split(' ')
        if not len(_username_split) == 2:
            raise ValueError(f'User name must have first name and surname spearated by a space, but got: {_username}')
        if _username_split[0][0].islower() or _username_split[0][0].islower():
            raise ValueError(f'Names must have capitalized first letters, but got: {_username}')
        self.attrs.create('username', _username)
        
    def delete(self):
        _username = self.attrs.get('username', None)
        if _username:
            print(f"deleting '{_username}'")
            self.attrs.__delitem__('username')
        else:
            raise AttributeError('No user to be deleted')

Some words about above lines:
- `@h5tbx.h5wrapper.register_special_property(h5tbx.H5File)`: registering the below class (only) to the class `H5File`
- don't use existing property names. An error will be raised anyhow. You may however pass `overwrite=True` in the registration method. Be careful though!
- provide `set` and optionally also `get` and `delete`. The method `get` makes sense though, while `delete` is not really needed most of the times.

Let's check if it worked out:

In [None]:
with h5tbx.H5File() as h5file:
    try:
        h5file.username = 'adam Username'
    except ValueError as e:
        print(e)
    h5file.username = 'Adam Username'
    print('The user name is: ', h5file.attrs['username'])
    
    del h5file.username
    try:
        del h5file.username
    except AttributeError as e:
        print(e)
        
    h5file.attrs['username'] = 'lower lower'

# (Naming) Conventions

To meet the sustainable (FAIR) principles of data management, the package introduces conventions, that define not only which properties must be attached to a data set but also what naming is allowed. Specifically, the following attributes are obligatory with HDF5 dataset:
 - `units`
 - `standard_name`
 - `long_name` (not needed if `standard_name` exists and vise versa)

**units**<br>
We expect that each data set written to the HDF5 file has a physical unit or no unit at all. It is registered in the attribute `units`.

**standard_name and long_name**<br>
For the sake of improved readability and interpretability we suggest to use `long_name` or `standard_name` as additional attributes. While `long_name` is human-readable and interpretable attribute, `standard_name` is intended to be read by a machine (other software). This allows to automate exploration and processing work.

The `standard_name` generally should not be chosen freely but must follow a certain convention. Such a naming convention may be defined by a project or a community, e.g. the climate and forecast community [cfconventions.org], from which the concept of standard names is adoped. A convention is described in an **XML** file and associated with the wrapper file `H5File`. Again this is adoped from [cfconventions.org]. The XML file contains the standard name, a description for each one and the respective unit. Thus, if a convention is associated with a wrapper class, the standard name and unit cannot be freely be chsen but is verified.

[cfconventions.org]: http://cfconventions.org/

Let's have a look into one of the implemented conventions:

In [None]:
h5tbx.conventions.FluidStandardNameTable.dump()

### Checks
We can use the standard name table to check some name. We can check for name compliance in a strict or non-strict way. That means, that we check whether the name actally exists in the table or if only the naming conventn in a formal way is approved (e.g. does not start with a letter):

In [None]:
try:
    h5tbx.conventions.FluidStandardNameTable.check_name('test', strict=True)
except h5tbx.conventions.StandardizedNameError as e:
    print(e)
    
try:
    h5tbx.conventions.FluidStandardNameTable.check_name('1234test', strict=False)
except h5tbx.conventions.StandardizedNameError as e:
    print(e)
    
print(h5tbx.conventions.FluidStandardNameTable.check_name('test', strict=False))

In [None]:
print(h5tbx.conventions.FluidStandardNameTable.check_name('x_velocity', strict=False))

try:
    h5tbx.conventions.FluidStandardNameTable.check_units('test', 'm')
except h5tbx.conventions.StandardizedNameError as e:
    print(e)
    
try:
    h5tbx.conventions.FluidStandardNameTable.check_units('x_velocity', 'm')
except h5tbx.conventions.StandardizedNameError as e:
    print(e)
    
h5tbx.conventions.FluidStandardNameTable.check_units('x_velocity', 'm/s')

## Dataset creation

As motivated, the package enforces us to use certain meta information to meet the FAIR principles. A dataset creation as known from the `h5py` package is therefore not possible, because we have to pass `units` and `standard_name` or `long_name`:

In [None]:
with h5tbx.H5File(standard_name_table=None) as h5:
    try:
        h5.create_dataset('x', shape=(4,))
    except h5tbx.conventions.UnitsError as e:
        print(e)
    h5.create_dataset('x', shape=(4,), units='m', long_name='a coordinate')

For now we only used a long name. What about standard name?

In [None]:
with h5tbx.H5File(standard_name_table=None) as h5:
    h5.create_dataset('x', shape=(4,), units='m', standard_name='a coordinate')

No problem so far because standard names are not regulated yet since we did not specify a `convention` with the `H5File`-object. In fact we even passed `standard_name_table=None`.

Let's pass the already implemented fluid convention to the wrapper class (The convention is motivated once again from the cf-conventions). We run through various errors first:

In [None]:
with h5tbx.H5File(standard_name_table=h5tbx.conventions.FluidStandardNameTable) as h5:
    try:
        h5.create_dataset('x', shape=(4,), units='m', standard_name='a coordinate')
    except h5tbx.conventions.StandardizedNameError as e:
        print(e)
    
    try:
        h5.create_dataset('x', shape=(4,), units='m', standard_name='a_coordinate')
    except h5tbx.conventions.StandardizedNameError as e:
        print(e)
    
    try:
        h5.create_dataset('x', shape=(4,), units='kg', standard_name='x_coordinate')  # note the wrong units!
    except h5tbx.conventions.StandardizedNameError as e:
        print(e)
        
    h5.create_dataset('x', shape=(4,), units='m', standard_name='x_coordinate')  # not finally correct
    h5.create_dataset('y', shape=(4,), units='km', standard_name='y_coordinate')  # only base units is checked

### Advanced dataset creation

There is more to dataset creation. You can:
- add attributes

In [None]:
with h5tbx.H5File() as h5:
    h5.create_dataset('ds', shape=(10, ), units='', attrs=dict(long_name='a long name', anothera='another attr'))  # unitless dataset. long_name is passed via parameter attrs

- make and attach scales (Note the output using `dump()`: the scale "link" is shown)

In [None]:
with h5tbx.H5File() as h5:
    h5.create_dataset('x', data=[1,2,3], units='m', standard_name='x_coordinate', make_scale=True)
    h5.create_dataset('t', data=[20.1, 18.5, 24.7], units='degC', standard_name='temperature', attach_scale=h5['x'])
    h5.dump()

- add `xarry.DataArrays`

In [None]:
import xarray as xr
import numpy as np
arr =  xr.DataArray(dims=('y', 'x'), data=np.random.rand(3, 2),
                                 coords={'y': xr.DataArray(dims='y', data=[1, 2, 3],
                                                               attrs={'units': 'm',
                                                                      'standard_name': 'y_coordinate'}),
                                         'x': xr.DataArray(dims='x',
                                                               data=[0, 1],
                                                               attrs={'standard_name': 'x_coordinate'})
                                        },
                                 attrs={'long_name': 'a long name',
                                        'units': 'm/s'})

with h5tbx.H5File() as h5:
    h5.create_dataset('temperature', data=arr)
    h5.dump()

## Dataset slicing

Differently to the `h5py` package, `xarray.DataArray` is returned and not `numpy.ndarray`. Note, how also the attach dimension scales are considered:

In [None]:
with h5tbx.H5File() as h5:
    dsx = h5.create_dataset('x', data=np.linspace(0, 10, 5), units='mm', long_name='x', make_scale=True)
    dsy = h5.create_dataset('y', data=np.linspace(0, 5, 3), units='mm', long_name='y', make_scale=True)
    h5.create_dataset('data', data=np.random.random((3, 5)), units='m/s', long_name='velocity', attach_scales=(dsy, dsx))
    data_arr = h5['data'][:]
data_arr

## Group creation
Is not much different except that you can pass a `long_name` and additional attributes.

In [None]:
with h5tbx.H5File() as h5:
    h5.create_group('mygrp')
    h5.create_group('othergrp', long_name='my other group', attrs=dict(one=2, two='a second attr'))

## Meta convention: Static layouts:

Each HDF wrapper class must fulfill certain requirements, e.g.
 - to have specific attributes
 - to have specific attributes with a certain value
 - to have specific groups
 - to have specific datasets
 - to have a specific dataset with a specific shape

The basic wrapper-class reflects the "minimum standard". It must have certain atributes, such as "creation_time " or "title". To check which one exactly, let's print the content of the file to the screen using `dump()` (outside of jupyter notebooks print the instance or call `sdump()`)

In [None]:
import h5py
with h5py.File(h5tbx.generate_temporary_filename(suffix='.hdf'), 'w') as h5:
    h5.attrs['title'] = 'Tutorial data'
    h5.attrs['__version__'] = '0.1.12'
    test_filename = h5.filename  # or h5.hdf_filename

with h5tbx.H5File(test_filename) as h5:
    h5.check(silent=False)

## Visualization

As the return value of a sliced dataset is a `xarray.DataArray` instead of a `numpy.ndarray` plotting features of `xarray` is used. For more information about `xarray` see https://docs.xarray.dev/en/stable/

In [None]:
import matplotlib.pyplot as plt
with h5tbx.H5File() as h5:
    dsx = h5.create_dataset('x', data=np.linspace(0, 10, 20), units='mm', long_name='x', make_scale=True)
    dsy = h5.create_dataset('y', data=np.linspace(0, 5, 10), units='mm', long_name='y', make_scale=True)
    h5.create_dataset('data', data=np.random.random((10, 20)), units='m/s', long_name='velocity', attach_scales=(dsy, dsx))
    
    # some plotting
    plt.figure()
    h5['data'][:].plot()
    plt.figure()
    h5['data'][:].plot.contourf()
    plt.figure()
    h5['data'][:,0].plot.line(marker='o')
    plt.figure()
    h5['data'][:].plot.hist()

---
## Other useful features


### Creating datasets and groups from a yaml file

Sometimes it may be useful to write standard datastes and groups to your file (e.g. always the same attributes for repetative tasks). This can be defined in a yaml file:

In [None]:
import yaml

dictionary = {'datasets': {'boundary/outlet/y': {'data': 2, 'units': 'm', 'standard_name': 'y_coordinate',
                                                         'attrs': {'comment': 'test', 'another_attr': 100.2,
                                                                   'array': [1, 2, 3]}}},
                      'groups': {'test/grp': {'long_name': 'a test group'}}
                      }

yaml_filename = h5tbx.generate_temporary_filename(suffix='.yml')
with open(yaml_filename, 'w') as f:
    yaml.safe_dump(dictionary, f)

with h5tbx.H5File() as h5:
    h5.from_yaml(yaml_filename)
    h5.dump()