# Getting started: H5File

The package `h5wrapper` adds general and application/domain-specific functionalities to the core interface HDF5 implemented in the package `h5py` (https://docs.h5py.org/en/stable/). This is done by providing so-calle wrapper classes. They
 - facilitate and streamline the work with HDF5 files and
 - integrate meta-conventions and file-layout specifications.

There are multiple wrapper-classes implemented, each of which exteding functionality and introducing specific layout- and naming-convetions. The basic wrapper-class is called `H5File`. We will first demonstrate the functionalities and concepts around it before the more specialize wrapper classes are shown.

The wrapper classes extend but don't limit the funcitonality of the `h5py` package, so everything knwon from it is available also through this package. An HDF file, which is created, read or generally processed with `H5File` requires to respect certain conventions and prinicples. It is possible to ignore them but warnings will be raised. The following sections walk you through the concept and basic functionalities of the first warpper-class `H5File`:

In [None]:
import h5rdmtoolbox as h5tbx

## Create a HDF file

There are several ways to open a file with this package.

In [None]:
h5 = h5tbx.H5File()  # note, not passing `mode` while the filename does not exist: 'r+' is used, otherwise defaultis 'r'
print(h5.hdf_filename.name)  # equal to h5.filename but a pathlib.Path and exists also after file is closed
h5.close()

A safer way to work with files is to use python's context manager. This is highly recommended and used throughout the hole documentation and package.

Thus, the above cell changes to:

In [None]:
with h5tbx.H5File() as h5:
    print(h5.hdf_filename.name)  # equal to h5.filename but a pathlib.Path and exists also after file is closed

## Open a file
... and displaying (dumping) the content). An interactive html representation of the file content is displayed. At the moment only attributes at the root level exist. They were created when the file was opened in write mode. While `creation_time` and `modification_time` are treated as regular attributes `__h5rdmtoolbox_version__` and `__wrcls__`, thus attributes starting and ending with `__` are special attributes reserved/used by the package.

In [None]:
with h5tbx.H5File() as h5file:  # default mode is 'r+'
    h5file.dump()   # call .sdump() outside of notebooks to get a similar but non-interactive representation

Using `open_wrapper()` opens the file and returns the respective wrapper class instance, namely the one that wrote to the file previously, here `H5File`

In [None]:
with h5tbx.open_wrapper(h5file.hdf_filename) as h5:
    h5.dump()
    print('Wrapper class instance name: ', type(h5))

# Conventions and layout schema:

 - Conventions: Mainly naming conventions
 - Layout (schema): required dataset, data dimension, attributes, groups in an HDF5 file that must always exist


Conventions are generally defined by a community (e.g. concept of standard names is used here: [cfconventions.org]) and regulate the usage of certain names. This repository supports the principle set by the climate and forecast community [cfconventions.org]. This means that each dataset must have one of the following or bth attributes:
  - `standard_name`
  - `long_name`.
  
While `long_name` is a user-defined and user-readable (as opposed to machine-readable) string without restrictions (e.g. w.r.t length), the attribute `standard_name` cannot be chosen freely but is defined by a `convention`. Each wrapper file (excepet the very basic one `H5Base`) is associated with a convention. A convention can be defined by an `XML` file, read in and passed to the wrapper class. For e.g. `H5Flow` the `FluidConvention` is set per default. Here, for instance, a dataset `x` shall be the created using the wrapper class `H5Flow`, then if the `standard_name` is set it is verified in the associated standard name convention.

[cfconventions.org]: http://cfconventions.org/

In [None]:
with h5tbx.H5File(standard_name_table=None) as h5f:
    print(h5f.standard_name_table)

The `H5Flow` class has a *non-empty* convention class set per default. Thus when creating datasets and the parameter `stadnard_name`is passed, it is checked if this name exists in the convention. If so, then the units of the created dataset and the registered convention is verified:

In [None]:
with h5tbx.H5Flow() as h5f:
    print(h5f.standard_name_table)
    try:
        h5f.create_dataset('x', data=1, standard_name='x coordinate', units='m')
    except h5tbx.conventions.StandardizedNameError as e:
        print(f' > Incorrect standard name: {e}')
    try:
        h5f.create_dataset('x', data=1, standard_name='x_coordinate', units='kg')
    except h5tbx.conventions.StandardizedNameError as e:
        print(f' > Incorrect units: {e}')
    h5f.create_dataset('x', data=1, standard_name='x_coordinate', units='m')
    h5f.dump()

---
## File creation
File creation as used to with `h5py` either with or without defining a filename (when passing no (file)name, write intent is automatically set to 'r+'). If no filename is set, a temporary file is created in a tmp-folder. You may also set the attribute `title` by passing it as parameter during initialization:

In [None]:
with h5tbx.H5File('test.hdf', mode='w') as h5:
    h5.dump()
    print(h5.layout.filename)

In [None]:
from pathlib import Path

with h5tbx.H5File() as h5:  # equal to "with H5File(mode='w') as h5:"
    print(f'HDF5 files initialized without any parameters\n--> Mode is {h5.mode} and filename is {h5.filename}')

In [None]:
try:
    print(h5.filenme)
except AttributeError as e:
    print(e)

In [None]:
print(h5.hdf_filename)

## Meta convention: Static layouts:

Each HDF wrapper class must fulfill certain requirements, e.g.
 - to have specific attributes
 - to have specific attributes with a certain value
 - to have specific groups
 - to have specific datasets
 - to have a specific dataset with a specific shape

The basic wrapper-class reflects the "minimum standard". It must have certain atributes, such as "creation_time " or "title". To check which one exactly, let's print the content of the file to the screen using `dump()` (outside of jupyter notebooks print the instance or call `sdump()`)

In [None]:
import h5py
with h5py.File(h5tbx.generate_temporary_filename(suffix='.hdf'), 'w') as h5:
    h5.attrs['title'] = 'Tutorial data'
    h5.attrs['__version__'] = '0.1.12'
    test_filename = h5.filename  # or h5.hdf_filename

with h5tbx.H5File(test_filename) as h5:
    h5.check(silent=False)

---
## Dataset creation
The basic dataset creation is no different than with the `h5py` package. However, `H5File` will encourage you to use an attribute `units` and either `long_name` or `standard_name`. You may pass those parameters in the `create_dataset()` function. If not, you will be warned (which you can suppress in the usual way in python `warnings.filterwarnings("ignore")`) but it has no other consequences. You may also directly specify dimension scales during dataset creation:

First let's create a simple dataset as usual. It will raise a warning, that no long name or standard name has been passed:

In [None]:
import numpy as np
import xarray as xr

In [None]:
#from h5rdmtoolbox.h5wrapper.h5file import UnitsError
with h5tbx.H5File('test.hdf', mode='w', standard_name_table=None) as h5:
    try:
        h5.create_dataset('temperature', data=np.random.rand(3, 2), long_name='temperature of something')
    except h5tbx.conventions.UnitsError as e:
        print(e)

In [None]:
with h5tbx.H5File('test.hdf', mode='w', standard_name_table=None) as h5:
    h5.create_dataset('temperature', data=np.random.rand(3, 2), long_name='Surface temperature', units='')

In [None]:
with h5tbx.H5File('test.hdf', mode='w', standard_name_table=h5tbx.conventions.FluidStandardNameTable) as h5:
    try:
        h5.create_dataset('temperature', data=np.random.rand(3, 2), standard_name='Surface temperature', units='')
    except h5tbx.conventions.StandardizedNameError as e:
        print(e)

In [None]:
with h5tbx.H5File('test.hdf', mode='w', standard_name_table=h5tbx.conventions.FluidStandardNameTable) as h5:
    try:
        h5.create_dataset('temperature', data=np.random.rand(3, 2), standard_name='surface temperature', units='')
    except h5tbx.conventions.StandardizedNameError as e:
        print(e)

In [None]:
with h5tbx.H5File('test.hdf', mode='w', standard_name_table=h5tbx.conventions.FluidStandardNameTable) as h5:
    h5.create_dataset('temperature', data=np.random.rand(3, 2), standard_name='temperature', units='K')

Now, let's provide a long_name and a unit, thus no warning will be shown as all requirements are fulfilled. Also note, that we added the parameter `overwrite=True` as we are using the same HDF file and want to overwrite the existing dataset:

In [None]:
with h5tbx.H5File('test.hdf', mode='r+', title='tutorial file') as h5:
    h5.create_dataset('temperature', data=np.random.rand(3, 2),
                     units='degC', overwrite=True, long_name='surface temperature')
    print(h5['temperature'])

Even better is to pass a standard name. Then the unit is checked for consistency (e.g. temperature has unit kelvin and not meters)

In [None]:
with h5tbx.H5File('test.hdf', mode='r+', title='tutorial file') as h5:
    try:
        h5.create_dataset('temperature', data=np.random.rand(3, 2),
                          overwrite=True, standard_name='temperature', units='m')
    except Exception as e:
        print(e)

The following temperature dataset has the coorect unit:

In [None]:
with h5tbx.H5File('test.hdf', mode='r+', title='tutorial file') as h5:
    h5.create_dataset('temperature', data=np.random.rand(3, 2),
                     units='degC', overwrite=True, standard_name='temperature')
    print(h5['temperature'])
    print('---')
    h5.sdump()

## Creating datasets and groups from a yaml file

Sometimes it may be useful to write standard datastes and groups to your file (e.g. always the same attributes for repetative tasks). This can be defined in a yaml file:

In [None]:
import yaml

dictionary = {'datasets': {'boundary/outlet/y': {'data': 2, 'units': 'm', 'standard_name': 'y_coordinate',
                                                         'attrs': {'comment': 'test', 'another_attr': 100.2,
                                                                   'array': [1, 2, 3]}}},
                      'groups': {'test/grp': {'long_name': 'a test group'}}
                      }
with open('test.yaml', 'w') as f:
    yaml.safe_dump(dictionary, f)

with h5tbx.H5File('test.hdf', 'w') as h5:
    h5.from_yaml('test.yaml')
    h5.dump()
    
# delete the yaml and hdf file again:
Path('test.yaml').unlink()
h5.hdf_filename.unlink()

---
## Attributes
Attributes can be added to the file as known from `h5py`. During file creation some are automatically created, such as package version and file creation/modification time:

In [None]:
with h5tbx.H5File(mode='w', title='a test file') as h5:
    print(h5.attrs)

Besides adding strings or floats as attributes, also **datasets** or **groups** can be **assigned to an attribute**. Effectively the internal HDF path is stored and when the attributed is requested this is recognized and the respective dataset or group is returned:

In [None]:
with h5tbx.H5File(mode='w', title='a test file') as h5:
    grp = h5.create_group('a group')
    h5.attrs['a root group'] = grp
    print(h5.attrs['a root group'])
    grp.create_dataset('ds', units='m/s', data=[1,2,3], standard_name='x_velocity')
    grp.attrs['ref_to_own_ds'] = grp['ds']
    d = h5['a group'].attrs['ref_to_own_ds']
    print(d)

---
## Data exploration and Natural Naming
Besides above stated "rules", the class gives quick and eays insight in the class by using `.print()` or `.explore()` (the fives an interactive HTML representation only available in notbooks).

In [None]:
with h5tbx.H5File(mode='w') as h5:
    print(h5)
    # h5.info()  # equal to print(h5)

In [None]:
with h5tbx.H5File(mode='w') as h5:
    h5.dump()

The following raises a warning for not setting `long_name` or `standard_name` and for not defining the `units` of the dataset. Inspection will therefore raise 2 issues (this time `title` was set though):

In [None]:
with h5tbx.H5File(mode='w', title='tutorial test file content') as h5:
    h5.create_dataset('test', shape=(2,3), long_name='test dataset', units='')
    n = h5.check(silent=False)  # n=1 because title is not set!

"Natural Naming" (enable/disable in yaml config see above) allows to address datasets and attributes as if they were attributes of the class:

In [None]:
with h5tbx.H5File(mode='w', title='tutorial test file content') as h5:
    h5.create_dataset('test', shape=(2,3), long_name='test dataset', units='')
    ds = h5['test']
    ds = h5.test
    print(h5.attrs.title)

Interaction with `xarray.DataArray`: Dataset slicing returns a `xarray.DataArray` instead of `np.ndarray`:

In [None]:
with h5tbx.H5File(mode='w', title='tutorial test file content') as h5:
    h5.create_dataset('test', data=np.random.rand(2,3), long_name='test dataset', units='')
    print(type(h5.test[:]))

---
# Interaction with `xarray`

In [None]:
import xarray as xr

It is possible to create datasets by passing an `xarray.DataArray`:

In [None]:
arr =  xr.DataArray(dims=('y', 'x'), data=np.random.rand(3, 2),
                                 coords={'y': xr.DataArray(dims='y', data=[1, 2, 3],
                                                               attrs={'units': 'm',
                                                                      'standard_name': 'y_coordinate'}),
                                         'x': xr.DataArray(dims='x',
                                                               data=[0, 1],
                                                               attrs={'standard_name': 'x_coordinate'})
                                        },
                                 attrs={'long_name': 'a long name',
                                        'units': 'm/s'})

As the `DataArray` has `units` and `long_name` attributes no warning will be raised in the following lines and no issues are dectected. Coordinates will be also created (if not already exist) as `hdf dimension_scales`

In [None]:
with h5tbx.H5File() as h5:
    h5['velocity'] = arr
    h5.dump()

## Reading and plotting datasets

Calling a dataset without slicing returns (as expected) the h5py dataset. However, when slicing an `xarray.DataArray` will be return instead of a `np.ndarray`. With that `xarray.DataArray` quick and easy plotting can be performed. For more information about `xarray` see https://docs.xarray.dev/en/stable/

In [None]:
import matplotlib.pyplot as plt
with h5tbx.H5File(h5.hdf_filename, mode='r+') as h5:
    velocity_h5ds = h5['velocity']  # returns h5py dataset
    velocity_xr = h5['velocity'][:]  # slicing returns xarray
    
    # some plotting
    plt.figure()
    h5['velocity'][:].plot()
    plt.figure()
    h5['velocity'][:].plot.contourf()
    plt.figure()
    h5['velocity'][:,0].plot.line(marker='o')
    plt.figure()
    h5['velocity'][:].plot.hist()

Datasets, groups and attributes can be addressed by natural naming:

In [None]:
with h5tbx.H5File(h5.hdf_filename, mode='r+') as h5:
    print(h5.creation_time)
    print(h5.attrs)
    print(h5.velocity)
    h5.create_group('test_group', long_name='a test group')
    print(h5.test_group)
    print('\n')
    h5.dump()

In [None]:
with h5tbx.H5File(h5.hdf_filename, 'r') as h5:
    print(h5['x'])
    vel = h5['velocity'][:,:]
    print(vel)
    x=h5['x'][:]

In [None]:
vel