# Working with HDF5 datasets

Unlike the h5py package, which returns `numpy.ndarray` when accessing the values of datasets, the `h5rdmtoolbox` returns `xarray.DataArray` objects ([https://xarray.pydata.org/]). The `xarray.DataArray` object allows to carry attributes with the numpy-like multi-dimensional array. It also supports the concept of dimensions and coordinates, allowing to assign the array axis with meaning ful (meta) data.

Let's dive into it and explore the practical implications of retrieving `xarray.DatArray`:

In [None]:
import h5rdmtoolbox as h5tbx
import numpy as np

Let's create an example file. Note, that we pass `make_scale` and `attach_scale` as arguments to setup the coordinates and their association to the HDF5 dataset "data". The useful implications will be visible when we access the dataset values in the next steps.

In [None]:
with h5tbx.File() as h5:
    dsx = h5.create_dataset('x', data=np.linspace(0, 10, 5),
                            attrs=dict(units='mm', long_name='x'),
                            make_scale=True)
    dsy = h5.create_dataset('y', data=np.linspace(0, 5, 11),
                            attrs=dict(units='mm', long_name='y'),
                            make_scale=True)
    h5.create_dataset('vel', data=np.random.random((11, 5)),
                      attrs=dict(units='m/s', long_name='velocity'),
                      attach_scales=(dsy, dsx))
    h5.dump()

## Array Slicing

Slicing an HDF5 dataset returns a `xarray.DataArray`

In [None]:
with h5tbx.File(h5.hdf_filename) as h5:
    data = h5.vel[:]
data

## Advantages of retrieving `xarray.DataArray`

Few of the advantages:
- attributes (aux. info) with the array `.attrs` (copied from the HDF dataset to the `xarray.DataArray`)
- dimensions and coordinates (1D arrays) to address the axis by label rather than by idex
- apply operations (computations, visualizations) based on the meta data

In [None]:
data.attrs

Select subarray by specifying coordinate values for a given axis (coordinate):

In [None]:
data.sel(y=2.0)

Plot data by using information from attributes and coorinates:

In [None]:
data.plot.contourf()

## Cirumnavigate return of `xarray.DataArray` objects

In certain cases, there may be no requirement to return `xarray.DataArray` objects, and it may be more convenient to work with the default interface, hence `numpy.array` objects:

If we got the `xarray` object already, just call the property `.values`. Otherwise, we have the following two options to retrieve `numpy.array`:

In [None]:
with h5tbx.File(h5.hdf_filename) as h5:
    data_np = h5.vel.values[:]
type(data_np)

Using the configuration setter just for this code snippet (using context manager syntax):

In [None]:
with h5tbx.set_config(return_xarray=False):
    with h5tbx.File(h5.hdf_filename) as h5:
        data_np = h5.vel.values[:]
type(data_np)

## Selecting data (`.sel`)

HDF5 datasets may sometimes be very large. Hence it is ineffcient to slice a larger array and then use the useful method of (selecting)[https://docs.xarray.dev/en/stable/user-guide/indexing.html]. The `h5rdmtoolbox` allows to call `.sel` prior to the above slicing, to reduce the data loaded to the RAM:

In [None]:
with h5tbx.File(h5.hdf_filename) as h5:
    print('available coords to select from: ', h5.vel.coords().keys())
    xdata = h5.vel.sel(y=2.0)
xdata

## HDF Dataset with ancillary datasets

Ancillary datasets, which exist in the HDF5 file and are associated to one dataset. The ancillary datasets must have the same shape as the parent dataset.

An common use-case is the association of validation flags or uncertainty data.

Let's add a relative uncertainty of 5% to the dataset "vel". For this we create the dataset "uncertainty" and attach it to the already existing dataset "vel":

In [None]:
rel_uncertainty = np.clip(np.random.normal(loc=0.025, scale=0.001, size=(11, 5)), 0, None)

In [None]:
with h5tbx.File(h5.hdf_filename, mode='r+') as h5:
    h5.create_dataset('uncertainty', data=rel_uncertainty,
                      units='',
                      attach_scales=('y', 'x'))
    h5.vel.attach_ancillary_dataset(h5.uncertainty)

In [None]:
h5tbx.dump(h5)

The ancillary dataset will appear as a `xarray` coordinate when the dataset is sliced:

In [None]:
with h5tbx.File(h5.hdf_filename) as h5:
    u = h5.vel[()]

In [None]:
with h5tbx.File(h5.hdf_filename) as h5:
    print('available ancillary datasets: ', h5.vel.ancillary_datasets)
    data = h5.vel.sel(y=3.1, method='nearest')
data.coords

## Conditional data selection

In [None]:
with h5tbx.File(h5.hdf_filename) as h5:
    data = h5.vel[()]

# data.uncertainty.plot.hist()
data.where(data.uncertainty<0.025).plot()