# Dataset creation

Dataset creation works almost as known from `h5py`. However, to achieve FAIRness, some additional parameters are mandatory, others new ones are optional. These together with optional parameters and additional features are explained here. It is worth noting, that the return value of sliced HDF5 datasets through the `h5rdmtoolbox` are `xarray` objects rather than `numpy` objects. This can is the default setting although can be pypassed. In any case it is worth checking out the `xarray` package [here](https://docs.xarray.dev/en/stable/).

In [16]:
import h5rdmtoolbox as h5tbx
import numpy as np
import xarray as xr

Mandatory parameters during dataset creation know fro `h5py` are `name`, `data` or `shape`. The toolbox adds new obligatory parameters: `standard_name` or `long_name` and a physical unit parameter, called `units`. These add miminaml required information to raw values.

The user may select between `standard_name` and `long_name`. The latter is a human-readable (short) description of the dataset without any syntax restriction:

In [17]:
with h5tbx.H5File() as h5:
    h5.create_dataset('x', shape=(4,),
                      units='m', long_name='coordinate in x direction')

Standard names are defined in a [standard name table](./Conventions.ipynb) and underly certain synatx rues, e.g. no spaces are allowed:

In [18]:
with h5tbx.H5File() as h5:
    h5.create_dataset('x', shape=(4,),
                      units='m', standard_name='x_coordinate')

**Note**: If the data is unitless, pass `units=''`

The name of the dataset is the path within the HDF5 file. It is possible to create the dataset although the (sub-)groups don't exist.

In [19]:
with h5tbx.H5File() as h5:
    h5.create_dataset('grp/subgrp/x', shape=(4,),
                      units='m', standard_name='x_coordinate')

## Dimension scales

Dimension scales can be defined during dataset creation. Let `time` be the dimension scale and `pressure` be the dataset to which it is attached.

In [20]:
fname_dimcales = h5tbx.generate_temporary_filename()
with h5tbx.H5File(fname_dimcales, 'w') as h5:
    h5.create_dataset('time', data=[0,1,2,3,4,5],
                      units='s', standard_name='time', make_scale=True)
    h5.create_dataset('pressure', data=np.random.rand(6),
                      units='Pa', standard_name='time', attach_scale=((h5['time'])))
    p = h5.pressure[:]
p

In order to be compliant with xarrays, single value "dimension scales" are set via the attribute `COORDINATES`. An example is the location of the pressure sensor in our case. Let's first create the datasets and then add them as attributes to "pressure":

In [21]:
with h5tbx.H5File(fname_dimcales, 'r+') as h5:
    h5.create_dataset('x', data=5.32, units='m',
                      standard_name='x_coordinate')
    h5.create_dataset('y', data=-3.1, units='m',
                      standard_name='y_coordinate')
    h5['pressure'].attrs['COORDINATES'] = ('x', 'y')
    p = h5.pressure[:]
p

### String datasets
String datasets can be created very quickly. No standard_name, long_name or units *must* be given. As units generally anyhow makes no sense, there is still the option to pass long and standard name via the method parameters.<br>
The dump method will display single strings but not lists of strings.<br>
The return value when sliced will still be a `xarray.DataArray` as attributes should still be attached to the object. Use `.values` to get the raw string:

In [22]:
with h5tbx.H5File() as h5:
    h5.create_string_dataset('astr', 'hello_world')
    h5.create_string_dataset('string_list', ['hello', 'world'])
    h5.dump()
    
    print('---\n', h5['astr'][()])
    print('---\n',h5['astr'].values[()])
    
    print('---\n', h5['string_list'][:])
    print('---\n',h5['string_list'].values[:])

---
 <xarray.DataArray 'astr' ()>
array(b'hello_world', dtype='|S11')
---
 b'hello_world'
---
 <xarray.DataArray 'string_list' (dim_0: 2)>
array([b'hello', b'world'], dtype='|S5')
Dimensions without coordinates: dim_0
---
 [b'hello' b'world']


### Advanced dataset creation

There is more to dataset creation. You can:
- add attributes

In [23]:
with h5tbx.H5File() as h5:
    h5.create_dataset('ds', shape=(10, ), units='', attrs=dict(long_name='a long name', anothera='another attr'))  # unitless dataset. long_name is passed via parameter attrs

- make and attach scales (Note the output using `dump()`: the scale "link" is shown)

In [24]:
with h5tbx.H5File() as h5:
    h5.create_dataset('x', data=[1,2,3], units='m', standard_name='x_coordinate', make_scale=True)
    h5.create_dataset('t', data=[20.1, 18.5, 24.7], units='degC', standard_name='temperature', attach_scale=h5['x'])
    print(h5.t.x)  # note, that you can access the dimension scale using attribute-style-syntax
    h5.dump()

<HDF5 dataset "x": shape (3,), type "<i4">


- add `xarry.DataArrays`

In [25]:
arr =  xr.DataArray(dims=('y', 'x'), data=np.random.rand(3, 2),
                                 coords={'y': xr.DataArray(dims='y', data=[1, 2, 3],
                                                               attrs={'units': 'm',
                                                                      'standard_name': 'y_coordinate'}),
                                         'x': xr.DataArray(dims='x',
                                                               data=[0, 1],
                                                               attrs={'standard_name': 'x_coordinate'})
                                        },
                                 attrs={'long_name': 'a long name',
                                        'units': 'm/s'})

with h5tbx.H5File() as h5:
    h5.create_dataset('temperature', data=arr)
    h5.dump()

- add `xarry.Dataset`

In [26]:
ds = xr.Dataset({'foo': [1,2,3], 'bar': ('x', [1, 2]), 'baz': np.pi})
ds

In [27]:
try:
    with h5tbx.H5File() as h5:
        h5.create_dataset_from_xarray_dataset(ds)
except h5tbx.errors.UnitsError as e:
    print(e)

Units cannot be None. A dimensionless dataset has units ""


In [28]:
ds.foo.attrs['units']='m'
ds.foo.attrs['long_name']='foo'

ds.bar.attrs['units']='m'
ds.bar.attrs['long_name']='bar'

ds.baz.attrs['units']='m'
ds.baz.attrs['long_name']='baz'

ds

In [29]:
with h5tbx.H5File() as h5:
    h5.create_dataset_from_xarray_dataset(ds)

We may also create a dataset by using the `__setitem__`:

In [30]:
with h5tbx.H5File() as h5:
    h5['x'] = [1,2,3], 'm/s', {'long_name':'hallo'}
with h5tbx.H5File() as h5:
    h5['x'] = ([1,2,3], 'm/s', 'long_name', 'standard_name')
with h5tbx.H5File() as h5:
    h5['x'] = ([1,2,3], dict(units='m/s', long_name='long_name',
                             attrs={'hello': 'world'}, compression='gzip'))