# Create test file `B`

- From a dataset, create a group instead of a dataset
    - e.g. by turning `run1` to `run1/a`, `run1/b`
    - Maybe `run1` -> `run1/a`?
- Change dtype of dataset
- Change ndim of dataset
- Change shape of dataset while keeping ndim the same

In [1]:
!pwd

/home/ludo/lbl/deduce/try-hdf5


In [2]:
!ls -lh *.h5

-rw-r--r-- 1 ludo ludo 7.6K Jul 16 14:55 A.h5
-rw-r--r-- 1 ludo ludo  14K Jul 16 17:21 B.h5


In [3]:
!cp A.h5 B.h5

In [4]:
!ls -lh *.h5

-rw-r--r-- 1 ludo ludo 7.6K Jul 16 14:55 A.h5
-rw-r--r-- 1 ludo ludo 7.6K Jul 16 17:39 B.h5


In [5]:
import h5py
import numpy as np

In [6]:
B = h5py.File('B.h5', mode='r+')
B

<HDF5 file "B.h5" (mode r+)>

In [7]:
B.keys()

<KeysViewHDF5 ['analysis', 'data_clean', 'data_raw']>

In [8]:
B['data_raw']

<HDF5 group "/data_raw" (7 members)>

### Change dataset ndim and shape

- In general it's possible to modify in-place a dataset's content using the `dataset[...] = arr` syntax
- This requires the shape to be identical, so in this case we have to delete the existing group and recreate it using the same name
- Attributes/comments, if present, must be copied manually?

In [9]:
dset0 = B['data_raw/exp0']

del B['data_raw/exp0']
B.create_dataset('data_raw/exp0', data=dset0[()].reshape(2, 3))

for key, val in dset0.attrs.items():
    B['data_raw/exp0'].attrs.create(key, val)

B['data_raw/exp0']

<HDF5 dataset "exp0": shape (2, 3), type "<i8">

### Change dataset `shape` with same `ndim`

In [10]:
dset5 = B['data_raw/exp5']
dset5

<HDF5 dataset "exp5": shape (5,), type "<i8">

In [11]:
del B['data_raw/exp5']
B.create_dataset('data_raw/exp5', data=np.append(dset5[()], dset5[()][::-1]))

for key, val in dset5.attrs.items():
    B['data_raw/exp5'].attrs.create(key, val)

B['data_raw/exp5']

<HDF5 dataset "exp5": shape (10,), type "<i8">

### Change dataset dtype

In [12]:
dset2 = B['data_raw/exp2']
dset2

<HDF5 dataset "exp2": shape (7,), type "<i8">

In [13]:
del B['data_raw/exp2']
B.create_dataset('data_raw/exp2', data=dset2[()].astype(float))

for key, val in dset2.attrs.items():
    B['data_raw/exp2'].attrs.create(key, val)

B['data_raw/exp2']

<HDF5 dataset "exp2": shape (7,), type "<f8">

### Replace a dataset with a group

- Split `exp1` into two runs, `run0` and `run1`
- `exp1` is now a group instead of a dataset

In [14]:
dset1 = B['data_raw/exp1']

del B['data_raw/exp1']
dset1

<HDF5 dataset ("anonymous"): shape (8,), type "|S1">

In [15]:
grp1 = B['data_raw'].create_group('exp1')
grp1

<HDF5 group "/data_raw/exp1" (0 members)>

In [16]:
data1 = dset1[()]

grp1.create_dataset('run0', data=data1)
grp1.create_dataset('run1', data=data1[::-1])

<HDF5 dataset "run1": shape (8,), type "|S1">

In [17]:
for k, v in dset1.attrs.items():
    grp1['run0'].attrs.create(k, v)
    grp1['run1'].attrs.create(k, v)
grp1['run1'].attrs.modify('uid', b'2019-07-03_002')
grp1['run1'].attrs.create('description', b'Second version of the same dataset')

### Change dataset `value` (same `shape`)

- Using array of repeated single value with alternating sign (+, -) so that:
    - Mean delta is 0 (if len is even)
    - Test signed vs unsigned delta

In [18]:
delta6 = [-1.234, 1.234] * 3
B['data_raw/exp6'][...] = B['data_raw/exp6'][...] + delta6
B['data_raw/exp6']

<HDF5 dataset "exp6": shape (6,), type "<f8">

### Change dataset attributes

In [19]:
dset3_attrs = B['data_raw/exp3'].attrs
dset3_attrs.modify('temperature', dset3_attrs['temperature'] + 5.)
dset3_attrs.create('temperature_comment', b'Temperature was actually higher')

### Add/remove datasets

#### Move ("rename") existing dataset

- This will show up as `ADDED`/`REMOVED` if using `name` as key
- If using e.g. `attrs['uid']` as key, it will show up as renamed

In [20]:
B['data_raw'].move('exp4', 'experiment_4')

# introducing a single-value change for verification
B['data_raw/experiment_4'][0] = 0.

#### Simulating a data cleaning step

In [21]:
def clean_data(file, key_raw, key_clean, threshold=10.):
    raw = file[key_raw][()]
    clean = np.where(raw > threshold, raw, np.nan)
    dset_clean = file.create_dataset(key_clean, data=clean)
    dset_clean.attrs.create('threshold', threshold)
    return dset_clean

In [22]:
clean_data(B, 'data_raw/exp2', 'data_clean/exp2', threshold=20)

<HDF5 dataset "exp2": shape (7,), type "<f8">

In [23]:
clean_data(B, 'data_raw/exp0', 'data_clean/exp0', threshold=4)

<HDF5 dataset "exp0": shape (2, 3), type "<f8">

In [24]:
B['data_clean']

<HDF5 group "/data_clean" (2 members)>

In [25]:
B.close()