# Purpose of package *h5py*

- Python interface which is as close as possible to interface of the HDF5 C library,
  providing almost all features and
- map feature set of HDF5 as close as possible to *NumPy* features, e.g.:

  + high level type system uses NumPy *dtype* objects
  + attribute and naming conventions like NumPy
  + mapping between HDF5 errors and exceptions
  
Makes HDF5 "pythonic", but adds no new features (e.g. no indexing, see *PyTables*).
  

# Installation

Conda (binary package incl.depependencies):

    conda install h5py
   
pip:

    pip install h5py
   
h5py Tutorial:
 
> Installing on Windows from source practically impossible because       
> of the C library dependencies involved*.


# Creating an HDF5 file

In [272]:
try:
    h5.close()
except:
    pass

In [275]:
!rm example-h5py.h5

rm: das Entfernen von »example-h5py.h5“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden


In [276]:
import h5py
h5py.enable_ipython_completer()

In [277]:
h5 = h5py.File('example-h5py.h5', 'w')

## Creating some groups and a link

`File` object is *root* group:

In [278]:
h5.name

'/'

In [279]:
g_sim = h5.create_group('sim')

In [280]:
g_sim

<HDF5 group "/sim" (0 members)>

In [281]:
g1 = h5.create_group('/sim/001')

In [282]:
g2 = g_sim.create_group('002')

In [283]:
g2

<HDF5 group "/sim/002" (0 members)>

Path-like addressing

In [284]:
g2.name

'/sim/002'

In [285]:
list(g_sim.keys())

['001', '002']

In [286]:
# g.create_dataset?

Creating a hard link (not a copy!):

In [287]:
h5['link-to-002'] = g2

In [288]:
list(h5.keys()) 

['link-to-002', 'sim']

<div class="alert alert-block alert-info">
When using h5py from Python 3, the keys(), values() and items() methods will return view-like objects instead of lists. These objects support containership testing and iteration, but can’t be sliced like lists.
</div>

In [289]:
h5['link-to-002'] == h5['/sim/002']

True

# Create datasets

Datasets:

- like NumPy arrays: homogenous collections of data elements, with an immutable datatype and (hyper)rectangular shape
- additionally: compression, error-detection, and chunked I/O


Creating an dataset filled with zeros:

In [290]:
dsA = g1.create_dataset("A", shape=(10000,10,11), dtype='float32')
# dsA = g1.create_dataset("A", shape=(10000,10,11), dtype='float32', compression=7)

In [291]:
dsA

<HDF5 dataset "A": shape (10000, 10, 11), type "<f4">

Or from existing data:

In [292]:
import numpy as np

In [293]:
arr = np.random.randn(200,3)

In [294]:
dsB = g1.create_dataset("B", data=arr)

In [295]:
list(dsB.dims.keys())

[<"" dimension 0 of HDF5 dataset at 140144412940472>,
 <"" dimension 1 of HDF5 dataset at 140144412940472>]

In [296]:
dsA[:,::2,:]=-np.arange(-1, -12, -1)

In [297]:
dsA.value

array([[[  1.,   2.,   3., ...,   9.,  10.,  11.],
        [  0.,   0.,   0., ...,   0.,   0.,   0.],
        [  1.,   2.,   3., ...,   9.,  10.,  11.],
        ..., 
        [  0.,   0.,   0., ...,   0.,   0.,   0.],
        [  1.,   2.,   3., ...,   9.,  10.,  11.],
        [  0.,   0.,   0., ...,   0.,   0.,   0.]],

       [[  1.,   2.,   3., ...,   9.,  10.,  11.],
        [  0.,   0.,   0., ...,   0.,   0.,   0.],
        [  1.,   2.,   3., ...,   9.,  10.,  11.],
        ..., 
        [  0.,   0.,   0., ...,   0.,   0.,   0.],
        [  1.,   2.,   3., ...,   9.,  10.,  11.],
        [  0.,   0.,   0., ...,   0.,   0.,   0.]],

       [[  1.,   2.,   3., ...,   9.,  10.,  11.],
        [  0.,   0.,   0., ...,   0.,   0.,   0.],
        [  1.,   2.,   3., ...,   9.,  10.,  11.],
        ..., 
        [  0.,   0.,   0., ...,   0.,   0.,   0.],
        [  1.,   2.,   3., ...,   9.,  10.,  11.],
        [  0.,   0.,   0., ...,   0.,   0.,   0.]],

       ..., 
       [[  1.,   2., 

In [298]:
!ls -al *.h5

-rw-rw-r-- 1 mcrot mcrot  4409240 Apr 12 12:03 example-h5py.h5
-rw-rw-r-- 1 mcrot mcrot 31265398 Apr 12 11:39 example-pytables.h5
-rw-rw-r-- 1 mcrot mcrot   252554 Apr 11 13:58 stock.h5


In [299]:
10000*10*10*8

8000000

In [300]:
h5['/sim/002/C'] = np.random.randint(0,10,1000)

In [301]:
# del h5['/sim/002/C']

In [302]:
h5.flush()

In [303]:
# h5.close()

# Resizing a dataset

In [304]:
dsD = h5.create_dataset("unlimited", (100, 10), maxshape=(None, 10))
dsD[:] = np.random.randn(100,10)
dsD.size

1000

In [305]:
h5.flush()

In [306]:
dsD

<HDF5 dataset "unlimited": shape (100, 10), type "<f4">

In [307]:
dsD.resize(200, axis=0)

In [308]:
dsD.size

2000

In [309]:
dsD[100:,:] = 1

In [310]:
h5.flush()

See in *hdfview* or *hdf_compass*!

# Attributes (Metadata)

In [236]:
dsB.attrs['experiment'] = 123
dsB.attrs['description'] = "random data and integers mixed"

In [237]:
list(dsB.attrs.keys())

['DIMENSION_LIST', 'DIMENSION_LABELS', 'experiment', 'description']

In [271]:
h5.close()

# References

HDF5 does not only support hard and soft links, but also *references*.

<div class="alert alert-block alert-info">
 References are low-level pointers to other objects with can be used like data.
</div>

So e.g. you can generate datasets with references or use references in attributes, e.g. as pointer to a time vector:

In [254]:
dsB.attrs['D'] = dsD.ref

In [255]:
h5.flush()

In [256]:
h5[dsB.attrs['D']]

<HDF5 dataset "unlimited": shape (200, 10), type "<f4">

# Conclusion *h5py*

- relatively easy to use
- well-suited if you mainly work with NumPy arrays
  and want to access those arrays on disk in similar way
- closely follows HDF5 but no more features on top, e.g. no indexed search on disk
- also interesting for implementing customized formats on top of HDF5





# (Using dimensions)

Dimensions can be labeled and attached to scales.

In [311]:
list(dsB.dims.keys())

[<"" dimension 0 of HDF5 dataset at 140144412940472>,
 <"" dimension 1 of HDF5 dataset at 140144412940472>]

In [326]:
h5['position'] = np.arange(200)

RuntimeError: Unable to create link (Name already exists)

In [313]:
dsB.dims.create_scale(h5['position'])

In [314]:
d = dsB.dims[0]

In [315]:
d.attach_scale(h5['position'])

In [316]:
d.label = 'position'

In [317]:
[ d.label for d in dsB.dims ]

['position', 'flag']

In [330]:
list(d.items())

[('', <HDF5 dataset "position": shape (200,), type "<i8">)]

In [222]:
h5.flush()