# Author & Notes

This Jupyter Notebook was written by Benjamin S. Meyers ([bsm9339@rit.edu](mailto:bsm9339@rit.edu)). This is a very simple tutorial to get your started on working with HDF5. This covers datasets, but HDF5 is more than just data storage, it's also a set of functions for working with that data.

# Background on HDF5

[HDF5](https://www.hdfgroup.org/solutions/hdf5/) has been completely revamped since HDF4 (sort of like Python 2.7 vs. Python 3+). HDF5 has seven _concepts_ that you need to worry about:
- **File:** The HDF5 file containing any number of datasets. Typically organized like a file directory (see Groups).
- **Dataset:** Datasets contain data in the form of an n-dimensional array, typically with elements of the same type.
- **Datatype:** Metadata describing the individual elements of the dataset, e.g. 32bit integers, structs, strings, etc.
- **Dataspace:** Metadata describing the layout of the data, e.g. 3-D array.
- **Attribute:** Simple metadata describing basically anything you want.
- **Group:** A group is a collection of related datasets. Starting from the root group of the file, you may have a group of visualizations and a group of datasets, for example.
- **Link:** Links between related objects (Datasets and Groups) inside an HDF5 file (or even in another HDF5 file).

As you can see, HDF5 is object-oriented in it's design, but it's actually implemented in C for the sake of efficiency.

Here is a very good [video tutorial](https://www.youtube.com/watch?v=BAjsCldRMMc) to get started.

# Background on H5py

We'll be using [H5py](https://docs.h5py.org/en/stable/index.html), which is a popular Python wrapper for HDF5. [PyTables](https://www.pytables.org/) is an alternative, but they have a "there is only one good way to do X" mindset, so they're a bit limited.

The key thing to remember when using H5py is that HDF5 groups work like dictionaries, and HDF5 datasets work like NumPy arrays.

# Setup

Install HDF5:
- Debian/Ubuntu: `sudo apt install libhdf5-dev`
- Others: [See Documentation](https://www.hdfgroup.org/downloads/hdf5)

Install H5py: `pip3 install h5py`

In [1]:
# Imports
import h5py
import numpy as np

# (1) Creating an HDF5 File

In [2]:
with h5py.File("testfile.hdf5", "w") as f:
    # This creates a file called `testfile.hdf5` containing a dataset called
    # `test_dataset` with a 1-D array of size 100 with integer elements
    test_dataset = f.create_dataset("test_dataset", (100, ), dtype="i")

NOTE: If you try to view `testfile.hdf5` in vim or some other editor, you'll only see compiled code.

# (2) Reading, Writing, and Closing an HDF5 File

In [3]:
# Note the read permissions
with h5py.File("testfile.hdf5", "r") as f:
    pass # Do stuff
# OR
f = h5py.File("testfile.hdf5", "r")

So, how do we know what we have inside this file?

In [4]:
list(f.keys())

['test_dataset']

Now we know this file contains a dataset called "test_dataset". Let's create another:

In [5]:
# Variable length unicode strings
dt = h5py.special_dtype(vlen=str)
f.create_dataset("apples", (10,), dtype=dt)

ValueError: Unable to create dataset (no write intent on file)

We opened the HDF5 file with read permissions, so we can't modify it. Let's fix that:

In [6]:
# Always do this
h5py.File.close(f)
# Not this
# f.close()

# "a" = "rw" if the file exists
f = h5py.File("testfile.hdf5", "a")
list(f.keys())

['test_dataset']

**Read that again.** If you run `f.close()` instead of `h5py.File.close(f)`, your HDF5 will likely be corrupted.

# (3) Modifying and Accessing Datasets

Okay, now let's try creating another dataset:

In [7]:
# Variable length unicode strings
dt = h5py.special_dtype(vlen=str)
apples_dataset = f.create_dataset("apples", (10,), dtype=dt)

In [8]:
print(list(f.keys()))
print(apples_dataset.shape)
print(apples_dataset.dtype)

['apples', 'test_dataset']
(10,)
object


We've got two datasets. Let's add some apples:

In [9]:
apples = ["Red Delicious", "Gala", "Granny Smith", "Golden Delicious", "Lady", "Baldwin", "McIntosh", "Honey Crisp", "Fuji", "Cortland"]
apples_dataset = np.array(apples, dtype=dt)
print(apples_dataset)
print(list(f["apples"]))

['Red Delicious' 'Gala' 'Granny Smith' 'Golden Delicious' 'Lady' 'Baldwin'
 'McIntosh' 'Honey Crisp' 'Fuji' 'Cortland']
[b'', b'', b'', b'', b'', b'', b'', b'', b'', b'']


So, setting the pointer `apples_dataset` to a NumPy array **does not** change the actually dataset. Instead, we need to do this:

In [11]:
apples_dataset = f["apples"]
# The `[...]` is critical
apples_dataset[...] = np.array(apples, dtype=dt)
print(list(f["apples"]))

[b'Red Delicious', b'Gala', b'Granny Smith', b'Golden Delicious', b'Lady', b'Baldwin', b'McIntosh', b'Honey Crisp', b'Fuji', b'Cortland']


In [12]:
print(apples_dataset[0])

b'Red Delicious'


In [13]:
# This should give us an IndexError
apples_dataset[10] = "Empire"

IndexError: Index (10) out of range for (0-9)

If we want to add more data, we need to change the shape of the dataset:

In [14]:
apples_dataset.resize((11,))
print(apples_dataset)

TypeError: Only chunked datasets can be resized

Alright, so we didn't plan very well. We need our dataset to be _chunked_ if we want to resize it. To understand what this means, we need to look at how HDF5 stores data:
- **Contiguous Layout:** The default. Datasets are serialized into a monolithic block, which maps directly to a memory buffer the size of the dataset.
- **Chunked Layout:**: Datasets are split into chunks which are stored separately in the file. Storage order doesn't matter. The benefit of chunking is that (1) datasets can be resized, and (2) chunks can be read/written individually, improving performance when manipulating a subset of the dataset. [More details](https://support.hdfgroup.org/HDF5/doc/Advanced/Chunking/).

In [21]:
# Delete our "apples" dataset
del f["apples"]

# Recreate it, but make it chunked
apples_dataset = f.create_dataset("apples", (10,), maxshape=(None,), dtype=dt, chunks=True)
apples_dataset[...] = np.array(apples, dtype=dt)
print(list(f["apples"]))

[b'Red Delicious', b'Gala', b'Granny Smith', b'Golden Delicious', b'Lady', b'Baldwin', b'McIntosh', b'Honey Crisp', b'Fuji', b'Cortland']


Now we should be able to resize the dataset:

In [22]:
apples_dataset.resize((11,))
print(apples_dataset)

<HDF5 dataset "apples": shape (11,), type "|O">


In [24]:
apples_dataset[10] = "Empire"
print(list(f["apples"]))

[b'Red Delicious', b'Gala', b'Granny Smith', b'Golden Delicious', b'Lady', b'Baldwin', b'McIntosh', b'Honey Crisp', b'Fuji', b'Cortland', b'Empire']


# (4) Attrributes

In [27]:
print(dict(apples_dataset.attrs))

{}


So, our "apples" dataset has no attributes, i.e. no metadata. Let's fix that:

In [28]:
attr_name = "Description"
attr_data = "List of the most popular species of apples."
apples_dataset.attrs.create(name=attr_name, data=attr_data)

In [29]:
print(dict(apples_dataset.attrs))

{'Description': 'List of the most popular species of apples.'}


# (5) Groups

Groups allow us to group related datasets together. So let's make a group for vegetable datasets. First, let's see what the existing group hierarachy is:

In [30]:
print(f.name)

/


That's the root group. What about our apples?

In [31]:
print(f["apples"].name)

/apples


And now some veggies:

In [32]:
veggie_group = f.create_group("vegetables")

In [33]:
print(veggie_group)

<HDF5 group "/vegetables" (0 members)>


Now we need some datasets for our veggie group:

In [34]:
root_veggies = veggie_group.create_dataset("root_veggies", (10,), maxshape=(None,), dtype=dt, chunks=True)
print(root_veggies)
leafy_veggies = veggie_group.create_dataset("leafy_veggies", (10,), maxshape=(None,), dtype=dt, chunks=True)
print(leafy_veggies)

<HDF5 dataset "root_veggies": shape (10,), type "|O">
<HDF5 dataset "leafy_veggies": shape (10,), type "|O">


In [40]:
root_veggies[...] = np.array(["Onions", "Sweet Potatoes", "Turnips", "Ginger", "Beets", "Garlic", "Radishes", "Turnips", "Fennel", "Carrots"])
print(list(root_veggies))

[b'Onions', b'Sweet Potatoes', b'Turnips', b'Ginger', b'Beets', b'Garlic', b'Radishes', b'Turnips', b'Fennel', b'Carrots']


In [43]:
print(root_veggies.name)
print(leafy_veggies.name)
print(veggie_group.name)

/vegetables/root_veggies
/vegetables/leafy_veggies
/vegetables


As you can see, groups are basically directories of related datasets.

In [46]:
print(dict(f.items()))

{'apples': <HDF5 dataset "apples": shape (11,), type "|O">, 'test_dataset': <HDF5 dataset "test_dataset": shape (100,), type "<i4">, 'vegetables': <HDF5 group "/vegetables" (2 members)>}


We can iterate over the items in a group and run a function like this:

In [47]:
def printname(name):
    print(name)
f["vegetables"].visit(printname)

leafy_veggies
root_veggies
