# Using dataloaders with HDF5 files in torch

Example of how to use `h5py` and `torch.utils.data.Dataset` to load data in batches without loading a whole file.

**Note:** this will often be slower than loading all of the data at once, but it avoids having to load large files in memory which can cause issues on shared compute resources.

Michael J. Williams 2022

In [None]:
import h5py
import numpy as np
import time
import torch
from typing import Optional, Tuple

In [None]:
np.__path__

# Dataset class

This class is designed for use in a supervised learning problem but can easily be adapted for unsupervised learning.

In [None]:
class H5Dataset(torch.utils.data.Dataset):
    """A dataset to handle large HDF5 files.

    Based on this post:
        https://discuss.pytorch.org/t/efficiently-saving-and-loading-data-using-h5py-or-other-methods/74153
    
    Parameters
    ----------
    h5_path
      Path to the HDF5 file
    x_key
      Key for the x data
    y_key
      Key for the y data
    """
    def __init__(
        self,
        h5_path: str,
        x_key: str,
        y_key: str,
      ) -> None:

        self.h5_path = h5_path
        self._h5_gen = None
        self.x_key = x_key
        self.y_key = y_key
    
    def __getitem__(self, index: int) -> Tuple[np.ndarray, np.ndarray]:
        if self._h5_gen is None:
            self._h5_gen = self._get_generator()
            next(self._h5_gen)
        return self._h5_gen.send(index)

    def _get_generator(self):
        with h5py.File( self.h5_path, 'r') as record:
            index = yield
            while True:
                X = record[self.x_key][index]
                y = record[self.y_key][index]
                index = yield X, y

    def __len__(self) -> int:
        with h5py.File(self.h5_path,'r') as record:
            length = record[self.x_key].shape[0]
            return length


# Make an example HDF5 file.

In [None]:
filename = 'test.h5'

Let's check the RAM usage before we create the large dataset. The values here are printed in `MB`

In [None]:
!free -m | head -n 2

Now let's create the large dataset.

In a real use case you would do this in a seperate script and only load the data in the training script. You could also make use of the option to add data in batches to the file, so you don't have to generate it all at once.

See: https://docs.h5py.org/en/stable/high/dataset.html

In [None]:
N = 100_000
x_data = np.random.randn(N, 1000)
y_data = np.random.randn(N, 2)

Now save the data into a HDF5 file using `h5py`. We'll use one dataset for the x data and another for the y data.

In [None]:
with h5py.File(filename, 'w') as f:
    f.create_dataset('time_series', data=x_data, dtype='float32')
    f.create_dataset('targets', data=y_data, dtype='float32')

We can check the size of file

In [None]:
!ls -lh

We can also check the RAM usage again

In [None]:
!free -m | head -n 2

# Make the dataloader

We then just create a normal torch Dataloader using the custom dataset class. We specify the filename and the name of the x and y datasets using `x_key` and `y_key`.

Setting `num_workers=2` will use two threads to pre-load batches, this can help to speed things up.

In [None]:
loader = torch.utils.data.DataLoader(
    dataset=H5Dataset(filename, x_key='time_series', y_key='targets'), 
    batch_size=1000, 
    shuffle=True,
    num_workers=2
)

# Example training loop

We can now use the dataloader in a normal training loop

In [None]:
%%time
for i, (x_batch, y_batch) in enumerate(loader):
    # Do stuff here
    a = 0
    if not i % 10:
        print(f'it: {i} / {len(loader)}')

If we check the RAM usage again after running the loop, we see that it has barely changed despite having loaded all of the data.

In [None]:
!free -m | head -n 2

If you have seperate training and validation sets then you'll either need to have seperate files or customise `H5Dataset` to only use part of the file.