## Requirements

In [1]:
import h5py as h5
import numpy as np
import pathlib

## Problem setting

For some applications I/O takes up a significant fraction of the execution time.  If that is the case, it is worth paying attention to, and choosing the appropriate data format.

The numpy library supports text I/O and binary I/O.  Additionally, HDF5 library wrappers such as h5py or pytables can be used to save numpy arrays in HDF5 files, or read them from such files.

In this notebook you can compare the performance of various options.

## Data

You can simply use a one-dimensional array to experiment with.  In order to do a useful benchmark, this array shouldn't be either too small, or too large.  The array `data` will be used as the test array.

In [2]:
array_size = 10_000_000
data = np.random.uniform(-1.0, 1.0, size=array_size)

## Text I/O

Saving your data as text has the advantage that you can read the resulting file using an editor, and it can be read without hassle by application written in any programming language, as well as by a great many existing tools.

However, with respect to performance, this may not be your best option.

For all experiments with text I/O, the file name is stored in `txt_file_name`.

In [3]:
txt_file_name = 'tmp_data.txt'

In [4]:
%time np.savetxt(txt_file_name, data)

CPU times: user 11.7 s, sys: 323 ms, total: 12 s
Wall time: 12.2 s


The numpy library has at least three functions to read data from a text file.  However, they differ in features and especially performance.

### `np.loadtxt`

In [5]:
%timeit np.loadtxt(txt_file_name)

2.14 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Verifying the loaded data is always wise.

In [6]:
loaded_data = np.loadtxt(txt_file_name)
print(loaded_data.shape, loaded_data.dtype)

(10000000,) float64


### `np.fromfile`

In [7]:
%timeit np.fromfile(txt_file_name, sep='\n')

6.22 s ± 89.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Verifying the loaded data is always wise.

In [8]:
loaded_data = np.fromfile(txt_file_name, sep='\n')
print(loaded_data.shape, loaded_data.dtype)

(10000000,) float64


### `np.genfromtxt`

In [9]:
%timeit np.genfromtxt(txt_file_name)

12.1 s ± 289 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Verifying the loaded data is always wise.

In [10]:
loaded_data = np.genfromtxt(txt_file_name)
print(loaded_data.shape, loaded_data.dtype)

(10000000,) float64


### Conclusion

It is clear that although all three functions have similar functionality, `np.loadtxt` has a distinct performance edge, being approximately three times faster than `np.fromfile`, and five times faster than `np.genfromtxt`.

In [11]:
pathlib.Path(txt_file_name).unlink()

## Binary I/O

A drawback of storing data in a binary format is that it can't be read directly using an editor.  However, the performance gains are perhaps worth the bother.

First, create the data file in binary format.

In [12]:
bin_file_name = 'tmp_data.npy'

In [13]:
%time np.save(bin_file_name, data)

CPU times: user 8.4 ms, sys: 110 ms, total: 118 ms
Wall time: 119 ms


### `np.load`

In [14]:
%timeit np.load(bin_file_name)

19.3 ms ± 2.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Verifying the loaded data is always wise.

In [15]:
loaded_data = np.load(bin_file_name)
print(loaded_data.shape, loaded_data.dtype)

(10000000,) float64


### `np.fromfile`

In [16]:
%timeit np.fromfile(bin_file_name)

16.7 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Verifying the loaded data is always wise.

In [17]:
loaded_data = np.fromfile(bin_file_name)
print(loaded_data.shape, loaded_data.dtype)

(10000016,) float64


### Conclusion

Binary I/O is orders of magnitude faster than text I/O.

In [18]:
pathlib.Path(bin_file_name).unlink()

## HDF5

As a binary format allows much more efficient I/O, is there a way to combine it with the advantages of text-based I/O?  HDF5 to the rescue.  It stores data in binary format, but is self-documenting and, to some extent, human readable (using `hrdump`).

In [19]:
h5_file_name = 'tmp_data.h5'

In [20]:
%%time
with h5.File(h5_file_name, 'w') as h5_file:
    h5_file.create_dataset('data', data.shape, data.dtype)
    h5_file['data'][:] = data[:]

CPU times: user 24 ms, sys: 69.1 ms, total: 93.1 ms
Wall time: 94.4 ms


In [21]:
%%timeit
with h5.File(h5_file_name, 'r') as h5_file:
    data = np.empty(h5_file['data'].shape, dtype=h5_file['data'].dtype)
    data[:] = h5_file['data'][:]

26.9 ms ± 308 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Note that a copy of the data is made to get an apples to apples comparison.  Howevever, very often there is no need to do this since computations can be made directly on the dataset.

Verifying the loaded data is always wise.

In [22]:
with h5.File(h5_file_name, 'r') as h5_file:
    loaded_data = np.empty(h5_file['data'].shape, dtype=h5_file['data'].dtype)
    loaded_data[:] = h5_file['data'][:]
print(loaded_data.shape, loaded_data.dtype)

(10000000,) float64


### Conclusion

There is overhead involved in using HDF5, but this will be mitigated by the advantages of using HDF5 as a format for long-term data storage.

In [23]:
pathlib.Path(h5_file_name).unlink()

## Conclusion

HDF5 is a very nice compromise between performance and transparancy.