In [1]:
from expelliarmus import Wizard
import pathlib
import h5py
import numpy as np
import timeit
import requests
import pickle
import os

In [2]:
## Download original data in evt3 raw format

fname = "driving_sample"
extension_map = {
    'dat': 'dat',
    'evt2': 'raw',
    'evt3': 'raw',
    'hdf5': 'hdf5',
    'hdf5_lzf': 'hdf5',
    'hdf5_gzip': 'hdf5',
    'numpy': 'npy',
}
get_fpath = lambda encoding: f"{fname}_{encoding}.{extension_map[encoding]}"

if not os.path.exists(get_fpath('evt3')):
    # Downloading files.
    print("Downloading EVT3 file... ", end="")
    if not pathlib.Path(get_fpath('evt3')).is_file():
        r = requests.get("https://dataset.prophesee.ai/index.php/s/nVcLLdWAnNzrmII/download", allow_redirects=True)
        open(get_fpath('evt3'), 'wb').write(r.content)
    print("done!")

wizard = Wizard(encoding="evt3")
data = wizard.read(get_fpath('evt3'))

In [3]:
## Generate all comparison files

# evt2 and dat
raw_encodings = ["dat", "evt2", "evt3"]
for encoding in raw_encodings:
    if not os.path.exists(get_fpath(encoding)):
        print(f"Generating file for {encoding} encoding.")
        wizard = Wizard(encoding="evt3")
        wizard.set_encoding(encoding)
        wizard.save(fpath=get_fpath(encoding), arr=data)

# variants of hdf5
hdf5_encodings = ["hdf5", "hdf5_lzf", "hdf5_gzip"]
for encoding in hdf5_encodings:
    fpath = pathlib.Path(f"{fname}_{encoding}.hdf5")
    if not os.path.exists(fpath):
        print(f"Generating file for {encoding} encoding.")
        fp = h5py.File(fpath, "w")
        fpath = pathlib.Path(get_fpath(encoding))
        if encoding=="hdf5":
            fp.create_dataset(name="events", shape=data.shape, dtype=data.dtype, data=data)
        elif encoding=="hdf5_lzf":
            fp.create_dataset(name="events", shape=data.shape, dtype=data.dtype, data=data, compression="lzf")
        elif encoding=="hdf5_gzip":
            fp.create_dataset(name="events", shape=data.shape, dtype=data.dtype, data=data, compression="gzip")
        fp.close()

# numpy
fpath = get_fpath('numpy')
if not os.path.exists(fpath):
    print(f"Generating file for numpy encoding.")
    np.save(fpath, data, allow_pickle=True)

In [4]:
## Run benchmarks

REPEAT = 10
get_fsize_MB = lambda fpath: round(fpath.stat().st_size/(1024*1024))

# evt2, evt3, dat
raw_times = []
raw_sizes = []
for encoding in raw_encodings:
    fpath = get_fpath(encoding)
    wizard = Wizard(encoding)
    wizard.set_file(fpath)
    raw_times.append(sum(timeit.repeat(lambda: wizard.read(fpath), number=1, repeat=REPEAT))/REPEAT)
    raw_sizes.append(get_fsize_MB(pathlib.Path(fpath)))

# hdf5 variants
hdf5_times = []
hdf5_sizes = []
for encoding in hdf5_encodings:
    fpath = get_fpath(encoding)
    fp = h5py.File(fpath)
    hdf5_times.append(sum(timeit.repeat(lambda: fp["events"][:], number=1, repeat=REPEAT))/REPEAT)
    fp.close()
    hdf5_sizes.append(get_fsize_MB(pathlib.Path(fpath)))

# numpy
fpath = get_fpath('numpy')
numpy_time = sum(timeit.repeat(lambda: np.load(fpath), number=1, repeat=REPEAT))/REPEAT
numpy_size = get_fsize_MB(pathlib.Path(fpath))

In [5]:
## Aggregate results

import pandas as pd

df = pd.DataFrame({
    'Encoding': raw_encodings + hdf5_encodings + ["numpy"],
    'Framework': ["expelliarmus"] * len(raw_encodings) + ["h5py"] * len(hdf5_encodings) + ["numpy"],
    'Read time [s]': raw_times + hdf5_times + [numpy_time],
    'File size [MB]': raw_sizes + hdf5_sizes + [numpy_size],
})

In [9]:
## Plot results

import plotly.express as px
from IPython.display import Image

title = f"Reading the same {int(len(data)/1e6)} million events from different files."
fig = px.scatter(df, x='Read time [s]', y='File size [MB]', color='Framework', symbol='Encoding', title=title)
fig.update_traces(marker_size=13)
fig.write_image('file_read_benchmark.png')
# img_bytes = fig.to_image(format="png")

Only cells below here will be included in the article! To convert the notebook to markdown, run
```
jupyter nbconvert index.ipynb --to markdown --TagRemovePreprocessor.enabled=True --TagRemovePreprocessor.remove_input_tags remove_input --TagRemovePreprocessor.remove_cell_tags remove_cell
```

---
title: "Reading events from disk, fast"
date: 2023-01-11
description: "Reduce loading times and disk footprint drastically. "
draft: true
image: file_read_benchmark.png
tags: ["file encoding", "events"]
---
In contrast to png/jpg for images, there is no standard format for events. When streaming data from an event camera, we get tuples of (time,x,y,polarity) that look something like this:

In [17]:
print(data[100:1100:100])

[(11718661,  762, 147, 1) (11718665,  833, 184, 1)
 (11718669, 1161,  72, 1) (11718674, 1110, 100, 0)
 (11718679, 1073,  23, 1) (11718684, 1134,  56, 1)
 (11718688,  799, 304, 0) (11718691,  391, 289, 0)
 (11718694,  234, 275, 1) (11718699,  512, 335, 1)]


With the emergence of event-based sensors, likewise came numerous ways of how to store the data. Some of the better ideas are hdf5 and numpy, some of the worse ones text files. When training spiking neural networks, file reading speed is a bottleneck we need to keep in mind. As the spatial resolution of event cameras grows, we receive more events for the same signal. Training on bigger datasets means that we want to keep in mind the file reading speed of our data. Here we list the results of our benchmark of different file type encodings and software frameworks that can decode files. 

![benchmark](file_read_benchmark.png)

The file size depends on the encoding, whereas the reading speed depends on the particular implementation of how files are read. In terms of file size, we can see that numpy doesn't use any compression whatsoever, resulting in some 1.7GB file for our events. Prophesee's [evt3](https://docs.prophesee.ai/stable/data/encoding_formats/evt3.html) format achieves the best compression by cleverly encoding differences in timestamps. In terms of reading speed, numpy is the fastest as it doesn't deal with any compression on disk. Unzipping the events from disk on the other hand using h5py is by far the slowest. Using [Expelliarmus](https://github.com/open-neuromorphic/expelliarmus) and the [evt2](https://docs.prophesee.ai/stable/data/encoding_formats/evt2.html) file format, we get very close to numpy reading speeds while at the same time only using a fourth of the disk space. This becomes particularly important for larger datasets which can easily reach some 3-4TB because of inefficient file encodings. 