# Data compression

*legend-pydataobj* gives the user a lot of flexibility in choosing how to compress LGDOs, on disk or in memory, through traditional HDF5 filters or custom waveform compression algorithms.

In [None]:
import lgdo
import numpy as np

Let's start by creating a dummy LGDO Table:

In [None]:
data = lgdo.Table(
    size=1000,
    col_dict={
        "col1": lgdo.Array(np.arange(0, 100, 0.1)),
        "col2": lgdo.Array(np.random.rand(1000)),
    },
)
data

and writing it to disk with default settings:

In [None]:
store = lgdo.lh5.LH5Store()
store.write(data, "data", "data.lh5", wo_mode="of")
lgdo.show("data.lh5")

Let's inspect the data on disk:

In [None]:
import h5py


def show_h5ds_opts(obj):
    with h5py.File("data.lh5") as f:
        print(obj)
        for attr in ["compression", "compression_opts", "shuffle", "chunks"]:
            print(">", attr, ":", f[obj].__getattribute__(attr))
        print("> size :", f[obj].id.get_storage_size(), "B")


show_h5ds_opts("data/col1")

Looks like the data is compressed with [Gzip](http://www.gzip.org) (compression level 4) by default! This default setting is stored in the global `DEFAULT_HDF5_SETTINGS` variable:

In [None]:
lgdo.lh5.store.DEFAULT_HDF5_SETTINGS

Which specifies the default keyword arguments forwarded to [h5py.Group.create_dataset()](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset) and can be overridden by the user

Examples:

In [None]:
# use another built-in filter
lgdo.lh5.store.DEFAULT_HDF5_SETTINGS = {"compression": "lzf"}

# specify filter name and options
lgdo.lh5.store.DEFAULT_HDF5_SETTINGS = {"compression": "gzip", "compression_opts": 7}

# specify a registered filter provided by hdf5plugin
import hdf5plugin

lgdo.lh5.store.DEFAULT_HDF5_SETTINGS = {"compression": hdf5plugin.Blosc()}

# shuffle bytes before compressing (typically better compression ratio with no performance penalty)
lgdo.lh5.store.DEFAULT_HDF5_SETTINGS = {"shuffle": True, "compression": "lzf"}

Useful resources and lists of HDF5 filters:

- [Registered HDF5 filters](https://confluence.hdfgroup.org/display/support/HDF5+Filter+Plugins)
- [Built-in HDF5 filters from h5py](https://docs.h5py.org/en/stable/high/dataset.html#filter-pipeline)
- [Extra filters from hdf5plugin](https://hdf5plugin.readthedocs.io/en/stable/usage.html)

Let's now re-write the data with the updated default settings:

In [None]:
store.write(data, "data", "data.lh5", wo_mode="of")
show_h5ds_opts("data/col1")

Nice. Shuffling bytes before compressing significantly reduced size on disk. Last but not least, `create_dataset()` keyword arguments can be passed to `write()`. They will be forwarded as is, overriding default settings.

In [None]:
store.write(data, "data", "data.lh5", wo_mode="of", shuffle=True, compression="gzip")
show_h5ds_opts("data/col1")

Object-specific compression settings are supported via the `hdf5_settings` LGDO attribute:

In [None]:
data["col2"].attrs["hdf5_settings"] = {"compression": "gzip"}
store.write(data, "data", "data.lh5", wo_mode="of")

show_h5ds_opts("data/col1")
show_h5ds_opts("data/col2")

We are now storing table columns with different compression settings.

<div class="alert alert-info">
**Note:** since any [h5py.Group.create_dataset()](https://docs.h5py.org/en/stable/high/group.html#h5py.Group.create_dataset) keyword argument can be used in `write()` or set in the `hdf5_settings` attribute, other HDF5 dataset settings can be configured, like the chunk size.
</div>

In [None]:
store.write(data, "data", "data.lh5", wo_mode="of", chunks=2)

## Waveform compression

*legend-pydataobj* implements fast custom waveform compression routines in the [lgdo.compression](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.compression.html) subpackage.

Let's try them out on some waveform test data:

In [None]:
from legendtestdata import LegendTestData

ldata = LegendTestData()
wfs, n_rows = store.read(
    "geds/raw/waveform",
    ldata.get_path("lh5/LDQTA_r117_20200110T105115Z_cal_geds_raw.lh5"),
)
wfs

Let's encode the waveform values with the [RadwareSigcompress](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.compression.html#lgdo.compression.radware.RadwareSigcompress) codec.

<div class="alert alert-info">
**Note:** samples from these test waveforms must be shifted by -32768 for compatibility reasons, see [lgdo.compression.radware.encode()](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.compression.html#lgdo.compression.radware.encode).
</div>

In [None]:
from lgdo.compression import encode, RadwareSigcompress

enc_values = encode(wfs.values, RadwareSigcompress(codec_shift=-32768))
enc_values

The output LGDO is an [ArrayOfEncodedEqualSizedArrays](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.types.html#lgdo.types.encoded.ArrayOfEncodedEqualSizedArrays), which is basically an array of bytes representing the compressed data. How big is this compressed object in bytes?

In [None]:
enc_values.encoded_data.flattened_data.nda.nbytes

How big was the original data structure?

In [None]:
wfs.values.nda.nbytes

It shrank quite a bit!

Let's now make a `WaveformTable` object wrapping these encoded values, instead of the uncompressed ones, and dump it to disk.

In [None]:
enc_wfs = lgdo.WaveformTable(
    values=enc_values,
    t0=wfs.t0,
    dt=wfs.dt,
)
store.write(enc_wfs, "waveforms", "data.lh5", wo_mode="o")
lgdo.show("data.lh5", attrs=True)

The LH5 structure is more complex now. Note how the compression settings are stored as HDF5 attributes.

<div class="alert alert-warning">
**Warning:** HDF5 compression is never applied to waveforms compressed with these custom filters.
</div>

Let's try to read the data back in memory:

In [None]:
obj, _ = store.read("waveforms", "data.lh5")
obj.values

Wait, this is not the compressed data we just wrote to disk, it got decompressed on the fly! It's still possible to just return the compressed data though:

In [None]:
obj, _ = store.read("waveforms", "data.lh5", decompress=False)
obj.values

And then decompress it manually:

In [None]:
from lgdo.compression import decode

decode(obj.values)

Waveform compression settings can also be specified at the LGDO level by attaching a `compression` attribute to the `values` attribute of a `WaveformTable` object:

In [None]:
from lgdo.compression import ULEB128ZigZagDiff

wfs.values.attrs["compression"] = ULEB128ZigZagDiff()
store.write(wfs, "waveforms", "data.lh5", wo_mode="of")

obj, _ = store.read("waveforms", "data.lh5", decompress=False)
obj.values.attrs["codec"]

Further reading:

- [Available waveform compression algorithms](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.compression.html)
- [read() docstring](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5.store.LH5Store.read)
- [write() docstring](https://legend-pydataobj.readthedocs.io/en/stable/api/lgdo.html#lgdo.lh5_store.LH5Store.write)