Skip to content

Commit

Permalink
Add ZSTD support (#104)
Browse files Browse the repository at this point in the history
Add support for compressing the data with ZSTD a new fast and efficient compression algorithm. ZSTD is pulled in as a submodule. This bumps the version to 0.4.0.

Co-authored-by: Richard Shaw <richard@phas.ubc.ca>
  • Loading branch information
james-s-willis and jrs65 committed Feb 25, 2022
1 parent 0aee87e commit 9ba2580
Show file tree
Hide file tree
Showing 22 changed files with 578 additions and 70 deletions.
5 changes: 4 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,11 @@ jobs:
pip install -r requirements.txt
pip install pytest
# Pull in ZSTD repo
git submodule update --init
# Installing the plugin to arbitrary directory to check the install script.
python setup.py install --h5plugin --h5plugin-dir ~/hdf5/lib
python setup.py install --h5plugin --h5plugin-dir ~/hdf5/lib --zstd
- name: Run tests
run: pytest .
7 changes: 4 additions & 3 deletions .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,15 @@ jobs:
run: python -m cibuildwheel --output-dir wheelhouse-hdf5-${{ matrix.hdf5 }}
env:
CIBW_ARCHS_LINUX: "x86_64"
CIBW_BEFORE_BUILD_LINUX: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh ${{ matrix.hdf5 }}
CIBW_ENVIRONMENT: "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib"
CIBW_BEFORE_BUILD_LINUX: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh ${{ matrix.hdf5 }};
git submodule update --init
CIBW_ENVIRONMENT: "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib ENABLE_ZSTD=1"
CIBW_TEST_REQUIRES: pytest
# Install different version of HDF5 for unit tests to ensure the
# wheels are indepedent of HDF5 installation
CIBW_BEFORE_TEST: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh 1.8.11;
# Run units tests but disable test_h5plugin.py
CIBW_TEST_COMMAND: CI_BUILD_WHEEL=1 pytest {package}/tests
CIBW_TEST_COMMAND: pytest {package}/tests

# Package wheels and host on CI
- uses: actions/upload-artifact@v2
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "zstd"]
path = zstd
url = https://github.com/facebook/zstd
29 changes: 20 additions & 9 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ is performed within blocks of data roughly 8kB long [1]_.

This does not in itself compress data, only rearranges it for more efficient
compression. To perform the actual compression you will need a compression
library. Bitshuffle has been designed to be well matched Marc Lehmann's
LZF_ as well as LZ4_. Note that because Bitshuffle modifies the data at the bit
library. Bitshuffle has been designed to be well matched to Marc Lehmann's
LZF_ as well as LZ4_ and ZSTD_. Note that because Bitshuffle modifies the data at the bit
level, sophisticated entropy reducing compression libraries such as GZIP and
BZIP are unlikely to achieve significantly better compression than simpler and
faster duplicate-string-elimination algorithms such as LZF and LZ4. Bitshuffle
thus includes routines (and HDF5 filter options) to apply LZ4 compression to
faster duplicate-string-elimination algorithms such as LZF, LZ4 and ZSTD. Bitshuffle
thus includes routines (and HDF5 filter options) to apply LZ4 and ZSTD compression to
each block after shuffling [2]_.

The Bitshuffle algorithm relies on neighbouring elements of a dataset being
Expand All @@ -50,7 +50,7 @@ used outside of python and in command line utilities such as ``h5dump``.
.. [1] Chosen to fit comfortably within L1 cache as well as be well matched
window of the LZF compression library.
.. [2] Over applying bitshuffle to the full dataset then applying LZ4
.. [2] Over applying bitshuffle to the full dataset then applying LZ4/ZSTD
compression, this has the tremendous advantage that the block is
already in the L1 cache.
Expand All @@ -62,6 +62,8 @@ used outside of python and in command line utilities such as ``h5dump``.

.. _LZ4: https://code.google.com/p/lz4/

.. _ZSTD: https://github.com/facebook/zstd


Applications
------------
Expand Down Expand Up @@ -97,11 +99,14 @@ Installation for Python

Installation requires python 2.7+ or 3.3+, HDF5 1.8.4 or later, HDF5 for python
(h5py), Numpy and Cython. Bitshuffle is linked against HDF5. To use the dynamically
loaded HDF5 filter requires HDF5 1.8.11 or later.
loaded HDF5 filter requires HDF5 1.8.11 or later. If ZSTD support is enabled the ZSTD
repo needs to pulled into bitshuffle before installation with::

git submodule update --init

To install::
To install bitshuffle::

python setup.py install [--h5plugin [--h5plugin-dir=spam]]
python setup.py install [--h5plugin [--h5plugin-dir=spam] --zstd]

To get finer control of installation options, including whether to compile
with OpenMP multi-threading, copy the ``setup.cfg.example`` to ``setup.cfg``
Expand All @@ -112,6 +117,8 @@ Bitshuffle and LZF filters outside of python), set the environment variable
``HDF5_PLUGIN_PATH`` to the value of ``--h5plugin-dir`` or use HDF5's default
search location of ``/usr/local/hdf5/lib/plugin``.

ZSTD support is enabled with ``--zstd``.

If you get an error about missing source files when building the extensions,
try upgrading setuptools. There is a weird bug where setuptools prior to 0.7
doesn't work properly with Cython in some cases.
Expand All @@ -133,9 +140,13 @@ the filter will be available only within python and only after importing
The filter can be added to new datasets either through the `h5py` low level
interface or through the convenience functions provided in
`bitshuffle.h5`. See the docstrings and unit tests for examples. For `h5py`
version 2.5.0 and later Bitshuffle can added to new datasets through the
version 2.5.0 and later Bitshuffle can be added to new datasets through the
high level interface, as in the example below.

The compression algorithm can be configured using the `filter_opts` in
`bitshuffle.h5.create_dataset()`. LZ4 is chosen with:
`(BLOCK_SIZE, h5.H5_COMPRESS_LZ4)` and ZSTD with:
`(BLOCK_SIZE, h5.H5_COMPRESS_ZSTD, COMP_LVL)`. See `test_h5filter.py` for an example.

Example h5py
------------
Expand Down
16 changes: 15 additions & 1 deletion bitshuffle/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# flake8: noqa
"""
Filter for improving compression of typed binary data.
Expand All @@ -11,6 +12,8 @@
bitunshuffle
compress_lz4
decompress_lz4
compress_zstd
decompress_zstd
"""

Expand All @@ -19,6 +22,7 @@

from bitshuffle.ext import (
__version__,
__zstd__,
bitshuffle,
bitunshuffle,
using_NEON,
Expand All @@ -28,6 +32,16 @@
decompress_lz4,
)

# Import ZSTD API if enabled
zstd_api = []
if __zstd__:
from bitshuffle.ext import (
compress_zstd,
decompress_zstd,
)

zstd_api += ["compress_zstd", "decompress_zstd"]

__all__ = [
"__version__",
"bitshuffle",
Expand All @@ -37,4 +51,4 @@
"using_AVX2",
"compress_lz4",
"decompress_lz4",
]
] + zstd_api
122 changes: 119 additions & 3 deletions bitshuffle/ext.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,23 @@ cdef extern from b"bitshuffle.h":
int block_size) nogil
int bshuf_decompress_lz4(void *A, void *B, int size, int elem_size,
int block_size) nogil
IF ZSTD_SUPPORT:
int bshuf_compress_zstd_bound(int size, int elem_size, int block_size)
int bshuf_compress_zstd(void *A, void *B, int size, int elem_size,
int block_size, const int comp_lvl) nogil
int bshuf_decompress_zstd(void *A, void *B, int size, int elem_size,
int block_size) nogil
int BSHUF_VERSION_MAJOR
int BSHUF_VERSION_MINOR
int BSHUF_VERSION_POINT

__version__ = "%d.%d.%d" % (BSHUF_VERSION_MAJOR, BSHUF_VERSION_MINOR,
BSHUF_VERSION_POINT)

__version__ = str("%d.%d.%d").format(BSHUF_VERSION_MAJOR, BSHUF_VERSION_MINOR,
BSHUF_VERSION_POINT)

IF ZSTD_SUPPORT:
__zstd__ = True
ELSE:
__zstd__ = False

# Prototypes from bitshuffle.c
cdef extern int bshuf_copy(void *A, void *B, int size, int elem_size)
Expand Down Expand Up @@ -451,3 +460,110 @@ def decompress_lz4(np.ndarray arr not None, shape, dtype, int block_size=0):
return out


IF ZSTD_SUPPORT:
@cython.boundscheck(False)
@cython.wraparound(False)
def compress_zstd(np.ndarray arr not None, int block_size=0, int comp_lvl=1):
"""Bitshuffle then compress an array using ZSTD.
Parameters
----------
arr : numpy array
Data to be processed.
block_size : positive integer
Block size in number of elements. By default, block size is chosen
automatically.
comp_lvl : positive integer
Compression level applied by ZSTD
Returns
-------
out : array with np.uint8 data type
Buffer holding compressed data.
"""

cdef int ii, size, itemsize, count=0
shape = (arr.shape[i] for i in range(arr.ndim))
if not arr.flags['C_CONTIGUOUS']:
msg = "Input array must be C-contiguous."
raise ValueError(msg)
size = arr.size
dtype = arr.dtype
itemsize = dtype.itemsize

max_out_size = bshuf_compress_zstd_bound(size, itemsize, block_size)

cdef np.ndarray out
out = np.empty(max_out_size, dtype=np.uint8)

cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] arr_flat
arr_flat = arr.view(np.uint8).ravel()
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] out_flat
out_flat = out.view(np.uint8).ravel()
cdef void* arr_ptr = <void*> &arr_flat[0]
cdef void* out_ptr = <void*> &out_flat[0]
with nogil:
for ii in range(REPEATC):
count = bshuf_compress_zstd(arr_ptr, out_ptr, size, itemsize, block_size, comp_lvl)
if count < 0:
msg = "Failed. Error code %d."
excp = RuntimeError(msg % count, count)
raise excp
return out[:count]

@cython.boundscheck(False)
@cython.wraparound(False)
def decompress_zstd(np.ndarray arr not None, shape, dtype, int block_size=0):
"""Decompress a buffer using ZSTD then bitunshuffle it yielding an array.
Parameters
----------
arr : numpy array
Input data to be decompressed.
shape : tuple of integers
Shape of the output (decompressed array). Must match the shape of the
original data array before compression.
dtype : numpy dtype
Datatype of the output array. Must match the data type of the original
data array before compression.
block_size : positive integer
Block size in number of elements. Must match value used for
compression.
Returns
-------
out : numpy array with shape *shape* and data type *dtype*
Decompressed data.
"""

cdef int ii, size, itemsize, count=0
if not arr.flags['C_CONTIGUOUS']:
msg = "Input array must be C-contiguous."
raise ValueError(msg)
size = np.prod(shape)
itemsize = dtype.itemsize

cdef np.ndarray out
out = np.empty(tuple(shape), dtype=dtype)

cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] arr_flat
arr_flat = arr.view(np.uint8).ravel()
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] out_flat
out_flat = out.view(np.uint8).ravel()
cdef void* arr_ptr = <void*> &arr_flat[0]
cdef void* out_ptr = <void*> &out_flat[0]
with nogil:
for ii in range(REPEATC):
count = bshuf_decompress_zstd(arr_ptr, out_ptr, size, itemsize,
block_size)
if count < 0:
msg = "Failed. Error code %d."
excp = RuntimeError(msg % count, count)
raise excp
if count != arr.size:
msg = "Decompressed different number of bytes than input buffer size."
msg += "Input buffer %d, decompressed %d." % (arr.size, count)
raise RuntimeError(msg, count)
return out
3 changes: 3 additions & 0 deletions bitshuffle/h5.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Constants
H5FILTER : The Bitshuffle HDF5 filter integer identifier.
H5_COMPRESS_LZ4 : Filter option flag for LZ4 compression.
H5_COMPRESS_ZSTD : Filter option flag for ZSTD compression.
Functions
=========
Expand Down Expand Up @@ -54,13 +55,15 @@ cdef extern from b"bshuf_h5filter.h":
int bshuf_register_h5filter()
int BSHUF_H5FILTER
int BSHUF_H5_COMPRESS_LZ4
int BSHUF_H5_COMPRESS_ZSTD

cdef extern int init_filter(const char* libname)

cdef int LZF_FILTER = 32000

H5FILTER = BSHUF_H5FILTER
H5_COMPRESS_LZ4 = BSHUF_H5_COMPRESS_LZ4
H5_COMPRESS_ZSTD = BSHUF_H5_COMPRESS_ZSTD

# Init HDF5 dynamic loading with HDF5 library used by h5py
if not sys.platform.startswith('win'):
Expand Down
Loading

0 comments on commit 9ba2580

Please sign in to comment.