Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZSTD support #104

Merged
merged 55 commits into from
Feb 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
050477e
Added zstd directories to include paths.
james-s-willis May 5, 2021
ddaabbe
Updated API with calls to zstd compress/decompress functions.
james-s-willis May 5, 2021
2774f40
Added zstd source v1.4.9.
james-s-willis May 5, 2021
26653a8
Added ZSTD compression to H5 filter.
james-s-willis May 6, 2021
ec9085d
Added ZSTD python interface.
james-s-willis May 6, 2021
65670fa
Removed unused ZSTD files.
james-s-willis May 6, 2021
930a3b3
Added compression level as argument to compress_ZSTD.
james-s-willis May 13, 2021
a82fdb5
Retrieve compression level from cd_values array.
james-s-willis May 13, 2021
e933d0f
Added extra source files for ZSTD.
james-s-willis May 17, 2021
30bbe72
Include all source files from zstd/common, zstd/compress and zstd/dec…
james-s-willis May 17, 2021
c31023f
Added H5_COMPRESS_ZSTD constant for ZSTD compression.
james-s-willis May 18, 2021
d729fd1
Test ZSTD compression with h5filter.
james-s-willis May 18, 2021
c4eb82e
Fixed check for ZSTD decompression.
james-s-willis May 18, 2021
40efa63
Fix bitshuffle __version__ string. Update bitshuffle version to 0.3.6…
james-s-willis May 19, 2021
7979cb1
Added ZSTD compression to regression tests.
james-s-willis May 19, 2021
b65519f
Removed old test data and replaced with lz4 and zstd specific files.
james-s-willis May 19, 2021
a1dc362
Update documentation and remove debug.
james-s-willis May 19, 2021
e7220f4
Re-added old regression test data.
james-s-willis May 19, 2021
b02bd98
Removed lz4 and zstd specific test files.
james-s-willis May 19, 2021
392d234
Add ZSTD compressed data as another dataset to regression test file.
james-s-willis May 19, 2021
3838e0a
New 0.3.6 regression test file.
james-s-willis May 19, 2021
1646cbb
Update README with ZSTD comments.
james-s-willis May 20, 2021
910eefa
Merge branch 'master' into jsw/zstd-support
james-s-willis May 26, 2021
877dab4
Formatting.
james-s-willis May 26, 2021
5a65422
Missed HEAD from merge.
james-s-willis May 26, 2021
07c77e9
Formatting.
james-s-willis May 26, 2021
928a20a
Formatting.
james-s-willis May 26, 2021
7e51067
Removed duplicate variable.
james-s-willis May 26, 2021
eb7e85c
Move new regression test data.
james-s-willis May 26, 2021
1e77673
Remove ZSTD source.
james-s-willis May 31, 2021
21467ba
Add ZSTD as a submodule.
james-s-willis May 31, 2021
ec69b88
New path for zstd library.
james-s-willis May 31, 2021
18a06ce
Fix bitshuffle __version__ string.
james-s-willis May 31, 2021
f4dc23c
Update src/bshuf_h5filter.c
james-s-willis Jun 18, 2021
4d0a88e
Merge branch 'master' into jsw/zstd-sub
james-s-willis Aug 5, 2021
d832710
Add nogil to ZSTD functions.
james-s-willis Aug 5, 2021
b370026
Updated Bitshuffle minor version to 0.4.0.
james-s-willis Aug 5, 2021
2cb8bdb
Formatting.
james-s-willis Aug 5, 2021
83df9d0
Pull in ZSTD repo when running CI.
james-s-willis Aug 5, 2021
3282095
Update regression file version.
james-s-willis Aug 5, 2021
8c245e3
Fixed typo in name of 'origional' dataset inside regression data files.
james-s-willis Aug 5, 2021
00f897b
Bump version to 0.4.0 in setup.py
james-s-willis Aug 10, 2021
1314cd7
Add instructions for pulling the ZSTD repo.
james-s-willis Aug 10, 2021
ef01823
Make ZSTD build optional. Enable with --zstd to install command.
james-s-willis Aug 10, 2021
7fcb0f1
Provide --zstd option when building bitshuffle.
james-s-willis Aug 11, 2021
7e33d74
Formatting.
james-s-willis Aug 11, 2021
7149779
Only define ZSTD_SUPPORT when --zstd is used.
james-s-willis Aug 12, 2021
6d44c8d
Ignore __init__.py when using flake8.
james-s-willis Aug 12, 2021
e865764
Enable ZSTD support when building wheel using environment variable.
james-s-willis Aug 13, 2021
c9ab26f
Use inbuilt CIBUILDWHEEL for plugin test instead of user defined CI_B…
james-s-willis Aug 13, 2021
a75a165
Flake8.
james-s-willis Aug 13, 2021
abe0cbb
Only run certain unit tests when ZSTD support is present.
james-s-willis Aug 13, 2021
4a98bf3
Build with ZSTD when running unit tests.
james-s-willis Aug 13, 2021
f5c6de1
Flake8.
james-s-willis Aug 13, 2021
3596158
Flake8.
james-s-willis Aug 13, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,11 @@ jobs:
pip install -r requirements.txt
pip install pytest

# Pull in ZSTD repo
git submodule update --init

# Installing the plugin to arbitrary directory to check the install script.
python setup.py install --h5plugin --h5plugin-dir ~/hdf5/lib
python setup.py install --h5plugin --h5plugin-dir ~/hdf5/lib --zstd

- name: Run tests
run: pytest .
7 changes: 4 additions & 3 deletions .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,15 @@ jobs:
run: python -m cibuildwheel --output-dir wheelhouse-hdf5-${{ matrix.hdf5 }}
env:
CIBW_ARCHS_LINUX: "x86_64"
CIBW_BEFORE_BUILD_LINUX: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh ${{ matrix.hdf5 }}
CIBW_ENVIRONMENT: "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib"
CIBW_BEFORE_BUILD_LINUX: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh ${{ matrix.hdf5 }};
git submodule update --init
CIBW_ENVIRONMENT: "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib ENABLE_ZSTD=1"
CIBW_TEST_REQUIRES: pytest
# Install different version of HDF5 for unit tests to ensure the
# wheels are indepedent of HDF5 installation
CIBW_BEFORE_TEST: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh 1.8.11;
# Run units tests but disable test_h5plugin.py
CIBW_TEST_COMMAND: CI_BUILD_WHEEL=1 pytest {package}/tests
CIBW_TEST_COMMAND: pytest {package}/tests

# Package wheels and host on CI
- uses: actions/upload-artifact@v2
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "zstd"]
path = zstd
url = https://github.com/facebook/zstd
29 changes: 20 additions & 9 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ is performed within blocks of data roughly 8kB long [1]_.

This does not in itself compress data, only rearranges it for more efficient
compression. To perform the actual compression you will need a compression
library. Bitshuffle has been designed to be well matched Marc Lehmann's
LZF_ as well as LZ4_. Note that because Bitshuffle modifies the data at the bit
library. Bitshuffle has been designed to be well matched to Marc Lehmann's
LZF_ as well as LZ4_ and ZSTD_. Note that because Bitshuffle modifies the data at the bit
level, sophisticated entropy reducing compression libraries such as GZIP and
BZIP are unlikely to achieve significantly better compression than simpler and
faster duplicate-string-elimination algorithms such as LZF and LZ4. Bitshuffle
thus includes routines (and HDF5 filter options) to apply LZ4 compression to
faster duplicate-string-elimination algorithms such as LZF, LZ4 and ZSTD. Bitshuffle
thus includes routines (and HDF5 filter options) to apply LZ4 and ZSTD compression to
each block after shuffling [2]_.

The Bitshuffle algorithm relies on neighbouring elements of a dataset being
Expand All @@ -50,7 +50,7 @@ used outside of python and in command line utilities such as ``h5dump``.
.. [1] Chosen to fit comfortably within L1 cache as well as be well matched
window of the LZF compression library.

.. [2] Over applying bitshuffle to the full dataset then applying LZ4
.. [2] Over applying bitshuffle to the full dataset then applying LZ4/ZSTD
compression, this has the tremendous advantage that the block is
already in the L1 cache.

Expand All @@ -62,6 +62,8 @@ used outside of python and in command line utilities such as ``h5dump``.

.. _LZ4: https://code.google.com/p/lz4/

.. _ZSTD: https://github.com/facebook/zstd


Applications
------------
Expand Down Expand Up @@ -97,11 +99,14 @@ Installation for Python

Installation requires python 2.7+ or 3.3+, HDF5 1.8.4 or later, HDF5 for python
(h5py), Numpy and Cython. Bitshuffle is linked against HDF5. To use the dynamically
loaded HDF5 filter requires HDF5 1.8.11 or later.
loaded HDF5 filter requires HDF5 1.8.11 or later. If ZSTD support is enabled the ZSTD
repo needs to pulled into bitshuffle before installation with::

git submodule update --init

To install::
To install bitshuffle::

python setup.py install [--h5plugin [--h5plugin-dir=spam]]
python setup.py install [--h5plugin [--h5plugin-dir=spam] --zstd]

To get finer control of installation options, including whether to compile
with OpenMP multi-threading, copy the ``setup.cfg.example`` to ``setup.cfg``
Expand All @@ -112,6 +117,8 @@ Bitshuffle and LZF filters outside of python), set the environment variable
``HDF5_PLUGIN_PATH`` to the value of ``--h5plugin-dir`` or use HDF5's default
search location of ``/usr/local/hdf5/lib/plugin``.

ZSTD support is enabled with ``--zstd``.

If you get an error about missing source files when building the extensions,
try upgrading setuptools. There is a weird bug where setuptools prior to 0.7
doesn't work properly with Cython in some cases.
Expand All @@ -133,9 +140,13 @@ the filter will be available only within python and only after importing
The filter can be added to new datasets either through the `h5py` low level
interface or through the convenience functions provided in
`bitshuffle.h5`. See the docstrings and unit tests for examples. For `h5py`
version 2.5.0 and later Bitshuffle can added to new datasets through the
version 2.5.0 and later Bitshuffle can be added to new datasets through the
high level interface, as in the example below.

The compression algorithm can be configured using the `filter_opts` in
`bitshuffle.h5.create_dataset()`. LZ4 is chosen with:
`(BLOCK_SIZE, h5.H5_COMPRESS_LZ4)` and ZSTD with:
`(BLOCK_SIZE, h5.H5_COMPRESS_ZSTD, COMP_LVL)`. See `test_h5filter.py` for an example.

Example h5py
------------
Expand Down
16 changes: 15 additions & 1 deletion bitshuffle/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# flake8: noqa
"""
Filter for improving compression of typed binary data.

Expand All @@ -11,6 +12,8 @@
bitunshuffle
compress_lz4
decompress_lz4
compress_zstd
decompress_zstd

"""

Expand All @@ -19,6 +22,7 @@

from bitshuffle.ext import (
__version__,
__zstd__,
bitshuffle,
bitunshuffle,
using_NEON,
Expand All @@ -28,6 +32,16 @@
decompress_lz4,
)

# Import ZSTD API if enabled
zstd_api = []
if __zstd__:
from bitshuffle.ext import (
compress_zstd,
decompress_zstd,
)

zstd_api += ["compress_zstd", "decompress_zstd"]

__all__ = [
"__version__",
"bitshuffle",
Expand All @@ -37,4 +51,4 @@
"using_AVX2",
"compress_lz4",
"decompress_lz4",
]
] + zstd_api
122 changes: 119 additions & 3 deletions bitshuffle/ext.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,23 @@ cdef extern from b"bitshuffle.h":
int block_size) nogil
int bshuf_decompress_lz4(void *A, void *B, int size, int elem_size,
int block_size) nogil
IF ZSTD_SUPPORT:
int bshuf_compress_zstd_bound(int size, int elem_size, int block_size)
int bshuf_compress_zstd(void *A, void *B, int size, int elem_size,
int block_size, const int comp_lvl) nogil
int bshuf_decompress_zstd(void *A, void *B, int size, int elem_size,
int block_size) nogil
int BSHUF_VERSION_MAJOR
int BSHUF_VERSION_MINOR
int BSHUF_VERSION_POINT

__version__ = "%d.%d.%d" % (BSHUF_VERSION_MAJOR, BSHUF_VERSION_MINOR,
BSHUF_VERSION_POINT)

__version__ = str("%d.%d.%d").format(BSHUF_VERSION_MAJOR, BSHUF_VERSION_MINOR,
BSHUF_VERSION_POINT)

IF ZSTD_SUPPORT:
__zstd__ = True
ELSE:
__zstd__ = False

# Prototypes from bitshuffle.c
cdef extern int bshuf_copy(void *A, void *B, int size, int elem_size)
Expand Down Expand Up @@ -451,3 +460,110 @@ def decompress_lz4(np.ndarray arr not None, shape, dtype, int block_size=0):
return out


IF ZSTD_SUPPORT:
@cython.boundscheck(False)
@cython.wraparound(False)
def compress_zstd(np.ndarray arr not None, int block_size=0, int comp_lvl=1):
"""Bitshuffle then compress an array using ZSTD.

Parameters
----------
arr : numpy array
Data to be processed.
block_size : positive integer
Block size in number of elements. By default, block size is chosen
automatically.
comp_lvl : positive integer
Compression level applied by ZSTD

Returns
-------
out : array with np.uint8 data type
Buffer holding compressed data.

"""

cdef int ii, size, itemsize, count=0
shape = (arr.shape[i] for i in range(arr.ndim))
if not arr.flags['C_CONTIGUOUS']:
msg = "Input array must be C-contiguous."
raise ValueError(msg)
size = arr.size
dtype = arr.dtype
itemsize = dtype.itemsize

max_out_size = bshuf_compress_zstd_bound(size, itemsize, block_size)

cdef np.ndarray out
out = np.empty(max_out_size, dtype=np.uint8)

cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] arr_flat
arr_flat = arr.view(np.uint8).ravel()
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] out_flat
out_flat = out.view(np.uint8).ravel()
cdef void* arr_ptr = <void*> &arr_flat[0]
cdef void* out_ptr = <void*> &out_flat[0]
with nogil:
for ii in range(REPEATC):
count = bshuf_compress_zstd(arr_ptr, out_ptr, size, itemsize, block_size, comp_lvl)
if count < 0:
msg = "Failed. Error code %d."
excp = RuntimeError(msg % count, count)
raise excp
return out[:count]

@cython.boundscheck(False)
@cython.wraparound(False)
def decompress_zstd(np.ndarray arr not None, shape, dtype, int block_size=0):
"""Decompress a buffer using ZSTD then bitunshuffle it yielding an array.

Parameters
----------
arr : numpy array
Input data to be decompressed.
shape : tuple of integers
Shape of the output (decompressed array). Must match the shape of the
original data array before compression.
dtype : numpy dtype
Datatype of the output array. Must match the data type of the original
data array before compression.
block_size : positive integer
Block size in number of elements. Must match value used for
compression.

Returns
-------
out : numpy array with shape *shape* and data type *dtype*
Decompressed data.

"""

cdef int ii, size, itemsize, count=0
if not arr.flags['C_CONTIGUOUS']:
msg = "Input array must be C-contiguous."
raise ValueError(msg)
size = np.prod(shape)
itemsize = dtype.itemsize

cdef np.ndarray out
out = np.empty(tuple(shape), dtype=dtype)

cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] arr_flat
arr_flat = arr.view(np.uint8).ravel()
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] out_flat
out_flat = out.view(np.uint8).ravel()
cdef void* arr_ptr = <void*> &arr_flat[0]
cdef void* out_ptr = <void*> &out_flat[0]
with nogil:
for ii in range(REPEATC):
count = bshuf_decompress_zstd(arr_ptr, out_ptr, size, itemsize,
block_size)
if count < 0:
msg = "Failed. Error code %d."
excp = RuntimeError(msg % count, count)
raise excp
if count != arr.size:
msg = "Decompressed different number of bytes than input buffer size."
msg += "Input buffer %d, decompressed %d." % (arr.size, count)
raise RuntimeError(msg, count)
return out
3 changes: 3 additions & 0 deletions bitshuffle/h5.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Constants

H5FILTER : The Bitshuffle HDF5 filter integer identifier.
H5_COMPRESS_LZ4 : Filter option flag for LZ4 compression.
H5_COMPRESS_ZSTD : Filter option flag for ZSTD compression.

Functions
=========
Expand Down Expand Up @@ -54,13 +55,15 @@ cdef extern from b"bshuf_h5filter.h":
int bshuf_register_h5filter()
int BSHUF_H5FILTER
int BSHUF_H5_COMPRESS_LZ4
int BSHUF_H5_COMPRESS_ZSTD

cdef extern int init_filter(const char* libname)

cdef int LZF_FILTER = 32000

H5FILTER = BSHUF_H5FILTER
H5_COMPRESS_LZ4 = BSHUF_H5_COMPRESS_LZ4
H5_COMPRESS_ZSTD = BSHUF_H5_COMPRESS_ZSTD

# Init HDF5 dynamic loading with HDF5 library used by h5py
if not sys.platform.startswith('win'):
Expand Down
Loading