df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478

Closed
dragoljub opened this Issue Oct 23, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@dragoljub

df.to_hdf() blocks access to the following compressors offered in pytables 3.3.0: ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’.

I would like to try blosc:lz4 compression for some of the bigger data I have to compare size and speed to LZO.

df.to_hdf(path, 'df', complib='blosc:lz4')
D:\Python27\lib\site-packages\pandas\io\pytables.pyc in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    434 
    435         if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib'):
--> 436             raise ValueError("complib only supports 'blosc', 'bzip2', lzo' "
    437                              "or 'zlib' compression.")
    438 

ValueError: complib only supports 'blosc', 'bzip2', lzo' or 'zlib' compression.
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 23, 2016

Contributor

done originally here: https://github.com/pandas-dev/pandas/pull/10341/files

should be easy enough to expand this list if you would like to do a PR

the check should directly introspect pytables for this validation I think

Contributor

jreback commented Oct 23, 2016

done originally here: https://github.com/pandas-dev/pandas/pull/10341/files

should be easy enough to expand this list if you would like to do a PR

the check should directly introspect pytables for this validation I think

@jreback jreback added this to the Next Major Release milestone Oct 23, 2016

@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Oct 23, 2016

@jreback thanks for the link with details. I'll take look at some local testing and let you know if the blosc compressors work.

@jreback thanks for the link with details. I'll take look at some local testing and let you know if the blosc compressors work.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Oct 24, 2016

Contributor

@dragoljub When this patch was submitted pandas did not work with the multi-compression filters. Things might have changed.

Contributor

bashtage commented Oct 24, 2016

@dragoljub When this patch was submitted pandas did not work with the multi-compression filters. Things might have changed.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Oct 24, 2016

Contributor

There is a verbal description of a case the produced incorrect results here:

#8874

Contributor

bashtage commented Oct 24, 2016

There is a verbal description of a case the produced incorrect results here:

#8874

@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Oct 24, 2016

I made this simple change in \pandas\io\pytables.py line 435 and now the different compression libraries seem to work with PyTables 3.3.0 and Pandas 0.19.0. I'll need some time to get a PR prepared with tests etc.

if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'):

Blosc:LZ4 reads my data about 30% faster than LZO with about the same compression ratio. I'll have to play more with different combination of strings and floats but so far it seems to be a nice option to have.

Unfortunately nothing comes close to the compression ratio I get with gzipped pickle files. I bet HDF5 CArray chunkshape being row-major in float32 blocks removes some of the benefits we may see with pure columnar chunked compression for columns with repeated values.

Some Benchmarks:

In [64]: %time df.to_hdf(r'df_none.h5', 'df', mode='w')
Wall time: 1.12 s

In [67]: %time df.to_hdf(r'df_lzo.h5', 'df', mode='w', complib='lzo', complevel=9)
Wall time: 378 ms

In [68]: %time df.to_hdf(r'df_lz4.h5', 'df', mode='w', complib='blosc:lz4', complevel=9)
Wall time: 357 ms

In [69]: %time df.to_hdf(r'df_zstd.h5', 'df', mode='w', complib='blosc:zstd', complevel=9)
Wall time: 28.4 s

In [70]: %time df.to_hdf(r'df_lz4hc.h5', 'df', mode='w', complib='blosc:lz4hc', complevel=9)
Wall time: 33.2 s

In [71]: %timeit  pd.read_hdf(r'df_none.h5', mode='r')
10 loops, best of 3: 134 ms per loop

In [72]: %timeit  pd.read_hdf(r'df_lzo.h5', mode='r')
1 loop, best of 3: 389 ms per loop

In [73]: %timeit  pd.read_hdf(r'df_lz4.h5', mode='r')
1 loop, best of 3: 277 ms per loop

In [74]: %timeit  pd.read_hdf(r'df_zstd.h5', mode='r')
1 loop, best of 3: 471 ms per loop

In [75]: %timeit  pd.read_hdf(r'df_lz4hc.h5', mode='r')
1 loop, best of 3: 260 ms per loop

In [76]: %time df.to_pickle(r'df.pkl')
Wall time: 1.25 s

In [77]: %timeit pd.read_pickle(r'df.pkl')
1 loop, best of 3: 228 ms per loop

I made this simple change in \pandas\io\pytables.py line 435 and now the different compression libraries seem to work with PyTables 3.3.0 and Pandas 0.19.0. I'll need some time to get a PR prepared with tests etc.

if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'):

Blosc:LZ4 reads my data about 30% faster than LZO with about the same compression ratio. I'll have to play more with different combination of strings and floats but so far it seems to be a nice option to have.

Unfortunately nothing comes close to the compression ratio I get with gzipped pickle files. I bet HDF5 CArray chunkshape being row-major in float32 blocks removes some of the benefits we may see with pure columnar chunked compression for columns with repeated values.

Some Benchmarks:

In [64]: %time df.to_hdf(r'df_none.h5', 'df', mode='w')
Wall time: 1.12 s

In [67]: %time df.to_hdf(r'df_lzo.h5', 'df', mode='w', complib='lzo', complevel=9)
Wall time: 378 ms

In [68]: %time df.to_hdf(r'df_lz4.h5', 'df', mode='w', complib='blosc:lz4', complevel=9)
Wall time: 357 ms

In [69]: %time df.to_hdf(r'df_zstd.h5', 'df', mode='w', complib='blosc:zstd', complevel=9)
Wall time: 28.4 s

In [70]: %time df.to_hdf(r'df_lz4hc.h5', 'df', mode='w', complib='blosc:lz4hc', complevel=9)
Wall time: 33.2 s

In [71]: %timeit  pd.read_hdf(r'df_none.h5', mode='r')
10 loops, best of 3: 134 ms per loop

In [72]: %timeit  pd.read_hdf(r'df_lzo.h5', mode='r')
1 loop, best of 3: 389 ms per loop

In [73]: %timeit  pd.read_hdf(r'df_lz4.h5', mode='r')
1 loop, best of 3: 277 ms per loop

In [74]: %timeit  pd.read_hdf(r'df_zstd.h5', mode='r')
1 loop, best of 3: 471 ms per loop

In [75]: %timeit  pd.read_hdf(r'df_lz4hc.h5', mode='r')
1 loop, best of 3: 260 ms per loop

In [76]: %time df.to_pickle(r'df.pkl')
Wall time: 1.25 s

In [77]: %timeit pd.read_pickle(r'df.pkl')
1 loop, best of 3: 228 ms per loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment