Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF file compression not working #29310

Open
WuraolaOyewusi opened this issue Oct 31, 2019 · 20 comments
Open

HDF file compression not working #29310

WuraolaOyewusi opened this issue Oct 31, 2019 · 20 comments
Assignees
Labels
Bug IO HDF5 read_hdf, HDFStore

Comments

@WuraolaOyewusi
Copy link
Contributor

WuraolaOyewusi commented Oct 31, 2019

xref #28890 (review)

While updating the performance comparison part of the IO docs it was found that compressed size values for .hdf file formats were the same as uncompressed .hdf file formats.

    24009288 Oct 10 06:43 test_fixed.hdf
    24009288 Oct 10 06:43 test_fixed_compress.hdf
    24458940 Oct 10 06:44 test_table.hdf
    24458940 Oct 10 06:44 test_table_compress.hdf

This seems to be caused by the next lines saving the same files:

df.to_hdf('test.hdf', 'test', mode='w')
df.to_hdf('test.hdf', 'test', mode='w', complib='blosc')

We need to see why the complib parameter is being ignored, and fix it so the hdf5 file is saved compressed when used.

CC @datapythonista

@jbrockmendel jbrockmendel added the IO HDF5 read_hdf, HDFStore label Oct 31, 2019
@kylekeppler
Copy link
Contributor

@yp1996
Copy link

yp1996 commented Nov 4, 2019

Hi @WuraolaOyewusi are currently working on this issue? If not, do you mind if I give it a shot?

@WuraolaOyewusi
Copy link
Contributor Author

@yp1996 . Please go ahead.

@joybh98
Copy link
Contributor

joybh98 commented Dec 24, 2019

@yp1996 ,are you still working on this? If not I'll like to work on it!

@MarcoGorelli
Copy link
Member

@joybhallaa we haven't heard from them in over 2 weeks, so if this is something you'd like to work on I think you're OK to go ahead - please comment 'take' so the issue will be assigned to you

@joybh98
Copy link
Contributor

joybh98 commented Feb 7, 2020

take

@MarcoGorelli
Copy link
Member

Sure - you can see #29404 for a starting point if you like

@joybh98
Copy link
Contributor

joybh98 commented Feb 7, 2020

@MarcoGorelli , sure will do!

@joybh98
Copy link
Contributor

joybh98 commented Feb 25, 2020

Hey all,
So I was going through the tests in test_store.py
and the tests have lines

assert node.filters.complevel == 0
assert node.filters.complib is None

These tests are checking for conditions and returning true or false, but not changing the values for these 2 parameters.
Can I use pytest to set values in case of any errors, and use it to set a "default" value, or should I do it in pytables.py.

I may be wrong about some things, as I am still fairly new to pytest.
Open to any suggestions.

@sathyz
Copy link
Contributor

sathyz commented Mar 11, 2020

Hi joybhalla,
are you still working on this?

@joybh98
Copy link
Contributor

joybh98 commented Mar 12, 2020

@sathyz yes !

@joybh98
Copy link
Contributor

joybh98 commented Mar 17, 2020

@sathyz feel free to take this issue !

@quangngd
Copy link
Contributor

Hi, as a newbie, I would love to take on this issue if it's okay.

@datapythonista
Copy link
Member

@quangngd if it's your first contribution, I'd recommend to start working on one of the files of #32550 instead. And once you've got that come back to this. I think it'll make your life easier to start with something trivial. But totally up to you.

@quangngd
Copy link
Contributor

@quangngd if it's your first contribution, I'd recommend to start working on one of the files of #32550 instead. And once you've got that come back to this. I think it'll make your life easier to start with something trivial. But totally up to you.

Thank you for the suggestion. I will definitely do that. I will come back if @sathyz don't want to work on this.

@quangngd
Copy link
Contributor

take

@quangngd
Copy link
Contributor

So I have looked into the code and here are the problems i found:

  1. In df.to_hdf, if you don't set complevel, it wouldn't get compressed even complib is explicitly set.
  2. The API of HDFStore is kinda confusing as you can set complevel/complib in two places - when constructing the HDFStore instance and when calling put/append. And the latter take priority if complib is explicitly set.
    For example,
with HDFStore(tmpfile, "w", complevel=5, complib='bzip2') as store:
    store.put(gname, df, format='t')

would result in bzip2(5), whereas

with HDFStore(tmpfile, "w", complevel=5, complib='bzip2') as store:
    store.put(gname, df, format='t', complevel=2, complib='zlib')

would result in zlib(2)
only specify complib='zlib' in put would result in zlib(5) and
only specify complevel=2 in put would result in bzip2(5) (complib is not explicitly set in store.put)
3. In HDFStore, there is a check which only accept table format if compressing. But the check is not working.
4. With fixed format, the file size of the "compressed" version is bigger than that of no compress. (maybe that's why we have no3)

So my suggestions are

  • For (1), we can add to the doc to address this issue, or having a default complevel value if complib is explicitly set.
  • Regarding the HDFStore, for a quick fix, we could modify so that the args in put/append method always take priority over those stored in self. In the long run, we should only have complevel, complib in one place.

But imo, doing just the former (and maybe fix the check) would already resolves this issue. The latter one deserve its own issue

@quangngd
Copy link
Contributor

update:
Found this test, so maybe this is an expected behaviour after all?
Can anyone comfirm this? @datapythonista

pandas/tests/io/pytables/test_store.py::TestHDFStore::test_complibs_default_settings

# Set complib and check to see if compression is disabled
        with ensure_clean_path(setup_path) as tmpfile:
            df.to_hdf(tmpfile, "df", complib="zlib")
            result = pd.read_hdf(tmpfile, "df")
            tm.assert_frame_equal(result, df)

            with tables.open_file(tmpfile, mode="r") as h5file:
                for node in h5file.walk_nodes(where="/df", classname="Leaf"):
                    assert node.filters.complevel == 0
                    assert node.filters.complib is None

@datapythonista
Copy link
Member

Thanks @quangngd for all the research on this. I'm not an expert in HDF, but feels like there is a compression for the whole store, and another one for each table.

If I'm understanding it correctly, the test is checking that the tables are not compressed when using complib, but I guess the store should still be, otherwise that parameter would be pointless.

Can you run a git blame of that test, and check who implemented it? Probably better to ask the right person that try to guess.

@quangngd
Copy link
Contributor

Tracing back to PR #16355, @jreback has already mentioned this issue in the code review: #16355 (comment) but in the end it was still merged.
Do you want to add a check for when complib is explicitly set while complevel is not and then set complevel to a default value, say 1?

@mroeschke mroeschke added the Bug label Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants