Improve HDFStore.groups performance #21372

spott · 2018-06-07T23:17:54Z

I ran into this issue (#17593) recently, when I was trying to figure out what was causing the order of magnitude difference between h5ls and HDFStore.groups() or HDFStore.keys() and was trying to figure out more information about the problem. This was shrugged off as a Tables problem at the time and then closed.

I've come to the conclusion that this is actually Pandas problem. Hopefully, I can convince you of that as well.

If I look at a large flat HDF5 file (one where I have only created pandas dataframes at the root node /), then:

%%timeit
groups_iter_nodes = set()
for x in store._handle.iter_nodes(store._handle.root):
    groups_iter_nodes |= {x._v_pathname}

78.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

where:

%%timeit
groups_walk_nodes = set()
for x in store._handle.walk_nodes(store._handle.root):
    groups_walk_nodes |= {x._v_pathname}

7.08 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And the iter_nodes method has found every node that is relevant. The bottom version is (an abbreviated) version of what the Pandas library does, while the top version is just looking at the first level of keys.

Obviously, just looking at the root level of the HDF file doesn't work for the vast majority of cases, but I think there is still significant room to improve.

The root cause is that there are a whole bunch of leaves of groups that don't need to be looked at because we already know the answer. In my file, each of those groups has the following arrays under it: ['axis0', 'axis1', 'block0_items', 'block0_values']. These arrays are where the data lies for the pandas dataframe. In the above example, walk_nodes must actually look at every one of these nodes, while the iter_nodes version doesn't.

If we check if there are any children not in this list:

%%timeit
groups_iter_nodes = set()
for x in store._handle.iter_nodes(store._handle.root):
    if len(set(x._v_children.keys()) - {'axis0', 'axis1', 'block0_items', 'block0_values'}) > 0:
        print("oops")
    groups_iter_nodes |= {x._v_pathname}

88.4 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is still much faster than actually visiting the nodes in order to determine that they aren't pandas dataframes.

I'm not entirely sure how pandas deals with dataframes in hdf files. But if pandas really doesn't ever store anything except as children of groups, then this is a better way to look for all the groups:

%%timeit
groups_walk_nodes = set()
for x in store._handle.walk_groups(store._handle.root):
    groups_walk_nodes |= {x._v_pathname}

149 ms ± 2.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ultimately, I don't know enough about the history of pandas and HDFStores, or the details about how pandas stores dataframes in hdf files to fix this -- and I expect that there is a lot of historical backwards compatibility baggage that needs to be navigated -- but I think it is clear that this could be much smarter about how it finds groups.

If backwards compatability isn't an issue, I would be happy to submit a pull request for this. I can only assume that walk_groups is the least that could be done to improve things.

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-06-08T04:22:30Z

@spott : We have no issues with backwards compatibility so long as the tests pass for all relevant dependencies (and their versions). Go for it!

spott · 2018-06-08T16:21:18Z

Can I make the assumption that all pandas dataframes are stored as a group? (Is there ever a situation where a pandas object will be stored as a single array?)

gfyoung · 2018-06-08T17:10:20Z

@spott : I'm not entirely sure if that's true or not, but unless someone says otherwise, run with that assumption and see where it takes you. I see no reason to put roadblocks in front of your efforts!

gfyoung added the IO HDF5 read_hdf, HDFStore label Jun 8, 2018

gfyoung added the Performance Memory or execution speed performance label Jun 8, 2018

TomAugspurger changed the title ~~HDFStore.groups() is too dumb.~~ Improve HDFStore.groups performance Jun 8, 2018

spott mentioned this issue Jun 19, 2018

Fixed HDFSTore.groups() performance. #21543

Merged

4 tasks

jreback added this to the 0.23.2 milestone Jun 19, 2018

jreback closed this as completed in #21543 Jun 21, 2018

st-bender mentioned this issue Nov 28, 2019

HDF5: empty groups and keys #29916

Closed

st-bender mentioned this issue Feb 20, 2020

dd.read_hdf fails with pandas >= 0.24.0 dask/dask#5934

Closed

st-bender mentioned this issue Mar 4, 2020

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HDFStore.groups performance #21372

Improve HDFStore.groups performance #21372

spott commented Jun 7, 2018

gfyoung commented Jun 8, 2018

spott commented Jun 8, 2018

gfyoung commented Jun 8, 2018

Improve HDFStore.groups performance #21372

Improve HDFStore.groups performance #21372

Comments

spott commented Jun 7, 2018

gfyoung commented Jun 8, 2018

spott commented Jun 8, 2018

gfyoung commented Jun 8, 2018