Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HDFStore.groups performance #21372

Closed
spott opened this issue Jun 7, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@spott
Copy link
Contributor

commented Jun 7, 2018

I ran into this issue (#17593) recently, when I was trying to figure out what was causing the order of magnitude difference between h5ls and HDFStore.groups() or HDFStore.keys() and was trying to figure out more information about the problem. This was shrugged off as a Tables problem at the time and then closed.

I've come to the conclusion that this is actually Pandas problem. Hopefully, I can convince you of that as well.

If I look at a large flat HDF5 file (one where I have only created pandas dataframes at the root node /), then:

%%timeit
groups_iter_nodes = set()
for x in store._handle.iter_nodes(store._handle.root):
    groups_iter_nodes |= {x._v_pathname} 
78.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

where:

%%timeit
groups_walk_nodes = set()
for x in store._handle.walk_nodes(store._handle.root):
    groups_walk_nodes |= {x._v_pathname}
7.08 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And the iter_nodes method has found every node that is relevant. The bottom version is (an abbreviated) version of what the Pandas library does, while the top version is just looking at the first level of keys.

Obviously, just looking at the root level of the HDF file doesn't work for the vast majority of cases, but I think there is still significant room to improve.

The root cause is that there are a whole bunch of leaves of groups that don't need to be looked at because we already know the answer. In my file, each of those groups has the following arrays under it: ['axis0', 'axis1', 'block0_items', 'block0_values']. These arrays are where the data lies for the pandas dataframe. In the above example, walk_nodes must actually look at every one of these nodes, while the iter_nodes version doesn't.

If we check if there are any children not in this list:

%%timeit
groups_iter_nodes = set()
for x in store._handle.iter_nodes(store._handle.root):
    if len(set(x._v_children.keys()) - {'axis0', 'axis1', 'block0_items', 'block0_values'}) > 0:
        print("oops")
    groups_iter_nodes |= {x._v_pathname}
88.4 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is still much faster than actually visiting the nodes in order to determine that they aren't pandas dataframes.

I'm not entirely sure how pandas deals with dataframes in hdf files. But if pandas really doesn't ever store anything except as children of groups, then this is a better way to look for all the groups:

%%timeit
groups_walk_nodes = set()
for x in store._handle.walk_groups(store._handle.root):
    groups_walk_nodes |= {x._v_pathname}
149 ms ± 2.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Ultimately, I don't know enough about the history of pandas and HDFStores, or the details about how pandas stores dataframes in hdf files to fix this -- and I expect that there is a lot of historical backwards compatibility baggage that needs to be navigated -- but I think it is clear that this could be much smarter about how it finds groups.

If backwards compatability isn't an issue, I would be happy to submit a pull request for this. I can only assume that walk_groups is the least that could be done to improve things.

@gfyoung gfyoung added the IO HDF5 label Jun 8, 2018

@gfyoung

This comment has been minimized.

Copy link
Member

commented Jun 8, 2018

@spott : We have no issues with backwards compatibility so long as the tests pass for all relevant dependencies (and their versions). Go for it!

@gfyoung gfyoung added the Performance label Jun 8, 2018

@TomAugspurger TomAugspurger changed the title HDFStore.groups() is too dumb. Improve HDFStore.groups performance Jun 8, 2018

@spott

This comment has been minimized.

Copy link
Contributor Author

commented Jun 8, 2018

Can I make the assumption that all pandas dataframes are stored as a group? (Is there ever a situation where a pandas object will be stored as a single array?)

@gfyoung

This comment has been minimized.

Copy link
Member

commented Jun 8, 2018

@spott : I'm not entirely sure if that's true or not, but unless someone says otherwise, run with that assumption and see where it takes you. I see no reason to put roadblocks in front of your efforts!

@spott spott referenced this issue Jun 19, 2018

Merged

Fixed HDFSTore.groups() performance. #21543

3 of 4 tasks complete

@jreback jreback added this to the 0.23.2 milestone Jun 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.