Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Improve HDFStore.groups performance #21372
I ran into this issue (#17593) recently, when I was trying to figure out what was causing the order of magnitude difference between
I've come to the conclusion that this is actually Pandas problem. Hopefully, I can convince you of that as well.
If I look at a large flat HDF5 file (one where I have only created pandas dataframes at the root node
Obviously, just looking at the root level of the HDF file doesn't work for the vast majority of cases, but I think there is still significant room to improve.
The root cause is that there are a whole bunch of leaves of groups that don't need to be looked at because we already know the answer. In my file, each of those groups has the following arrays under it:
If we check if there are any children not in this list:
This is still much faster than actually visiting the nodes in order to determine that they aren't pandas dataframes.
I'm not entirely sure how pandas deals with dataframes in hdf files. But if pandas really doesn't ever store anything except as children of groups, then this is a better way to look for all the groups:
Ultimately, I don't know enough about the history of pandas and HDFStores, or the details about how pandas stores dataframes in hdf files to fix this -- and I expect that there is a lot of historical backwards compatibility baggage that needs to be navigated -- but I think it is clear that this could be much smarter about how it finds groups.
If backwards compatability isn't an issue, I would be happy to submit a pull request for this. I can only assume that