HDF5: empty groups and keys #29916

st-bender · 2019-11-28T14:09:53Z

Hi,

With some of the hdf5 files I have, pandas.HDFStore.groups() returns an empty list. (as does .keys() which iterates over the groups). However, the data are accessible via .get() or .get_node().

This is related to #21543 and #21372 where the .groups() logic was changed, in particular using self._handle.walk_groups() instead of self._handle.walk_nodes(), now to be found here:

pandas/pandas/io/pytables.py

Line 1212 in ea2e26a

for g in self._handle.walk_groups()

Current Output

>>> hdf.groups()
[]

>>> hdf.keys()
[]

Expected Ouptut

List of groups and keys as visible with e.g. h5dump.
Note: Changing the aforementioned line back to use .walk_nodes() fixes the issue and lists the groups and keys properly:

>>> hdf.groups()
[/Data/Table Layout (Table(69462,), zlib(4)) ''
   description := {
...
/Data/Array Layout/2D Parameters/Data Parameters (Table(15,)) ''
   description := {
   "mnemonic": StringCol(itemsize=8, shape=(), dflt=b'', pos=0),
   "description": StringCol(itemsize=48, shape=(), dflt=b'', pos=1),
   "isError": Int64Col(shape=(), dflt=0, pos=2),
   "units": StringCol(itemsize=7, shape=(), dflt=b'', pos=3),
   "category": StringCol(itemsize=31, shape=(), dflt=b'', pos=4)}
   byteorder := 'little'
   chunkshape := (642,)]]

>>> hdf.keys()
['/Data/Table Layout',
 '/Metadata/Data Parameters',
 '/Metadata/Experiment Notes',
 '/Metadata/Experiment Parameters',
 '/Metadata/Independent Spatial Parameters',
 '/Metadata/_record_layout',
 '/Data/Array Layout/Layout Description',
 '/Data/Array Layout/1D Parameters/Data Parameters',
 '/Data/Array Layout/2D Parameters/Data Parameters']

Fix

One solution would be (I guess) to revert #21543, another to fix at least .keys() to use ._handle.walk_nodes() instead of .groups() in

pandas/pandas/io/pytables.py

Line 562 in ea2e26a

return [n._v_pathname for n in self.groups()]

Could also be that it is a bug in pytables.

Problem background

I was trying to figure out why some hdf5 files open fine with pandas but fail with dask.
The reason is that dask allows wildcards and iterates over the keys to find valid ones. If .keys() is empty, reading the files with dask fails.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.27.2.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.1.post20191125
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : 3.6.1
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

roberthdevries · 2020-02-16T15:44:58Z

I get the impression that there are two distinct users of the .keys() and .groups() methods:

the read_hdf() method that is only interested in pandas specific groups
the users who want to use the .get() method to retrieve individual tables from an HDF5 file created by other software than pandas.

User #1 is probably the one that was hit by the performance bug of tickets #21543 and #21372.
User #2 feels the effects of the fix for user #1.
Both users have conflicting interests, so the fix would be to make a specific method called pandas_groups() which will be used by user #1 and a simple variant without the filtering for user #2.
Alternatively we could add a parameter to the groups() method to enable the filter.

jreback · 2020-02-16T16:06:30Z

we are not going to add methods, but wouldn’t object to a filter keyword to defaults to the current behavior

st-bender · 2020-02-16T19:10:12Z

Hi,
Thanks for looking into it.

I get the impression that there are two distinct users of the .keys() and .groups() methods:

the read_hdf() method that is only interested in pandas specific groups

the users who want to use the .get() method to retrieve individual tables from an HDF5 file created by other software than pandas.

Shouldn't nr. 2 be the majority of use cases?
I mean when using hdf5 files form some source, you can't really expect them to be created by pandas or pytables. Manually iterating over the nodes and hence duplicating the code from the pandas source for those cases seems somewhat counterintuitive.

User #1 is probably the one that was hit by the performance bug of tickets #21543 and #21372.
User #2 feels the effects of the fix for user #1.
Both users have conflicting interests, so the fix would be to make a specific method called pandas_groups() which will be used by user #1 and a simple variant without the filtering for user #2.
Alternatively we could add a parameter to the groups() method to enable the filter.

It can also just be a documentation issue, people should be aware that .groups() returns only groups created with pandas. However, the other data can still be read with pandas, you just can't list it easily.

roberthdevries · 2020-02-16T20:27:12Z

The current behavior of keys() and groups() is different from the behavior in version 0.23.3, so it depends on the point of view which behavior is the desired one.
The documentation is not really clear what can be expected from the keys() method. Also the iterator method __iter__() (undocumented btw) which uses the keys() method, does not return any data objects. So all in all there is now no user friendly method to get a list of tables of an HDF5 file, or to iterate through these objects.
Question is what the original author of HDFStore had in mind when originally implementing this class.

roberthdevries · 2020-02-16T22:19:47Z

What about adding a kind parameter to the keys() method? The default value would be "pandas" and produces the existing behavior. The proposed other value is "table" which will return the list of Tables in the HDF5 file. This could be further extended with other node types, but they might be less relevant as they cannot be converted to DataFrames

st-bender · 2020-02-18T20:59:10Z

Question is what the original author of HDFStore had in mind when originally implementing this class.

I can't speak for the original author, and the code seems to have evolved gradually, but for the last few years it read self._handle.walk_nodes() (or similar for previous pytables APIs), until it was recently changed because of performance issues.

Maybe the pytables version is not the most efficient, but it would be nice to have some way of getting all the nodes in an hdf5 file with pandas (as was possible before), regardless of if they can be converted or not. HDFStore.get() should fail appropriately in those cases. But that's just my opinion.

roberthdevries · 2020-02-19T21:53:39Z

@jreback What do you think about my proposal of adding an optional kind parameter to the HDFStore.keys() method?

jreback · 2020-02-19T21:56:47Z

it’s ok, kind of -0 on it as intermixing other non pandas tables is an anti pattern

probably wouldn’t object to a PR though

roberthdevries · 2020-02-19T22:09:31Z

But isn't reading data from hdf5 files produced by other software one of the ways of getting data into pandas? I'm a little confused here.
I don't think that the use case here is storing pandas tables and non-pandas tables in one hdf5 file. However the HDFStore does not know which kind of hdf5 it gets and it is up to the user to figure out what to to do with a specific file.

roberthdevries · 2020-02-19T22:11:19Z

I am working on a small patch, but I am struggling a little to get my pandas test environment up and running (I was going the non-anaconda route and it does not really work as documented)

st-bender · 2020-02-20T13:23:21Z

it’s ok, kind of -0 on it as intermixing other non pandas tables is an anti pattern

probably wouldn’t object to a PR though

Hi,
I fear that this is a mis-understanding, this issue isn't about mixing or anything, it's about restoring the previous behaviour (pre-"optimization") to list the "groups" using walk_nodes() (previously) instead of walk_groups() (now). Both are methods of the underlying pytables handle and should be fully compatible with pandas. I don't know if there was something else seriously wrong with the "old" approach, apart from the performance hit in some cases.

All the data sets behind the keys listed in the "expected output" (OP) are perfectly fine to be imported with pandas. They are just not listed anymore which leads to downstream problems with e.g. dask.

Right now I am using a hand-patched version of pandas to have at least a working setup.

st-bender · 2020-02-20T13:28:03Z

But isn't reading data from hdf5 files produced by other software one of the ways of getting data into pandas? I'm a little confused here.

Thanks, yes that was my impression too and I use it in this way.

Instead of adding a kind parameter or anything, one could also just try to use the old method if the groups list is still empty, as in

groups = [ ... self._handle.walk_groups()
...
]
if not groups:
    groups = [ ... self._handle.walk_nodes()
...
]
return groups

roberthdevries · 2020-03-03T21:12:29Z

@jreback Could you have a look at my pull request? It passes all tests except for the typing validation, but I don't see what is actually wrong with the code or how to fix it.

jamesbooker · 2020-06-09T15:31:58Z

Please could a moderator review what's happening with this? If @roberthdevries solution is to be merged I can still use pandas for my project (which is processing data from a non-pandas system)

st-bender · 2020-06-09T17:14:10Z

Please could a moderator review what's happening with this? If @roberthdevries solution is to be merged I can still use pandas for my project (which is processing data from a non-pandas system)

Hi, this issue has been fixed downstream in dask.

As long as the files contain hdf5 "tables" they should work by accessing the hdf5 path directly, at least this worked and still works for me. The problem in the current issue is not that pandas could not import the files, but that the previous change in the API lead to downstream problems, i.e., they could no longer be imported with dask.
If your files don't work with pandas at all, that may be a separate issue.

st-bender · 2020-06-09T17:33:26Z

I get the impression that there are two distinct users of the .keys() and .groups() methods:

the read_hdf() method that is only interested in pandas specific groups

the users who want to use the .get() method to retrieve individual tables from an HDF5 file created by other software than pandas.

Btw, re-reading your comment @roberthdevries, I noticed that 1 is not quite true, read_hdf() reads anything (actually using pytables' .get_node() under the hood), it does not (yet) care about pandas-only files. Which would somehow defeat the purpose of having a general import function.

I can only assume that it was a design decision to list only pandas-native dataframes, and to mitigate the performance issues.

jbrockmendel added the IO HDF5 read_hdf, HDFStore label Nov 30, 2019

st-bender mentioned this issue Feb 20, 2020

dd.read_hdf fails with pandas >= 0.24.0 dask/dask#5934

Closed

roberthdevries mentioned this issue Mar 2, 2020

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

Closed

5 tasks

roberthdevries mentioned this issue Mar 15, 2020

BUG: Fix HDFStore empty keys on native HDF5 file by adding keyword include #32723

Merged

5 tasks

mroeschke added the Enhancement label May 16, 2020

jreback added this to the 1.1 milestone Jun 14, 2020

jreback closed this as completed in #32723 Jun 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF5: empty groups and keys #29916

HDF5: empty groups and keys #29916

st-bender commented Nov 28, 2019

INSTALLED VERSIONS

roberthdevries commented Feb 16, 2020

jreback commented Feb 16, 2020

st-bender commented Feb 16, 2020

roberthdevries commented Feb 16, 2020

roberthdevries commented Feb 16, 2020

st-bender commented Feb 18, 2020

roberthdevries commented Feb 19, 2020

jreback commented Feb 19, 2020

roberthdevries commented Feb 19, 2020

roberthdevries commented Feb 19, 2020 •

edited

Loading

st-bender commented Feb 20, 2020

st-bender commented Feb 20, 2020

roberthdevries commented Mar 3, 2020

jamesbooker commented Jun 9, 2020

st-bender commented Jun 9, 2020

st-bender commented Jun 9, 2020

HDF5: empty groups and keys #29916

HDF5: empty groups and keys #29916

Comments

st-bender commented Nov 28, 2019

Current Output

Expected Ouptut

Fix

Problem background

Output of pd.show_versions()

INSTALLED VERSIONS

roberthdevries commented Feb 16, 2020

jreback commented Feb 16, 2020

st-bender commented Feb 16, 2020

roberthdevries commented Feb 16, 2020

roberthdevries commented Feb 16, 2020

st-bender commented Feb 18, 2020

roberthdevries commented Feb 19, 2020

jreback commented Feb 19, 2020

roberthdevries commented Feb 19, 2020

roberthdevries commented Feb 19, 2020 • edited Loading

st-bender commented Feb 20, 2020

st-bender commented Feb 20, 2020

roberthdevries commented Mar 3, 2020

jamesbooker commented Jun 9, 2020

st-bender commented Jun 9, 2020

st-bender commented Jun 9, 2020

Output of `pd.show_versions()`

roberthdevries commented Feb 19, 2020 •

edited

Loading