Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5: empty groups and keys #29916

Closed
st-bender opened this issue Nov 28, 2019 · 16 comments · Fixed by #32723
Closed

HDF5: empty groups and keys #29916

st-bender opened this issue Nov 28, 2019 · 16 comments · Fixed by #32723
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Milestone

Comments

@st-bender
Copy link

Hi,

With some of the hdf5 files I have, pandas.HDFStore.groups() returns an empty list. (as does .keys() which iterates over the groups). However, the data are accessible via .get() or .get_node().

This is related to #21543 and #21372 where the .groups() logic was changed, in particular using self._handle.walk_groups() instead of self._handle.walk_nodes(), now to be found here:

for g in self._handle.walk_groups()

Current Output

>>> hdf.groups()
[]
>>> hdf.keys()
[]

Expected Ouptut

List of groups and keys as visible with e.g. h5dump.
Note: Changing the aforementioned line back to use .walk_nodes() fixes the issue and lists the groups and keys properly:

>>> hdf.groups()
[/Data/Table Layout (Table(69462,), zlib(4)) ''
   description := {
...
/Data/Array Layout/2D Parameters/Data Parameters (Table(15,)) ''
   description := {
   "mnemonic": StringCol(itemsize=8, shape=(), dflt=b'', pos=0),
   "description": StringCol(itemsize=48, shape=(), dflt=b'', pos=1),
   "isError": Int64Col(shape=(), dflt=0, pos=2),
   "units": StringCol(itemsize=7, shape=(), dflt=b'', pos=3),
   "category": StringCol(itemsize=31, shape=(), dflt=b'', pos=4)}
   byteorder := 'little'
   chunkshape := (642,)]]
>>> hdf.keys()
['/Data/Table Layout',
 '/Metadata/Data Parameters',
 '/Metadata/Experiment Notes',
 '/Metadata/Experiment Parameters',
 '/Metadata/Independent Spatial Parameters',
 '/Metadata/_record_layout',
 '/Data/Array Layout/Layout Description',
 '/Data/Array Layout/1D Parameters/Data Parameters',
 '/Data/Array Layout/2D Parameters/Data Parameters']

Fix

One solution would be (I guess) to revert #21543, another to fix at least .keys() to use ._handle.walk_nodes() instead of .groups() in

return [n._v_pathname for n in self.groups()]

Could also be that it is a bug in pytables.

Problem background

I was trying to figure out why some hdf5 files open fine with pandas but fail with dask.
The reason is that dask allows wildcards and iterates over the keys to find valid ones. If .keys() is empty, reading the files with dask fails.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-957.27.2.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.1.post20191125
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : None
tables : 3.6.1
xarray : 0.14.1
xlrd : None
xlwt : None
xlsxwriter : None

@jbrockmendel jbrockmendel added the IO HDF5 read_hdf, HDFStore label Nov 30, 2019
@roberthdevries
Copy link
Contributor

I get the impression that there are two distinct users of the .keys() and .groups() methods:

  1. the read_hdf() method that is only interested in pandas specific groups
  2. the users who want to use the .get() method to retrieve individual tables from an HDF5 file created by other software than pandas.

User #​1 is probably the one that was hit by the performance bug of tickets #21543 and #21372.
User #​2 feels the effects of the fix for user #​1.
Both users have conflicting interests, so the fix would be to make a specific method called pandas_groups() which will be used by user #​1 and a simple variant without the filtering for user #​2.
Alternatively we could add a parameter to the groups() method to enable the filter.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2020

we are not going to add methods, but wouldn’t object to a filter keyword to defaults to the current behavior

@st-bender
Copy link
Author

Hi,
Thanks for looking into it.

I get the impression that there are two distinct users of the .keys() and .groups() methods:

  1. the read_hdf() method that is only interested in pandas specific groups
  2. the users who want to use the .get() method to retrieve individual tables from an HDF5 file created by other software than pandas.

Shouldn't nr. 2 be the majority of use cases?
I mean when using hdf5 files form some source, you can't really expect them to be created by pandas or pytables. Manually iterating over the nodes and hence duplicating the code from the pandas source for those cases seems somewhat counterintuitive.

User #​1 is probably the one that was hit by the performance bug of tickets #21543 and #21372.
User #​2 feels the effects of the fix for user #​1.
Both users have conflicting interests, so the fix would be to make a specific method called pandas_groups() which will be used by user #​1 and a simple variant without the filtering for user #​2.
Alternatively we could add a parameter to the groups() method to enable the filter.

It can also just be a documentation issue, people should be aware that .groups() returns only groups created with pandas. However, the other data can still be read with pandas, you just can't list it easily.

@roberthdevries
Copy link
Contributor

The current behavior of keys() and groups() is different from the behavior in version 0.23.3, so it depends on the point of view which behavior is the desired one.
The documentation is not really clear what can be expected from the keys() method. Also the iterator method __iter__() (undocumented btw) which uses the keys() method, does not return any data objects. So all in all there is now no user friendly method to get a list of tables of an HDF5 file, or to iterate through these objects.
Question is what the original author of HDFStore had in mind when originally implementing this class.

@roberthdevries
Copy link
Contributor

What about adding a kind parameter to the keys() method? The default value would be "pandas" and produces the existing behavior. The proposed other value is "table" which will return the list of Tables in the HDF5 file. This could be further extended with other node types, but they might be less relevant as they cannot be converted to DataFrames

@st-bender
Copy link
Author

Question is what the original author of HDFStore had in mind when originally implementing this class.

I can't speak for the original author, and the code seems to have evolved gradually, but for the last few years it read self._handle.walk_nodes() (or similar for previous pytables APIs), until it was recently changed because of performance issues.

Maybe the pytables version is not the most efficient, but it would be nice to have some way of getting all the nodes in an hdf5 file with pandas (as was possible before), regardless of if they can be converted or not. HDFStore.get() should fail appropriately in those cases. But that's just my opinion.

@roberthdevries
Copy link
Contributor

@jreback What do you think about my proposal of adding an optional kind parameter to the HDFStore.keys() method?

@jreback
Copy link
Contributor

jreback commented Feb 19, 2020

it’s ok, kind of -0 on it as intermixing other non pandas tables is an anti pattern

probably wouldn’t object to a PR though

@roberthdevries
Copy link
Contributor

But isn't reading data from hdf5 files produced by other software one of the ways of getting data into pandas? I'm a little confused here.
I don't think that the use case here is storing pandas tables and non-pandas tables in one hdf5 file. However the HDFStore does not know which kind of hdf5 it gets and it is up to the user to figure out what to to do with a specific file.

@roberthdevries
Copy link
Contributor

roberthdevries commented Feb 19, 2020

I am working on a small patch, but I am struggling a little to get my pandas test environment up and running (I was going the non-anaconda route and it does not really work as documented)

@st-bender
Copy link
Author

it’s ok, kind of -0 on it as intermixing other non pandas tables is an anti pattern

probably wouldn’t object to a PR though

Hi,
I fear that this is a mis-understanding, this issue isn't about mixing or anything, it's about restoring the previous behaviour (pre-"optimization") to list the "groups" using walk_nodes() (previously) instead of walk_groups() (now). Both are methods of the underlying pytables handle and should be fully compatible with pandas. I don't know if there was something else seriously wrong with the "old" approach, apart from the performance hit in some cases.

All the data sets behind the keys listed in the "expected output" (OP) are perfectly fine to be imported with pandas. They are just not listed anymore which leads to downstream problems with e.g. dask.

Right now I am using a hand-patched version of pandas to have at least a working setup.

@st-bender
Copy link
Author

But isn't reading data from hdf5 files produced by other software one of the ways of getting data into pandas? I'm a little confused here.

Thanks, yes that was my impression too and I use it in this way.

Instead of adding a kind parameter or anything, one could also just try to use the old method if the groups list is still empty, as in

groups = [ ... self._handle.walk_groups()
...
]
if not groups:
    groups = [ ... self._handle.walk_nodes()
...
]
return groups

@roberthdevries
Copy link
Contributor

@jreback Could you have a look at my pull request? It passes all tests except for the typing validation, but I don't see what is actually wrong with the code or how to fix it.

@jamesbooker
Copy link

Please could a moderator review what's happening with this? If @roberthdevries solution is to be merged I can still use pandas for my project (which is processing data from a non-pandas system)

@st-bender
Copy link
Author

Please could a moderator review what's happening with this? If @roberthdevries solution is to be merged I can still use pandas for my project (which is processing data from a non-pandas system)

Hi, this issue has been fixed downstream in dask.

As long as the files contain hdf5 "tables" they should work by accessing the hdf5 path directly, at least this worked and still works for me. The problem in the current issue is not that pandas could not import the files, but that the previous change in the API lead to downstream problems, i.e., they could no longer be imported with dask.
If your files don't work with pandas at all, that may be a separate issue.

@st-bender
Copy link
Author

I get the impression that there are two distinct users of the .keys() and .groups() methods:

  1. the read_hdf() method that is only interested in pandas specific groups
  2. the users who want to use the .get() method to retrieve individual tables from an HDF5 file created by other software than pandas.

Btw, re-reading your comment @roberthdevries, I noticed that 1 is not quite true, read_hdf() reads anything (actually using pytables' .get_node() under the hood), it does not (yet) care about pandas-only files. Which would somehow defeat the purpose of having a general import function.

I can only assume that it was a design decision to list only pandas-native dataframes, and to mitigate the performance issues.

@jreback jreback added this to the 1.1 milestone Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Projects
None yet
6 participants