New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: entries missing when reading from pytables hdf store using "where" statement #9676
Comments
Sounds maybe similar to #8265. |
@rockg Yes looks like the same problem, sorry I missed that post. I also wanted to update that the problem persists in 0.16.0rc1 |
It's a pytables issue, unfortunately. We bugged them a few times, but nothing has been done so far. |
yep, issue is here: PyTables/PyTables#319 |
Thanks @jreback - I just came across that topic myself as well. Given that nothing seems to be happening on the PyTables side, are there any workarounds in the meantime? Even "nasty" workarounds (as alluded to by @rockg) are better than nothing... I'm guessing either setting index=False when saving or calling ptrepack each time might solve the problem - do we know if either of those is foolproof, or can the problem still persist? |
so here's my example directly using hdfstore. The issue comes up only when using start/stop AND where IIRC. And if you use a 'big' chunksize this will work, its only a smaller value that seems to trigger it. Selecting on the entire file is safe. So in practice this is pretty hard to actually make this fail.
|
This has been addressed in PyTables: PyTables/PyTables@035dbd5 and the fix will be part of the forthcoming 3.2.0 release. |
this is going to need a validation test which we can put in once 3.2 comes out |
I have added a new test in PyTables itself, but it is a good idea to re-check that in pandas too. Sorry for closing too fast. |
np it's just very subtle and wanted to mention it in our next release |
It was subtle indeed. Thanks for taking the time for producing the self-contained example. It proved to be very useful. |
http://pytables.github.io/release-notes/RELEASE_NOTES_v3.2.x.html is in rc pls give a try |
@FrancescAlted Unfortunately I believe that the behavior exhibited in #8265 still exists. I just tried with the PyTables release candidate and the mismatch in lengths between using the index and not is still there. I'm using the file generated prior to the bug fix (don't know if this is the issue or not).
|
@rockg you need to regenerate the original file. The problm is that the indexes are off in the written file. Pls le me know. |
I took the dataframe returned from p2 and wrote that out to a new HDF5 file. I then ran the above code and still see differences. |
Hmm, I would like to fix that before 3.2.0 final, but I would need the data file for having a self-contained example. Could you provide a minimal example exposing the problem please? |
You can download the file https://www.dropbox.com/s/122q55g5ubcf4fl/indexIssue.h5?dl=0. Then I think all you have to do is the below (relies on pandas, but I don't think that impacts the underlying problem). The first part of the code below reproduces the file under new code and then selects out of the database using the index and then selects out of the frame not using the index.
|
This has been fixed in PyTables in the 'release-3.2.0' branch. Could you please give it a try? |
Good news, I too see this working properly. |
@rockg awesome ! can u do a pr to note this for the whatsnew for 0.16.1 and more importantly put a warning in the HDF5 section (eg recommend pytables 3.2 for issues like this) thxs |
@rockg can you do a doc pr for this? |
When I select from a HDF store using a "where" string (locating entries in which one field matches a particular string value), the function returns fewer rows than when I load the entire dataframe into memory and then match on that field. Below is some code that reproduces the problem; unfortunately, I can't easily provide the code that generates the source HDF store, but I'm happy to provide the kept_tids_20150310.h5 file if it would help. There are no nan values in the dataframe.
Running ptrepack on the dataframe solves the problem, but I don't believe this should happen in the first place.
I am using pandas 0.15.2 but have not tried 0.16.0.
The text was updated successfully, but these errors were encountered: