Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

Closed
alexfields opened this issue Mar 18, 2015 · 21 comments · Fixed by #10069
Closed

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

alexfields opened this issue Mar 18, 2015 · 21 comments · Fixed by #10069
Labels
Bug Docs IO HDF5 read_hdf, HDFStore
Milestone

Comments

@alexfields
Copy link

When I select from a HDF store using a "where" string (locating entries in which one field matches a particular string value), the function returns fewer rows than when I load the entire dataframe into memory and then match on that field. Below is some code that reproduces the problem; unfortunately, I can't easily provide the code that generates the source HDF store, but I'm happy to provide the kept_tids_20150310.h5 file if it would help. There are no nan values in the dataframe.

Running ptrepack on the dataframe solves the problem, but I don't believe this should happen in the first place.

I am using pandas 0.15.2 but have not tried 0.16.0.

>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.19.2
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.5.0
IPython: 2.4.0
sphinx: 1.2.3
patsy: 0.2.1
dateutil: 2.4.1
pytz: 2014.10
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: 0.7.2
xlsxwriter: None
lxml: 2.3.2
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.7.2
apiclient: None
rpy2: 2.4.2
sqlalchemy: None
pymysql: None
psycopg2: None
>>> kept_tids = pd.read_hdf('kept_tids_20150310.h5', 'kept_tids', mode='r')
>>> kept_tids.to_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='w', format='t', data_columns=True)
>>> chroms = kept_tids['chrom'].drop_duplicates().order().tolist()
>>> print chroms
['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chrM', 'chrX', 'chrY']
>>> len(kept_tids)
202836
>>> sum(len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == '%s'"%x)) for x in chroms)
193757
>>> (kept_tids['chrom']=='chr16').sum()
10157
>>> len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == 'chr16'"))
6278
@rockg
Copy link
Contributor

rockg commented Mar 18, 2015

Sounds maybe similar to #8265.

@alexfields
Copy link
Author

@rockg Yes looks like the same problem, sorry I missed that post. I also wanted to update that the problem persists in 0.16.0rc1

@rockg
Copy link
Contributor

rockg commented Mar 18, 2015

It's a pytables issue, unfortunately. We bugged them a few times, but nothing has been done so far.

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Mar 18, 2015
@jreback
Copy link
Contributor

jreback commented Mar 18, 2015

yep, issue is here: PyTables/PyTables#319

@jreback jreback added the Bug label Mar 18, 2015
@alexfields
Copy link
Author

Thanks @jreback - I just came across that topic myself as well. Given that nothing seems to be happening on the PyTables side, are there any workarounds in the meantime? Even "nasty" workarounds (as alluded to by @rockg) are better than nothing...

I'm guessing either setting index=False when saving or calling ptrepack each time might solve the problem - do we know if either of those is foolproof, or can the problem still persist?

@jreback
Copy link
Contributor

jreback commented Mar 19, 2015

so here's my example directly using hdfstore. The issue comes up only when using start/stop AND where IIRC. And if you use a 'big' chunksize this will work, its only a smaller value that seems to trigger it. Selecting on the entire file is safe.

So in practice this is pretty hard to actually make this fail.

import numpy as np
import pandas as pd

def pr(result):
    print result.A.unique()

print "creating test file"
N = 7000000
df = pd.DataFrame({'A' : np.random.randint(-20000,-15000,size=N), 'B' : np.random.randn(N) })

df.to_hdf('test.h5','df',mode='w',complib='blosc',format='table',data_columns=True,index=False)
print "creating test file : done"

uniques = df.A.unique()
v1 = uniques[0]
v2 = uniques[1]
selector = "A=[v1,v2]"

print "in-memory"
pr(pd.DataFrame({'A' : [v1,v2]}))

print "full selection (no-index)"
pr(pd.read_hdf('test.h5','df',where=selector))

print "indexing"
with pd.HDFStore('test.h5') as store:
    store.create_table_index('df')

print "full selection (with-index)"
pr(pd.read_hdf('test.h5','df',where=selector))


print "selecting by chunks"
cs = 500000
chunks = N / cs
for i in range(chunks):
    pr(pd.read_hdf('test.h5','df',start=i*cs,stop=(i+1)*cs,where=selector))

print "cleaning up"
import os
os.remove('test.h5')

@FrancescAlted
Copy link

This has been addressed in PyTables: PyTables/PyTables@035dbd5 and the fix will be part of the forthcoming 3.2.0 release.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2015

this is going to need a validation test which we can put in once 3.2 comes out

@jreback jreback reopened this Apr 18, 2015
@jreback jreback added this to the 0.17.0 milestone Apr 18, 2015
@FrancescAlted
Copy link

I have added a new test in PyTables itself, but it is a good idea to re-check that in pandas too. Sorry for closing too fast.

@jreback
Copy link
Contributor

jreback commented Apr 18, 2015

np it's just very subtle and wanted to mention it in our next release

@FrancescAlted
Copy link

It was subtle indeed. Thanks for taking the time for producing the self-contained example. It proved to be very useful.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2015

@rockg
Copy link
Contributor

rockg commented Apr 27, 2015

@FrancescAlted Unfortunately I believe that the behavior exhibited in #8265 still exists. I just tried with the PyTables release candidate and the mismatch in lengths between using the index and not is still there. I'm using the file generated prior to the bug fix (don't know if this is the issue or not).

In [1]: from pandas import HDFStore, Term
In [2]: store = HDFStore('/mnt/home/gfr/Downloads/indexIssue (8).h5', mode='r')
In [3]: p1      =   store.select('ts', where=Term('Path', '=', 6), auto_close=False)
In [4]: len(p1)
Out[4]: 2892
In [5]: p2      =   store.select('ts', auto_close=False)
In [6]: p2s     =   p2[p2.index.get_level_values('Path') == 6]
In [7]: len(p2s)
Out[7]: 2972

In [8]: import tables  
In [9]: tables.__version__
Out[9]: '3.2.0rc2'

@jreback
Copy link
Contributor

jreback commented Apr 27, 2015

@rockg you need to regenerate the original file. The problm is that the indexes are off in the written file. Pls le me know.

@rockg
Copy link
Contributor

rockg commented Apr 27, 2015

I took the dataframe returned from p2 and wrote that out to a new HDF5 file. I then ran the above code and still see differences.

@FrancescAlted
Copy link

Hmm, I would like to fix that before 3.2.0 final, but I would need the data file for having a self-contained example. Could you provide a minimal example exposing the problem please?

@rockg
Copy link
Contributor

rockg commented Apr 28, 2015

You can download the file https://www.dropbox.com/s/122q55g5ubcf4fl/indexIssue.h5?dl=0. Then I think all you have to do is the below (relies on pandas, but I don't think that impacts the underlying problem). The first part of the code below reproduces the file under new code and then selects out of the database using the index and then selects out of the frame not using the index.

from pandas import HDFStore, Term

store   =   HDFStore('indexIssue.h5', mode='r')
data = store.select('ts')
data.to_hdf('indexIssueNew.h5', 'ts', format='table')

store   =   HDFStore('indexIssueNew.h5', mode='r')
p1      =   store.select('ts', where=Term('Path', '=', 6), auto_close=False)
print(len(p1))
p2      =   store.select('ts', auto_close=False)
p2s     =   p2[p2.index.get_level_values('Path') == 6]
print(len(p2s))

@FrancescAlted
Copy link

This has been fixed in PyTables in the 'release-3.2.0' branch. Could you please give it a try?

@rockg
Copy link
Contributor

rockg commented May 4, 2015

Good news, I too see this working properly.

@jreback
Copy link
Contributor

jreback commented May 4, 2015

@rockg awesome !

can u do a pr to note this for the whatsnew for 0.16.1 and more importantly put a warning in the HDF5 section (eg recommend pytables 3.2 for issues like this)

thxs

@jreback jreback modified the milestones: 0.16.1, 0.17.0 May 4, 2015
@jreback jreback added the Docs label May 4, 2015
@jreback
Copy link
Contributor

jreback commented May 6, 2015

@rockg can you do a doc pr for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docs IO HDF5 read_hdf, HDFStore
Projects
None yet
4 participants