BUG: entries missing when reading from pytables hdf store using "where" statement #9676

alexfields · 2015-03-18T19:09:37Z

When I select from a HDF store using a "where" string (locating entries in which one field matches a particular string value), the function returns fewer rows than when I load the entire dataframe into memory and then match on that field. Below is some code that reproduces the problem; unfortunately, I can't easily provide the code that generates the source HDF store, but I'm happy to provide the kept_tids_20150310.h5 file if it would help. There are no nan values in the dataframe.

Running ptrepack on the dataframe solves the problem, but I don't believe this should happen in the first place.

I am using pandas 0.15.2 but have not tried 0.16.0.

>>> import pandas as pd
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.19.2
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.5.0
IPython: 2.4.0
sphinx: 1.2.3
patsy: 0.2.1
dateutil: 2.4.1
pytz: 2014.10
bottleneck: 0.8.0
tables: 3.1.0
numexpr: 2.3
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: 0.7.2
xlsxwriter: None
lxml: 2.3.2
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.7.2
apiclient: None
rpy2: 2.4.2
sqlalchemy: None
pymysql: None
psycopg2: None
>>> kept_tids = pd.read_hdf('kept_tids_20150310.h5', 'kept_tids', mode='r')
>>> kept_tids.to_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='w', format='t', data_columns=True)
>>> chroms = kept_tids['chrom'].drop_duplicates().order().tolist()
>>> print chroms
['chr1', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr2', 'chr20', 'chr21', 'chr22', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chrM', 'chrX', 'chrY']
>>> len(kept_tids)
202836
>>> sum(len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == '%s'"%x)) for x in chroms)
193757
>>> (kept_tids['chrom']=='chr16').sum()
10157
>>> len(pd.read_hdf('kept_tids_20150310_resave.h5', 'kept_tids', mode='r', where="chrom == 'chr16'"))
6278

rockg · 2015-03-18T20:34:37Z

Sounds maybe similar to #8265.

alexfields · 2015-03-18T20:39:13Z

@rockg Yes looks like the same problem, sorry I missed that post. I also wanted to update that the problem persists in 0.16.0rc1

rockg · 2015-03-18T20:42:30Z

It's a pytables issue, unfortunately. We bugged them a few times, but nothing has been done so far.

jreback · 2015-03-18T23:17:40Z

yep, issue is here: PyTables/PyTables#319

alexfields · 2015-03-18T23:24:39Z

Thanks @jreback - I just came across that topic myself as well. Given that nothing seems to be happening on the PyTables side, are there any workarounds in the meantime? Even "nasty" workarounds (as alluded to by @rockg) are better than nothing...

I'm guessing either setting index=False when saving or calling ptrepack each time might solve the problem - do we know if either of those is foolproof, or can the problem still persist?

jreback · 2015-03-19T11:06:27Z

so here's my example directly using hdfstore. The issue comes up only when using start/stop AND where IIRC. And if you use a 'big' chunksize this will work, its only a smaller value that seems to trigger it. Selecting on the entire file is safe.

So in practice this is pretty hard to actually make this fail.

import numpy as np
import pandas as pd

def pr(result):
    print result.A.unique()

print "creating test file"
N = 7000000
df = pd.DataFrame({'A' : np.random.randint(-20000,-15000,size=N), 'B' : np.random.randn(N) })

df.to_hdf('test.h5','df',mode='w',complib='blosc',format='table',data_columns=True,index=False)
print "creating test file : done"

uniques = df.A.unique()
v1 = uniques[0]
v2 = uniques[1]
selector = "A=[v1,v2]"

print "in-memory"
pr(pd.DataFrame({'A' : [v1,v2]}))

print "full selection (no-index)"
pr(pd.read_hdf('test.h5','df',where=selector))

print "indexing"
with pd.HDFStore('test.h5') as store:
    store.create_table_index('df')

print "full selection (with-index)"
pr(pd.read_hdf('test.h5','df',where=selector))


print "selecting by chunks"
cs = 500000
chunks = N / cs
for i in range(chunks):
    pr(pd.read_hdf('test.h5','df',start=i*cs,stop=(i+1)*cs,where=selector))

print "cleaning up"
import os
os.remove('test.h5')

FrancescAlted · 2015-04-18T18:59:27Z

This has been addressed in PyTables: PyTables/PyTables@035dbd5 and the fix will be part of the forthcoming 3.2.0 release.

jreback · 2015-04-18T19:01:51Z

this is going to need a validation test which we can put in once 3.2 comes out

FrancescAlted · 2015-04-18T19:17:06Z

I have added a new test in PyTables itself, but it is a good idea to re-check that in pandas too. Sorry for closing too fast.

jreback · 2015-04-18T19:23:55Z

np it's just very subtle and wanted to mention it in our next release

FrancescAlted · 2015-04-18T19:26:30Z

It was subtle indeed. Thanks for taking the time for producing the self-contained example. It proved to be very useful.

jreback · 2015-04-21T13:27:37Z

http://pytables.github.io/release-notes/RELEASE_NOTES_v3.2.x.html

is in rc
cc @rockg
cc @alexfields

pls give a try

rockg · 2015-04-27T18:22:40Z

@FrancescAlted Unfortunately I believe that the behavior exhibited in #8265 still exists. I just tried with the PyTables release candidate and the mismatch in lengths between using the index and not is still there. I'm using the file generated prior to the bug fix (don't know if this is the issue or not).

In [1]: from pandas import HDFStore, Term
In [2]: store = HDFStore('/mnt/home/gfr/Downloads/indexIssue (8).h5', mode='r')
In [3]: p1      =   store.select('ts', where=Term('Path', '=', 6), auto_close=False)
In [4]: len(p1)
Out[4]: 2892
In [5]: p2      =   store.select('ts', auto_close=False)
In [6]: p2s     =   p2[p2.index.get_level_values('Path') == 6]
In [7]: len(p2s)
Out[7]: 2972

In [8]: import tables  
In [9]: tables.__version__
Out[9]: '3.2.0rc2'

jreback · 2015-04-27T18:25:26Z

@rockg you need to regenerate the original file. The problm is that the indexes are off in the written file. Pls le me know.

rockg · 2015-04-27T18:29:59Z

I took the dataframe returned from p2 and wrote that out to a new HDF5 file. I then ran the above code and still see differences.

FrancescAlted · 2015-04-28T10:54:14Z

Hmm, I would like to fix that before 3.2.0 final, but I would need the data file for having a self-contained example. Could you provide a minimal example exposing the problem please?

rockg · 2015-04-28T11:35:42Z

You can download the file https://www.dropbox.com/s/122q55g5ubcf4fl/indexIssue.h5?dl=0. Then I think all you have to do is the below (relies on pandas, but I don't think that impacts the underlying problem). The first part of the code below reproduces the file under new code and then selects out of the database using the index and then selects out of the frame not using the index.

from pandas import HDFStore, Term

store   =   HDFStore('indexIssue.h5', mode='r')
data = store.select('ts')
data.to_hdf('indexIssueNew.h5', 'ts', format='table')

store   =   HDFStore('indexIssueNew.h5', mode='r')
p1      =   store.select('ts', where=Term('Path', '=', 6), auto_close=False)
print(len(p1))
p2      =   store.select('ts', auto_close=False)
p2s     =   p2[p2.index.get_level_values('Path') == 6]
print(len(p2s))

FrancescAlted · 2015-04-29T18:28:36Z

This has been fixed in PyTables in the 'release-3.2.0' branch. Could you please give it a try?

rockg · 2015-05-04T19:47:37Z

Good news, I too see this working properly.

jreback · 2015-05-04T20:25:19Z

@rockg awesome !

can u do a pr to note this for the whatsnew for 0.16.1 and more importantly put a warning in the HDF5 section (eg recommend pytables 3.2 for issues like this)

thxs

jreback · 2015-05-06T10:44:06Z

@rockg can you do a doc pr for this?

jreback added the IO HDF5 read_hdf, HDFStore label Mar 18, 2015

jreback added the Bug label Mar 18, 2015

alexfields mentioned this issue Mar 18, 2015

table.where query does not seem to be able to find rows... PyTables/PyTables#409

Closed

jreback added this to the 0.16.0 milestone Mar 19, 2015

jreback mentioned this issue Mar 19, 2015

BUG: workaround PyTables 319, but not setting expected rows (GH8265, GH9676) #9681

Closed

jreback removed this from the 0.16.0 milestone Mar 19, 2015

FrancescAlted closed this as completed Apr 18, 2015

jreback reopened this Apr 18, 2015

jreback added this to the 0.17.0 milestone Apr 18, 2015

FrancescAlted mentioned this issue Apr 29, 2015

Another bug in an indexed query PyTables/PyTables#441

Closed

jreback modified the milestones: 0.16.1, 0.17.0 May 4, 2015

jreback added the Docs label May 4, 2015

rockg mentioned this issue May 6, 2015

DOC: Note on PyTables index issue and additional Contributing refinements #10069

Merged

jreback closed this as completed in #10069 May 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

alexfields commented Mar 18, 2015

rockg commented Mar 18, 2015

alexfields commented Mar 18, 2015

rockg commented Mar 18, 2015

jreback commented Mar 18, 2015

alexfields commented Mar 18, 2015

jreback commented Mar 19, 2015

FrancescAlted commented Apr 18, 2015

jreback commented Apr 18, 2015

FrancescAlted commented Apr 18, 2015

jreback commented Apr 18, 2015

FrancescAlted commented Apr 18, 2015

jreback commented Apr 21, 2015

rockg commented Apr 27, 2015

jreback commented Apr 27, 2015

rockg commented Apr 27, 2015

FrancescAlted commented Apr 28, 2015

rockg commented Apr 28, 2015

FrancescAlted commented Apr 29, 2015

rockg commented May 4, 2015

jreback commented May 4, 2015

jreback commented May 6, 2015

Navigation Menu

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

BUG: entries missing when reading from pytables hdf store using "where" statement #9676

Comments

alexfields commented Mar 18, 2015

rockg commented Mar 18, 2015

alexfields commented Mar 18, 2015

rockg commented Mar 18, 2015

jreback commented Mar 18, 2015

alexfields commented Mar 18, 2015

jreback commented Mar 19, 2015

FrancescAlted commented Apr 18, 2015

jreback commented Apr 18, 2015

FrancescAlted commented Apr 18, 2015

jreback commented Apr 18, 2015

FrancescAlted commented Apr 18, 2015

jreback commented Apr 21, 2015

rockg commented Apr 27, 2015

jreback commented Apr 27, 2015

rockg commented Apr 27, 2015

FrancescAlted commented Apr 28, 2015

rockg commented Apr 28, 2015

FrancescAlted commented Apr 29, 2015

rockg commented May 4, 2015

jreback commented May 4, 2015

jreback commented May 6, 2015