Pandas pytables interface doesn't create empty table datasets #13016

damionw · 2016-04-28T13:49:01Z

Pandas used to allow the writing of empty HDF5 datasets through its pytables interface code. However, after upgrading to 0.17 (from 0.11), we've discovered that this behaviour is intentionally
short circuited. The library behaves as though the dataset is being written, but simply ignores the request and the resulting HDF5 file doesn't contain the requested table.

The offending code is in pandas/io/pytables.py:_write_to_group()

    # we don't want to store a table node at all if are object is 0-len
    # as there are not dtypes
    if getattr(value, 'empty', None) and (format == 'table' or append):
        return

We've worked around it by patching our installed copy of pandas, but we'd like to know the provocation behind this code before submitting a pull request. The comment implies that the lack of dtypes in the dataset is the cause, however each pandas column has type information even if empty.

Any clarification would be appreciated

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-28T13:51:40Z

pls show a copy pastable example as well as show versions for old/new (esp pytables versions)

jreback · 2016-04-28T13:52:42Z

I don't believe this was ever supported in PyTables for table format.

damionw · 2016-04-28T14:14:21Z

Hard to believe since we were using it :-)

In 2012, Wes McKinnon merged a patch into pandas related to this issue:

#1707

Seen here:

603e5ae

And, as to the current behaviour, why is it not an error ? It doesn't fail, and it just doesn't write any of the supporting structures

jreback · 2016-04-28T14:15:33Z

you are refering to fixed format. Pls show an example and version, including a clean file with generated meta data.

damionw · 2016-04-28T14:21:14Z

Example

#! /usr/bin/env python
import pandas as pd

h5 = pd.HDFStore("/tmp/test.h5", mode="w")

try:
df = pd.DataFrame([[_x, _x * 2] for _x in range(12)], columns=['one', 'two'], index=None)

h5.put("full", df[:], format='table', data_columns=list(df.keys())) # Write full dataset
h5.put("empty", df[0:0], format='table', data_columns=list(df.keys())) # Write empty dataset

finally:
h5.close()

h5 = pd.HDFStore("/tmp/test.h5", mode="r")

for key in ["full", "empty"]:
print "*** Examining table [{}] ***".format(key)

try:
    print h5[key].head()
except KeyError as _exception:
    print "{} does not exist in hdf5 file".format(key)

print

h5.close()

jreback · 2016-04-28T14:29:28Z

In [2]: import pandas as pd

In [3]: pd.__version__
Out[3]: '0.11.0'

In [4]: import numpy as np

In [5]: np.__version__
Out[5]: '1.7.1'

In [6]: import tables

In [7]: tables.__version__
Out[7]: '2.4.0'

In [8]: df = pd.DataFrame([[_x, _x * 2] for _x in range(12)], columns=['one', 'two'], index=None)

In [9]: 

In [9]: df
Out[9]: 
    one  two
0     0    0
1     1    2
2     2    4
3     3    6
4     4    8
5     5   10
6     6   12
7     7   14
8     8   16
9     9   18
10   10   20
11   11   22

In [10]: h5 = pd.HDFStore("test.h5", mode="w")
In [11]: h5.put("full", df[:], format='table', data_columns=list(df.keys())) # Write full dataset

In [12]: h5.put("empty", df[0:0], format='table', data_columns=list(df.keys())) # Write empty dataset

In [13]: h5.close()

In [14]: h5 = pd.HDFStore("test.h5", mode="r")

In [15]: h5
Out[15]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/empty            frame        (shape->[1,2]) 
/full             frame        (shape->[12,2])

In [16]: h5.root.empty
Out[16]: 
/empty (Group) ''
  children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array), 'axis1' (Array)]

In [20]: h5['empty']
ValueError: Shape of passed values is (2, 0), indices imply (2, 1)

In [22]: h5['full']
Out[22]: 
    one  two
0     0    0
1     1    2
2     2    4
3     3    6
4     4    8
5     5   10
6     6   12
7     7   14
8     8   16
9     9   18
10   10   20
11   11   22

So this looks broken in that version.

So you will have to be more specific on what you are doing.

damionw · 2016-04-28T14:30:22Z

The above example writes the underlying data structures in version 0.11, but doesn't understand the dimensions (y==0) during reading

In >0.16 the write is ignored

jreback · 2016-04-28T14:31:40Z

not sure I understand then. you said it worked in 0.11, what does that have to do with it then?

Hard to believe since we were using it :-)

damionw · 2016-04-28T14:32:41Z

We're writing a data structure that can be empty. Then we're reading the data structure in another program. The current method silently elides the existence of the table, so the reading program would have to catch an exception and fake the data structure.

jreback · 2016-04-28T14:32:44Z

and from the comments in the PR

fixed this, though required a bit of a hackjob (pytables doesn't like zero-length objects)

jreback · 2016-04-28T14:34:23Z

@damionw you are saying that it worked though. I want you to prove it. I don't recall this EVER working for table=True (different option back then), because of the PyTables limitation; it was a work-around for fixed.

damionw · 2016-04-28T14:35:38Z

It worked because we aren't reading the data structure using pandas (in this use case). However, that should work in pandas too, shouldn't it ? Simply ignoring the write request doesn't seem to me to be the way to handle it.

I'll happily fix it if there isn't a reason not to, which is the question I'm asking

jreback · 2016-04-28T14:38:12Z

I am still not clear on what you aim to fix or even what's broken. Pls show a complete example. Yes you can't read back the empty in that version, but that has been fixed since.

In [19]: pd.read_hdf('../test.h5','empty').dtypes
Out[19]: 
one    int64
two    int64
dtype: object

In [20]: pd.__version__
Out[20]: '0.18.0+176.gb13ddd5'

damionw · 2016-04-28T14:49:32Z

Pandas normally writes supporting information for tables into the HDF5, the table name is actually used for the group that contains the various indices and data tables. In version 0.11 this information is written to the HDF5 file whether the data is empty or not. Subsequently, it is not written at all.

The problem being addressed is that, now, the very existence of the desired table is prevented when it is empty.

Your above example will read the entry once its created, but won't allow it to be written (at least on 0.17). Please advise if 0.18 now allows writing empty tables (using table=True)

I'm unsure what more of an example you require for showing the behaviour. Could you help me understand what else I need to provide ?

jreback · 2016-04-28T14:52:57Z

@damionw you need to show a complete example of what you are expecting given a certain input set, on a certain version. I showed that above. That works. So show something that does what you think you want.

damionw · 2016-04-28T14:54:05Z

The same example I gave works as expected when pandas is patched by removing the described return statement.

damionw · 2016-04-28T14:54:50Z

I can provide the resulting hdf5 files for all cases offline if that would help

damionw · 2016-04-28T14:57:55Z

Correction: For 0.16 and beyond, the table=True format creates a group with the desired table name and a table named "table" underneath. All of that is omitted when the dataset is empty.

jreback · 2016-04-28T15:45:16Z

@damionw pls write a test, using the current version of pandas that fails.

damionw · 2016-05-19T19:37:27Z

Will do. Thanks

DKW

On Thu, Apr 28, 2016 at 12:45 PM, Jeff Reback notifications@github.com
wrote:

@damionw https://github.com/damionw pls write a test, using the current
version of pandas that fails.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13016 (comment)

lafrech · 2017-10-30T15:55:59Z

I think I'm facing the same issue.

Here's how to reproduce:

import pandas as pd
from pandas import HDFStore


# Prints 0.20.3
print(pd.__version__)

emptydf = pd.DataFrame({'col_1': [], 'col_2': []}, index=[])

with HDFStore("test.h5", 'w') as store:

    assert not store.keys()

    # append -> no table created
    store.append('empty', emptydf)
    assert not store.keys()

    # put, 'table' format -> no table created
    store.put('empty', emptydf, format='table')
    # No table created
    assert not store.keys()

    # put, default format -> array created
    store.put('empty', emptydf)
    assert store.keys() == ['/empty']

store.close()

My use case

I'm writing an API to store timeseries and I would like to separate creation/deletion of timeseries ID and data write/delete in a timeseries.

In other words, I want to be able to do

# Returns empty list []
list_ids()

# Raises "ID does not exist" exception
save(new_id, new_data)

# Creates new timeseries ID
create(new_id)

# Returns [new_id, ]
list_ids()

# Writes data (this time, ID exists)
save(new_id, new_data)

but I don't know how to create an empty timeseries because it won't be written in the file. I could allow save to auto create timeseries, but this wouldn't solve the issue of the ID not being listed until there actually is data in it, therefore not being advertised in the list.

The only workaround I see is to maintain an ID list somewhere else, which I'd rather avoid.

damionw · 2017-10-30T17:34:48Z

I have an old changeset that allows this, which we're using in production. Unfortunately, I haven't had the cycles to construct the tests, etc to make it palatable enough to make an acceptable pull request. Also, since I did it in April 2016, it's fallen behind quite a few pandas updates. https://github.com/damionw/pandas on branch allow-empty-hdf5-datasets Damion K. Wilson

…

On Mon, Oct 30, 2017 at 12:56 PM, Jérôme Lafréchoux < ***@***.***> wrote: I think I'm facing the same issue. Here's how to reproduce: import pandas as pdfrom pandas import HDFStore # Prints 0.20.3print(pd.__version__) emptydf = pd.DataFrame({'col_1': [], 'col_2': []}, index=[]) with HDFStore("test.h5", 'w') as store: assert not store.keys() # append -> no table created store.append('empty', emptydf) assert not store.keys() # put, 'table' format -> no table created store.put('empty', emptydf, format='table') # No table created assert not store.keys() # put, default format -> array created store.put('empty', emptydf) assert store.keys() == ['/empty'] store.close() ------------------------------ *My use case* I'm writing an API to store timeseries and I would like to separate creation/deletion of timeseries ID and data write/delete in a timeseries. In other words, I want to be able to do # Returns empty list [] list_ids() # Raises "ID does not exist" exception save(new_id, new_data) # Creates new timeseries ID create(new_id) # Returns [new_id, ] list_ids() # Writes data (this time, ID exists) save(new_id, new_data) but I don't know how to create an empty timeseries because it won't be written in the file. I could allow save to auto create timeseries, but this wouldn't solve the issue of the ID not being listed until there actually is data in it, therefore not being advertised in the list. The only workaround I see is to maintain an ID list somewhere else, which I'd rather avoid. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13016 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACEblNw27iqhP0SdVnQVep4sH3RwMAI2ks5sxfG4gaJpZM4IR73g> .

vss888 · 2019-03-22T17:03:20Z

Could you please let me know, what the conclusion is: will it be fixed? I can see that it is scheduled for the "Next Major Release" milestone, that has been deleted by now, which (I guess) means the fix is not scheduled anymore, even though it is a "Effort Low" issue.

jreback · 2019-03-22T17:07:02Z

this is a problem with PyTables and would have to be fixed there

we have 3000 issues so things get fixed when someone submits a patch

vss888 · 2019-03-22T17:14:49Z

In this case, a patch has been proposed above:

pytables.py#L1365

explicitly skips writing an object if object.empty is True. All we need to do is to delete the condition (assuming that it does not break anything else).

damionw · 2019-03-22T17:17:06Z

Understood, thanks. I started preparing a patch and never got around to finalising and submitting it. We've been successfully using the patched version ever since, though Damion K. Wilson

…

On Fri, Mar 22, 2019 at 2:07 PM Jeff Reback ***@***.***> wrote: this is a problem with PyTables and would have to be fixed there we have 3000 issues so things get fixed when someone submits a patch — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13016 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACEblMf4V9quS8jBUQ0rQeQ1un0OoGrBks5vZQ3KgaJpZM4IR73g> .

arw2019 · 2020-10-11T05:41:02Z

xref PyTables/PyTables#592

jreback added the IO HDF5 read_hdf, HDFStore label Apr 28, 2016

jreback added API Design Difficulty Intermediate labels May 19, 2016

jreback added this to the Next Major Release milestone May 19, 2016

TomAugspurger mentioned this issue Nov 2, 2016

Can't read an empty dataframe from hdf5 dask/dask#1719

Closed

WillAyd mentioned this issue Mar 22, 2019

Writing empty DataFrame to HDF file #25834

Closed

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

mroeschke added the Bug label May 16, 2020

mroeschke removed the API Design label Apr 23, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas pytables interface doesn't create empty table datasets #13016

Pandas pytables interface doesn't create empty table datasets #13016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016 •

edited

Loading

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016 •

edited

Loading

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

damionw commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented May 19, 2016

lafrech commented Oct 30, 2017

damionw commented Oct 30, 2017 via email

vss888 commented Mar 22, 2019

jreback commented Mar 22, 2019

vss888 commented Mar 22, 2019 •

edited

Loading

damionw commented Mar 22, 2019 via email

arw2019 commented Oct 11, 2020

Pandas pytables interface doesn't create empty table datasets #13016

Pandas pytables interface doesn't create empty table datasets #13016

Comments

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016 • edited Loading

damionw commented Apr 28, 2016

Example

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016 • edited Loading

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented Apr 28, 2016

damionw commented Apr 28, 2016

damionw commented Apr 28, 2016

jreback commented Apr 28, 2016

damionw commented May 19, 2016

lafrech commented Oct 30, 2017

damionw commented Oct 30, 2017 via email

vss888 commented Mar 22, 2019

jreback commented Mar 22, 2019

vss888 commented Mar 22, 2019 • edited Loading

damionw commented Mar 22, 2019 via email

arw2019 commented Oct 11, 2020

jreback commented Apr 28, 2016 •

edited

Loading

jreback commented Apr 28, 2016 •

edited

Loading

vss888 commented Mar 22, 2019 •

edited

Loading