Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFStore.append_to_multiple doesn't write rows that are all np.nan #4698

Merged
merged 1 commit into from Sep 3, 2013

Conversation

@adgaudio
Copy link
Contributor

commented Aug 27, 2013

Hi,

Using HDFStore.append_to_multiple, if an entire row written to any one table consists entirely of np.nan, the row is not written to the table, but is written to the other tables. The following code reproduces and fixes the issue.

I would prefer that append_to_multiple maintain synchronized rows across tables, and to my knowledge, the best way to do that is drop that row from the other tables. We would probably need a fix that looks something like this PR.

I'm not sure if this is the best way to do this, and would love some feedback!

Thanks,
Alex

The Fix:

# setup
d = {'table1': ['col1', 'col2'], 'table2': ['col3', 'col4']}
value = pandas.DataFrame([[1,2,3,4], [5,6,None, None]], columns=['col1', 'col2', 'col3', 'col4'])

# fix (as in PR)
dfs = (value[cols] for cols in d.values())
valid_index = reduce(
        lambda index1, index2: index2.intersection(index1),
        [df.dropna(how='all').index for df in dfs]) 
value = value.ix[valid_index]

To reproduce the error:

In [1]: from pandas import DataFrame, HDFStore

In [2]: store = HDFStore('tmp1.h5')

In [3]: df = DataFrame([[1,2], [3,None]], columns=['a', 'b'])

In [4]: store.append_to_multiple({'table1': ['a'], 'table2': ['b']}, df, 'table1')

In [5]: store
Out[5]:
<class 'pandas.io.pytables.HDFStore'>
File path: tmp1.h5
/table1            frame_table  (typ->appendable,nrows->2,ncols->1,indexers->[index],dc->[a])
/table2            frame_table  (typ->appendable,nrows->1,ncols->1,indexers->[index])
store.select_as_multiple(['table1', 'table2'])

In [6]: store.select_as_multiple(['table1', 'table2'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-1a2861fcaa34> in <module>()
----> 1 store.select_as_multiple(['table1', 'table2'])

/Users/alexgaudio/.virtualenvs/ds/lib/python2.7/site-packages/pandas/io/pytables.pyc in select_as_multiple(self, keys, where, selector, columns, start, stop, iterator, chunksize, auto_close, **kwargs)
    540                 nrows = t.nrows
    541             elif t.nrows != nrows:
--> 542                 raise ValueError("all tables must have exactly the same nrows!")
    543
    544         # select coordinates from the selector table

ValueError: all tables must have exactly the same nrows!
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2013

related #4625

this is a more general issue ; I think better way to handle is to add a dropna=True (which preserves the existing behavior) then your solution would work

pls add some tests for this as well

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2013

@jreback - thanks for your prompt feedback! I just added to the existing tests for that method and added a dropna kwarg + docstring to the code.

For dropna, I deviated from the normal true/false syntax. Specifically, dropna= 'any' | 'all' | False. If dropna evaluates to True, then call DF.dropna(how=dropna). You might prefer to revert to normal True/False behavior...

Also, if I entirely misinterpreted what you meant by dropna to begin with, please let me know!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2013

can u hookup Travis (see contributing.md)
add a release notes entry

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2013

Travis is enabled for my fork, and I've squashed. When tests pass, it's good to go.

I also modified existing documentation, which you may want to review.

Thanks!
Alex

@jreback

View changes

doc/source/io.rst Outdated
defines which table is the selector table (which you can make queries from).
The argument ``dropna`` will remove rows from the input DataFrame with
'any' or 'all' ``np.NaN`` values, unless it evaluates to False. ``dropna``
will also ensure that tables are synchronized across rows. For example,

This comment has been minimized.

Copy link
@jreback

jreback Aug 28, 2013

Contributor

this is good - maybe break into 2 paragraphs

This comment has been minimized.

Copy link
@adgaudio

adgaudio Aug 29, 2013

Author Contributor

thanks - I attempted to simplify a little further and broke into 2 paragraphs

@jreback

View changes

doc/source/release.rst Outdated
@@ -40,6 +40,8 @@ pandas 0.13

**Improvements to existing features**

- ``append_to_multiple`` automatically synchronizes writing rows to multiple
tables and adds a ``dropna`` kwarg (:issue:`4698`)

This comment has been minimized.

Copy link
@jreback

jreback Aug 28, 2013

Contributor

can u put in the API section ( under the HDFStore sub heading that is already their)

This comment has been minimized.

Copy link
@adgaudio

adgaudio Aug 29, 2013

Author Contributor

done

@jreback

View changes

pandas/io/pytables.py Outdated
@@ -776,6 +776,10 @@ def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, *
selector : a string that designates the indexable table; all of its columns will
be designed as data_columns, unless data_columns is passed, in which
case these are used
data_columns : list of columns to create as data columns, or True to use all columns
dropna : {'any', 'all'}

This comment has been minimized.

Copy link
@jreback

jreback Aug 29, 2013

Contributor

I think this only makes sense as True or False/None ( going back to your original way I think)
as there is no difference between any and all, so might as well make it True by default (to synchronize)

though I am going to make an option on an individual table to be able to write an all nan row (maybe by setting dropna=False, default will still be True)
so that may be confusing

This comment has been minimized.

Copy link
@adgaudio

adgaudio Aug 29, 2013

Author Contributor

I reverted to True/False syntax. The idea behind any/all/False was actually to support dropna(how='any'), which would potentially be a lot more rigorous than just 'all', but that idea did break convention and could be a source of potential confusion.

Ability to write nan rows would be great. When that happens, I vote to change the default kwarg to dropna=False and update the documentation accordingly.

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2013

Thanks for your review! I made the changes you suggested, and we should be good for review 2 if Travis tests pass for python3.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2013

gr8 I'll look further later this week

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2013

@adgaudio pls rebase on master, I merged #4714. this should allow you to simply pass dropna=False when creating the tables in append_as_multiple

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Aug 31, 2013

I just rebased and pushed. Thanks!

@@ -833,6 +836,14 @@ def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, *
if data_columns is None:
data_columns = d[selector]

# ensure rows are synchronized across the tables

This comment has been minimized.

Copy link
@jreback

jreback Aug 31, 2013

Contributor

what I meant was that to the individual table appends, you can just simply pass dropna=False if dropna=True is passed to append_to_multiple will achieve the same effect.

This comment has been minimized.

Copy link
@jreback

jreback Aug 31, 2013

Contributor

nvm....what you have is correct

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2013

can you add your example example as 2 test cases, 1 with dropna=True, the other with dropna=False (and assert that it still raises the error?) thxs.....then I think ready to merge

@jreback

View changes

doc/source/io.rst Outdated
tables are synchronized. This means that if a row for one of the tables
being written to is entirely ``np.NaN``, that row will be dropped from all tables.

If ``dropna`` is False, the user is responsible for synchronizing tables.

This comment has been minimized.

Copy link
@jreback

jreback Aug 31, 2013

Contributor

can you put the 'USER IS REPONSIBLE FOR SYNCHRONIZING TABLES" back in CAPS?

This comment has been minimized.

Copy link
@adgaudio

adgaudio Aug 31, 2013

Author Contributor

done

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Aug 31, 2013

I removed the doc section that raises the error, and agree the documentation addition showing is probably unnecessary (which is why it wasn't there to begin with).

Unless I'm misunderstanding you, the tests already test both dropna=True (the # regular operation subsection) and dropna=False (the # don't maintain synchronized rows across tables if dropna=False subsection). Regarding tests, I'm not sure what you're asking for.

If it's simpler for you to just make changes to the test, please feel free to modify the PR however you wish. Otherwise, I'll need a little more explanation on what your looking for.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2013

@adgaudio the tests can't have tested exactly the case that you have, because otherwise it would fail. I want you to add the example as a set of tests (make a new function), rather then modify an existing test as you did).

BUG: HDFStore.append_to_multiple - ensures rows are synchronized befo…
…re writing

adds dropna kwarg + docstring + tests + documentation + release note

python3 compatible
@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2013

@jreback, I broke out the tests into their own function and rebased onto master. Let me know if you need anything else!

Thanks,
Alex

jreback added a commit that referenced this pull request Sep 3, 2013
Merge pull request #4698 from adgaudio/master
HDFStore.append_to_multiple doesn't write rows that are all np.nan

@jreback jreback merged commit 6ffed43 into pandas-dev:master Sep 3, 2013

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2013

Thanks for the pr!

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Sep 3, 2013

Awesome, thanks!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 3, 2013

gr8

also been thinking about a class to help out in splitting and reconstructions I the shards
(eg instead of actually specify a dictionar d), you would create a class to do this (which could then be serialized as well), to facilitate and automatic table reconstruction

if u have suggestions or maybe a nice use case would e interesting
(as I don't actually use this much)

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2013

Hey @jreback,

I bet I significantly misinterpreted what you were thinking, but here's my attempt to rephrase your idea... Also, if you like this, let me know how we should take this to a next step:

Problem (or my interpretation):
We'd like a way to write and collect large amounts of data into an HDFStore, and we need to write more data that fits in memory at any one time. The existing append_to_multiple method requires one massive DataFrame, which takes up too much memory to be particularly useful.

Possible Solution:
To resolve this problem, we want to buffer an incoming data stream in memory and then periodically flush it to the HDFStore according to some specific configuration. The incoming data stream could be smaller dataframes, or an "infinite" list of dictionaries (or whatever else DataFrames accept in their constructor). The buffer should be flushed to the HDFStore whenever it reaches a certain size. We can implement this buffer as a class with "write" "flush" and "close" methods.

I've actually implemented this and would be willing to contribute it to pandas, if it's suitable for pandas. On the other hand, I'm not sure this is a direction we all should be going. What do you think?

The class I implemented (and regularly use) to write data to HDFStore looks basically like this:

class BufferedHDFStore:
    def write(self, dict_of_data):
         validate input data, maybe?
         append to an internal buffer
         if internal buffer greater than X mb, flush to HDFStore

    def flush():
         Convert in memory buffer to dataframes and append to HDFStore

    def close():
         close the HDFStore
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2013

@adgaudio hmm....I want talking about a different problem actually

basically the idea of 'sharding' a wide table into small sub-tables, but with a nicer interface and the saving of the meta data into the table node so that reconstruction becomes simpler. e.g. right now you have to have the dict d that you constructed the table with in append_to_multiple to reconstruct it properlty

you idea is a different one, but how is different from just periodically appending data?
(e.g. I do this when I have a store then every day add a bunch of rows to it for that day, I amd storing Panels)

@adgaudio

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2013

haha yea, I totally misinterpreted that! Not requiring the table names in append_to_multiple could be pretty nice. I would need some time to play around with this idea, though.

Regarding how a buffered implementation is different from just periodically appending data: I have a particular use-case in which the stream of dicts gets chunked into (buffered) lists of Series, and each list then gets converted to a DataFrame and finally appended to the HDFStore. If the stream of dicts was already in the DataFrame form, then as you suggested, I wouldn't need build this BufferedHDFStore implementation. On the other hand, I'm not completely happy with this solution, because something about it seems overly engineered. But given low memory constraints and some other details inspiring the work, it was a difficult project and is now doing the job well.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2013

Why are you saving the intermediate form as dict of Series, why not just add to the frame as you create/modify a list? Then periodically append?

All for doing chunked/iterator based solutions. Yours 'sounds' like a write-iterator. Can you show a test-example?

An example off the top of my head that I know we prob need a better solution is the iterate over a csv then put in a hdf iteratively...

which itself is worth of a pandas function....maybe csv_to_hdf (which is just an exercise in figuring dtypes ahead of time and chunking the read and write) - because a int column in say the first chunk could later have missing data, so you really have to read it twice to be sure

@jreback jreback referenced this pull request Sep 10, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.