Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
HDFStore.append_to_multiple doesn't write rows that are all np.nan #4698
Using HDFStore.append_to_multiple, if an entire row written to any one table consists entirely of np.nan, the row is not written to the table, but is written to the other tables. The following code reproduces and fixes the issue.
I would prefer that append_to_multiple maintain synchronized rows across tables, and to my knowledge, the best way to do that is drop that row from the other tables. We would probably need a fix that looks something like this PR.
I'm not sure if this is the best way to do this, and would love some feedback!
To reproduce the error:
@jreback - thanks for your prompt feedback! I just added to the existing tests for that method and added a dropna kwarg + docstring to the code.
For dropna, I deviated from the normal true/false syntax. Specifically, dropna= 'any' | 'all' | False. If dropna evaluates to True, then call DF.dropna(how=dropna). You might prefer to revert to normal True/False behavior...
Also, if I entirely misinterpreted what you meant by dropna to begin with, please let me know!
I removed the doc section that raises the error, and agree the documentation addition showing is probably unnecessary (which is why it wasn't there to begin with).
Unless I'm misunderstanding you, the tests already test both dropna=True (the
If it's simpler for you to just make changes to the test, please feel free to modify the PR however you wish. Otherwise, I'll need a little more explanation on what your looking for.
also been thinking about a class to help out in splitting and reconstructions I the shards
if u have suggestions or maybe a nice use case would e interesting
I bet I significantly misinterpreted what you were thinking, but here's my attempt to rephrase your idea... Also, if you like this, let me know how we should take this to a next step:
Problem (or my interpretation):
I've actually implemented this and would be willing to contribute it to pandas, if it's suitable for pandas. On the other hand, I'm not sure this is a direction we all should be going. What do you think?
The class I implemented (and regularly use) to write data to HDFStore looks basically like this:
@adgaudio hmm....I want talking about a different problem actually
basically the idea of 'sharding' a wide table into small sub-tables, but with a nicer interface and the saving of the meta data into the table node so that reconstruction becomes simpler. e.g. right now you have to have the dict d that you constructed the table with in
you idea is a different one, but how is different from just periodically appending data?
haha yea, I totally misinterpreted that! Not requiring the table names in append_to_multiple could be pretty nice. I would need some time to play around with this idea, though.
Regarding how a buffered implementation is different from just periodically appending data: I have a particular use-case in which the stream of dicts gets chunked into (buffered) lists of Series, and each list then gets converted to a DataFrame and finally appended to the HDFStore. If the stream of dicts was already in the DataFrame form, then as you suggested, I wouldn't need build this BufferedHDFStore implementation. On the other hand, I'm not completely happy with this solution, because something about it seems overly engineered. But given low memory constraints and some other details inspiring the work, it was a difficult project and is now doing the job well.
Why are you saving the intermediate form as dict of Series, why not just add to the frame as you create/modify a list? Then periodically append?
All for doing chunked/iterator based solutions. Yours 'sounds' like a write-iterator. Can you show a test-example?
An example off the top of my head that I know we prob need a better solution is the iterate over a csv then put in a hdf iteratively...
which itself is worth of a pandas function....maybe