Don't make dropping missing rows a default behavior for HDF append()? #9382

Closed
nickeubank opened this Issue Jan 31, 2015 · 8 comments

Comments

Projects
None yet
2 participants
Contributor

nickeubank commented Jan 31, 2015

Hi All,

At the moment, the default behavior for the HDF append() function ( docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html?highlight=append#pandas.HDFStore.append ) is to silently drop all rows that are all NaN except for the index.

As I understand it from a PyData exchange with Jeff, the reason is that people working with panels often have sparse datasets, so this is a very reasonable default.

However, while I appreciate the appeal for time-series analysis, I think this is a dangerous default. The main reason is that the assumption is that if an index has a value but the columns do not, there is no meaningful data in the row. But while true in a time series context -- where it's easy to reconstruct the index values that are dropped -- if indexes contain information like userIDs, sensor codes, place names, etc., the index itself is meaningful, and not easy to reconstruct. Thus the default behavior is potentially deleting user data without a warning.

Given the trade-off between a default that may lead to inefficient storage (dropna = False) and one that potentially erases user data (dropna = True), I think we should error on the side of data preservation.

Contributor

nickeubank commented Feb 5, 2015

I'm a little new to the open-source world -- should I be doing something more than waiting for input at this point, and if none comes, should I do nothing, or make changes? Thanks!

Contributor

jreback commented Feb 5, 2015

well you can go ahead and make a pull request if you would like

Contributor

nickeubank commented Feb 5, 2015

OK -- Do you have a position John? I know you did the hard work of creating this, so I don't want to adjust without your input!

Contributor

jreback commented Feb 5, 2015

I think changing the default is ok

you will have to adjust some tests
pls provide a release note that shows the prior and new behavior as this is an api change

Contributor

nickeubank commented Feb 12, 2015

OK, great. This will be my first edit on a big project, so will likely take a few days to figure out how to do it right, but i'm on it!

Contributor

nickeubank commented Feb 13, 2015

Submitted as Pull Request #9484

Where do I add notes for API change?

Contributor

jreback commented Feb 13, 2015

you would need to add a mini section in the whatsnew for 0.16.0 under api changes

Contributor

nickeubank commented Feb 13, 2015

Great, done! Thanks for the hand-holding!

jreback added this to the 0.17.0 milestone May 10, 2015

jreback closed this in #10097 Jul 31, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment