Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Don't make dropping missing rows a default behavior for HDF append()? #9382
Comments
jreback
added API Design HDF5
labels
Feb 2, 2015
|
I'm a little new to the open-source world -- should I be doing something more than waiting for input at this point, and if none comes, should I do nothing, or make changes? Thanks! |
|
well you can go ahead and make a pull request if you would like |
|
OK -- Do you have a position John? I know you did the hard work of creating this, so I don't want to adjust without your input! |
|
I think changing the default is ok you will have to adjust some tests |
|
OK, great. This will be my first edit on a big project, so will likely take a few days to figure out how to do it right, but i'm on it! |
|
Submitted as Pull Request #9484 Where do I add notes for API change? |
nickeubank
referenced
this issue
Feb 13, 2015
Closed
Default values for dropna to "False" (issue 9382) #9484
|
you would need to add a mini section in the whatsnew for 0.16.0 under api changes |
|
Great, done! Thanks for the hand-holding! |
nickeubank commentedJan 31, 2015
Hi All,
At the moment, the default behavior for the HDF append() function ( docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html?highlight=append#pandas.HDFStore.append ) is to silently drop all rows that are all NaN except for the index.
As I understand it from a PyData exchange with Jeff, the reason is that people working with panels often have sparse datasets, so this is a very reasonable default.
However, while I appreciate the appeal for time-series analysis, I think this is a dangerous default. The main reason is that the assumption is that if an index has a value but the columns do not, there is no meaningful data in the row. But while true in a time series context -- where it's easy to reconstruct the index values that are dropped -- if indexes contain information like userIDs, sensor codes, place names, etc., the index itself is meaningful, and not easy to reconstruct. Thus the default behavior is potentially deleting user data without a warning.
Given the trade-off between a default that may lead to inefficient storage (dropna = False) and one that potentially erases user data (dropna = True), I think we should error on the side of data preservation.