Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't make dropping missing rows a default behavior for HDF append()? #9382

Closed
nickeubank opened this issue Jan 31, 2015 · 8 comments · Fixed by #10097
Closed

Don't make dropping missing rows a default behavior for HDF append()? #9382

nickeubank opened this issue Jan 31, 2015 · 8 comments · Fixed by #10097
Labels
API Design IO HDF5 read_hdf, HDFStore
Milestone

Comments

@nickeubank
Copy link
Contributor

Hi All,

At the moment, the default behavior for the HDF append() function ( docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html?highlight=append#pandas.HDFStore.append ) is to silently drop all rows that are all NaN except for the index.

As I understand it from a PyData exchange with Jeff, the reason is that people working with panels often have sparse datasets, so this is a very reasonable default.

However, while I appreciate the appeal for time-series analysis, I think this is a dangerous default. The main reason is that the assumption is that if an index has a value but the columns do not, there is no meaningful data in the row. But while true in a time series context -- where it's easy to reconstruct the index values that are dropped -- if indexes contain information like userIDs, sensor codes, place names, etc., the index itself is meaningful, and not easy to reconstruct. Thus the default behavior is potentially deleting user data without a warning.

Given the trade-off between a default that may lead to inefficient storage (dropna = False) and one that potentially erases user data (dropna = True), I think we should error on the side of data preservation.

@jreback jreback added API Design IO HDF5 read_hdf, HDFStore labels Feb 2, 2015
@nickeubank
Copy link
Contributor Author

I'm a little new to the open-source world -- should I be doing something more than waiting for input at this point, and if none comes, should I do nothing, or make changes? Thanks!

@jreback
Copy link
Contributor

jreback commented Feb 5, 2015

well you can go ahead and make a pull request if you would like

@nickeubank
Copy link
Contributor Author

OK -- Do you have a position John? I know you did the hard work of creating this, so I don't want to adjust without your input!

@jreback
Copy link
Contributor

jreback commented Feb 5, 2015

I think changing the default is ok

you will have to adjust some tests
pls provide a release note that shows the prior and new behavior as this is an api change

@nickeubank
Copy link
Contributor Author

OK, great. This will be my first edit on a big project, so will likely take a few days to figure out how to do it right, but i'm on it!

@nickeubank
Copy link
Contributor Author

Submitted as Pull Request #9484

Where do I add notes for API change?

@jreback
Copy link
Contributor

jreback commented Feb 13, 2015

you would need to add a mini section in the whatsnew for 0.16.0 under api changes

@nickeubank
Copy link
Contributor Author

Great, done! Thanks for the hand-holding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO HDF5 read_hdf, HDFStore
Projects
None yet
2 participants