Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

savez method for DataFrame, Series: porting data between python2 and python3 #3151

Closed
cottrell opened this issue Mar 23, 2013 · 12 comments
Closed
Labels
Ideas Long-Term Enhancement Discussions IO Data IO issues that don't fit into a more specific label

Comments

@cottrell
Copy link
Contributor

I find that I am sometimes working between python2 and 3 installs and using pickle for passing data is problematic. I am having a look at adding some simple functions like:

pandas.loadnpz
pandas.obj.savenpz (where obj would be DataFrame, Series, Panel etc ...)

Any opinions on this?
Is it already there and I haven't found it?
Is there a natural (efficient) place to do this? It seems the save/load are fairly generic and are attached to all PandasObjects. Maybe at the level of NDFrame would make most sense?

Also, supposing there is interest (or at least not objection) to this, is there any way to add a test for this under the current framework since the full functionality would involve the space of both python2.* and python3.* pandas.

@ghost
Copy link

ghost commented Mar 23, 2013

pandas uses 2to3, and tests are cross-python by using

if PY3:
    foo
else:
    bar

That shouldn't be a problem.

I don't think pickle across pythons has been raised as an issue before, so thanks for that. ( edit: #686 )
Probably, it would be a hard sell to merge a new binary serialization format into pandas core,
perhaps HDFStore can serve as a de-facto storage format? @jreback moved it
light years ahead in the last couple of releases.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2013

have u considered

http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables

offers all of the savez type functionality, faster, has compressing options, and offers tables (optional) for another option

only downside is a couple of additional dependencies
py3k should be coming soon for PyTables btw

@jreback
Copy link
Contributor

jreback commented Mar 23, 2013

See this PyTables issue, provides a savez/PyTables comparsion:
PyTables/PyTables#185

Here is HDFStore export capability to R table format:
http://pandas.pydata.org/pandas-docs/dev/io.html#external-compatibility

Here (see 6), is something that could be useful: #2391

I could see adding an export method to HDFStore with a format specified, one of these could be npz (and R table format too). (Think of HDFStore as managing the binary file save formats for pandas)

@cottrell
Copy link
Contributor Author

Is pytables necessary for running pandas? I thought it was optional.

@cottrell
Copy link
Contributor Author

Also, npz and npy are not new serializations. They are part of numpy which pandas is built upon. Serializing the object and serialization the data are two fundamentally different things.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2013

pytables is optional, but highly recommended, esp when dealing with data of any non-trivial size

you can simply do this I believe
np.savez(file, series.index, series.values)

what I think you are talking about is supporting this method officially. I have no problem with it, but its essentially deprecated as its a numpy only format.

just because something is in numpy does not mean pandas should support it, after all, just my 2c

@wesm
Copy link
Member

wesm commented Mar 23, 2013

Reminder that we need to implement a pickle-agnostic binary data format using msgpack or some such that is not dependent on pickle (and preferable not dependent on too many internal details of pandas objects).

@cottrell
Copy link
Contributor Author

Had a go at getting PyTables with python3 ... still a lot of work I think. It looks like PyTables depends on numexpr which is not yet py3k'd. I've moderate success hacking away at these kinds of conversions but I don't really know what I'm doing which makes me less than an ideal contributor.

HDFStore looks like a great option. But it would be much better if it was a required dependency of pandas. On the other hand this would make pandas harder to install.

@jreback
Copy link
Contributor

jreback commented Mar 24, 2013

looks like both Numexpr and pytables are going to be py3 very soon in any event (the branches are merged )

the dependency doesn't matter, the user can install if they want. in fact for 0.11 we made Numexpr a highly recommended dependency in order to use internally (but all that means is doc warnings!)
if u want extra performance then the user would install it

another really good option if is read_csv/to_csv they are quite fast

@jreback
Copy link
Contributor

jreback commented Mar 24, 2013

fwiw both Numexpr and pytables are maintained by same team, so should be released together

@cottrell
Copy link
Contributor Author

Sounds great! I'll try to find the dev branches and try it out ...

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

PyTables 3.0.0 and ne 2.1 solve this problem to a large extent

@jreback jreback closed this as completed Sep 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ideas Long-Term Enhancement Discussions IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants