Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPython.parallel issue with pushing pandas TimeSeries #2793

Closed
richbwood opened this issue Jan 15, 2013 · 6 comments · Fixed by #2800
Closed

IPython.parallel issue with pushing pandas TimeSeries #2793

richbwood opened this issue Jan 15, 2013 · 6 comments · Fixed by #2800
Milestone

Comments

@richbwood
Copy link

A pandas.TimeSeries is cast to a numpy.ndarray when pushed using IPython.parallel.

Here is an example:

In [1]: from IPython import parallel

In [2]: dview = parallel.Client()[:]

In [3]: with dview.sync_imports(): import pandas
importing pandas on engine(s)

In [4]: a = pandas.TimeSeries([1,2,3],index=[1,2,3])

In [5]: a.__class__
pandas.core.series.Series

In [6]: dview.push({'b':a})
<AsyncResult: _push>

In [7]: ret = dview.apply_async(lambda x: b.__class__, range(0,1))

In [8]: ret.result
[numpy.ndarray,
 numpy.ndarray]

This seems to be a result of dview.push handling numpy arrays differently: http://ipython.org/ipython-doc/dev/parallel/parallel_details.html
and pandas.TimeSeries using numpy arrays:

In [9]: cPickle.dumps(a)
"cnumpy.core.multiarray\n_reconstruct\np1\n(cpandas.core.series\nSeries\np2\n(I0\ntS'b'\ntRp3\n((I1\n(I3\ntcnumpy\ndtype\np4\n(S'i8'\nI0\nI1\ntRp5\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\nt(g1\n(cpandas.core.index\nInt64Index\np6\n(I0\ntS'b'\ntRp7\n((I1\n(I3\ntg5\nI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\nt(NttbNttb."

Is there a way to cast back to pandas.TimeSeries on the ipcluster engines?

Many thanks.

@minrk
Copy link
Member

minrk commented Jan 15, 2013

Pinging @wesm on this one. There are really two issues here:

  1. updating serialization to be aware of TimeSeries (and, I presume, data frames as a whole).
  2. the interim workaround for how to reconstruct a TimeSeries from numpy arrays, if possible.

So the direct question to @wesm: Can you detail serialization of pandas objects? What, in addition to the arrays themselves, is necessary to reconstruct a pandas object? What pandas objects actually subclass ndarray (it is an isinstance(cls, ndarray) that is causing IPython to use its efficient numpy path here, rather than simply pickling).

@richbwood
Copy link
Author

I don't think this is a problem with pandas, I think it is more general than that because any class that subclasses numpy.ndarray will break under the current implementation. Therefore, I think it makes sense for the IPython.parallel code to only use the efficient numpy path if the object is

type(self.obj) == numpy.ndarray

rather than

isinstance(self.obj, numpy.ndarray)

Making DirectView.push efficient for pandas.TimeSeries objects is a nice-to-have, but I would have thought that it is ultimately less important than having DirectView.push work as expected.

As an interim workaround solution the pandas.TimeSeries object can be wrapped in a pandas.DataFrame to stop the efficient numpy path from being used.

Many thanks

@minrk
Copy link
Member

minrk commented Jan 17, 2013

An excellent point. I will change the check from isinstance(obj, cls) to type(obj) is cls.

I would still love to hear from @wesm about efficient serialization of pandas objects.

@minrk
Copy link
Member

minrk commented Jan 17, 2013

typecheck change is in #2800

@wesm
Copy link

wesm commented Jan 17, 2013

Sorry it's taken me a while to have a look.

This is probably another case of "Series probably shouldn't be an ndarray (subclass)". I would recommend using pickle whenever the class is not exactly ndarray.

@minrk
Copy link
Member

minrk commented Jan 17, 2013

@wesm - yes, that's the right answer in general, and what is done in #2800. But I would still like to give pandas the special treatment we do for numpy. So any time you can write up (or code up, if necessary) a representation of pandas data structures that is buffers + metadata, so that we can add it to our zero-copy stuff, that would be great.

@minrk minrk closed this as completed in a6d0b5e Jan 18, 2013
minrk added a commit that referenced this issue Jan 18, 2013
use `type(obj) is cls` as switch when canning

`isinstance(obj, cos)` would trigger the canning shortcuts for subclasses,
which can be inappropriate (e.g. pandas.TimeSeries).

closes #2793
minrk added a commit to minrk/ipython that referenced this issue Jan 26, 2013
`isinstance(obj, cos)` would trigger the canning shortcuts for subclasses,
which can be inappropriate (e.g. pandas.TimeSeries).

closes ipython#2793
mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
`isinstance(obj, cos)` would trigger the canning shortcuts for subclasses,
which can be inappropriate (e.g. pandas.TimeSeries).

closes ipython#2793
mattvonrocketstein pushed a commit to mattvonrocketstein/ipython that referenced this issue Nov 3, 2014
use `type(obj) is cls` as switch when canning

`isinstance(obj, cos)` would trigger the canning shortcuts for subclasses,
which can be inappropriate (e.g. pandas.TimeSeries).

closes ipython#2793
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants