IPython.parallel issue with pushing pandas TimeSeries #2793

richbwood · 2013-01-15T11:15:34Z

A pandas.TimeSeries is cast to a numpy.ndarray when pushed using IPython.parallel.

Here is an example:

In [1]: from IPython import parallel

In [2]: dview = parallel.Client()[:]

In [3]: with dview.sync_imports(): import pandas
importing pandas on engine(s)

In [4]: a = pandas.TimeSeries([1,2,3],index=[1,2,3])

In [5]: a.__class__
pandas.core.series.Series

In [6]: dview.push({'b':a})
<AsyncResult: _push>

In [7]: ret = dview.apply_async(lambda x: b.__class__, range(0,1))

In [8]: ret.result
[numpy.ndarray,
 numpy.ndarray]

This seems to be a result of dview.push handling numpy arrays differently: http://ipython.org/ipython-doc/dev/parallel/parallel_details.html
and pandas.TimeSeries using numpy arrays:

In [9]: cPickle.dumps(a)
"cnumpy.core.multiarray\n_reconstruct\np1\n(cpandas.core.series\nSeries\np2\n(I0\ntS'b'\ntRp3\n((I1\n(I3\ntcnumpy\ndtype\np4\n(S'i8'\nI0\nI1\ntRp5\n(I3\nS'<'\nNNNI-1\nI-1\nI0\ntbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\nt(g1\n(cpandas.core.index\nInt64Index\np6\n(I0\ntS'b'\ntRp7\n((I1\n(I3\ntg5\nI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\nt(NttbNttb."

Is there a way to cast back to pandas.TimeSeries on the ipcluster engines?

Many thanks.

The text was updated successfully, but these errors were encountered:

minrk · 2013-01-15T17:33:29Z

Pinging @wesm on this one. There are really two issues here:

updating serialization to be aware of TimeSeries (and, I presume, data frames as a whole).
the interim workaround for how to reconstruct a TimeSeries from numpy arrays, if possible.

So the direct question to @wesm: Can you detail serialization of pandas objects? What, in addition to the arrays themselves, is necessary to reconstruct a pandas object? What pandas objects actually subclass ndarray (it is an isinstance(cls, ndarray) that is causing IPython to use its efficient numpy path here, rather than simply pickling).

richbwood · 2013-01-17T12:45:37Z

I don't think this is a problem with pandas, I think it is more general than that because any class that subclasses numpy.ndarray will break under the current implementation. Therefore, I think it makes sense for the IPython.parallel code to only use the efficient numpy path if the object is

type(self.obj) == numpy.ndarray

rather than

isinstance(self.obj, numpy.ndarray)

Making DirectView.push efficient for pandas.TimeSeries objects is a nice-to-have, but I would have thought that it is ultimately less important than having DirectView.push work as expected.

As an interim workaround solution the pandas.TimeSeries object can be wrapped in a pandas.DataFrame to stop the efficient numpy path from being used.

Many thanks

minrk · 2013-01-17T18:20:48Z

An excellent point. I will change the check from isinstance(obj, cls) to type(obj) is cls.

I would still love to hear from @wesm about efficient serialization of pandas objects.

minrk · 2013-01-17T18:22:18Z

typecheck change is in #2800

wesm · 2013-01-17T19:43:34Z

Sorry it's taken me a while to have a look.

This is probably another case of "Series probably shouldn't be an ndarray (subclass)". I would recommend using pickle whenever the class is not exactly ndarray.

minrk · 2013-01-17T19:52:35Z

@wesm - yes, that's the right answer in general, and what is done in #2800. But I would still like to give pandas the special treatment we do for numpy. So any time you can write up (or code up, if necessary) a representation of pandas data structures that is buffers + metadata, so that we can add it to our zero-copy stuff, that would be great.

use `type(obj) is cls` as switch when canning `isinstance(obj, cos)` would trigger the canning shortcuts for subclasses, which can be inappropriate (e.g. pandas.TimeSeries). closes #2793

`isinstance(obj, cos)` would trigger the canning shortcuts for subclasses, which can be inappropriate (e.g. pandas.TimeSeries). closes ipython#2793

use `type(obj) is cls` as switch when canning `isinstance(obj, cos)` would trigger the canning shortcuts for subclasses, which can be inappropriate (e.g. pandas.TimeSeries). closes ipython#2793

minrk mentioned this issue Jan 17, 2013

use type(obj) is cls as switch when canning #2800

Merged

minrk closed this as completed in a6d0b5e Jan 18, 2013

minrk added a commit to minrk/ipython that referenced this issue Jan 26, 2013

use type(obj) is cls as switch when canning

036105e

`isinstance(obj, cos)` would trigger the canning shortcuts for subclasses, which can be inappropriate (e.g. pandas.TimeSeries). closes ipython#2793

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPython.parallel issue with pushing pandas TimeSeries #2793

IPython.parallel issue with pushing pandas TimeSeries #2793

richbwood commented Jan 15, 2013

minrk commented Jan 15, 2013

richbwood commented Jan 17, 2013

minrk commented Jan 17, 2013

minrk commented Jan 17, 2013

wesm commented Jan 17, 2013

minrk commented Jan 17, 2013

IPython.parallel issue with pushing pandas TimeSeries #2793

IPython.parallel issue with pushing pandas TimeSeries #2793

Comments

richbwood commented Jan 15, 2013

minrk commented Jan 15, 2013

richbwood commented Jan 17, 2013

minrk commented Jan 17, 2013

minrk commented Jan 17, 2013

wesm commented Jan 17, 2013

minrk commented Jan 17, 2013