Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializing Pandas Functions #12021

Closed
mrocklin opened this issue Jan 12, 2016 · 36 comments
Closed

Serializing Pandas Functions #12021

mrocklin opened this issue Jan 12, 2016 · 36 comments
Labels
Compat pandas objects compatability with Numpy or Python functions
Milestone

Comments

@mrocklin
Copy link
Contributor

In recent efforts using Pandas on multiple machines I've found that some of the functions are tricky to serialize. Apparently this might be due to runtime generation. Here are a few examples of serialization breaking, occasionally in unpleasant ways:

In [1]: import pandas as pd
In [2]: import pickle
In [3]: pd.read_csv
Out[3]: <function pandas.io.parsers._make_parser_function.<locals>.parser_f>
In [4]: pickle.loads(pickle.dumps(pd.read_csv))
AttributeError: Can't pickle local object '_make_parser_function.<locals>.parser_f'

Lest you think that this is just a problem with pickle (which has many flaws), dill, a much more robust function serialization library, also fails (the failure here is py35 only.) (cc @mmckerns)

In [5]: import dill
In [6]: dill.loads(dill.dumps(pd.read_csv))
PicklingError: Can't pickle <function _make_parser_function.<locals>.parser_f at 0x7f71f5ec1158>: it's not found as pandas.io.parsers._make_parser_function.<locals>.parser_f

In this particular case though cloudpickle will work.

Other functions have this problem as well. Consider the series methods:

In [7]: pickle.loads(pickle.dumps(pd.Series.sum))
AttributeError: Can't pickle local object '_make_stat_function.<locals>.stat_func'

In this case, concerningly cloudpickle completes, but returns a wrong result:

In [9]: import cloudpickle
In [11]: pd.Series.sum
Out[11]: <function pandas.core.generic._make_stat_function.<locals>.stat_func>

In [12]: cloudpickle.loads(cloudpickle.dumps(pd.Series.sum))
Out[12]: <function stat_func>

I've been able to fix some of these in cloudpipe/cloudpickle#46 but generally speaking I'm running into a number of problems here. It would be useful if, during the generation of these functions we could at least pay attention to assigning metadata like __name__ correctly. This one in particular confused me for a while:

In [15]: pd.Series.cumsum.__name__
Out[15]: 'sum'

What would help?

  • Testing that most of the API is serializable
  • Looking at what metadata the serialization libraries use, and making sure that this metadata is enough to properly identify the function. Some relevant snippets from cloudpickle follow:
    def save_instancemethod(self, obj):
        # Memoization rarely is ever useful due to python bounding
        if obj.__self__ is None:
            self.save_reduce(getattr, (obj.im_class, obj.__name__))
        else:
            if PY3:
                self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
            else:
                self.save_reduce(types.MethodType, (obj.__func__, obj.__self__, obj.__self__.__class__),
                         obj=obj)

    def _reduce_method_descriptor(obj):
        return (getattr, (obj.__objclass__, obj.__name__))
@max-sixty
Copy link
Contributor

Why do you need to serialize the functions? If pandas is on the other machine, that shouldn't be necessary?

@mrocklin
Copy link
Contributor Author

I need to communicate to a process on the other machine which pandas function to run.

@mmckerns
Copy link

@mrocklin: I'm not seeing dill have issue with pd.Series.sum. It would be nice to have a list of the serialization issues you are seeing with pandas. (Thanks for the pointer on the breakage for pandas with dill in python 3.5.)

Some adjustments to dill could be made to help you here, however, the answer is typically changes to the package (pandas) is the better answer. It's common that very little changes make all the difference, and usually have negligible negative impact on speed and functionality of the package. pandas is a fundamental enough package that I'd be willing to digest some special cases.

@jreback
Copy link
Contributor

jreback commented Jan 12, 2016

So the problem with the cum functions is trivial. Just naming them differently. So a PR for that is sort of easy.

diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index 958571f..9764ae2 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -4701,7 +4701,7 @@ class NDFrame(PandasObject):
             lambda y, axis: np.minimum.accumulate(y, axis),
             np.inf, np.nan)
         cls.cumsum = _make_cum_function(
-            'sum', name, name2, axis_descr,
+            'cumsum', name, name2, axis_descr,
             "cumulative sum",
             lambda y, axis: y.cumsum(axis), 0., np.nan)
         cls.cumprod = _make_cum_function(

how is the cloudpickle result for Series.sum wrong? this is just a closure

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Jan 12, 2016
@jreback jreback added this to the 0.18.0 milestone Jan 12, 2016
@mrocklin
Copy link
Contributor Author

In [1]: import pandas as pd

In [2]: import cloudpickle

In [3]: s = pd.Series([1, 2, 3])

In [4]: sum2 = cloudpickle.loads(cloudpickle.dumps(pd.Series.sum))

In [5]: pd.Series.sum(s)
Out[5]: 6

In [6]: sum2(s)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-49effaf88fb4> in <module>()
----> 1 sum2(s)

TypeError: unbound method stat_func() must be called with NoneType instance as first argument (got Series instance instead)

This I think was more an error with cloudpickle than with pandas. This was fixed in a recent cloudpickle PR.

@jreback
Copy link
Contributor

jreback commented Jan 12, 2016

So in your example above its serializing an unbound method.

In [5]: sum2.__self__

You can serialize a bound method then it works

In [14]: sum2 = cloudpickle.loads(cloudpickle.dumps(s.sum))

In [15]: sum2
Out[15]: 
<bound method Series.stat_func of 0    1
1    2
2    3
dtype: int64>

In [16]: sum2()
Out[16]: 6

@mrocklin
Copy link
Contributor Author

In this case I want to serialize an unbound method.

@jreback
Copy link
Contributor

jreback commented Jan 12, 2016

so this is a cloudpickle/dill issue then? (aside from the naming in the cum* funcs)?

@mrocklin
Copy link
Contributor Author

It can be handled in either. Like Mike said earlier, a little bit of help from Pandas can go a long way here. It's important to make sure that all functions maintain metadata like __name__ and __self__ and im_class through all of the manipulations that Pandas does.

@jreback
Copy link
Contributor

jreback commented Jan 12, 2016

no problem with changing things. maybe @mmckerns has an example of how you are constructing the closures. maybe we aren't setting something up correctly. (we are setting __name__ and __doc__), then assigning to the appropriate class.

def _make_stat_function(name, name1, name2, axis_descr, desc, f):

    @Substitution(outname=name, desc=desc, name1=name1, name2=name2, axis_descr=axis_descr)
    @Appender(_num_doc)
    def stat_func(self, axis=None, skipna=None, level=None,
                  numeric_only=None, **kwargs):
        if skipna is None:
            skipna = True
        if axis is None:
            axis = self._stat_axis_number
        if level is not None:
            return self._agg_by_level(name, axis=axis, level=level,
                                      skipna=skipna)
        return self._reduce(f, name, axis=axis,
                            skipna=skipna, numeric_only=numeric_only)
    stat_func.__name__ = name
    return stat_func

@mmckerns
Copy link

Not sure what cloudpickle does for closures… but I think it has failed on them for a long while (as @mrocklin noted above) and may recently have been fixed. One major issue is that python uses/used the wrong fully qualified name for closures, and until __qualname__ was introduced, it caused a serializer to have a lot of difficulty referencing the code.

dill handles closures in most cases (but can still miss w/ bad qualnames) -- and it handles bound and unbound methods in most cases as well. One issue tends to be referenced here: https://github.com/uqfoundation/dill/blob/cccbea9b715e16b742288e1e5a21a687a4d4081b/dill/dill.py#L1000… also when a reference to a class method is used at the module level, this is also difficult. Also I think both dill and cloudpickle try to identify code in closures by looking at the source (to identify where the code lives)... and this can be aided by, say, not always using inner for all closured functions in the same file.

I guess, in short, it's not so much about how dill or cloudpickle constructs closures… it's about which closure constructs a given serializer can handle. Small things, like naming, and whether the decorator is in the same file as where it is used, etc, can all make a difference.

Messing with __name__ can also cause problems, but there are workarounds. I think I see what you are doing… that function then gets referenced by some class as a method? Yikes. Maybe a reference to a full example would help.

@jreback jreback modified the milestones: Next Major Release, 0.18.0 Jan 30, 2016
@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

@mrocklin
#12372

anything else you need?

@jreback jreback modified the milestones: 0.18.0, Next Major Release Feb 17, 2016
@mrocklin
Copy link
Contributor Author

The problem is more extensive. I suspect you would need to be more careful about how you wrap things. Here is a fail case not handled by that PR.

In [1]: import pandas as pd

In [2]: from cloudpickle import dumps, loads

In [3]: pd.Series.sum
Out[3]: <function pandas.core.generic._make_stat_function.<locals>.stat_func>

In [4]: loads(dumps(pd.Series.sum))
Out[4]: <function stat_func>

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

I checkout 0.2.1 of cloudpickle

In [1]: import cloudpickle

In [2]: from cloudpickle import dumps, loads

In [3]: loads(dumps(pd.Series.sum))
Out[3]: <unbound method Series.sum>

In [4]: pd.__version__
Out[4]: '0.18.0rc1+23.g8fba893.dirty'

In [5]: cloudpickle.__version__
Out[5]: '0.2.1'

@mrocklin
Copy link
Contributor Author

Try Python 3

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

so if you have

def make_function(.....):

   def a_func(.....):
         ....
   a_func.__name__ = 'foo'
   return a_func

how do I set the name instead of a_func? (I tried a_func.func_name = 'foo') but doesn't seem to do anything

@mrocklin
Copy link
Contributor Author

Looks like the attributes __name__ and __qualname__ come up in the definition of functools.wraps

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

@kawochen do you have a reference on how/what do with the __qualname__, I mean can simply do a substtitute is that the right way?

@kawochen
Copy link
Contributor

yes I think you just change __qualname__ upon (re)binding. PEP3155.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

@mrocklin I updated #12372 can you see if this works for you?

I could make this anything actually. What do you think cloudpickle actually wants?

In [1]: pd.Series.sum.__qualname__
Out[1]: '_make_stat_function.<locals>.sum'

In [2]: pd.Series.cumsum.__qualname__
Out[2]: '_make_cum_function.<locals>.cumsum'

In [3]: pd.Series.any.__qualname__
Out[3]: '_make_logical_function.<locals>.any'

In [3]: pd.Series.cumsum.__name__
Out[3]: 'cumsum'

@mrocklin
Copy link
Contributor Author

In [1]: import pandas as pd
pd
In [2]: pd.__version__
Out[2]: '0.18.0rc1+28.gdcfadad'

In [3]: from cloudpickle import dumps, loads

In [4]: pd.Series.sum
Out[4]: <function pandas.core.generic._make_stat_function.<locals>.sum>

In [5]: loads(dumps(pd.Series.sum))
Out[5]: <function stat_func>

In [6]: s = pd.Series([1, 2, 3])

In [7]: pd.Series.sum(s)
Out[7]: 6

In [8]: loads(dumps(pd.Series.sum))(s)
Out[8]: 6

In [9]: pd.Series.sum.__name__
Out[9]: 'sum'

In [10]: pd.Series.sum.__module__
Out[10]: 'pandas.core.generic'

In [11]: pd.Series.sum.__qualname__
Out[11]: '_make_stat_function.<locals>.sum'

In [12]: loads(dumps(pd.Series.sum)).__name__
Out[12]: 'stat_func'

In [13]: loads(dumps(pd.Series.sum)).__module__

In [14]: loads(dumps(pd.Series.sum)).__qualname__
Out[14]: 'stat_func'

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

ok latest push does this, but still not sure what cloudpickle is looking at. I think we actually need to go deeper into the closure

In [1]: from cloudpickle import loads, dumps

In [2]: pd.Series.sum.__qualname__
Out[2]: 'pandas.core.series.Series.sum'

In [3]: loads(dumps(pd.Series.sum)).__qualname__
Out[3]: 'stat_func'

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

I just tried adding on __module__ but no joy.

@mrocklin
Copy link
Contributor Author

I added breakpoints inside of cloudpickle to see where this would end up. Arrived at the save_function method: https://github.com/cloudpipe/cloudpickle/blob/master/cloudpickle/cloudpickle.py#L173-L224

Although it also ended up there seven times for one call to dumps, so it's presumably moving around a fair amount.

Cloudpickle is only 700 lines. I think it's worth skimming it to see the kinds of things they look for.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

yep, it looks at first glance that the module is good if pandas.core.series, but it wants name Series, then not sure how to get it to find the actual method.

-> if getattr(themodule, name, None) is obj:
(Pdb) l
191             if modname == '__main__':
192                 themodule = None
193     
194             if themodule:
195                 self.modules.add(themodule)
196  ->             if getattr(themodule, name, None) is obj:
197                     return self.save_global(obj, name)
198     
199             # if func is lambda, def'ed at prompt, is in main, or is nested, then
200             # we'll pickle the actual function object rather than simply saving a
201             # reference (as is done in default pickler), via save_function_tuple.
(Pdb) n
> /Users/jreback/miniconda/envs/py3.5/lib/python3.5/site-packages/cloudpickle-0.2.1-py3.5.egg/cloudpickle/cloudpickle.py(202)save_function()

@jreback
Copy link
Contributor

jreback commented Feb 18, 2016

yeh, I think we are not setting it up correctly to save a closure that is actually an instancemethod. It tries to save like a module level function I think.

@mmckerns
Copy link

If you want to see what dill does, all you have to do is turn on trace… so use dill.detect.trace(True). And if you want to use a version of dill that is the most similar to what cloudpickle does, first do dill.settings['recurse'] = True. This will print out the path dill takes in serializing each object. You can easily look at the matching code that is found in the source in dill.py for each object in the trace. It's not going to be exactly the same as cloudpickle, but it should be close.

@mrocklin
Copy link
Contributor Author

@mmckerns FYI I've run into strange behavior when using dill.settings['recurse'] = True (things like int not being found upon reserialization), hence my recent move over to cloudpickle.

Providing traces though is really slick.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

thanks @mmckerns ok will get to this prob next week. In any event it seems that what we are doing is using a static function which we then assign to a class at run-time (so its a method now at least in py3, in py2 have this bound nonsense...). I am thinking that on deserialization it cannot be found (even though I am settting the __qualname__ and __module__ which are supposed to provide the places to look up.

is that a reasonable description?

@jreback
Copy link
Contributor

jreback commented Feb 19, 2016

In [1]: import dill

In [2]: dill.detect.trace(True)

In [3]: from dill import loads, dumps

In [4]: loads(dumps(pd.Series.sum))
F1: <function Series.sum at 0x10b8e8730>
F2: <function _create_function at 0x10c00c6a8>
# F2
Co: <code object stat_func at 0x10b2d3150, file "/Users/jreback/miniconda/envs/py3.5/pandas/pandas/core/generic.py", line 5240>
T1: <class 'code'>
F2: <function _load_type at 0x10c00c598>
# F2
# T1
# Co
D4: <dict object at 0x10b29e2c8>
# D4
Ce: <cell at 0x10b8e19a8: function object at 0x10ab9d268>
F2: <function _create_cell at 0x10c00c9d8>
# F2
F2: <function nansum at 0x10ab9d268>
# F2
# Ce
Ce: <cell at 0x10b8e19d8: str object at 0x1003bc3b0>
# Ce
D2: <dict object at 0x10bfa13c8>
# D2
# F1
Out[4]: <function pandas.core.generic.stat_func>

@mmckerns
Copy link

@jreback: I think you have a reasonable picture of it.

Note that the trace starts by going into F1 (which is your method), and ends with # F1 (which is when F1 is actually serialized). So, it first serializes the helper function (this is F2) that will be used to create your target function. Then it tries to serialize your target function's code object (Co), and apparently needs the code class to do this, and also needs another helper function to create this class object. It gets those two (T1 and F2), then finishes serializing the code object (# Co). The dict D4 tells me that it's grabbing the module dict as the global dict for the function… and I expect that things like nansum that follow are because they are referenced by the object at the module dict level.

@mmckerns
Copy link

@mrocklin: I'm guessing you mean that int is missing you mean that numpy.int64 is not found as numpy.int64 upon deserialization. I don't want to hijack the thread for that, so please feel free to open a ticket for whatever issues you are running into, and I'll try to fix them.

jreback added a commit to jreback/pandas that referenced this issue Apr 5, 2016
@jreback jreback closed this as completed in a69d628 Apr 5, 2016
@jreback
Copy link
Contributor

jreback commented Apr 5, 2016

@mrocklin

I merged the fixes for __name__ and __qualname__

lmk if anything else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants