ENH: Support out-of-band pickling (protocol 5) #34244

jakirkham · 2020-05-19T05:06:47Z

Is your feature request related to a problem?

It would be nice if Pandas objects supported pickle's protocol 5 for out-of-band serialization. This would allow the underlying data to be captured in PickleBuffers (specialized memoryview). For libraries using pickle's protocol 5 to transmit data over the wire, this would allow for zero-copy data transmission.

Describe the solution you'd like

Pandas objects implement __reduce_ex__ and if the protocol argument is 5 or greater, they construct PickleBuffers out of any data arguments.

API breaking implications

NA as it should be possible to fallback to existing behavior for older pickle protocols. Users have to actively opt-in at a higher level API (through pickle) to see any effect.

Describe alternatives you've considered

NA

Additional context

This would be useful in libraries that support distributed dataframes ;)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-05-19T13:48:34Z

Thanks, looks interesting.

At a glance, it looks like we're successfully using pickle5 protocol when pickling underlying ndarrays.

import pandas as pd
import numpy as np
import pickle
import pickletools

a = np.arange(4)
b = pd.Series(a)

pickletools.dis(pickletools.optimize(pickle.dumps(a, protocol=5)))

pickletools.dis(pickletools.optimize(pickle.dumps(b, protocol=5)))

So the primary work to do here are

Ensure that that's actually correct, including for DataFrame?
Check Series / DataFrame for large objects that could also support out-of-band pickling?

jakirkham · 2020-05-19T19:15:59Z

Good point! Yeah if the objects Pandas uses for data storage already support pickle protocol 5 then it should just work. NumPy arrays are a good example (since they already support pickle protocol 5). Not sure what other objects might be used.

Certainly testing would help build confidence :)

My guess is size of objects shouldn't matter unless Pandas does something different with data representation of large objects.

TomAugspurger · 2020-05-19T20:23:01Z

Not sure what other objects might be used.

The other potentially large objects would be extension arrays (Categorical, etc.). All of pandas' extension arrays do consistent of one or more NumPy ndarrays.

jakirkham · 2020-05-19T21:45:10Z

Ok, so this may already just work then. FWIW this seems to be the case with DataFrame:

In [1]: import pickle                                                           

In [2]: import numpy                                                            

In [3]: import pandas                                                           

In [4]: d = pandas.DataFrame({"a": [1, 2, 3], "b": [0.5, 0.2, 0.3]})            

In [5]: f = [] 
   ...: h = pickle.dumps(d, protocol=5, buffer_callback=f.append)               

In [6]: [numpy.asarray(e) for e in f]                                           
Out[6]: [array([[0.5, 0.2, 0.3]]), array([[1, 2, 3]])]

Where are the current pickle tests?

TomAugspurger · 2020-05-20T10:51:06Z

Should all be in pandas/tests/io/test_pickle.py.

jakirkham · 2020-05-28T18:09:26Z

One other observation is if a column is represented with many small NumPy arrays, this will be true of the pickled form as well. During unpickling would Pandas keep the small NumPy arrays or would it consolidate them into a single one?

jakirkham added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 19, 2020

TomAugspurger added Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 19, 2020

TomAugspurger added this to the Contributions Welcome milestone May 19, 2020

TomAugspurger added the IO Pickle read_pickle, to_pickle label May 19, 2020

jakirkham mentioned this issue May 19, 2020

Efficient Pandas serialization dask/distributed#614

Closed

jakirkham mentioned this issue May 21, 2020

Supporting out-of-band buffers with pickle protocol 5 explosion/spaCy#5472

Open

jakirkham mentioned this issue May 29, 2020

Support Pickle's protocol 5 dask/distributed#3784

Merged

ig248 mentioned this issue Oct 11, 2020

Write pickle to file-like without intermediate in-memory buffer #37056

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Oct 12, 2020

jreback closed this as completed in #37056 Oct 14, 2020

jakirkham mentioned this issue Dec 3, 2021

Custom serializer is not transfered to the workers dask/distributed#5561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support out-of-band pickling (protocol 5) #34244

ENH: Support out-of-band pickling (protocol 5) #34244

jakirkham commented May 19, 2020

TomAugspurger commented May 19, 2020

jakirkham commented May 19, 2020

TomAugspurger commented May 19, 2020

jakirkham commented May 19, 2020

TomAugspurger commented May 20, 2020

jakirkham commented May 28, 2020

ENH: Support out-of-band pickling (protocol 5) #34244

ENH: Support out-of-band pickling (protocol 5) #34244

Comments

jakirkham commented May 19, 2020

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

TomAugspurger commented May 19, 2020

jakirkham commented May 19, 2020

TomAugspurger commented May 19, 2020

jakirkham commented May 19, 2020

TomAugspurger commented May 20, 2020

jakirkham commented May 28, 2020