Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support out-of-band pickling (protocol 5) #34244

Closed
jakirkham opened this issue May 19, 2020 · 6 comments · Fixed by #37056
Closed

ENH: Support out-of-band pickling (protocol 5) #34244

jakirkham opened this issue May 19, 2020 · 6 comments · Fixed by #37056
Labels
Enhancement IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance
Milestone

Comments

@jakirkham
Copy link
Contributor

Is your feature request related to a problem?

It would be nice if Pandas objects supported pickle's protocol 5 for out-of-band serialization. This would allow the underlying data to be captured in PickleBuffers (specialized memoryview). For libraries using pickle's protocol 5 to transmit data over the wire, this would allow for zero-copy data transmission.

Describe the solution you'd like

Pandas objects implement __reduce_ex__ and if the protocol argument is 5 or greater, they construct PickleBuffers out of any data arguments.

API breaking implications

NA as it should be possible to fallback to existing behavior for older pickle protocols. Users have to actively opt-in at a higher level API (through pickle) to see any effect.

Describe alternatives you've considered

NA

Additional context

This would be useful in libraries that support distributed dataframes ;)

@jakirkham jakirkham added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 19, 2020
@TomAugspurger TomAugspurger added Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 19, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone May 19, 2020
@TomAugspurger TomAugspurger added the IO Pickle read_pickle, to_pickle label May 19, 2020
@TomAugspurger
Copy link
Contributor

Thanks, looks interesting.

At a glance, it looks like we're successfully using pickle5 protocol when pickling underlying ndarrays.

import pandas as pd
import numpy as np
import pickle
import pickletools

a = np.arange(4)
b = pd.Series(a)

pickletools.dis(pickletools.optimize(pickle.dumps(a, protocol=5)))

pickletools.dis(pickletools.optimize(pickle.dumps(b, protocol=5)))

So the primary work to do here are

  1. Ensure that that's actually correct, including for DataFrame?
  2. Check Series / DataFrame for large objects that could also support out-of-band pickling?

@jakirkham
Copy link
Contributor Author

Good point! Yeah if the objects Pandas uses for data storage already support pickle protocol 5 then it should just work. NumPy arrays are a good example (since they already support pickle protocol 5). Not sure what other objects might be used.

Certainly testing would help build confidence :)

My guess is size of objects shouldn't matter unless Pandas does something different with data representation of large objects.

@TomAugspurger
Copy link
Contributor

Not sure what other objects might be used.

The other potentially large objects would be extension arrays (Categorical, etc.). All of pandas' extension arrays do consistent of one or more NumPy ndarrays.

@jakirkham
Copy link
Contributor Author

Ok, so this may already just work then. FWIW this seems to be the case with DataFrame:

In [1]: import pickle                                                           

In [2]: import numpy                                                            

In [3]: import pandas                                                           

In [4]: d = pandas.DataFrame({"a": [1, 2, 3], "b": [0.5, 0.2, 0.3]})            

In [5]: f = [] 
   ...: h = pickle.dumps(d, protocol=5, buffer_callback=f.append)               

In [6]: [numpy.asarray(e) for e in f]                                           
Out[6]: [array([[0.5, 0.2, 0.3]]), array([[1, 2, 3]])]

Where are the current pickle tests?

@TomAugspurger
Copy link
Contributor

Should all be in pandas/tests/io/test_pickle.py.

@jakirkham
Copy link
Contributor Author

One other observation is if a column is represented with many small NumPy arrays, this will be true of the pickled form as well. During unpickling would Pandas keep the small NumPy arrays or would it consolidate them into a single one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants