Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write pickle to file-like without intermediate in-memory buffer #37056

Merged
merged 9 commits into from
Oct 14, 2020

Conversation

ig248
Copy link
Contributor

@ig248 ig248 commented Oct 11, 2020

Before this change, calling pickle.dumps() created an in-memory byte buffer, negating the advantage
of zero-copy pickle protocol 5. After this change, pickle.dump writes directly to open file(-like),
cutting peak memory in half in most cases.

Profiling was done with pandas@master and python 3.8.5

Related issues:

image

Update: ASV results

$ asv continuous -f 1.1 origin/master HEAD -b pickle yields:

       before           after         ratio
     [58f74686]       [3c1ee270]
                      <zero-copy-pickle>
-           91.9M            80.1M     0.87  io.pickle.Pickle.peakmem_write_pickle

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Before this change, calling pickle.dumps() created an in-memory byte buffer, negating the advantage
of zero-copy pickle protocol 5. After this change, pickle.dump writes directly to open file(-like),
cutting peak memory in half in most cases.
@ig248 ig248 changed the title [Draft] Write pickle to file-like without intermediate storage Write pickle to file-like without intermediate in-memory buffer Oct 11, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, can you add a memory asv and show the results of this & a whatsnew note for 1.2 perf section.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2020

also to actually close this issue i think would be good to add some tests that explicity set the pickle protocal to 5

@jreback jreback added IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance labels Oct 11, 2020
@ig248
Copy link
Contributor Author

ig248 commented Oct 11, 2020

also to actually close this issue i think would be good to add some tests that explicity set the pickle protocal to 5

Isn't that covered by py38 tests where pickle.HIGHEST_PROTOCOL (==5) is used by default?

@jreback
Copy link
Contributor

jreback commented Oct 11, 2020

also to actually close this issue i think would be good to add some tests that explicity set the pickle protocal to 5

Isn't that covered by py38 tests where pickle.HIGHEST_PROTOCOL (==5) is used by default?

probably but let's make it explict as well (e.g. parameterize on 4,5,highest when PY38 or greater)

@ig248
Copy link
Contributor Author

ig248 commented Oct 11, 2020

Added ASV peakmem_ benchmark. The benchmark currently uses an ~11MB dataframe - I have no experience with ASV to tell how reliable memory benchmarking is for small-ish objects, any advice appreciated.

@jreback
Copy link
Contributor

jreback commented Oct 11, 2020

Added ASV peakmem_ benchmark. The benchmark currently uses an ~11MB dataframe - I have no experience with ASV to tell how reliable memory benchmarking is for small-ish objects, any advice appreciated.

it should show a diff here in any event. what kind of results are you seeing?

pandas/tests/io/test_pickle.py Outdated Show resolved Hide resolved
pandas/tests/io/test_pickle.py Outdated Show resolved Hide resolved
@ig248
Copy link
Contributor Author

ig248 commented Oct 11, 2020

Added ASV peakmem_ benchmark. The benchmark currently uses an ~11MB dataframe - I have no experience with ASV to tell how reliable memory benchmarking is for small-ish objects, any advice appreciated.

it should show a diff here in any event. what kind of results are you seeing?

Yep, added results above - peakmem_write_pickle shows the 11MB drop nicely - in practice the relative gain is obv much more significant than shown for larger dataframes.

@pep8speaks
Copy link

pep8speaks commented Oct 11, 2020

Hello @ig248! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-12 22:51:15 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ping on green. can you post the results of the asv (in the top of the PR, just edit it)

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved
pandas/tests/io/test_pickle.py Show resolved Hide resolved
@jreback jreback added this to the 1.2 milestone Oct 12, 2020
@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

lgtm @ig248 just resolved the conflict and can merge on green.

@jreback jreback merged commit 0fa47b6 into pandas-dev:master Oct 14, 2020
@jreback
Copy link
Contributor

jreback commented Oct 14, 2020

thanks @ig248 very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Support out-of-band pickling (protocol 5)
3 participants