Use msgpack for cache serialization #115

StephanErb · 2016-02-27T11:22:40Z

This is a pull request meant to improve the efficiency of wheel download caching in pip (pypa/pip#3515).

Msgpack is fast, supports all major Python versions, and does not add overhead for the serialization of large binary values (as commonly handled by pip).

Benchmark results:

# Before
$ python ./examples/benchmark.py
Total time for 1000 requests: 0:00:00.670020

# After
$ python ./examples/benchmark.py
Total time for 1000 requests: 0:00:00.574051

StephanErb · 2016-02-27T11:28:59Z

The code-quality check seems to have failed due to a internal server error at their end:

Check number 12 has failed. This is almost always due to a bug in Landscape - we have been notified. Sorry about that.
Failure reason: Unknown
We don't know exactly why this error happened yet

StephanErb · 2016-03-07T16:30:29Z

Anything missing here or things you would want me to address?

/cc @dstufft

ionrock · 2016-03-07T20:30:08Z

@StephanErb sorry the slow response. My only concern is adding the dependency. Right now, CacheControl only requires requests and this adds another dependency that is arguably doesn't add a huge advantage in the general case.

What do you think about making it optional if msgpack is installed, falling back to the default encoding scheme otherwise? I realize this could get weird if you had more than one process using with msgpack installed and another without, so please take this as a desire to discuss the issue and not a demand!

Let me know what you think!

dstufft · 2016-03-07T20:44:00Z

It does have some advantages, even in the general case:

The code to serialize is simpler since msgpack has a binary type and JSON does not. This would probably fix UnicodeDecodeError on some utf8 content in headers in cachecontrol #91 and would have prevented UnicodeDecodeError raised on some cache max-age headers #84.
It's faster that JSON in both small and large messages, making it a speed boost in both scenarios.
It's smaller, making it so that the cache will consume less space on disk (or in memory, or wherever it is stored).

Of course, it comes with the downside of adding another mandatory dependency (in this case, a dependency that does have an optional C extension for CPython speedups, but it fails gracefully if that fails to compile for any reason).

All in all, my personal opinion is that I think msgpack is great, and I think it makes sense to do this, however if you would prefer not to force it I would suggest that it's likely to be a better idea to have some sort of flag to force this on/off rather than auto detecting (though I'm not sure how to handle that exactly, the versioned cache serialization format doesn't really handle downgrades or "features", just upgrades).

StephanErb · 2016-03-08T09:05:19Z

When preparing the pull request, I also pondered whether to keep the non-msgpack fallback or not.

I concluded that the primary concern of cachecontrol is caching of HTTP responses. Having a dependency on a first-class serialization library therefore did not seem unreasonable.

StephanErb · 2016-03-13T23:32:09Z

Have you put any more thought into this? If deemed necessary, I'd be happy to update the pull request with a fallback mode to the old serialization format.

ionrock · 2016-03-14T14:45:41Z

@StephanErb Actually yes, I think it would be best to avoid the extra dependency simply b/c people do have hangups about that sort of thing. I'm honestly not one of them, but the fact that msgpack can include C deps makes it more confusing and cumbersome for folks. I can imagine new users seeing a message it can't compile something in some output would feel they are doing something wrong. Even though it doesn't make a difference, it is one of those things that makes it harder getting started. I've always been a fan of CherryPy, partially for this reason.

With that in mind, as I'm sure this is helpful for pip, if you and @dstufft wanted to pull the file cache bits out into its own library, that doesn't bother me. I've always avoided CacheControl including a ton of storage backends b/c they are difficult to maintain. The file cache was the one exception, but if it seems like it needs to be pulled out into its own library in order to ensure yours truly is not a bottleneck, I'm not completely opposed to the idea.

Lastly, if it is a total pain in the neck to have a fallback path, I'm curious how the benchmark looks with msgpack using C vs. the pure Python msgpack. If the pure Python version is slower, I'm thinking it would be best to use an explicit method of including msgpack:

# this completely made up code
import cachecontrol.config
cachecontrol.config.use_msgpack = True

That way we help move folks towards the best solution depending on their requirements (ie no C runtime to build in deployment).

Thanks for your patience and I hope I'm not putting any undue burden on you. Part of me just wants to merge it, but, at the same time, after dealing with deployment issues as of late, adding an extra dependency that does have the potential of compiling C does make me hesitant.

dstufft · 2016-03-14T20:02:23Z

I don't have a problem using another library that extends cachecontrol, but this isn't something that the storage backend controls, this is the serializer which currently is storage agnostic. The serializer appears to be pluggable (can be passed into the controller as an argument). Assuming that is part of the API and isn't just something for testing I wouldn't be opposed to depending on some other library that implements a msgpack serializer for CacheControl (or just doing it in pip in that case I guess).

I do want to point out that all modern versions of pip will not show any message about not being able to compile msgpack and msgpack itself will attempt to compile, and then will fallback to pure python without failing the install. If that changes your opinion on the matter at all.

StephanErb · 2016-05-18T20:17:56Z

@ionrock I can absolutely relate to you, now wanting an additional dependency. I will reconsider doing the pull request directly in pip

ionrock · 2016-09-20T19:55:36Z

@StephanErb @dstufft

After some conversation in irc, I'd be open to merging this. @StephanErb do you mind taking a look at your patch to make sure things are still working?

This will also need some documentation updates to help communicate the change.

Thanks!

dsully · 2016-09-28T21:11:43Z

+1 - this would be great to have.

StephanErb · 2016-11-08T23:55:59Z

I have rebased this pull request. No code modifications were necessary and all tests are still passing. From my perspective this would be ready to merge.

ionrock · 2016-11-25T15:36:42Z

cachecontrol/serialize.py

-def _b64_encode(s):
-    if isinstance(s, text_type):
-        return _b64_encode_str(s)
-    return _b64_encode_bytes(s)


This should probably be left around for the time being as they are still used in the previous versioned loader. I would think that if a v3 cache item gets loaded we'll see an exception.

I have only deleted the encode methods. The decode methods required for reading are still around. I therefore think that loading an old compressed json item is still working. There is also a test case for this. See my comment below.

I am happy re-add if the code. Just want to make sure we are on the same page :)

Have you had time to think about this? Should I revert the change as you have indicated above?

@StephanErb you were right! Thanks!

ionrock · 2016-11-25T15:37:09Z

tests/test_serialization.py

-        assert _b64_decode_str(unicode_result) == self.unicode_string
-
-        bytes_result = _b64_encode(self.unicode_string)
-        assert _b64_decode_str(bytes_result) == self.unicode_string


As per the comment regarding these functions, this test can stick around as well.

ionrock · 2016-11-25T15:37:50Z

@StephanErb Just a couple small comments. Thanks for rebasing!

StephanErb · 2016-11-25T16:05:58Z

tests/test_serialization.py

+        # We have to decode our urllib3 data back into a unicode string.
+        assert resp.data == 'Hello World'.encode('utf-8')
+
+    def test_read_version_v2(self):


This test ensures we can still read old data.

StephanErb · 2016-11-25T16:06:20Z

cachecontrol/serialize.py

-    if isinstance(s, text_type):
-        return _b64_encode_str(s)
-    return _b64_encode_bytes(s)
+from .compat import HTTPResponse, pickle


 def _b64_decode_bytes(b):


Decode has not been removed to ensure backwards compatibility.

Msgpack is fast, supports all major Python versions, and does not add overhead for the serialization of large binary values (as commonly handled by pip).

StephanErb · 2017-01-11T21:54:04Z

I have rebased this branch. Is there anything else that need to be addressed?

ionrock · 2017-01-11T21:55:56Z

Thanks @StephanErb for your patience with this patch!

dstufft · 2017-01-12T02:18:40Z

💯

StephanErb · 2017-01-27T16:05:58Z

@ionrock would it be possible to make a new release including this chance?

ionrock · 2017-01-27T16:38:20Z

@StephanErb Yes! I just started at a new job and have been a bit swamped. I'll get it out today.

dsully · 2017-01-29T19:04:21Z

@ionrock - any chance we can get a release soon? Thanks!

ionrock · 2017-01-30T05:06:27Z

@dsully @StephanErb Just released 0.12.0. Let me know if you find any issues. Thanks!

StephanErb mentioned this pull request Feb 27, 2016

Inefficient wheel download caching pypa/pip#3515

Closed

StephanErb closed this May 18, 2016

ionrock reopened this Sep 20, 2016

StephanErb force-pushed the msgpack branch from 5ce5797 to 6b8792a Compare November 8, 2016 23:50

ionrock reviewed Nov 25, 2016

View reviewed changes

StephanErb commented Nov 25, 2016

View reviewed changes

StephanErb force-pushed the msgpack branch from 6b8792a to 5fa0fc5 Compare January 5, 2017 16:07

Use msgpack for cache serialization.

2ae417d

Msgpack is fast, supports all major Python versions, and does not add overhead for the serialization of large binary values (as commonly handled by pip).

StephanErb force-pushed the msgpack branch from 5fa0fc5 to 2ae417d Compare January 11, 2017 21:50

ionrock merged commit 5cf2852 into psf:master Jan 11, 2017

xavfernandez mentioned this pull request Mar 24, 2017

Excessive memory use when caching large packages pypa/pip#2984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use msgpack for cache serialization #115

Use msgpack for cache serialization #115

StephanErb commented Feb 27, 2016 •

edited

StephanErb commented Feb 27, 2016

StephanErb commented Mar 7, 2016

ionrock commented Mar 7, 2016

dstufft commented Mar 7, 2016

StephanErb commented Mar 8, 2016

StephanErb commented Mar 13, 2016

ionrock commented Mar 14, 2016

dstufft commented Mar 14, 2016

StephanErb commented May 18, 2016

ionrock commented Sep 20, 2016

dsully commented Sep 28, 2016 •

edited

StephanErb commented Nov 8, 2016

ionrock Nov 25, 2016

StephanErb Nov 25, 2016

StephanErb Dec 13, 2016

ionrock Jan 11, 2017

ionrock Nov 25, 2016

ionrock commented Nov 25, 2016

StephanErb Nov 25, 2016

StephanErb Nov 25, 2016

StephanErb commented Jan 11, 2017

ionrock commented Jan 11, 2017

dstufft commented Jan 12, 2017

StephanErb commented Jan 27, 2017

ionrock commented Jan 27, 2017

dsully commented Jan 29, 2017

ionrock commented Jan 30, 2017

Use msgpack for cache serialization #115

Use msgpack for cache serialization #115

Conversation

StephanErb commented Feb 27, 2016 • edited

StephanErb commented Feb 27, 2016

StephanErb commented Mar 7, 2016

ionrock commented Mar 7, 2016

dstufft commented Mar 7, 2016

StephanErb commented Mar 8, 2016

StephanErb commented Mar 13, 2016

ionrock commented Mar 14, 2016

dstufft commented Mar 14, 2016

StephanErb commented May 18, 2016

ionrock commented Sep 20, 2016

dsully commented Sep 28, 2016 • edited

StephanErb commented Nov 8, 2016

ionrock Nov 25, 2016

Choose a reason for hiding this comment

StephanErb Nov 25, 2016

Choose a reason for hiding this comment

StephanErb Dec 13, 2016

Choose a reason for hiding this comment

ionrock Jan 11, 2017

Choose a reason for hiding this comment

ionrock Nov 25, 2016

Choose a reason for hiding this comment

ionrock commented Nov 25, 2016

StephanErb Nov 25, 2016

Choose a reason for hiding this comment

StephanErb Nov 25, 2016

Choose a reason for hiding this comment

StephanErb commented Jan 11, 2017

ionrock commented Jan 11, 2017

dstufft commented Jan 12, 2017

StephanErb commented Jan 27, 2017

ionrock commented Jan 27, 2017

dsully commented Jan 29, 2017

ionrock commented Jan 30, 2017

StephanErb commented Feb 27, 2016 •

edited

dsully commented Sep 28, 2016 •

edited