New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use msgpack for cache serialization #115
Conversation
The code-quality check seems to have failed due to a internal server error at their end:
|
Anything missing here or things you would want me to address? /cc @dstufft |
@StephanErb sorry the slow response. My only concern is adding the dependency. Right now, CacheControl only requires requests and this adds another dependency that is arguably doesn't add a huge advantage in the general case. What do you think about making it optional if msgpack is installed, falling back to the default encoding scheme otherwise? I realize this could get weird if you had more than one process using with msgpack installed and another without, so please take this as a desire to discuss the issue and not a demand! Let me know what you think! |
It does have some advantages, even in the general case:
Of course, it comes with the downside of adding another mandatory dependency (in this case, a dependency that does have an optional C extension for CPython speedups, but it fails gracefully if that fails to compile for any reason). All in all, my personal opinion is that I think msgpack is great, and I think it makes sense to do this, however if you would prefer not to force it I would suggest that it's likely to be a better idea to have some sort of flag to force this on/off rather than auto detecting (though I'm not sure how to handle that exactly, the versioned cache serialization format doesn't really handle downgrades or "features", just upgrades). |
When preparing the pull request, I also pondered whether to keep the non-msgpack fallback or not. I concluded that the primary concern of cachecontrol is caching of HTTP responses. Having a dependency on a first-class serialization library therefore did not seem unreasonable. |
Have you put any more thought into this? If deemed necessary, I'd be happy to update the pull request with a fallback mode to the old serialization format. |
@StephanErb Actually yes, I think it would be best to avoid the extra dependency simply b/c people do have hangups about that sort of thing. I'm honestly not one of them, but the fact that msgpack can include C deps makes it more confusing and cumbersome for folks. I can imagine new users seeing a message it can't compile something in some output would feel they are doing something wrong. Even though it doesn't make a difference, it is one of those things that makes it harder getting started. I've always been a fan of CherryPy, partially for this reason. With that in mind, as I'm sure this is helpful for pip, if you and @dstufft wanted to pull the file cache bits out into its own library, that doesn't bother me. I've always avoided CacheControl including a ton of storage backends b/c they are difficult to maintain. The file cache was the one exception, but if it seems like it needs to be pulled out into its own library in order to ensure yours truly is not a bottleneck, I'm not completely opposed to the idea. Lastly, if it is a total pain in the neck to have a fallback path, I'm curious how the benchmark looks with msgpack using C vs. the pure Python msgpack. If the pure Python version is slower, I'm thinking it would be best to use an explicit method of including msgpack:
That way we help move folks towards the best solution depending on their requirements (ie no C runtime to build in deployment). Thanks for your patience and I hope I'm not putting any undue burden on you. Part of me just wants to merge it, but, at the same time, after dealing with deployment issues as of late, adding an extra dependency that does have the potential of compiling C does make me hesitant. |
I don't have a problem using another library that extends cachecontrol, but this isn't something that the storage backend controls, this is the serializer which currently is storage agnostic. The serializer appears to be pluggable (can be passed into the controller as an argument). Assuming that is part of the API and isn't just something for testing I wouldn't be opposed to depending on some other library that implements a msgpack serializer for CacheControl (or just doing it in pip in that case I guess). I do want to point out that all modern versions of pip will not show any message about not being able to compile msgpack and msgpack itself will attempt to compile, and then will fallback to pure python without failing the install. If that changes your opinion on the matter at all. |
@ionrock I can absolutely relate to you, now wanting an additional dependency. I will reconsider doing the pull request directly in pip |
After some conversation in irc, I'd be open to merging this. @StephanErb do you mind taking a look at your patch to make sure things are still working? This will also need some documentation updates to help communicate the change. Thanks! |
+1 - this would be great to have. |
I have rebased this pull request. No code modifications were necessary and all tests are still passing. From my perspective this would be ready to merge. |
def _b64_encode(s): | ||
if isinstance(s, text_type): | ||
return _b64_encode_str(s) | ||
return _b64_encode_bytes(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be left around for the time being as they are still used in the previous versioned loader. I would think that if a v3 cache item gets loaded we'll see an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only deleted the encode methods. The decode methods required for reading are still around. I therefore think that loading an old compressed json item is still working. There is also a test case for this. See my comment below.
I am happy re-add if the code. Just want to make sure we are on the same page :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you had time to think about this? Should I revert the change as you have indicated above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StephanErb you were right! Thanks!
assert _b64_decode_str(unicode_result) == self.unicode_string | ||
|
||
bytes_result = _b64_encode(self.unicode_string) | ||
assert _b64_decode_str(bytes_result) == self.unicode_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the comment regarding these functions, this test can stick around as well.
@StephanErb Just a couple small comments. Thanks for rebasing! |
# We have to decode our urllib3 data back into a unicode string. | ||
assert resp.data == 'Hello World'.encode('utf-8') | ||
|
||
def test_read_version_v2(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test ensures we can still read old data.
if isinstance(s, text_type): | ||
return _b64_encode_str(s) | ||
return _b64_encode_bytes(s) | ||
from .compat import HTTPResponse, pickle | ||
|
||
|
||
def _b64_decode_bytes(b): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decode has not been removed to ensure backwards compatibility.
Msgpack is fast, supports all major Python versions, and does not add overhead for the serialization of large binary values (as commonly handled by pip).
I have rebased this branch. Is there anything else that need to be addressed? |
Thanks @StephanErb for your patience with this patch! |
💯 |
@ionrock would it be possible to make a new release including this chance? |
@StephanErb Yes! I just started at a new job and have been a bit swamped. I'll get it out today. |
@ionrock - any chance we can get a release soon? Thanks! |
@dsully @StephanErb Just released 0.12.0. Let me know if you find any issues. Thanks! |
This is a pull request meant to improve the efficiency of wheel download caching in pip (pypa/pip#3515).
Msgpack is fast, supports all major Python versions, and does not add overhead for the serialization of large binary values (as commonly handled by pip).
Benchmark results: