Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use msgpack for cache serialization #115

Merged
merged 1 commit into from Jan 11, 2017
Merged

Conversation

StephanErb
Copy link
Contributor

@StephanErb StephanErb commented Feb 27, 2016

This is a pull request meant to improve the efficiency of wheel download caching in pip (pypa/pip#3515).

Msgpack is fast, supports all major Python versions, and does not add overhead for the serialization of large binary values (as commonly handled by pip).

Benchmark results:

# Before
$ python ./examples/benchmark.py
Total time for 1000 requests: 0:00:00.670020
# After
$ python ./examples/benchmark.py
Total time for 1000 requests: 0:00:00.574051

@StephanErb
Copy link
Contributor Author

The code-quality check seems to have failed due to a internal server error at their end:

Check number 12 has failed. This is almost always due to a bug in Landscape - we have been notified. Sorry about that.
Failure reason: Unknown
We don't know exactly why this error happened yet

@StephanErb
Copy link
Contributor Author

Anything missing here or things you would want me to address?

/cc @dstufft

@ionrock
Copy link
Contributor

ionrock commented Mar 7, 2016

@StephanErb sorry the slow response. My only concern is adding the dependency. Right now, CacheControl only requires requests and this adds another dependency that is arguably doesn't add a huge advantage in the general case.

What do you think about making it optional if msgpack is installed, falling back to the default encoding scheme otherwise? I realize this could get weird if you had more than one process using with msgpack installed and another without, so please take this as a desire to discuss the issue and not a demand!

Let me know what you think!

@dstufft
Copy link
Contributor

dstufft commented Mar 7, 2016

It does have some advantages, even in the general case:

Of course, it comes with the downside of adding another mandatory dependency (in this case, a dependency that does have an optional C extension for CPython speedups, but it fails gracefully if that fails to compile for any reason).

All in all, my personal opinion is that I think msgpack is great, and I think it makes sense to do this, however if you would prefer not to force it I would suggest that it's likely to be a better idea to have some sort of flag to force this on/off rather than auto detecting (though I'm not sure how to handle that exactly, the versioned cache serialization format doesn't really handle downgrades or "features", just upgrades).

@StephanErb
Copy link
Contributor Author

When preparing the pull request, I also pondered whether to keep the non-msgpack fallback or not.

I concluded that the primary concern of cachecontrol is caching of HTTP responses. Having a dependency on a first-class serialization library therefore did not seem unreasonable.

@StephanErb
Copy link
Contributor Author

Have you put any more thought into this? If deemed necessary, I'd be happy to update the pull request with a fallback mode to the old serialization format.

@ionrock
Copy link
Contributor

ionrock commented Mar 14, 2016

@StephanErb Actually yes, I think it would be best to avoid the extra dependency simply b/c people do have hangups about that sort of thing. I'm honestly not one of them, but the fact that msgpack can include C deps makes it more confusing and cumbersome for folks. I can imagine new users seeing a message it can't compile something in some output would feel they are doing something wrong. Even though it doesn't make a difference, it is one of those things that makes it harder getting started. I've always been a fan of CherryPy, partially for this reason.

With that in mind, as I'm sure this is helpful for pip, if you and @dstufft wanted to pull the file cache bits out into its own library, that doesn't bother me. I've always avoided CacheControl including a ton of storage backends b/c they are difficult to maintain. The file cache was the one exception, but if it seems like it needs to be pulled out into its own library in order to ensure yours truly is not a bottleneck, I'm not completely opposed to the idea.

Lastly, if it is a total pain in the neck to have a fallback path, I'm curious how the benchmark looks with msgpack using C vs. the pure Python msgpack. If the pure Python version is slower, I'm thinking it would be best to use an explicit method of including msgpack:

# this completely made up code
import cachecontrol.config
cachecontrol.config.use_msgpack = True

That way we help move folks towards the best solution depending on their requirements (ie no C runtime to build in deployment).

Thanks for your patience and I hope I'm not putting any undue burden on you. Part of me just wants to merge it, but, at the same time, after dealing with deployment issues as of late, adding an extra dependency that does have the potential of compiling C does make me hesitant.

@dstufft
Copy link
Contributor

dstufft commented Mar 14, 2016

I don't have a problem using another library that extends cachecontrol, but this isn't something that the storage backend controls, this is the serializer which currently is storage agnostic. The serializer appears to be pluggable (can be passed into the controller as an argument). Assuming that is part of the API and isn't just something for testing I wouldn't be opposed to depending on some other library that implements a msgpack serializer for CacheControl (or just doing it in pip in that case I guess).

I do want to point out that all modern versions of pip will not show any message about not being able to compile msgpack and msgpack itself will attempt to compile, and then will fallback to pure python without failing the install. If that changes your opinion on the matter at all.

@StephanErb
Copy link
Contributor Author

@ionrock I can absolutely relate to you, now wanting an additional dependency. I will reconsider doing the pull request directly in pip

@StephanErb StephanErb closed this May 18, 2016
@ionrock
Copy link
Contributor

ionrock commented Sep 20, 2016

@StephanErb @dstufft

After some conversation in irc, I'd be open to merging this. @StephanErb do you mind taking a look at your patch to make sure things are still working?

This will also need some documentation updates to help communicate the change.

Thanks!

@ionrock ionrock reopened this Sep 20, 2016
@dsully
Copy link

dsully commented Sep 28, 2016

+1 - this would be great to have.

@StephanErb
Copy link
Contributor Author

I have rebased this pull request. No code modifications were necessary and all tests are still passing. From my perspective this would be ready to merge.

def _b64_encode(s):
if isinstance(s, text_type):
return _b64_encode_str(s)
return _b64_encode_bytes(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be left around for the time being as they are still used in the previous versioned loader. I would think that if a v3 cache item gets loaded we'll see an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only deleted the encode methods. The decode methods required for reading are still around. I therefore think that loading an old compressed json item is still working. There is also a test case for this. See my comment below.

I am happy re-add if the code. Just want to make sure we are on the same page :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had time to think about this? Should I revert the change as you have indicated above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StephanErb you were right! Thanks!

assert _b64_decode_str(unicode_result) == self.unicode_string

bytes_result = _b64_encode(self.unicode_string)
assert _b64_decode_str(bytes_result) == self.unicode_string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the comment regarding these functions, this test can stick around as well.

@ionrock
Copy link
Contributor

ionrock commented Nov 25, 2016

@StephanErb Just a couple small comments. Thanks for rebasing!

# We have to decode our urllib3 data back into a unicode string.
assert resp.data == 'Hello World'.encode('utf-8')

def test_read_version_v2(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test ensures we can still read old data.

if isinstance(s, text_type):
return _b64_encode_str(s)
return _b64_encode_bytes(s)
from .compat import HTTPResponse, pickle


def _b64_decode_bytes(b):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decode has not been removed to ensure backwards compatibility.

Msgpack is fast, supports all major Python versions, and does not add
overhead for the serialization of large binary values (as commonly
handled by pip).
@StephanErb
Copy link
Contributor Author

I have rebased this branch. Is there anything else that need to be addressed?

@ionrock ionrock merged commit 5cf2852 into psf:master Jan 11, 2017
@ionrock
Copy link
Contributor

ionrock commented Jan 11, 2017

Thanks @StephanErb for your patience with this patch!

@dstufft
Copy link
Contributor

dstufft commented Jan 12, 2017

💯

@StephanErb
Copy link
Contributor Author

@ionrock would it be possible to make a new release including this chance?

@ionrock
Copy link
Contributor

ionrock commented Jan 27, 2017

@StephanErb Yes! I just started at a new job and have been a bit swamped. I'll get it out today.

@dsully
Copy link

dsully commented Jan 29, 2017

@ionrock - any chance we can get a release soon? Thanks!

@ionrock
Copy link
Contributor

ionrock commented Jan 30, 2017

@dsully @StephanErb Just released 0.12.0. Let me know if you find any issues. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants