iter_content slow with large chunk size on HTTPS connection #3729

Closed
pdknsk opened this Issue Nov 25, 2016 · 22 comments

Projects

None yet

3 participants

@pdknsk
pdknsk commented Nov 25, 2016
https = requests.get(url, stream=True)
for content in https.iter_content(100 * 2 ** 20): # 100MB
    pass

This causes 100% CPU and slows down throughput to less than 1MB/s.

@Lukasa
Collaborator
Lukasa commented Nov 25, 2016

Sorry, but right now you haven't given us anything like enough to go on. Please answer all of the following questions:

  • How does this behave on smaller chunk sizes?
  • What about larger ones?
  • What version of Requests are you using?
  • What version of Python are you using?
  • Can you share the URL in question?
  • What is the throughput when using .content directly?
  • What do the response headers include?
@pdknsk
pdknsk commented Nov 25, 2016 edited

I had wrongly assumed it's an universal bug affecting every HTTPS connection, which is why I didn't provide further information. I've done further investigation and it seems to affect only some CDNs, like Akamai and Azure.

Example (affected). Counter-example (not affected).

At 10MB, there is no noticeable increase in CPU load, and no impact on throughput. At 1GB, CPU load is at 100%, just like with 100MB, but throughput is reduced to below 100KB/s, compared to 1MB/s at 100MB.

I'm using the latest version on Python 2.7.12, installed through pip.

@Lukasa
Collaborator
Lukasa commented Nov 25, 2016

Is this true even with empty loop bodies as in your sample code? I ask because my normal instinct here would be that the repeated context switches back into Python are causing your performance problem, but larger chunks shouldn't exhibit that problem at all: they should be proportionally more efficient than the smaller chunk sizes.

This suggests that the problem is the manipulation of increasingly large chunks of memory. If this happens to you even with empty loop bodies then we have something really interesting to have to investigate.

Do you have access to Python 3 on your machine?

@pdknsk
pdknsk commented Nov 25, 2016 edited

Is this true even with empty loop bodies as in your sample code?

Yes.

Do you have access to Python 3 on your machine?

I've tried Python 3.4.3 with requests 2.2.1 as provided by Ubuntu repositories, and the problem does not occur, even at 1GB chunk size.

I have figured out why. When I Ctrl+C the connection, the stack traces hold the clue.

Python 2.7.12

File "/home/user/.local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1299, in recv
  buf = _ffi.new("char[]", bufsiz)

Python 3.4.3

File "/usr/lib/python3.4/ssl.py", line 641, in read
  v = self._sslobj.read(len, buffer)

Maybe this works.

$ pip uninstall PyOpenSSL

Python 2.7.12 again.

File "/usr/local/lib/python2.7/ssl.py", line 643, in read
  v = self._sslobj.read(len)

And the problem does not occur. No decrease in throughput or increase in CPU load.

It still doesn't answer the question why it manifests only with large chunk size in the first place.

@pdknsk
pdknsk commented Nov 25, 2016 edited

I've also figured out why the counter-example provided a few posts earlier does not exhibit the problem: it redirects to a HTTP server. So it does affect every HTTPS connection after all.

@pdknsk
pdknsk commented Nov 25, 2016 edited

I've only installed PyOpenSSL to silence the InsecurePlatformWarning on Python 2.7.6, which I've since replaced with self-compiled Python 2.7.12, so PyOpenSSL is now obsolete and the problem fixed for me.

I think there is still a bug here somewhere related to chunk size.

@pdknsk
pdknsk commented Nov 25, 2016 edited

Unfortunately I just noticed that some other packages installed through pip since do also require PyOpenSSL.

@Lukasa
Collaborator
Lukasa commented Nov 25, 2016

Ok, so then the next question is: is this merely a case of CFFI buffer allocation being slow, or is there something else going on? I'll need to investigate this further, as PyOpenSSL may not be at fault directly: it may be something to do with how we use it.

@Lukasa
Collaborator
Lukasa commented Nov 26, 2016

@pdknsk It would be interesting to see if you continue to have this problem on Python 3 with pip install requests[security]. That will also bring in PyOpenSSL, which should cause the problem again.

@Lukasa
Collaborator
Lukasa commented Nov 26, 2016

So, a quick test reveals that our use of CFFI is almost certainly the bottleneck here. I ran the following script (note the import from PyOpenSSL's FFI object, this was intended only for consistency and ease):

from OpenSSL._util import ffi

MB = 1024 * 1024

def test(alloc_size):
    buf = ffi.new("char[]", alloc_size)

import timeit

for size in [1, 10, 100, 1000]:
    print "Doing allocations of {}MB".format(size)
    size = size * MB
    print timeit.repeat(
        "test({})".format(size), setup="from __main__ import test", number=100
    )

That gave me the following results:

% python test_cffi.py                           
Doing allocations of 1MB
[0.004097938537597656, 0.003545999526977539, 0.003551006317138672]
Doing allocations of 10MB
[0.07877898216247559, 0.07785892486572266, 0.06536316871643066]
Doing allocations of 100MB
[3.7023580074310303, 3.6858859062194824, 3.602185010910034]
Doing allocations of 1000MB
[65.49478697776794, 67.42595791816711, 67.89322805404663]

Note, essentially, that increasing the allocation size increases the cost of allocation by a factor of between 20 to 40. That's really quite unexpected. The next step is going to be trying to work out if this is a problem with CFFI or with malloc more generally.

@Lukasa
Collaborator
Lukasa commented Nov 26, 2016

So I can't reproduce this behaviour in plain C code. That suggests that the problem is elsewhere. We might need to look at CFFI here.

@Lukasa
Collaborator
Lukasa commented Nov 27, 2016

Ok, this is kinda CFFI's fault, but not really.

When we do the initial buffer allocation CFFI kindly gives us a zeroed buffer. This zeroed buffer requires something that is basically the equivalent of malloc followed by memset (it's actually calloc, but that's neither here nor there). That means that allocations of buffers like this in CFFI have an approximately linear cost in the size of the allocation.

So, this isn't really CFFI's fault. Its strategy is reasonable. This is a problem higher up the stack. In the case of reading from a socket we don't need a zeroed buffer: we know how many bytes the code will write into it, so we can avoid doing anything silly. That's why CPython doesn't encounter the issue: I think they just directly malloc the underlying byte buffer.

I guess the first line of attack would be to see if CFFI is interested in adding a non-zeroed buffer acquisition API. That, however, only solves the problem on newer CFFIs.

Another option is to adjust the logic in either PyOpenSSL or Requests to quietly ignore really large arguments to recv and to clamp them to the size of the socket buffer in the kernel. The socket buffer represents a hard upper limit on how much data the kernel can actually return in one call to recv on a socket. So we could pursue that fix as well.

@pdknsk
pdknsk commented Nov 27, 2016 edited

I read the CFFI documentation briefly yesterday and noticed that ffi.new returns zero-initialised memory, but didn't mention it because I'm unfamiliar with all the machinery involved here, so wasn't sure that's the performance limitation.

What I also noticed is function ffi.new_allocator which provides a way to use malloc directly.

@Lukasa
Collaborator
Lukasa commented Nov 27, 2016

Yup, this has already been proposed as an alternative solution in the relevant pyopenssl issue.

@pdknsk
pdknsk commented Nov 27, 2016 edited

I just tried the benchmark you posted earlier, with the line from pyca/pyopenssl#577, and it returns instantly regardless of allocation size, as you'd expect. Problem solved.

@Lukasa
Collaborator
Lukasa commented Nov 28, 2016

Ok, the underlying issue at PyOpenSSL has had a fix merged. This should drastically improve our performance in this edge case. Thanks for the report @pdknsk and thanks for working with us to narrow down the problem!

@Lukasa Lukasa closed this Nov 28, 2016
@pdknsk
pdknsk commented Nov 28, 2016

It's interesting how a rare problem produced a fix with much wider application. I figure this will save quite a bit of CPU and thus electricity, and perhaps even yield noticeable performance improvements for every HTTPS connection on slow devices. A problem well worth reporting.

@njsmith
njsmith commented Dec 5, 2016

...What's particularly mysterious here is that for large buffers (these definitely qualify), calloc should be about as cheap as malloc, and much cheaper than malloc+memset! Once you pass the size threshold where the heap allocator decides to go to the kernel to satisfy the request (128 KiB for glibc by default), then the kernel returns lazily-zeroed memory regardless. So malloc and calloc should be indistinguishable, and both should be O(1) because the kernel will zero the memory lazily as its faulted in.

@njsmith
njsmith commented Dec 5, 2016

...darn, I got curious and investigated further. I have poor impulse control. Anyway, this is totally a bug in cffi: https://bitbucket.org/cffi/cffi/issues/295/cffinew-is-way-slower-than-it-should-be-it

@njsmith
njsmith commented Dec 5, 2016

...annnnnd it's also a bug in MacOS: pyca/pyopenssl#578 (comment)

@pdknsk what have you done

@Lukasa
Collaborator
Lukasa commented Dec 5, 2016

So @pdknsk, after further investigation prodded by @njsmith, you've actually stumbled onto two bugs that are wasting a lot of CPU and electricity.

The first is that CFFI should be using more efficient allocation. The second, and much more important, is that there is a region of allocations between about 128kB and 125MB where macOS (and, presumably, iOS/watchOS/tvOS as well) forcibly page-in and zero memory when they don't need to, wasting a ton of CPU to do it.

Nicely done!

@Lukasa
Collaborator
Lukasa commented Dec 5, 2016

This bug now also filed as rdar://29508271

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment