When stream=True iter_content(chunk_size=None) reads the input as a single big chunk #5536

hexagonrecursion · 2020-07-19T12:45:58Z

According to the documentation when stream=True iter_content(chunk_size=None) "will read data as it arrives in whatever size the chunks are received", But it actually collects all input into a single big bytes object consuming large amounts of memory and entirely defeating the purpose of iter_content().

Expected Result

iter_content(chunk_size=None) yields "data as it arrives in whatever size the chunks are received".

Actual Result

A single big chunk

Reproduction Steps

from requests import get
URL = 'https://dl.fedoraproject.org/pub/alt/iot/32/IoT/x86_64/images/Fedora-IoT-32-20200603.0.x86_64.raw.xz'
r = get(URL, stream=True)
for b in r.iter_content(chunk_size=None):
    print(len(b))

prints

533830860

System Information

$ python -m requests.help

{
  "chardet": {
    "version": "3.0.4"
  },
  "cryptography": {
    "version": "2.9.2"
  },
  "idna": {
    "version": "2.9"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.6"
  },
  "platform": {
    "release": "4.19.104+",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010107f",
    "version": "19.1.0"
  },
  "requests": {
    "version": "2.23.0"
  },
  "system_ssl": {
    "version": "1010107f"
  },
  "urllib3": {
    "version": "1.25.9"
  },
  "using_pyopenssl": true
}

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2020-07-19T15:10:13Z

chunk_size=None as you've quoted relies on the size of the data as sent by the server. If there server is sending everything all at once and it's all on the socket, what do you expect the library to do differently?

hexagonrecursion · 2020-07-19T16:00:13Z

@sigmavirus24 I don't think the server sends the file all at once. The example above produces no output for ~30 seconds and then prints 533830860. This starts printing right away:

from requests import get
URL = 'https://dl.fedoraproject.org/pub/alt/iot/32/IoT/x86_64/images/Fedora-IoT-32-20200603.0.x86_64.raw.xz'
r = get(URL, stream=True)
for b in r.iter_content(chunk_size=2**23):
    print(len(b))

laktak · 2020-07-27T14:35:54Z

I have the same issue with 2.24.0

When I use a chunk_size of 1 I get the expected output but with a huge overhead.

djrobstep · 2020-12-07T03:49:04Z

Can confirm the same is occurring.

Works as with chunk_size=1, hangs with None.

Can try to put together a reproducible example if that's helpful?

djrobstep · 2020-12-29T02:42:31Z

As promised, here's a reproducible example against httpbin.org:

import requests
chunk_size = None

URL = 'https://httpbin.org/drip?duration=2'

r = requests.get(URL, stream=True)

for x in r.iter_content(chunk_size=chunk_size):
    print(f'response: {x}')

Run this and you'll see that iter_content waits until the request is fully complete to return anything.

Change the chunk_size to 1 and everything works nicely (albeit with high overhead).

If somebody can point me in the right direction, I'm happy to investigate this and do what is required to fix it.

stephen-goveia · 2022-03-30T16:23:37Z

Any resolution to this? I am also still seeing this on v2.25.1

nateprewitt · 2022-03-30T17:04:33Z

Hi @stephen-goveia, this is a behavior in urllib3 as noted in urllib3/urllib3#2123. We aren't able to change it in Requests, so the outcome will be determined by whether this makes it into the urllib3 v2 release.

stephen-goveia · 2022-03-30T17:47:50Z

thanks @nateprewitt!

Karmavil · 2022-08-15T09:26:52Z

Hi. I don't understand why this issue is still open.
Here is a link to the official documentation.

chunk_size must be of type int or None. A value of None will function differently depending on the value of stream. stream=True will read data as it arrives in whatever size the chunks are received. If stream=False, data is returned as a single chunk.

hexagonrecursion · 2022-08-22T05:41:54Z

Even after setting stream=True this is still an issue:

import requests
import time

chunk_size = None

URL = 'https://httpbin.org/drip?duration=20&numbytes=4'

r = requests.get(URL, stream=True)

t = time.monotonic()
for x in r.iter_content(chunk_size=chunk_size):
    t2 = time.monotonic()
    print(f'{t2 - t}')
    t = time.monotonic()

prints:

15.593049310147762

Karmavil · 2022-08-22T12:28:17Z

Please keep in mind that I'm making this comment as a user, not as a contributor.

You're right. It is.. but please read the documentation.
All I'm saying is that the documentation is clear enough, (or at least it is today):

When stream=True is set on the request, this avoids reading the content at once into memory for large responses

What should the module do when you ask not to download everything at once but to download "Nothing"?
Should it throw an error?
Should it not download anything at all?

Just check the content-length header and set a suitable chunk size when dealing with large files

bablokb · 2022-12-30T17:43:57Z

It is not only about large files, it is also about SSE (server sent events). They are streamed, and clients expect them to arrive directly after the server sends them.

dmyoung9 · 2023-08-10T14:10:37Z

No movement on this in ~8 months... Any update?

mbhynes · 2023-08-14T02:27:04Z

Possible workaround using the Response.raw.stream(), seems to work on my end:

resp = requests.get("something", stream=True)
for chunk in resp.raw.stream():
   print(f"chunk size: {len(chunk)}")

smason · 2023-11-09T15:26:01Z

@mbhynes Not sure what you were doing to have that "work", but it certainly doesn't do what I'd expect...

import requests
url = "https://httpbin.org/drip?duration=2&numbytes=8"
resp = requests.get(url, stream=True)
for chunk in resp.raw.stream():
   print(f"chunk size: {len(chunk)}")

just gives me a single 8-byte chunk back after 2 seconds, rather than 8 single byte chunks every few hundred milliseconds.

I'd assume your endpoint happens to be returning the data via a "chunked transfer encoding" which has been able to handle streaming data in chunks for a long time already, but you could check by doing:

print(resp.headers.get("transfer-encoding"))

That said, I've created a pull-request with urllib3 (urllib3/urllib3#3186) that can be built on to enable streaming in cases like this and I'd hope would allow the normal iter_content method to yield data in appropriately sized chunks.

djrobstep mentioned this issue Dec 29, 2020

When iterating over a streamed response with amt=None, entire response is returned instead of streaming urllib3/urllib3#2123

Open

smason mentioned this issue Nov 27, 2023

make HTTPResponse.stream use read1 when amt=None urllib3/urllib3#3216

Open

theroggy mentioned this issue Jan 8, 2024

ENH: expose chunk_size parameter in download functions Open-EO/openeo-python-client#524

Closed

bouweandela mentioned this issue May 23, 2024

Avoid loading entire files into memory when downloading from ESGF ESMValGroup/ESMValCore#2434

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When stream=True iter_content(chunk_size=None) reads the input as a single big chunk #5536

When stream=True iter_content(chunk_size=None) reads the input as a single big chunk #5536

hexagonrecursion commented Jul 19, 2020

sigmavirus24 commented Jul 19, 2020

hexagonrecursion commented Jul 19, 2020

laktak commented Jul 27, 2020 •

edited

Loading

djrobstep commented Dec 7, 2020

djrobstep commented Dec 29, 2020

stephen-goveia commented Mar 30, 2022

nateprewitt commented Mar 30, 2022

stephen-goveia commented Mar 30, 2022

Karmavil commented Aug 15, 2022 •

edited

Loading

hexagonrecursion commented Aug 22, 2022

Karmavil commented Aug 22, 2022

bablokb commented Dec 30, 2022

dmyoung9 commented Aug 10, 2023

mbhynes commented Aug 14, 2023

smason commented Nov 9, 2023

When stream=True iter_content(chunk_size=None) reads the input as a single big chunk #5536

When stream=True iter_content(chunk_size=None) reads the input as a single big chunk #5536

Comments

hexagonrecursion commented Jul 19, 2020

Expected Result

Actual Result

Reproduction Steps

System Information

sigmavirus24 commented Jul 19, 2020

hexagonrecursion commented Jul 19, 2020

laktak commented Jul 27, 2020 • edited Loading

djrobstep commented Dec 7, 2020

djrobstep commented Dec 29, 2020

stephen-goveia commented Mar 30, 2022

nateprewitt commented Mar 30, 2022

stephen-goveia commented Mar 30, 2022

Karmavil commented Aug 15, 2022 • edited Loading

hexagonrecursion commented Aug 22, 2022

Karmavil commented Aug 22, 2022

bablokb commented Dec 30, 2022

dmyoung9 commented Aug 10, 2023

mbhynes commented Aug 14, 2023

smason commented Nov 9, 2023

laktak commented Jul 27, 2020 •

edited

Loading

Karmavil commented Aug 15, 2022 •

edited

Loading