Don't read whole files into memory for multipart file uploads. #295

zacharyvoase · 2011-11-29T11:51:14Z

This: https://github.com/kennethreitz/requests/blob/develop/requests/models.py#L375 would indicate that files passed in files are being read into memory in their entirety. This is likely to get pathological for large files; the underlying poster library supports streaming requests. The solution to this issue is probably related to #292.

The text was updated successfully, but these errors were encountered:

kennethreitz · 2011-11-29T14:35:22Z

Correct. This is currently a bit of a fundamental limitation of the current architecture, but is something that I'd like to revisit on the longterm roadmap.

Requests no longer utilizes Poster internally.

I believe @mitsuhiko has some thoughts on this.

cpatrick · 2012-03-16T00:26:22Z

I see that this is planned. I just wanted to verify that this is definitely an issue with large files. I'm taking a look at poster for handling large uploads.

cpatrick · 2012-05-15T13:55:06Z

@kennethreitz, If I wanted to take a look at implementing this and contributing it to requests, where would you suggest I start. I don't want to step on any toes :).

rlskoeser · 2012-05-25T14:27:43Z

Has anyone made any progress with this? I need this for a project where I'd like to use requests but need to handle large files. If there's anyway I can help out with getting something implemented, I'd be willing to contribute. If anyone knows of any work-around solutions (e.g., poster + requests) that would be good enough until requests handles this better, I'd be interested in that as well.

zigmonty · 2012-08-22T12:16:16Z

Ok, so this is something that's been bothering me as well (i've got use-cases involving posting large files, potentially multi-GB). So i took a crack at it.

My changes are based on the observation that, except for a couple of lines of code in models.py which read any files passed into ram, and the encode_multipart_formdata() function itself, everything else is happy with the files remaining as file objects. Particularly, the httplib backend is happy for body to be a file.

The complication of course is that we can't directly send the file as the entire body, but rather it must be framed by multipart/form-data boundaries and headers. My solution was creating a file-like class that stores its contents as fragments of the whole body. These fragments may be either strings (framing text, string fields) or file objects. When httplib ultimately tries to send it, it walks through its set of fragments, coalescing them into blocks of the size requested. No more than 8 KB of each file will be read at a time (the block size passed by httplib to read()).

Technically the change is to the urllib3 package, with a small change to models.py to take advantage of it.

The commit is 0254278, although I can't say i'm particularly happy with it (hence no pull request). I'm feeling that i'm missing something and probably breaking code (although it passes the tests). Anyone got any ideas? Is this the sort of thing that should be fixed at an architectural level instead?

Lukasa · 2012-08-22T20:03:35Z

The bulk of your changes are in urllib3, so we should probably see if @shazow is interested in your changes. (NB: the reason the tests pass is because requests doesn't test urllib3. =D )

zigmonty · 2012-08-22T21:08:59Z

Yes and no. The encode_multipart_formdata() function is in urllib3, but the call to it is in requests in models.py (there's another call to it in urllib3, but that's unused by requests as far as i can see, as the body is already encoded by the time it gets that far). Requests is encoding the body before handing the request off to urllib3. It just happens to be reusing that function by importing it from urllib3. There's no reason they have to share the same copy of the function other than sanity. :)

Anyway, i'm just pointing out that, in the case that @shazow doesn't want these sorts of changes in urllib3, the patch can be changed to move that functionality local to requests. Btw, does anyone know the reason why requests doesn't just hand off the unencoded files to urllib3 but rather encodes them itself?

In any case, i'm sort of only posting this as a "could we do something like this?" suggestion rather than a pull request. There's a few flaws in my approach that even I noticed. I don't really have enough knowledge of the code and all the interactions to make this change. It's also almost certainly a performance regression for small files.

kennethreitz · 2012-08-22T21:10:37Z

So, the design decision that we came up with was to use iterators for streaming uploads.

If you provide a generator, for example, instead of a string for a response body, it'll be uploaded as a stream.

That's what urllib3 needs to support.

zigmonty · 2012-08-22T21:28:54Z

Yeah that sounds like a much more flexible approach. One thing i noticed implementing my version though is that if you don't send a content-length, httpbin.org won't parse the request properly, and i think a lot of other servers/web services will also break. And urllib3/httplib isn't going to be able to calculate the content-length if the body is an arbitrary generator right? So the calling code in requests will have to do that? Or use chunked encoding? Not a criticism, i'm just curious. :)

I'm assuming this also means a move away from httplib? As far as i can see, that only allows the body to be a string or a file-like object.

kennethreitz · 2012-08-22T21:31:21Z

I believe we'll have to do chunked encoding, yes.

As for httplib, @shazow would be a better person to ask ;)

shazow · 2012-08-22T21:35:14Z

See: urllib3/urllib3#51

zigmonty · 2012-08-22T22:04:13Z

Right, the approach you took is essentially the same as mine but using a separate multipart encoding generator and a file-like object to wrap around it, whereas i combined the two into a single object (sort of). Your approach is much nicer i think. One benefit of my approach though is that once you're in the middle of a large file, the call to read() from httplib is passed fairly directly to the underlying file, whereas in that implementation the reading from the underlying file and the read from httplib may be using mismatched blocks. which means a bunch of extra string concatenation and data always being left in leftover. It might be worth returning less data than requested by httplib once to sync up or something. Although my implementation has far too much string copying as well. The blocksizes involved in both seem a little wastefully small as well.

Anyway, the last comment in that issue is 6 months old and appears to have died with you asking if they wanted to do it. Are you still looking for someone to do this, or have you already got a plan to do it yourself? If you are, what would be required before you'd accept a pull request? Specific approach? Lots of testing?

(Apologies for posting this on the wrong project btw)

shazow · 2012-08-22T22:21:03Z

Don't think anyone is doing it. You're more than welcome to have at it. Requirements:

Post in the right project. :)
100% unit test coverage.
If possible, backwards-compatible changes. If not, then changes in as few places as possible.
Consistent code style with the rest of the project (mostly PEP8).

shrredd · 2012-10-20T18:33:47Z

Has there been any progress on this? We currently use poster + urllib2 for streaming uploads but having it built directly into requests would be much, much cleaner.

sigmavirus24 · 2012-11-26T23:26:43Z

Just an update, this is entirely related to #952

sigmavirus24 · 2013-01-26T18:20:09Z

Since streaming requests were finished in #952, I'm going to close this. If this is still a problem, let me know and I'll reopen it.

piotr-dobrogost · 2013-01-26T21:50:38Z

@sigmavirus24

I don't see how making it possible to stream a request by passing generator/iterator as data param's value could magically change how another part of api (files param) works.

sigmavirus24 · 2013-01-26T22:18:24Z

@piotr-dobrogost I missed this line when I originally skimmed the commit. This is making me wonder how difficult streaming multipart file uploads could be.

sigmavirus24 · 2013-01-26T22:35:13Z

I answered my own question: RFC 2854. Now to get a cup of hot chocolate and a blanket to make consuming this document more enjoyable.

kennethreitz mentioned this issue Oct 2, 2012

Support for request body streaming #874

Closed

kennethreitz mentioned this issue Oct 20, 2012

Chunked Transfer Encoding #899

Closed

sigmavirus24 closed this as completed Jan 26, 2013

sigmavirus24 reopened this Jan 26, 2013

kennethreitz closed this as completed Feb 10, 2013

2xyo mentioned this issue Nov 26, 2013

upload.py throws MemoryError with large files konomae/bitcasapy#2

Closed

2xyo mentioned this issue Dec 13, 2013

Move from urllib2 to Requests biscuitehh/bitcasa-fs-fuse#5

Closed

sebhtml mentioned this issue Feb 12, 2015

ar-run hangs kbaseattic/assembly#167

Closed

github-actions bot locked as resolved and limited conversation to collaborators Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't read whole files into memory for multipart file uploads. #295

Don't read whole files into memory for multipart file uploads. #295

zacharyvoase commented Nov 29, 2011

kennethreitz commented Nov 29, 2011

cpatrick commented Mar 16, 2012

cpatrick commented May 15, 2012

rlskoeser commented May 25, 2012

zigmonty commented Aug 22, 2012

Lukasa commented Aug 22, 2012

zigmonty commented Aug 22, 2012

kennethreitz commented Aug 22, 2012

zigmonty commented Aug 22, 2012

kennethreitz commented Aug 22, 2012

shazow commented Aug 22, 2012

zigmonty commented Aug 22, 2012

shazow commented Aug 22, 2012

shrredd commented Oct 20, 2012

sigmavirus24 commented Nov 26, 2012

sigmavirus24 commented Jan 26, 2013

piotr-dobrogost commented Jan 26, 2013

sigmavirus24 commented Jan 26, 2013

sigmavirus24 commented Jan 26, 2013

Don't read whole files into memory for multipart file uploads. #295

Don't read whole files into memory for multipart file uploads. #295

Comments

zacharyvoase commented Nov 29, 2011

kennethreitz commented Nov 29, 2011

cpatrick commented Mar 16, 2012

cpatrick commented May 15, 2012

rlskoeser commented May 25, 2012

zigmonty commented Aug 22, 2012

Lukasa commented Aug 22, 2012

zigmonty commented Aug 22, 2012

kennethreitz commented Aug 22, 2012

zigmonty commented Aug 22, 2012

kennethreitz commented Aug 22, 2012

shazow commented Aug 22, 2012

zigmonty commented Aug 22, 2012

shazow commented Aug 22, 2012

shrredd commented Oct 20, 2012

sigmavirus24 commented Nov 26, 2012

sigmavirus24 commented Jan 26, 2013

piotr-dobrogost commented Jan 26, 2013

sigmavirus24 commented Jan 26, 2013

sigmavirus24 commented Jan 26, 2013