Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't read whole files into memory for multipart file uploads. #295

Closed
zacharyvoase opened this issue Nov 29, 2011 · 19 comments
Closed

Don't read whole files into memory for multipart file uploads. #295

zacharyvoase opened this issue Nov 29, 2011 · 19 comments

Comments

@zacharyvoase
Copy link

This: https://github.com/kennethreitz/requests/blob/develop/requests/models.py#L375 would indicate that files passed in files are being read into memory in their entirety. This is likely to get pathological for large files; the underlying poster library supports streaming requests. The solution to this issue is probably related to #292.

@kennethreitz
Copy link
Contributor

Correct. This is currently a bit of a fundamental limitation of the current architecture, but is something that I'd like to revisit on the longterm roadmap.

Requests no longer utilizes Poster internally.

I believe @mitsuhiko has some thoughts on this.

@cpatrick
Copy link

I see that this is planned. I just wanted to verify that this is definitely an issue with large files. I'm taking a look at poster for handling large uploads.

@cpatrick
Copy link

@kennethreitz, If I wanted to take a look at implementing this and contributing it to requests, where would you suggest I start. I don't want to step on any toes :).

@rlskoeser
Copy link

Has anyone made any progress with this? I need this for a project where I'd like to use requests but need to handle large files. If there's anyway I can help out with getting something implemented, I'd be willing to contribute. If anyone knows of any work-around solutions (e.g., poster + requests) that would be good enough until requests handles this better, I'd be interested in that as well.

@zigmonty
Copy link
Contributor

Ok, so this is something that's been bothering me as well (i've got use-cases involving posting large files, potentially multi-GB). So i took a crack at it.

My changes are based on the observation that, except for a couple of lines of code in models.py which read any files passed into ram, and the encode_multipart_formdata() function itself, everything else is happy with the files remaining as file objects. Particularly, the httplib backend is happy for body to be a file.

The complication of course is that we can't directly send the file as the entire body, but rather it must be framed by multipart/form-data boundaries and headers. My solution was creating a file-like class that stores its contents as fragments of the whole body. These fragments may be either strings (framing text, string fields) or file objects. When httplib ultimately tries to send it, it walks through its set of fragments, coalescing them into blocks of the size requested. No more than 8 KB of each file will be read at a time (the block size passed by httplib to read()).

Technically the change is to the urllib3 package, with a small change to models.py to take advantage of it.

The commit is 0254278, although I can't say i'm particularly happy with it (hence no pull request). I'm feeling that i'm missing something and probably breaking code (although it passes the tests). Anyone got any ideas? Is this the sort of thing that should be fixed at an architectural level instead?

@Lukasa
Copy link
Member

Lukasa commented Aug 22, 2012

The bulk of your changes are in urllib3, so we should probably see if @shazow is interested in your changes. (NB: the reason the tests pass is because requests doesn't test urllib3. =D )

@zigmonty
Copy link
Contributor

Yes and no. The encode_multipart_formdata() function is in urllib3, but the call to it is in requests in models.py (there's another call to it in urllib3, but that's unused by requests as far as i can see, as the body is already encoded by the time it gets that far). Requests is encoding the body before handing the request off to urllib3. It just happens to be reusing that function by importing it from urllib3. There's no reason they have to share the same copy of the function other than sanity. :)

Anyway, i'm just pointing out that, in the case that @shazow doesn't want these sorts of changes in urllib3, the patch can be changed to move that functionality local to requests. Btw, does anyone know the reason why requests doesn't just hand off the unencoded files to urllib3 but rather encodes them itself?

In any case, i'm sort of only posting this as a "could we do something like this?" suggestion rather than a pull request. There's a few flaws in my approach that even I noticed. I don't really have enough knowledge of the code and all the interactions to make this change. It's also almost certainly a performance regression for small files.

@kennethreitz
Copy link
Contributor

So, the design decision that we came up with was to use iterators for streaming uploads.

If you provide a generator, for example, instead of a string for a response body, it'll be uploaded as a stream.

That's what urllib3 needs to support.

@zigmonty
Copy link
Contributor

Yeah that sounds like a much more flexible approach. One thing i noticed implementing my version though is that if you don't send a content-length, httpbin.org won't parse the request properly, and i think a lot of other servers/web services will also break. And urllib3/httplib isn't going to be able to calculate the content-length if the body is an arbitrary generator right? So the calling code in requests will have to do that? Or use chunked encoding? Not a criticism, i'm just curious. :)

I'm assuming this also means a move away from httplib? As far as i can see, that only allows the body to be a string or a file-like object.

@kennethreitz
Copy link
Contributor

I believe we'll have to do chunked encoding, yes.

As for httplib, @shazow would be a better person to ask ;)

@shazow
Copy link
Contributor

shazow commented Aug 22, 2012

See: urllib3/urllib3#51

@zigmonty
Copy link
Contributor

Right, the approach you took is essentially the same as mine but using a separate multipart encoding generator and a file-like object to wrap around it, whereas i combined the two into a single object (sort of). Your approach is much nicer i think. One benefit of my approach though is that once you're in the middle of a large file, the call to read() from httplib is passed fairly directly to the underlying file, whereas in that implementation the reading from the underlying file and the read from httplib may be using mismatched blocks. which means a bunch of extra string concatenation and data always being left in leftover. It might be worth returning less data than requested by httplib once to sync up or something. Although my implementation has far too much string copying as well. The blocksizes involved in both seem a little wastefully small as well.

Anyway, the last comment in that issue is 6 months old and appears to have died with you asking if they wanted to do it. Are you still looking for someone to do this, or have you already got a plan to do it yourself? If you are, what would be required before you'd accept a pull request? Specific approach? Lots of testing?

(Apologies for posting this on the wrong project btw)

@shazow
Copy link
Contributor

shazow commented Aug 22, 2012

Don't think anyone is doing it. You're more than welcome to have at it. Requirements:

  1. Post in the right project. :)
  2. 100% unit test coverage.
  3. If possible, backwards-compatible changes. If not, then changes in as few places as possible.
  4. Consistent code style with the rest of the project (mostly PEP8).

@shrredd
Copy link

shrredd commented Oct 20, 2012

Has there been any progress on this? We currently use poster + urllib2 for streaming uploads but having it built directly into requests would be much, much cleaner.

@sigmavirus24
Copy link
Contributor

Just an update, this is entirely related to #952

@sigmavirus24
Copy link
Contributor

Since streaming requests were finished in #952, I'm going to close this. If this is still a problem, let me know and I'll reopen it.

@piotr-dobrogost
Copy link

@sigmavirus24

I don't see how making it possible to stream a request by passing generator/iterator as data param's value could magically change how another part of api (files param) works.

@sigmavirus24
Copy link
Contributor

@piotr-dobrogost I missed this line when I originally skimmed the commit. This is making me wonder how difficult streaming multipart file uploads could be.

@sigmavirus24 sigmavirus24 reopened this Jan 26, 2013
@sigmavirus24
Copy link
Contributor

I answered my own question: RFC 2854. Now to get a cup of hot chocolate and a blanket to make consuming this document more enjoyable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants