Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable request body streaming with an IO object #409

Merged
merged 11 commits into from
May 19, 2017

Conversation

janko
Copy link
Member

@janko janko commented May 1, 2017

This change adds the ability to stream content of an IO object (an object that responds to #read) into the request body. This is convenient when you want to send large amounts of data as the request body, but you don't want to load all of it into memory.

HTTP.post("http://example.org/upload", body: File.open("video.mp4"))
HTTP.post("http://example.org/upload", body: StringIO.new("file data"))

It was already possible to send the request body in chunks by passing an Enumerable to :body, but that can (and should) only be used for making "Transfer-Encoding: chunked" requests.

This will also allow us to extend form_data.rb to stream the multipart-encoded request, instead having to load the whole body into memory before writing it to the socket.

@janko janko force-pushed the enable-request-body-streaming branch from 68c2f43 to c4798f5 Compare May 1, 2017 09:28
@ixti ixti self-assigned this May 1, 2017
@request_header << "#{Headers::CONTENT_LENGTH}: 0"
elsif @body.is_a?(Enumerable) && CHUNKED != @headers[Headers::TRANSFER_ENCODING]
raise(RequestError, "invalid transfer encoding")
unless @headers[Headers::CONTENT_LENGTH]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to use guard here:

return if @headers[Headers::CONTENT_LENGTH]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, updated in abe1b08.

if @body.is_a?(String)
@request_header << "#{Headers::CONTENT_LENGTH}: #{@body.bytesize}"
elsif @body.respond_to?(:read)
@request_header << "#{Headers::CONTENT_LENGTH}: #{@body.size}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are checking respond_to? :read we should also check here respond_to? size, no? So I believe the best will be:

if @body.respond_to?(:bytesize)
  @request_header << "#{Headers::CONTENT_LENGTH}: #{@body.bytesize}"
elsif @body.respond_to?(:size) && @body.respond_to?(:read)
  @request_header << "#{Headers::CONTENT_LENGTH}: #{@body.size}"
elsif @body.nil?
  @request_header << "#{Headers::CONTENT_LENGTH}: 0"
elsif ...
end

@httprb/core please take a look if you see any gotchas here

Copy link
Member Author

@janko janko May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I wasn't checking here whether the IO object responds to size was because I wanted that it's required for the IO object to respond to size if Content-Length wasn't passed in, because for non-chunked requests Content-Length header is required (I would document this requirement in the wiki).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you are checking for #read method, but using #size then which might be not available:

reader, writer = IO.pipe

reader.respond_to? :read # => true
reader.respond_to? :size # => false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I got what you mean. I guess then we should raise an exception:

elsif @body.respond_to? :read
  raise "IO object must respond to #size" unless @body.respond_to? :size
  # ...

Copy link
Member Author

@janko janko May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when IO.pipe is passed in and Content-Length is not explicitly passed in, then the request will fail with a NoMethodError because IO.pipe doesn't respond to #size. This can be solved by passing Content-Length explicitly:

reader, writer = IO.pipe

writer.write("content")
writer.close

HTTP.headers("Content-Length" => 7).post("http://example.org", body: reader)

If the exception isn't raised when Content-Length is absent and the IO doesn't respond to #size, the user could accidentally send the request without a Content-Length, which isn't a valid HTTP request (unless it's a chunked request).

The error is currently just a NoMethodError, which probably doesn't communicate well to the user, would it help if I improved the error message? Or are you in disagreement with this behaviour in general?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your last comment only after I posted the previous one, great, I'll update the error message then.

Copy link
Member Author

@janko janko May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in e53c71f

Copy link
Member

@ixti ixti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ixti
Copy link
Member

ixti commented May 1, 2017

Let's wait for input (if any) of @httprb/core and then I'll fix rubocop and merge!

@tarcieri
Copy link
Member

tarcieri commented May 1, 2017

IO is Enumerable:

http://ruby-doc.org/core-2.0.0/IO.html

I'm curious why this wasn't working with the existing Enumerable support

@janko
Copy link
Member Author

janko commented May 1, 2017

@tarcieri While IO objects can be streamed using their Enumerable interface, there are several limitations of this:

First, HTTP.rb's Enumerable support implies Transfer-Encoding: chunked requests, so currently it's not possible to use it for streaming the body of a regular request.

Second, IO#each iterates over lines of the file, which means if we have big binary file which doesn't have a single newline, the line will equal to the whole content, so the whole file will potentially be loaded into memory during upload.

Third, by relying only on #read and #size, we allow a much broader array of inputs that aren't necessarily descendants of IO, such as:

  • Tempfile (it's just a delegator to the underlying File object, so it doesn't implement Enumerable)
  • ActionDispatch::Http::UploadedFile (a delegator to the underlying Tempfile object)
  • FormData::Multipart which I plan to modify to respond to #read for streaming multipart uploads
  • Rack input of any web server (Unicorn::StreamInput, Passenger::TeeInput, ...)
  • Shrine::UploadedFile from the Shrine gem (and also IO objects created by data_uri and rack_file plugins)
  • Down::ChunkedIO from the Down gem

@ixti
Copy link
Member

ixti commented May 1, 2017

We definitely need to improve work with IO bodies... so that it will allow streaming chunked data of undetermined length (e.g. stream of JSON documents)

@janko
Copy link
Member Author

janko commented May 1, 2017

We definitely need to improve work with IO bodies... so that it will allow streaming chunked data of undetermined length (e.g. stream of JSON documents)

Hmm, isn't that what we already have with the Enumerable support?

@ixti
Copy link
Member

ixti commented May 1, 2017

@janko-m I guess so:

require "http"
require "yaml"

rd, wd = IO.pipe

writer = Thread.new(wd) do |io|
  Thread.stop

  puts "READER: Start feeding data"

  %w(a b c).each do |char|
    io.write(%({"#{char}":#{rand}}\n))
    sleep 1
  end

  puts "READER: All data was feeded"

  io.close
end

reader = Thread.new(rd) do |io|
  client   = HTTP.headers({ "Transfer-Encoding" => "chunked" })

  puts "WRITER: start post request"

  response = client.post("http://httpbin.org/post", :body => io).flush

  puts "WRITER: post request finished"

  puts "== RESPONSE ==\n#{response.parse.to_yaml}"
end

sleep 0.1 until "sleep" == writer.status

writer.wakeup
reader.join

@ixti
Copy link
Member

ixti commented May 1, 2017

Ignore me! It's httpbin is not understanding chunked requests. Tried that with netcat - all correct:

$ ncat -l 8080
POST / HTTP/1.1
Transfer-Encoding: chunked
Connection: close
Host: localhost:8080
User-Agent: http.rb/2.2.2

19
{"a":0.6089059717403343}

19
{"b":0.7655504586106395}

19
{"c":0.6511286447469131}

0

@britishtea
Copy link
Contributor

I think HTTP::Request::Writer#validate_body_type! should also be updated. Currently it will raise when given a body that responds to #read and #size but is not Enumerable. Perhaps change the test such that it doesn't provide an Enumerable?

Looks like the else clause in #send_request can also be removed, since #validate_body_type! is called before.

Could bodies with a #read but without a #size (or a #size returning nil) be handled as Enumerable bodies are right now?

@ixti
Copy link
Member

ixti commented May 1, 2017

IMHO it worth to create Request::Body wrapper class that will encapsulate and hide away all logic of content-length calculation, type validation and streaming.

@janko
Copy link
Member Author

janko commented May 1, 2017

I think HTTP::Request::Writer#validate_body_type! should also be updated. Currently it will raise when given a body that responds to #read and #size but is not Enumerable. Perhaps change the test such that it doesn't provide an Enumerable?

You're right, the current code wouldn't work with an IO object which isn't Enumerable, I need to add a test that uses something other than a StringIO.

Could bodies with a #read but without a #size (or a #size returning nil) be handled as Enumerable bodies are right now?

You mean, that if a body doesn't respond to #size, that it implies a Transfer-Encoding: chunked request (with the corresponding encoding)? Yeah, we could convert such an IO into an Enumerable, and have the same case clause which handles Enumerable handle it.

Enumerator.new do |yielder|
  loop do
    data = @io.read(Connection::BUFFER_SIZE, buffer ||= String.new)
    break if data.nil?
    yielder << data
  end
end

IMHO it worth to create Request::Body wrapper class that will encapsulate and hide away all logic of content-length calculation, type validation and streaming.

I'm sold 💰

@janko
Copy link
Member Author

janko commented May 2, 2017

I extracted body-related behaviour into an Request::Body class, which deals with validating input, determining size of the body, and retrieving content to be written to the socket. It doesn't know anything about request headers, chunked encodings or sockets.

During this refactoring, I implemented @britishtea's suggestion of allowing Transfer-Encoding: chunked to be used with an IO object. With the extraction of Request::Body it was simpler to generalize how we treat the scenario when Transfer-Encoding: chunked is set, to support chunked requests using any body type, not just Enumerable. I like this change.

I also modified tests to match all of the content that was written to the socket, not just prefixes and suffixes, so that I have greater confidence that I didn't break anything. I also tested streaming to the socket with both Enumerable and non-Enumerable IOs, to verify that Enumerable IOs are treated as IOs, not Enumerables. I'm not sure whether this is a breaking change, because this will change the chunked request body for Enumerable IOs, but then again it shouldn't matter in what format is the chunked request body, as long as all of the content is sent.

I see now that I've triggered quite a few Rubocop offences due to my style, I will let you correct or ignore offences which suit your preference. Investigating the JRuby failure now.

@janko
Copy link
Member Author

janko commented May 2, 2017

I opened a bug report for the JRuby failure: jruby/jruby#4583.

Normally this bug would be harmless, but we are reading chunks using a buffer (to avoid creating a new string for each chunk), so each retrieved chunk will override the next chunk. This means that if we request the first chunk, and JRuby starts executing the code after the first yield (even though we did not tell it to), it will read the next chunk and override the yielded first one.

I think at this point it probably makes more sense to work around this by not using a buffer, even at the cost of more string objects being allocated. This wouldn't introduce a performance regression, since this is a new feature.

chunk = String.new # rubocop:disable Style/EmptyLiteral
yield chunk while @body.read(BUFFER_SIZE, chunk)
loop do
data = @body.read(BUFFER_SIZE) or break
Copy link
Member

@tarcieri tarcieri May 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd prefer something like:

while (data = @body.read(BUFFER_SIZE))
  yield data
end

It better communicates intent, IMO, despite assigning in a branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarcieri I would also prefer that, but Ruby prints a warning when you attempt to use an assignment in conditionals (Ruby thinks that you wanted to use ==). It happens even when warning aren't enabled, just when running ruby -e 'nil if a = 1'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try the exact code I gave (with parens). It does not print a warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is wonderful, I'll update the code then 😃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarcieri Updated.

@tarcieri
Copy link
Member

tarcieri commented May 2, 2017

RuboCop ProTip(TM): rubocop -a (auto-correct)

ixti pushed a commit to httprb/form_data that referenced this pull request May 8, 2017
By changing all components to use an IO object as a base, we can
implement a common IO interface for all components, which delegates to
the underlying IO object.

This enables streaming multipart data into the request body, avoiding
loading the whole multipart data into memory when File parts are backed
by File objects.

See httprb/http#409 for the new streaming API.
ixti pushed a commit to httprb/form_data that referenced this pull request May 8, 2017
This is a squashed commit of [PR#12][] with some tiny cleanups
applied on top of that.

[PR#12]: #12

Allow any IO object in FormData::File
-------------------------------------

Previously we allowed only File and StringIO objects as an input
to `FormData::File`, but we can generalize that to any IO object that
responds to `#read` and `#size` (which includes `Tempfile`,
`ActionDispatch::Http::UploadedFile` etc).

Open File for given path in binary mode
---------------------------------------

That way different operating systems won't attempt to convert newline
characters to their internal representation, instead the file content
will always be retrieved byte-for-byte as is.

Officially support Pathname in FormData::File.new
-------------------------------------------------

Previously Pathname was implicitly supported, though extracting
filename wasn't working. With the recent refactoring this stopped
working, so we make the Pathname support explicit.

Make all components into IO objects
-----------------------------------

By changing all components to use an IO object as a base, we can
implement a common IO interface for all components, which delegates to
the underlying IO object.

This enables streaming multipart data into the request body, avoiding
loading the whole multipart data into memory when File parts are backed
by File objects.

See httprb/http#409 for the new streaming API.

Make CompositeIO convert strings to StringIOs
---------------------------------------------

By delegating handling strings to CompositeIO we can remove a lot of the
StringIO.new clutter when instantiating CompositeIO objects.

Use a buffer when reading IO files in CompositeIO
-------------------------------------------------

This way we're not creating a new string for each chunk read, instead
each chunk will be read into an existing string object (a "buffer"),
replacing any previous content.
@janko
Copy link
Member Author

janko commented May 18, 2017

I updated http-form_data to the prerelease version that has the streaming capabilities, and removed to_s so that it's used as an IO by the HTTP::Request::Writer. I also removed assigning Content-Length from HTTP::Client, as it will automatically be inferred from HTTP::FormData#size by the HTTP::Request::Writer, or omitted if Transfer-Encoding is chunked.

I was thinking whether we need some integration tests between HTTP and HTTP::FormData, but considering that both gems are in my opinion very well-tested, and that the only thing HTTP needs here is #read and #size (which both HTTP::FormData::Multipart and HTTP::FormData::Urlencoded implement), I thought that it's not necessary.

This PR should be complete now.

@ixti
Copy link
Member

ixti commented May 19, 2017

Please rebase this branch and I'll be happy to merge it down

janko added 7 commits May 19, 2017 14:51
This change adds the ability to stream content of an IO object (an
object that responds to #read) into the request body. This is convenient
when you want to send large amounts of data as the request body, but you
don't want to load all of it into memory.

  HTTP.post("http://example.org/upload", body: File.open("video.mp4"))
  HTTP.post("http://example.org/upload", body: StringIO.new("file data"))

It was already possible to send the request body in chunks by passing an
Enumeable to :body, but that's only defined for "Transfer-Encoding:
chunked" requests.

This will also allow us to extend form_data.rb to stream the
multipart-encoded request, instead having to load the whole body into
memory before writing it to the socket.
Before creating chunked requests was only possible with Enumerable body.
If "Transfer-Encoding: chunked" was set but the body wasn't Enumerable,
request would be calculated as if it was a regular request, which meant
inclusion of "Content-Length" header and writing request body as is. If
body was an Enumerable but "Transfer-Encoding: chunked" wasn't set, an
error would be raised.

This commit enables sending chunked requests with any body type, by
abstracting how the request body is read into an #each method. If
"Transfer-Encoding: chunked" is detected, the yielded chunks are encoded
accordingly before writing, otherwise the chunks are written as is.

This allows making a streaming chunked request using an IO object, where
file would be written in chunks of 16 KB, which wasn't previously
possible.
As described in jruby/jruby#4583, JRuby starts
executing code after the first `yield` even though we requested only the
first element, resulting in the first chunk being overriden with the
second chunk before it was even returned.

We work around this by not using a buffer string, therefore each
retrieved chunk is a new string, so even if JRuby immediately retrieves
the second chunk, it won't affect the first chunk.
@janko janko force-pushed the enable-request-body-streaming branch from 3cd6931 to bb4479f Compare May 19, 2017 04:52
@janko
Copy link
Member Author

janko commented May 19, 2017

@ixti Done!

@ixti
Copy link
Member

ixti commented May 19, 2017

Awesome! Thanks!

@ixti ixti merged commit 4445835 into httprb:master May 19, 2017
@janko janko deleted the enable-request-body-streaming branch May 19, 2017 16:44
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Dec 9, 2017
pkgsrc changes:
- sort DEPENDS

Upstream changes (from CHANGES.md):

## 3.0.0 (2017-10-01)

* Drop support of Ruby `2.0` and Ruby `2.1`.
  ([@ixti])

* [#410](httprb/http#410)
  Infer `Host` header upon redirects.
  ([@janko-m])

* [#409](httprb/http#409)
  Enables request body streaming on any IO object.
  ([@janko-m])

* [#413](httprb/http#413),
  [#414](httprb/http#414)
  Fix encoding of body chunks.
  ([@janko-m])

* [#368](httprb/http#368),
  [#357](httprb/http#357)
  Fix timeout issue.
  ([@HoneyryderChuck])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants