Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer encoding #19

Merged
merged 7 commits into from
Jan 29, 2018
Merged

Buffer encoding #19

merged 7 commits into from
Jan 29, 2018

Conversation

HoneyryderChuck
Copy link
Contributor

This should fix #18

@@ -29,14 +29,16 @@ def initialize(ios)
#
# @return [String, nil]
def read(length = nil, outbuf = nil)
outbuf = outbuf.to_s.replace("")
outbuf = outbuf.to_s.clear
Copy link
Member

@tarcieri tarcieri Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this is likely a bug in the previous version as well, #to_s may make a new object, at which point the purpose of passing outbuf is defeated.

Perhaps give outbuf a default value (e.g. outbuf = "".b instead of outbuf = nil as the argument definition), and raise if it's not a String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#to_s will not create a new object if the object is a string. But the replace call might change its encoding. That was the reasoning behind it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence "may make a new object". We never want to make a new object for the buffer... that defeats the point.

How about raising if the supplied buffer is not a String type with Encoding::BINARY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never want to make a new object for the buffer... that defeats the point.

ahh, I see what you mean.

How about raising if the supplied buffer is ...

I'm still not sure. The encoding requirement is usually not mandatory, and the buffer passed may be a specialized buffer. Besides "fixing" jruby, I'm not even sure that forcing binary encoding there is the right thing to do.

By the way, what's the reasoning behind allowing length = nil? The length argument is usually required for IO#read and File#read.

Copy link
Member

@tarcieri tarcieri Dec 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the buffer passed may be a specialized buffer

In what case would it make sense for it to be anything but a String? Ultimately it needs to thunk to a Ruby I/O function which also accepts a preallocated buffer, and those all need to be a String unless JRuby has some fancy behavior I don't know about where it lets you pass a Java ByteBuffer or something.

Besides "fixing" jruby, I'm not even sure that forcing binary encoding there is the right thing to do.

For these types of buffers, I think mandating Encoding::BINARY is the only right option. They should really be a ByteArray/ByteBuffer type, but Ruby doesn't have those, so we need the next best thing, which is a String with Encoding::BINARY.

For other encodings there are all sorts of pathological performance implications, and the contract at this point in the code is to handle arbitrary binary data that may not be whatever-encoding-was-passed-in.

Given this is a rarely used power-user feature, I think it should have some pretty exacting semantics, which are:

  • Avoid all allocations. The whole point of this API is to avoid allocations. This means no #to_s or anything else that might potentially allocate memory
  • String is probably the only argument type that makes sense
  • Encoding::BINARY (i.e. Encoding::ASCII_8BIT) is probably the only sensible encoding for the buffer String. If you think other encodings should be supported, do you really think the answer to the question "Will outbuf always be a valid string under encoding X?" is yes?

By the way, what's the reasoning behind allowing length = nil?

No idea, like the outbuf being nil I think that probably doesn't make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this change, I'll try to answer the questions below.

@ixti
Copy link
Member

ixti commented Dec 20, 2017

@janko-m Please, chip in.

@janko
Copy link
Member

janko commented Dec 20, 2017

I'll try to make some clarifications on the implementation and behaviour.

Though this is likely a bug in the previous version as well, #to_s may make a new object, at which point the purpose of passing outbuf is defeated.

The reason we're creating a new object is because when we're reading the chunks of internal @ios, we're reading them into the same @buffer string object that was allocated on CompositeIO.new, so no new string objects are created on this line:

current_io.read(length, @buffer)

Since Ruby's IO#read always returns a new string object when the outbuf is not passed in, we mimic that here as well by calling outbuf.to_s to create a new string in case outbuf wasn't passed in. That way when outbuf is not passed in we're allocating exactly 1 new string object per CompositeIO#read call, and when it's passed in it's 0, which is what we want.

Perhaps give outbuf a default value (e.g. outbuf = "".b instead of outbuf = nil as the argument definition), and raise if it's not a String?

Ruby's IO#read still allows explicitly passing the outbuf argument as nil (i.e. io.read(10, nil)), so a default argument wouldn't handle that. I recall experiencing that some library called IO#read that way. I know it's an edge case, but I think it makes sense to aim for 100% compatibility with IO#read.

#to_s will not create a new object if the object is a string. But the replace call might change its encoding. That was the reasoning behind it.

Good call, replace("") does in fact change the encoding on MRI 2.4.1:

[1] pry(main)> "".b.replace("").encoding
=> #<Encoding:UTF-8>

By the way, what's the reasoning behind allowing length = nil?
No idea, like the outbuf being nil I think that probably doesn't make sense.

Both length and outbuf are optional arguments in Ruby's IO#read, so that's why they're optional here as well. When length is not passed in, then the whole content should be returned. HTTP::Multipart#to_s (inherited from Multipart::FormData::Readable) calls HTTP::CompositeIO#read without any arguments. FYI, IO#readpartial and IO#read_nonblock require the length argument, those are the only ones I think.

@HoneyryderChuck
Copy link
Contributor Author

@janko-m thank you for the clarification. I had the impression that this was to follow IO#read, but wasn't sure anymore if length was optional or not.

Besides the buffer issue, anything else you'd like to add to the PR? I'd consider it ready to go otherwise.

Gemfile Outdated
@@ -18,7 +18,7 @@ group :test do
end

group :doc do
gem "redcarpet"
gem "redcarpet", platform: :mri

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis CI is blowing up on this line, because Rubocop for the project is requiring hash rockets. I am guessing that's a personal preference for the project, as I prefer Ruby 1.9+ colon syntax...

@nightsurge
Copy link

Can we get this CI issue resolved? I think my usage of the gem may depend on this change as well.

@tarcieri
Copy link
Member

tarcieri commented Jan 8, 2018

@nightsurge there's an open upstream issue re: keyword arguments httprb/http#447

@ixti
Copy link
Member

ixti commented Jan 9, 2018

@tarcieri as far as I thought kwargs issue was about changing options hashes to kwargs. But not about changing hash-rocket usage to new hash style. :)) I'm still hard follower of hash rockets stile as it keeps code much more consistent :D

@ixti
Copy link
Member

ixti commented Jan 9, 2018

Will review/merge this pr asap

@nightsurge
Copy link

nightsurge commented Jan 17, 2018

@HoneyryderChuck I have some oddities in my code:

    10: def initialize(ios)
    11:   @index  = 0
    12:   @buffer = String.new
 => 13:   binding.pry
    14:   @ios    = ios.map do |io|
    15:     if io.is_a?(String)
    16:       StringIO.new(io)
    17:     elsif io.respond_to?(:read)
    18:       io
    19:     else
    20:       raise ArgumentError,
    21:         "#{io.inspect} is neither a String nor an IO object"
    22:     end
    23:   end
    24: end

[1] pry(#<HTTP::FormData::CompositeIO>)> @buffer.encoding
=> #<Encoding:ASCII-8BIT>
[2] pry(#<HTTP::FormData::CompositeIO>)> String.new.encoding
=> #<Encoding:ASCII-8BIT>

@buffer/String.new appears to be ASCII-8BIT, which is correct. Also, when it gets to the io handling, I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer. I'm just doing standard image uploads of various types (png/jpg).

def read(length = nil, outbuf = nil)
        outbuf = outbuf.to_s.replace("")
        outbuf.force_encoding(Encoding::BINARY)

        while current_io
          current_io.read(length, @buffer)
          outbuf << @buffer.force_encoding(Encoding::BINARY)

I'm using JRuby 9.1.15.0

@HoneyryderChuck
Copy link
Contributor Author

I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer

I feel that this might be a jruby bug. You shouldn't have to force-encode between two binary string, or I don't see the edge case where you should.

@nightsurge
Copy link

@HoneyryderChuck agreed. I wonder if I can setup a test script using io.read and a sample file to reproduce the issue. Is it detrimental to performance to use both force_encodings? I hate to have to monkeypatch the gem to keep it working for our enterprise app. But then again it is silly to ask a gem to sacrifice potential performance for a JRuby bug. Both scenarios are potentially just temporary.

@HoneyryderChuck
Copy link
Contributor Author

@nightsurge I'd say that you could easily turn it into a reproducible script (in theory, at least), as you're just reading chunks into a buffer. I'd suggest you to present this in the jruby issue tracker, and they're usually quite responsive folks.

@janko
Copy link
Member

janko commented Jan 21, 2018

I have to add a force_encoding(Encoding::BINARY) to both the outbuf and @buffer

I feel that this might be a jruby bug. You shouldn't have to force-encode between two binary string, or I don't see the edge case where you should.

Ruby strings can sometimes change encodings depending on what you append to them. This is not JRuby-specific, it's the same on MRI.

require "securerandom"
string = ""
string.encoding # => #<Encoding:UTF-8>
string << SecureRandom.random_bytes
string # => ",\xCA\xBA\xEE<E\xB0\x12\xB0+\xD8\x1Ebh}\e"
string.encoding # => #<Encoding:ASCII-8BIT>

We can see that when we appended a binary string to an UTF-8 encoded string, its encoding automatically changed to binary as well.

Is it detrimental to performance to use both force_encodings?

I would be surprised if it is, because force_encoding doesn't change the bytes, it only changes how those bytes should be interpreted (the encoding).

If the goal is to have all string returned in binary encoding, which if you ask me is definitely a safe option, we should add .force_encoding(Encoding::BINARY) calls whenever we're appending to outbuf.

@nightsurge
Copy link

nightsurge commented Jan 21, 2018 via email

@HoneyryderChuck
Copy link
Contributor Author

Ruby strings can sometimes change encodings depending on what you append to them. This is not JRuby-specific, it's the same on MRI.

@janko-m the discussed case (if I remember correctly) was where the binary chunk was being appended to the binary outbuf AND then switching the encoding to UTF-8. And this does seem like a bug, although it might also be just undefined and one has to ensure encoding itself.

I could also just easily employ the fix and add the .force_encoding calls, but I was under the impression that it had a non-trivial cost. But if my impression is wrong, I can just simply add it.

@janko
Copy link
Member

janko commented Jan 29, 2018

the discussed case (if I remember correctly) was where the binary chunk was being appended to the binary outbuf AND then switching the encoding to UTF-8. And this does seem like a bug, although it might also be just undefined and one has to ensure encoding itself.

Oh, that definitely seems like a bug then. But we'd need the .force_encoding call in any case.

I could also just easily employ the fix and add the .force_encoding calls, but I was under the impression that it had a non-trivial cost. But if my impression is wrong, I can just simply add it.

We know that String#force_encoding changes only the encoding associated with the string, so I think we can safely assume that it's a very cheap operation.

I would like to see this merged, because that fact that currently String#length is used instead of String#bytesize means that this gem doesn't work correctly with files that have multi-byte UTF-8 characters and aren't opened in binary mode, on MRI as well as JRuby (sorry for having introduced that bug). Perhaps it would be also useful to add a test for that as part of this PR.

@ixti
Copy link
Member

ixti commented Jan 29, 2018

@HoneyryderChuck can you please add specs (if it's feasible)?

@HoneyryderChuck
Copy link
Contributor Author

@ixti I'll try to come up with smth, + the force_encoding call. But can we get a reproducible script to present the bug case to the jruby guys? @nightsurge any inputs?

@nightsurge
Copy link

nightsurge commented Jan 29, 2018

@HoneyryderChuck @ixti @janko-m I already posted to JRuby and they confirmed that since it was also a bug with MRI/C-Ruby, there was nothing they planned to do at this point and closed the issue:

(script is in that issue post)
jruby/jruby#4986

@HoneyryderChuck
Copy link
Contributor Author

@ixti decided not to write any test, as composite_io module tests are following a pattern I find hard to add other ios into. Also, I wouldn't know how to test the specific case of the bytesize call. If you think this is important, I'd suggest you to change one of the ios strings to a different encoding, although that's up to you.

@nightsurge did you file a bug report in ruby tracker? I'd like to follow that.

@janko
Copy link
Member

janko commented Jan 29, 2018

@HoneyryderChuck That's ok, since I wrote that bit I can add the tests after the PR is merged.

@ixti ixti merged commit 31eefbc into httprb:master Jan 29, 2018
@ixti
Copy link
Member

ixti commented Jan 29, 2018

@HoneyryderChuck Thank you!

@janko
Copy link
Member

janko commented Mar 4, 2018

@ixti @tarcieri Shall we make a new release with this fix?

@ixti
Copy link
Member

ixti commented Mar 4, 2018

@janko-m Oh, right!. Seems like so. Will handle this today! Thanks for the reminder.

@ixti
Copy link
Member

ixti commented Mar 5, 2018

@janko-m v2.1.0 is here :D

@janko
Copy link
Member

janko commented Mar 5, 2018

@ixti Awesome, thank you!

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Mar 17, 2018
## 2.1.0 (2018-03-05)

* [#21](httprb/form_data#21)
  Rewind content at the end of `Readable#to_s`.
  [@janko-m][]

* [#19](httprb/form_data#19)
  Fix buffer encoding.
  [@HoneyryderChuck][]


## 2.0.0 (2017-10-01)

* [#17](httprb/form_data#17)
  Add CRLF character to end of multipart body.
  [@mhickman][]


## 2.0.0.pre2 (2017-05-11)

* [#14](httprb/form_data#14)
  Enable streaming for urlencoded form data.
  [@janko-m][]


## 2.0.0.pre1 (2017-05-10)

* [#12](httprb/form_data#12)
  Enable form data streaming.
  [@janko-m][]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

buffer encoding issues
5 participants