Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String#encode raises different errors on 1.7.8 than on 1.7.4 (and both are different from MRI) #1254

Closed
myronmarston opened this Issue Nov 22, 2013 · 7 comments

Comments

Projects
None yet
3 participants
@myronmarston
Copy link

myronmarston commented Nov 22, 2013

Given encoding_test.rb:

# encoding: utf-8

string = "\xAE".force_encoding("ASCII-8BIT")
string.encode("UTF-8")

MRI 1.9.2, 1.9.3 and 2.0.0 consistently raise Encoding::UndefinedConversionError

$ chruby 1.9.2
$ ruby -v
ruby 1.9.2p320 (2012-04-20 revision 35421) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
     from encoding_test.rb:4:in `<main>'

$ chruby 1.9.3
$ ruby -v
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from encoding_test.rb:4:in `<main>'

$ chruby 2.0
$ ruby -v
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from encoding_test.rb:4:in `<main>'

JRuby 1.7.4 raises Encoding::ConverterNotFoundError:

$ chruby jruby-1.7.4
$ ruby -v
jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) 64-Bit Server VM 1.6.0_65-b14-462-11M4609 [darwin-x86_64]
$ ruby encoding_test.rb
Encoding::ConverterNotFoundError: code converter not found (ASCII-8BIT to UTF-8)
  encode at org/jruby/RubyString.java:7590
  (root) at encoding_test.rb:4

JRuby 1.7.8 raises Encoding::InvalidByteSequenceError:

$ chruby jruby-1.7.8
$ ruby -v
jruby 1.7.8 (1.9.3p392) 2013-11-14 0ce429e on Java HotSpot(TM) 64-Bit Server VM 1.6.0_65-b14-462-11M4609 [darwin-x86_64]
$ ruby encoding_test.rb
Encoding::InvalidByteSequenceError: ""\xAE"" on ASCII-8BIT
  encode at org/jruby/RubyString.java:7599
  (root) at encoding_test.rb:4

Honestly, I'm not sure which behavior is correct, but changing the error between 1.7.4 and 1.7.8 feels like a breaking API change in a patch release. It caused sudden travis failures for rspec-expectations due to them upgrading their VMs.

Is this a regression? Or should we be handling Encoding::InvalidByteSequenceError in RSpec as well?

@enebo

This comment has been minimized.

Copy link
Member

enebo commented Nov 22, 2013

This was an unintentional change. Our transcoding is supposed to match MRI but you can see that neither 1.7.4 nor what we currently do matches MRI's behavior. So I would say less this is less a regression as much as it has never worked correctly. Without digging I am am hoping we can match to have the same error as MRI (under the covers we are using Java transcoding and we might be limited in our ability to differentiate errors).

@headius

This comment has been minimized.

Copy link
Member

headius commented Nov 22, 2013

This is likely the same issue as #1255. I noticed something like it while working on #1242 and #1204 and only added a patch around it.

I believe the problem lies in transcoding itself not handling ASCII-8BIT properly.

Need to fix. Thanks for repro.

@headius

This comment has been minimized.

Copy link
Member

headius commented Nov 22, 2013

Note that the 1.7.4 output was before the complete rework of our transcoding subsystem, so ignore that.

The 1.7.8 output is closer to MRI but reporting the "invalid" error rather than the "undefined" error. This is in part due to how we handle 8-bit ascii passing through the encoding subsystem.

There's some discussion of which error is more appropriate here: http://bugs.ruby-lang.org/issues/8630

My argument was that since ASCII-8BIT does map to a known set of 256 characters, there should be no issue converting to UTF-8. However my mistake was that there is no single "ASCII-8BIT" encoding; there's many different 8-bit encodings that have assigned the high range their own way. As Martin says in the bug above, it's better to look at the high range in "ASCII-8BIT" as being unassigned values, and therefore it's the mapping that's undefined, not the source material that's invalid.

@myronmarston

This comment has been minimized.

Copy link
Author

myronmarston commented Nov 22, 2013

My argument was that since ASCII-8BIT does map to a known set of 256 characters, there should be no issue converting to UTF-8. However my mistake was that there is no single "ASCII-8BIT" encoding; there's many different 8-bit encodings that have assigned the high range their own way. As Martin says in the bug above, it's better to look at the high range in "ASCII-8BIT" as being unassigned values, and therefore it's the mapping that's undefined, not the source material that's invalid.

Interesting issue. I have no idea what the right behavior is, but your comment here goes a bit against my understanding of ASCII-8BIT. Previously, I had understood ASCII-8BIT to be an alias of "BINARY" -- in other words, it tags the string as being raw binary data, that shouldn't be assumed to correspond to any character set.

@headius

This comment has been minimized.

Copy link
Member

headius commented Nov 22, 2013

ASCII-8BIT and BINARY are aliases in MRI, and that is part of the problem translating this logic to JRuby. MRI will happily translate ASCII-8BIT low range characters to UTF-8, but will refuse to do so for high-range characters. If it were treating it all as arbitrary binary, it should also kick out the low range.

I believe this stems from ASCII-8BIT (BINARY) being considered "ascii-compatible" wrt those low-range characters. All "ascii-compatible" encodings can translate directly into each other because there's no loss of information. So I guess the only debate would be whether BINARY should not be considered "ascii-compatible", which would prevent it from transcoding to any non-binary character encoding (since it would mean all 256 bytes are undefined mappings).

@headius

This comment has been minimized.

Copy link
Member

headius commented Nov 22, 2013

We do have a brute-force path forward: we could define our own ASCII-8BIT JVM Charset that only supports transcoding to itself or low-range bytes to character encodings.

The reason this is broken right now is that in order for us to get the JDK to treat high-range bytes as unmappable (undefined conversion) we substitute US-ASCII when we see the source encoding is ASCII-8BIT and the target encoding is a text encoding. This gives us an error, as desired, since the high range of US-ASCII is undefined; however, the US-ASCII Charset considers the original string to be malformed, rather than having undefined codepoints. A custom charset could do the latter.

headius added a commit that referenced this issue Nov 23, 2013

Unmask transcode errors test.
Last issues fixed for #1254.

@headius headius closed this in 15db5e7 Nov 23, 2013

@myronmarston

This comment has been minimized.

Copy link
Author

myronmarston commented Nov 23, 2013

Thanks for the quick fix, @headius!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.