String#encode raises different errors on 1.7.8 than on 1.7.4 (and both are different from MRI) #1254

Closed
myronmarston opened this Issue Nov 22, 2013 · 7 comments

3 participants

@myronmarston

Given encoding_test.rb:

# encoding: utf-8

string = "\xAE".force_encoding("ASCII-8BIT")
string.encode("UTF-8")

MRI 1.9.2, 1.9.3 and 2.0.0 consistently raise Encoding::UndefinedConversionError

$ chruby 1.9.2
$ ruby -v
ruby 1.9.2p320 (2012-04-20 revision 35421) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
     from encoding_test.rb:4:in `<main>'

$ chruby 1.9.3
$ ruby -v
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from encoding_test.rb:4:in `<main>'

$ chruby 2.0
$ ruby -v
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from encoding_test.rb:4:in `<main>'

JRuby 1.7.4 raises Encoding::ConverterNotFoundError:

$ chruby jruby-1.7.4
$ ruby -v
jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) 64-Bit Server VM 1.6.0_65-b14-462-11M4609 [darwin-x86_64]
$ ruby encoding_test.rb
Encoding::ConverterNotFoundError: code converter not found (ASCII-8BIT to UTF-8)
  encode at org/jruby/RubyString.java:7590
  (root) at encoding_test.rb:4

JRuby 1.7.8 raises Encoding::InvalidByteSequenceError:

$ chruby jruby-1.7.8
$ ruby -v
jruby 1.7.8 (1.9.3p392) 2013-11-14 0ce429e on Java HotSpot(TM) 64-Bit Server VM 1.6.0_65-b14-462-11M4609 [darwin-x86_64]
$ ruby encoding_test.rb
Encoding::InvalidByteSequenceError: ""\xAE"" on ASCII-8BIT
  encode at org/jruby/RubyString.java:7599
  (root) at encoding_test.rb:4

Honestly, I'm not sure which behavior is correct, but changing the error between 1.7.4 and 1.7.8 feels like a breaking API change in a patch release. It caused sudden travis failures for rspec-expectations due to them upgrading their VMs.

Is this a regression? Or should we be handling Encoding::InvalidByteSequenceError in RSpec as well?

@myronmarston myronmarston referenced this issue in rspec/rspec-expectations Nov 22, 2013
Merged

Fix JRuby handling of EncodedStrings #366

@enebo
JRuby Team member

This was an unintentional change. Our transcoding is supposed to match MRI but you can see that neither 1.7.4 nor what we currently do matches MRI's behavior. So I would say less this is less a regression as much as it has never worked correctly. Without digging I am am hoping we can match to have the same error as MRI (under the covers we are using Java transcoding and we might be limited in our ability to differentiate errors).

@headius
JRuby Team member

This is likely the same issue as #1255. I noticed something like it while working on #1242 and #1204 and only added a patch around it.

I believe the problem lies in transcoding itself not handling ASCII-8BIT properly.

Need to fix. Thanks for repro.

@headius
JRuby Team member

Note that the 1.7.4 output was before the complete rework of our transcoding subsystem, so ignore that.

The 1.7.8 output is closer to MRI but reporting the "invalid" error rather than the "undefined" error. This is in part due to how we handle 8-bit ascii passing through the encoding subsystem.

There's some discussion of which error is more appropriate here: http://bugs.ruby-lang.org/issues/8630

My argument was that since ASCII-8BIT does map to a known set of 256 characters, there should be no issue converting to UTF-8. However my mistake was that there is no single "ASCII-8BIT" encoding; there's many different 8-bit encodings that have assigned the high range their own way. As Martin says in the bug above, it's better to look at the high range in "ASCII-8BIT" as being unassigned values, and therefore it's the mapping that's undefined, not the source material that's invalid.

@myronmarston

My argument was that since ASCII-8BIT does map to a known set of 256 characters, there should be no issue converting to UTF-8. However my mistake was that there is no single "ASCII-8BIT" encoding; there's many different 8-bit encodings that have assigned the high range their own way. As Martin says in the bug above, it's better to look at the high range in "ASCII-8BIT" as being unassigned values, and therefore it's the mapping that's undefined, not the source material that's invalid.

Interesting issue. I have no idea what the right behavior is, but your comment here goes a bit against my understanding of ASCII-8BIT. Previously, I had understood ASCII-8BIT to be an alias of "BINARY" -- in other words, it tags the string as being raw binary data, that shouldn't be assumed to correspond to any character set.

@headius
JRuby Team member

ASCII-8BIT and BINARY are aliases in MRI, and that is part of the problem translating this logic to JRuby. MRI will happily translate ASCII-8BIT low range characters to UTF-8, but will refuse to do so for high-range characters. If it were treating it all as arbitrary binary, it should also kick out the low range.

I believe this stems from ASCII-8BIT (BINARY) being considered "ascii-compatible" wrt those low-range characters. All "ascii-compatible" encodings can translate directly into each other because there's no loss of information. So I guess the only debate would be whether BINARY should not be considered "ascii-compatible", which would prevent it from transcoding to any non-binary character encoding (since it would mean all 256 bytes are undefined mappings).

@headius
JRuby Team member

We do have a brute-force path forward: we could define our own ASCII-8BIT JVM Charset that only supports transcoding to itself or low-range bytes to character encodings.

The reason this is broken right now is that in order for us to get the JDK to treat high-range bytes as unmappable (undefined conversion) we substitute US-ASCII when we see the source encoding is ASCII-8BIT and the target encoding is a text encoding. This gives us an error, as desired, since the high range of US-ASCII is undefined; however, the US-ASCII Charset considers the original string to be malformed, rather than having undefined codepoints. A custom charset could do the latter.

@headius headius added a commit that referenced this issue Nov 23, 2013
@headius headius Unmask transcode errors test.
Last issues fixed for #1254.
243f5d9
@headius headius added a commit that closed this issue Nov 23, 2013
@headius headius Present the appropriate error for ASCII-8BIT high-range to chars.
This is mildly hacky.

In order to get the JDK Charset logic to present an error for
converting from ASCII-8BIT to a normal character encoding, we
use the US-ASCII Charset as the input. However, that charset
considers high-range bytes to be malformed rather than unmappable,
causing us to present the wrong error. This hack informs our
transcoder that we're doing a binary-to-character transcode and
that malformed errors should be treated as unmappable errors,
allowing us to raise the appropriate Ruby error.

Fixes #1254.
15db5e7
@headius headius closed this in 15db5e7 Nov 23, 2013
@myronmarston

Thanks for the quick fix, @headius!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment