Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String#encode raises different errors on 1.7.8 than on 1.7.4 (and both are different from MRI) #1254

Closed
myronmarston opened this issue Nov 22, 2013 · 7 comments
Milestone

Comments

@myronmarston
Copy link

@myronmarston myronmarston commented Nov 22, 2013

Given encoding_test.rb:

# encoding: utf-8

string = "\xAE".force_encoding("ASCII-8BIT")
string.encode("UTF-8")

MRI 1.9.2, 1.9.3 and 2.0.0 consistently raise Encoding::UndefinedConversionError

$ chruby 1.9.2
$ ruby -v
ruby 1.9.2p320 (2012-04-20 revision 35421) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
     from encoding_test.rb:4:in `<main>'

$ chruby 1.9.3
$ ruby -v
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from encoding_test.rb:4:in `<main>'

$ chruby 2.0
$ ruby -v
ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
$ ruby encoding_test.rb
encoding_test.rb:4:in `encode': "\xAE" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from encoding_test.rb:4:in `<main>'

JRuby 1.7.4 raises Encoding::ConverterNotFoundError:

$ chruby jruby-1.7.4
$ ruby -v
jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) 64-Bit Server VM 1.6.0_65-b14-462-11M4609 [darwin-x86_64]
$ ruby encoding_test.rb
Encoding::ConverterNotFoundError: code converter not found (ASCII-8BIT to UTF-8)
  encode at org/jruby/RubyString.java:7590
  (root) at encoding_test.rb:4

JRuby 1.7.8 raises Encoding::InvalidByteSequenceError:

$ chruby jruby-1.7.8
$ ruby -v
jruby 1.7.8 (1.9.3p392) 2013-11-14 0ce429e on Java HotSpot(TM) 64-Bit Server VM 1.6.0_65-b14-462-11M4609 [darwin-x86_64]
$ ruby encoding_test.rb
Encoding::InvalidByteSequenceError: ""\xAE"" on ASCII-8BIT
  encode at org/jruby/RubyString.java:7599
  (root) at encoding_test.rb:4

Honestly, I'm not sure which behavior is correct, but changing the error between 1.7.4 and 1.7.8 feels like a breaking API change in a patch release. It caused sudden travis failures for rspec-expectations due to them upgrading their VMs.

Is this a regression? Or should we be handling Encoding::InvalidByteSequenceError in RSpec as well?

@enebo
Copy link
Member

@enebo enebo commented Nov 22, 2013

This was an unintentional change. Our transcoding is supposed to match MRI but you can see that neither 1.7.4 nor what we currently do matches MRI's behavior. So I would say less this is less a regression as much as it has never worked correctly. Without digging I am am hoping we can match to have the same error as MRI (under the covers we are using Java transcoding and we might be limited in our ability to differentiate errors).

Loading

@headius
Copy link
Member

@headius headius commented Nov 22, 2013

This is likely the same issue as #1255. I noticed something like it while working on #1242 and #1204 and only added a patch around it.

I believe the problem lies in transcoding itself not handling ASCII-8BIT properly.

Need to fix. Thanks for repro.

Loading

@headius
Copy link
Member

@headius headius commented Nov 22, 2013

Note that the 1.7.4 output was before the complete rework of our transcoding subsystem, so ignore that.

The 1.7.8 output is closer to MRI but reporting the "invalid" error rather than the "undefined" error. This is in part due to how we handle 8-bit ascii passing through the encoding subsystem.

There's some discussion of which error is more appropriate here: http://bugs.ruby-lang.org/issues/8630

My argument was that since ASCII-8BIT does map to a known set of 256 characters, there should be no issue converting to UTF-8. However my mistake was that there is no single "ASCII-8BIT" encoding; there's many different 8-bit encodings that have assigned the high range their own way. As Martin says in the bug above, it's better to look at the high range in "ASCII-8BIT" as being unassigned values, and therefore it's the mapping that's undefined, not the source material that's invalid.

Loading

@myronmarston
Copy link
Author

@myronmarston myronmarston commented Nov 22, 2013

My argument was that since ASCII-8BIT does map to a known set of 256 characters, there should be no issue converting to UTF-8. However my mistake was that there is no single "ASCII-8BIT" encoding; there's many different 8-bit encodings that have assigned the high range their own way. As Martin says in the bug above, it's better to look at the high range in "ASCII-8BIT" as being unassigned values, and therefore it's the mapping that's undefined, not the source material that's invalid.

Interesting issue. I have no idea what the right behavior is, but your comment here goes a bit against my understanding of ASCII-8BIT. Previously, I had understood ASCII-8BIT to be an alias of "BINARY" -- in other words, it tags the string as being raw binary data, that shouldn't be assumed to correspond to any character set.

Loading

@headius
Copy link
Member

@headius headius commented Nov 22, 2013

ASCII-8BIT and BINARY are aliases in MRI, and that is part of the problem translating this logic to JRuby. MRI will happily translate ASCII-8BIT low range characters to UTF-8, but will refuse to do so for high-range characters. If it were treating it all as arbitrary binary, it should also kick out the low range.

I believe this stems from ASCII-8BIT (BINARY) being considered "ascii-compatible" wrt those low-range characters. All "ascii-compatible" encodings can translate directly into each other because there's no loss of information. So I guess the only debate would be whether BINARY should not be considered "ascii-compatible", which would prevent it from transcoding to any non-binary character encoding (since it would mean all 256 bytes are undefined mappings).

Loading

@headius
Copy link
Member

@headius headius commented Nov 22, 2013

We do have a brute-force path forward: we could define our own ASCII-8BIT JVM Charset that only supports transcoding to itself or low-range bytes to character encodings.

The reason this is broken right now is that in order for us to get the JDK to treat high-range bytes as unmappable (undefined conversion) we substitute US-ASCII when we see the source encoding is ASCII-8BIT and the target encoding is a text encoding. This gives us an error, as desired, since the high range of US-ASCII is undefined; however, the US-ASCII Charset considers the original string to be malformed, rather than having undefined codepoints. A custom charset could do the latter.

Loading

headius added a commit that referenced this issue Nov 23, 2013
Last issues fixed for #1254.
@headius headius closed this in 15db5e7 Nov 23, 2013
@myronmarston
Copy link
Author

@myronmarston myronmarston commented Nov 23, 2013

Thanks for the quick fix, @headius!

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants