Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

JRuby 1.7.x String.encode not using Unicode replacement character #856

Closed
dekellum opened this Issue · 6 comments

3 participants

@dekellum

The following test script works correctly on MRI and JRuby 1.6.8, but gives invalid output on all released versions of JRuby 1.7.x.

#!/usr/bin/env ruby

input = "\xC3\x81".force_encoding('Windows-1252')

# 0x81 is undefined in Windows-1252, should be mapped to Unicode
# replacement character below

out = input.encode("UTF-8", :undef => :replace )
# doesn't help: :replace => "\uFFFD"

puts( ( [out] + out.codepoints.map { |i| sprintf('%04X', i) } ).join ' ' )

Below is output for MRI and JRuby 1.6.8: correctly using Unicode replacement character '�' U+FFFD followed by all versions of JRuby 1.7.x which incorrectly output '?' U+003F

ruby 1.9.3p429 (2013-05-15 revision 40747) [x86_64-linux]
� 00C3 FFFD

jruby 1.6.8 (ruby-1.9.2-p312) (2012-09-18 1772b40) (Java HotSpot(TM) Server VM 1.7.0_21) [linux-i386-java]
� 00C3 FFFD

jruby 1.7.0 (1.9.3p203) 2012-10-22 ff1ebbe on Java HotSpot(TM) Server VM 1.7.0_21-b11 [linux-i386]
Ã? 00C3 003F

jruby 1.7.1 (1.9.3p327) 2012-12-03 30a153b on Java HotSpot(TM) Server VM 1.7.0_21-b11 [linux-i386]
Ã? 00C3 003F

jruby 1.7.2 (1.9.3p327) 2013-01-04 302c706 on Java HotSpot(TM) Server VM 1.7.0_21-b11 [linux-i386]
Ã? 00C3 003F

jruby 1.7.3 (1.9.3p385) 2013-02-21 dac429b on Java HotSpot(TM) Server VM 1.7.0_21-b11 [linux-i386]
Ã? 00C3 003F

jruby 1.7.4 (1.9.3p392) 2013-05-16 2390d3b on Java HotSpot(TM) Server VM 1.7.0_21-b11 [linux-i386]
Ã? 00C3 003F
@dekellum

Also, I'm happy to convert above test script into a formal unit test upon request and suggestion of where it should go. I didn't find any obvious tests for encode and replacement.

@BanzaiMan
Owner

Thanks for the report. This is a dupe of #375.

@BanzaiMan BanzaiMan closed this
@dekellum

How is this a duplicate of #375? #375 looks like a completely different problem. Is #375 also a regression since jruby 1.6.8? Why not make sure the problem is actually fixed before closing this?

@BanzaiMan
Owner

You're right. I apologize.

@BanzaiMan BanzaiMan reopened this
@dekellum dekellum referenced this issue from a commit
@headius headius Rework replacement string to use Java string, target charset.
We originally passed around a Ruby string with no specified
encoding, but this appeared to decode into invalid replacement
bytes under some encodings. Since we always want to replace with
the '?' character (rather than e.g. the unicode <?> character in
UTF-16), I altered the replaceWith field of CodingErrorActions to
be a plain String, and added the target charset to getBytes.

Fixes #375.
Fixes #861 (same cause).
a84e6d8
@headius
Owner

I've just pushed a bunch of code that tries to use the default replacement character for multibyte encodings rather than hardcoding to ?, but it still seems to use ? for UTF-8. Still needs more investigation.

@headius
Owner

FWIW, e5b6c9f makes it possible to force the replace character to the one you want. This may just be a difference in how Java handles default replacement character for Unicode encodings.

@headius headius closed this in 7c53e5b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.