Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String#encode hard errors without replacement for ASCII 8-bit range to multibyte #461

headius opened this issue Dec 22, 2012 · 4 comments


Copy link

@headius headius commented Dec 22, 2012

This is the remaining failure in resque's 1-x-stable branch in 1.9 mode. The following code should work in 1.9 mode:

str = "\xFA\x8D\xA8\x8A\xBF4\xD3\x9FV]_"
str.force_encoding('BINARY').encode('UTF-8', :invalid => :replace, :undef => :replace)

We have a hard check for ASCII to multibyte that always raises error for 8-bit characters. In the case that :undef => :replace is specified, we should be allowing the encoding process to proceed to replacing those bytes, rather than hard erroring.

Fix pending.

Copy link
Member Author

@headius headius commented Dec 22, 2012

Hmm...I have a partial fix for this, but it seems like our replace logic isn't right in any case. The patch allows transcoding to proceed through to the replacement phase, but doesn't appear to replace all the characters as expected.

I'm pondering whether perhaps our double-transcoding logic (using Java Charset) is not quite right yet...

diff --git a/src/org/jruby/ b/src/org/jruby/
index d5ad1e7..dbc7aa8 100644
--- a/src/org/jruby/
+++ b/src/org/jruby/
@@ -63,6 +63,7 @@ import static org.jruby.util.StringSupport.unpackResult;

 import java.nio.charset.Charset;
+import java.nio.charset.CodingErrorAction;
 import java.util.Locale;

 import org.jcodings.Encoding;
@@ -7600,8 +7601,13 @@ public class RubyString extends RubyObject implements EncodingCapable {
             return new ByteList(value.bytes(), toEncoding);

+        CharsetTranscoder.CodingErrorActions actions = CharsetTranscoder.getCodingErrorActions(context, opts);
         // MRI does not allow ASCII-8BIT chars > 127 to transcode to multibyte encodings
-        if (fromName.equals("ASCII-8BIT") && toEncoding.maxLength() > 1) {
+        // if :invalid/onUnmappable option is unspecified or :error
+        if (fromName.equals("ASCII-8BIT") &&
+                toEncoding.maxLength() > 1 &&
+                actions.onUnmappableCharacter == CodingErrorAction.REPORT) {
             int length = value.length();

             for (int byteidx = 0; byteidx < length; byteidx++) {
diff --git a/src/org/jruby/util/ b/src/org/jruby/util/
index 735feee..bbffabf 100644
--- a/src/org/jruby/util/
+++ b/src/org/jruby/util/
@@ -83,6 +83,20 @@ public class CharsetTranscoder {

+     * This will try and transcode the supplied ByteList to the supplied toEncoding.
+     * 
+     * @see CharsetTranscoder#transcode(org.jruby.runtime.ThreadContext, org.jruby.util.ByteList, org.jcodings.Encoding, org.jcodings.Encoding, org.jruby.util.CharsetTranscoder.CodingErrorActions) 
+     * 
+     * c: rb_str_conv_enc_opts
+     */
+    public static ByteList transcode(ThreadContext context, ByteList value, Encoding forceEncoding,
+            Encoding toEncoding, IRubyObject opts) {
+        if (toEncoding == null) return value;
+        return transcode(context, value, forceEncoding, toEncoding, getCodingErrorActions(context, opts));
+    }
+    /**
      * This will try and transcode the supplied ByteList to the supplied toEncoding.  It will use
      * forceEncoding as its encoding if it is supplied; otherwise it will use the encoding it has
      * tucked away in the bytelist.  This will return a new copy of a ByteList in the request
@@ -91,16 +105,16 @@ public class CharsetTranscoder {
      * c: rb_str_conv_enc_opts
     public static ByteList transcode(ThreadContext context, ByteList value, Encoding forceEncoding,
-            Encoding toEncoding, IRubyObject opts) {
+            Encoding toEncoding, CodingErrorActions actions) {
         if (toEncoding == null) return value;

-        return new CharsetTranscoder(context, toEncoding, forceEncoding, getCodingErrorActions(context, opts)).transcode(context, value);
+        return new CharsetTranscoder(context, toEncoding, forceEncoding, actions).transcode(context, value);

     public static class CodingErrorActions {
-        final CodingErrorAction onUnmappableCharacter;
-        final CodingErrorAction onMalformedInput;
-        final RubyString replaceWith;
+        public final CodingErrorAction onUnmappableCharacter;
+        public final CodingErrorAction onMalformedInput;
+        public final RubyString replaceWith;

         CodingErrorActions(CodingErrorAction onUnmappableCharacter,
                 CodingErrorAction onMalformedInput, RubyString replaceWith) {
@@ -145,10 +159,6 @@ public class CharsetTranscoder {
             onUnmappableCharacter = CodingErrorAction.REPLACE;

-        if (replaceWith == null && (onUnmappableCharacter == CodingErrorAction.REPLACE || onMalformedInput == CodingErrorAction.REPLACE)) {
-            replaceWith = context.runtime.newString("?");
-        }
         return new CodingErrorActions(onUnmappableCharacter, onMalformedInput, replaceWith);

Copy link
Member Author

@headius headius commented Dec 22, 2012

I think in order for replace to work properly, we need the following:

  • A table mapping target encodings to their default replacement character. For example, in UTF-8 it would be � or \xEF\xBF\xBD.
  • During the first transcode into UTF-16, also transcode the replacement character. This is the first-line replacement.
  • During the second transcode, use the native replacement, so anything that can't go from UTF-16 to the target encoding will be replaced with the proper byte sequence.

The absolute correct fix is for is to not round-trip through Charset anymore, of course, but that requires a lot of work.

Thoughts? /cc @enebo

Copy link

@enebo enebo commented Jan 2, 2013

I have a tree which allows general multi-character replacement support and support for newline transcoding. It is hooked up for IO but not for much of File. I planned on working on this after ripper and 1.7.2 (or when ripper drives me crazy and I need a break).

As to the substance of your bullets I think the logic is right. There is one severe limitation with Charset transcoding in Java (which is why I mentioned my work above): it cannot replace more than a single character via replace in transcoding. For the UTF-16 thing to work all those 8-bit values must be still resolve to be a single character in UTF-16. If not then it can piggy-back off my newline transcoding stuff (which basically processes the string separately from the transcoder). Obviously, the real solution is oni transcoding port, but as you said that is a lot of work.

Copy link
Member Author

@headius headius commented Jul 13, 2013

Fixed by 74a57ef.

@headius headius closed this Jul 13, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants