Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

String#encode hard errors without replacement for ASCII 8-bit range to multibyte #461

Closed
headius opened this Issue · 4 comments

2 participants

@headius
Owner

This is the remaining failure in resque's 1-x-stable branch in 1.9 mode. The following code should work in 1.9 mode:

str = "\xFA\x8D\xA8\x8A\xBF4\xD3\x9FV]_"
str.force_encoding('BINARY').encode('UTF-8', :invalid => :replace, :undef => :replace)

We have a hard check for ASCII to multibyte that always raises error for 8-bit characters. In the case that :undef => :replace is specified, we should be allowing the encoding process to proceed to replacing those bytes, rather than hard erroring.

Fix pending.

@headius
Owner

Hmm...I have a partial fix for this, but it seems like our replace logic isn't right in any case. The patch allows transcoding to proceed through to the replacement phase, but doesn't appear to replace all the characters as expected.

I'm pondering whether perhaps our double-transcoding logic (using Java Charset) is not quite right yet...

diff --git a/src/org/jruby/RubyString.java b/src/org/jruby/RubyString.java
index d5ad1e7..dbc7aa8 100644
--- a/src/org/jruby/RubyString.java
+++ b/src/org/jruby/RubyString.java
@@ -63,6 +63,7 @@ import static org.jruby.util.StringSupport.unpackResult;

 import java.io.UnsupportedEncodingException;
 import java.nio.charset.Charset;
+import java.nio.charset.CodingErrorAction;
 import java.util.Locale;

 import org.jcodings.Encoding;
@@ -7600,8 +7601,13 @@ public class RubyString extends RubyObject implements EncodingCapable {
             return new ByteList(value.bytes(), toEncoding);
         }

+        CharsetTranscoder.CodingErrorActions actions = CharsetTranscoder.getCodingErrorActions(context, opts);
+        
         // MRI does not allow ASCII-8BIT chars > 127 to transcode to multibyte encodings
-        if (fromName.equals("ASCII-8BIT") && toEncoding.maxLength() > 1) {
+        // if :invalid/onUnmappable option is unspecified or :error
+        if (fromName.equals("ASCII-8BIT") &&
+                toEncoding.maxLength() > 1 &&
+                actions.onUnmappableCharacter == CodingErrorAction.REPORT) {
             int length = value.length();

             for (int byteidx = 0; byteidx < length; byteidx++) {
diff --git a/src/org/jruby/util/CharsetTranscoder.java b/src/org/jruby/util/CharsetTranscoder.java
index 735feee..bbffabf 100644
--- a/src/org/jruby/util/CharsetTranscoder.java
+++ b/src/org/jruby/util/CharsetTranscoder.java
@@ -83,6 +83,20 @@ public class CharsetTranscoder {
     }

     /**
+     * This will try and transcode the supplied ByteList to the supplied toEncoding.
+     * 
+     * @see CharsetTranscoder#transcode(org.jruby.runtime.ThreadContext, org.jruby.util.ByteList, org.jcodings.Encoding, org.jcodings.Encoding, org.jruby.util.CharsetTranscoder.CodingErrorActions) 
+     * 
+     * c: rb_str_conv_enc_opts
+     */
+    public static ByteList transcode(ThreadContext context, ByteList value, Encoding forceEncoding,
+            Encoding toEncoding, IRubyObject opts) {
+        if (toEncoding == null) return value;
+        
+        return transcode(context, value, forceEncoding, toEncoding, getCodingErrorActions(context, opts));
+    }
+
+    /**
      * This will try and transcode the supplied ByteList to the supplied toEncoding.  It will use
      * forceEncoding as its encoding if it is supplied; otherwise it will use the encoding it has
      * tucked away in the bytelist.  This will return a new copy of a ByteList in the request
@@ -91,16 +105,16 @@ public class CharsetTranscoder {
      * c: rb_str_conv_enc_opts
      */
     public static ByteList transcode(ThreadContext context, ByteList value, Encoding forceEncoding,
-            Encoding toEncoding, IRubyObject opts) {
+            Encoding toEncoding, CodingErrorActions actions) {
         if (toEncoding == null) return value;

-        return new CharsetTranscoder(context, toEncoding, forceEncoding, getCodingErrorActions(context, opts)).transcode(context, value);
+        return new CharsetTranscoder(context, toEncoding, forceEncoding, actions).transcode(context, value);
     }

     public static class CodingErrorActions {
-        final CodingErrorAction onUnmappableCharacter;
-        final CodingErrorAction onMalformedInput;
-        final RubyString replaceWith;
+        public final CodingErrorAction onUnmappableCharacter;
+        public final CodingErrorAction onMalformedInput;
+        public final RubyString replaceWith;

         CodingErrorActions(CodingErrorAction onUnmappableCharacter,
                 CodingErrorAction onMalformedInput, RubyString replaceWith) {
@@ -145,10 +159,6 @@ public class CharsetTranscoder {
             onUnmappableCharacter = CodingErrorAction.REPLACE;
         }

-        if (replaceWith == null && (onUnmappableCharacter == CodingErrorAction.REPLACE || onMalformedInput == CodingErrorAction.REPLACE)) {
-            replaceWith = context.runtime.newString("?");
-        }
-        
         return new CodingErrorActions(onUnmappableCharacter, onMalformedInput, replaceWith);

         /*
@headius
Owner

I think in order for replace to work properly, we need the following:

  • A table mapping target encodings to their default replacement character. For example, in UTF-8 it would be � or \xEF\xBF\xBD.
  • During the first transcode into UTF-16, also transcode the replacement character. This is the first-line replacement.
  • During the second transcode, use the native replacement, so anything that can't go from UTF-16 to the target encoding will be replaced with the proper byte sequence.

The absolute correct fix is for is to not round-trip through Charset anymore, of course, but that requires a lot of work.

Thoughts? /cc @enebo

@enebo
Owner

I have a tree which allows general multi-character replacement support and support for newline transcoding. It is hooked up for IO but not for much of File. I planned on working on this after ripper and 1.7.2 (or when ripper drives me crazy and I need a break).

As to the substance of your bullets I think the logic is right. There is one severe limitation with Charset transcoding in Java (which is why I mentioned my work above): it cannot replace more than a single character via replace in transcoding. For the UTF-16 thing to work all those 8-bit values must be still resolve to be a single character in UTF-16. If not then it can piggy-back off my newline transcoding stuff (which basically processes the string separately from the transcoder). Obviously, the real solution is oni transcoding port, but as you said that is a lot of work.

@headius
Owner

Fixed by 74a57ef.

@headius headius closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.