String#encode fails with :invalid => :replace option #375

Closed
dudleyf opened this Issue Nov 6, 2012 · 8 comments

Projects

None yet

7 participants

@dudleyf
dudleyf commented Nov 6, 2012

Under MRI (1.9.3-p286):

dudley@kima ~ :> ruby -e 'puts "foo".encode("UTF-16", :invalid => :replace)'
??foo

Under JRuby (1.7.0):

dudley@kima ~ :> ruby -e '"foo".encode("UTF-16", :invalid => :replace)'     
CharsetEncoder.java:285:in `replaceWith': java.lang.IllegalArgumentException: Illegal replacement
    from CharsetTranscoder.java:201:in `getCharsetEncoder'
    from CharsetTranscoder.java:69:in `transcode'
    from CharsetTranscoder.java:58:in `transcode'
    from CharsetTranscoder.java:97:in `transcode'
    from RubyString.java:7559:in `transcode'
    from RubyString.java:7525:in `encode'
    from RubyString$INVOKER$i$encode.gen:-1:in `call'
    from MethodHandle.java:599:in `invokeWithArguments'
    from InvocationLinker.java:183:in `invocationFallback'
    from -e:1:in `__file__'
    from -e:-1:in `load'
    from Ruby.java:770:in `runScript'
    from Ruby.java:763:in `runScript'
    from Ruby.java:640:in `runNormally'
    from Ruby.java:489:in `runFromMain'
    from Main.java:375:in `doRunFromMain'
    from Main.java:264:in `internalRun'
    from Main.java:230:in `run'
    from Main.java:214:in `run'
    from Main.java:194:in `main'

According to the docs (http://www.ruby-doc.org/core-1.9.3/String.html#method-i-encode), the replacement string should default to "\uFFFD" for Unicode encoding forms, and "?" otherwise.

@headius
Member
headius commented Nov 8, 2012

I believe the problem is simply that we're not using the right replacement bytes for some encodings in the following two methods. We need appropriate default replacement logic for the various targets.

In CharsetTranscoder:

    private CharsetDecoder getCharsetDecoder(Charset charset) {
        CharsetDecoder decoder = charset.newDecoder();

        decoder.onUnmappableCharacter(actions.onUnmappableCharacter);
        decoder.onMalformedInput(actions.onMalformedInput);

        if (actions.replaceWith != null) decoder.replaceWith(actions.replaceWith.toString());

        return decoder;
    }

    private CharsetEncoder getCharsetEncoder(Charset charset) {
        CharsetEncoder encoder = charset.newEncoder();

        encoder.onUnmappableCharacter(actions.onUnmappableCharacter);
        encoder.onMalformedInput(actions.onMalformedInput);
        if (actions.replaceWith != null) {
            encoder.replaceWith(actions.replaceWith.getBytes());
        }

        return encoder;
    } 
@toreriklinnerud

I can't be terribly helpful, but whilst testing a lightweight email processing app we currently have in production, this was the biggest issue we hit so far and is currently stopping us from moving forward with a switch from MRI, unless we develop a workaround.

@BanzaiMan
Member
@BanzaiMan
Member

I looked at this issue in a little more detail.

  1. CharsetTranscoder$CodingErrorActions will need to be made encoding aware. This is tedious, but doable.
  2. Our encoding library (jcodings) will need to be made aware of the "UTF-16" encoding, which is a bit dubious. In Java, it is expected that the string contains byte order marker; I'm not sure if MRI has that expectation. At any rate, since jcodings does not know about "UTF-16" encoding, by the time execution gets to CharsetTranscoder, we are treating it like it is ASCII. This loss of information is obviously problematic here, though there might be a good reason against having it.
@pmahoney
Contributor

Just a vote for this issue. exception_notification uses this pattern.

@toreriklinnerud

Does anyone have a suggested workaround? Our app accepts text (email) with any encoding (specified in the email) . Occasionally emails are not in the encoding they say they are, and or they have invalid characters. We need to a way to convert the incoming content to something that is guaranteed valid UTF-8, even if it involves data loss.

@dekellum
Contributor
dekellum commented Jul 5, 2013

@BanzaiMan pointed me at this issue from #856, which I believe is a distinct enough problem, though is clearly related.

FWIW: This issues does appear to be a regression from jruby 1.6.8:

jruby-1.6 -v -e 'puts "foo".encode("UTF-16", :invalid => :replace)'
jruby 1.6.8 (ruby-1.9.2-p312) (2012-09-18 1772b40) (Java HotSpot(TM) Server VM 1.7.0_21) [linux-i386-java]
��foo
@headius headius added a commit that closed this issue Jul 12, 2013
@headius headius Rework replacement string to use Java string, target charset.
We originally passed around a Ruby string with no specified
encoding, but this appeared to decode into invalid replacement
bytes under some encodings. Since we always want to replace with
the '?' character (rather than e.g. the unicode <?> character in
UTF-16), I altered the replaceWith field of CodingErrorActions to
be a plain String, and added the target charset to getBytes.

Fixes #375.
Fixes #861 (same cause).
a84e6d8
@headius headius closed this in a84e6d8 Jul 12, 2013
@bf4
bf4 commented Sep 16, 2013

Great that this will be fixed in 1.7.5! In the meantime, this worked for me:

  def as_utf8(str)
    if str.respond_to?(:encode!)
      if defined?(JRUBY_VERSION)
        # don't blow up https://github.com/jruby/jruby/issues/375
        str = as_ascii(str)
      else
        # Converting it to a higher higher character set (UTF-16) and then
        #  back (to UTF-8) ensures that you will strip away invalid or undefined byte sequences.
        str.encode!(Encoding::UTF_16LE, :invalid => :replace, :undef => :replace, :replace => '_')
      end
      str.encode!(Encoding::UTF_8)
    else
      str
    end
  rescue => e
    # Airbrake.notify(e)
    warn e.inspect
    force_utf8_encoding(str)
  end

  # help out copy and pasting errors of good-looking email addresses
  # by stripping out non-ASCII characters
  def as_ascii(str)
    if str.respond_to?(:to_ascii)
      # with stringex or talentbox-unidecoder
      str.to_ascii
    else
      # avoids invalid multi-byte escape error
      ascii_text = str.encode( Encoding::ASCII, invalid: :replace, undef: :replace, replace: '' )
      # see http://www.ruby-forum.com/topic/183413
      pattern = Regexp.new('[\x80-\xff]', nil, 'n')
      ascii_text.gsub(pattern, '')
    end
  end
  def force_utf8_encoding(str)
    if str.is_a?(String) && str.respond_to?(:force_encoding)
      str = str.dup if str.frozen?

      str.force_encoding(Encoding::UTF_8)

      if !str.valid_encoding?
        #logger.warn("encoding: forcing invalid UTF-8 string; text is #{str}")
        str.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
      end
    end

    str
  end
@eregon eregon added a commit that referenced this issue Jan 27, 2017
@eregon eregon Squashed 'spec/ruby/' changes from f4a4f0e..fb3e3ac
fb3e3ac Remove redundant specs
89b75f6 Write to an external file instead of using output_to_fd
a1837cb If you change the max of a sized queue, the old items beyond the max remain
f3b9b2d change TruffleRuby's RUBY_ENGINE to ruby until we are recognised by rubygems
5792e29 Simplify Array#sample spec and avoid #itself
1af4646 add array sample randomness spec
086ccc5 Fix a spec typo
ac4cabf Add link to TruffleRuby
8dcc2ca Remove old JRuby+Truffle link
c8eda60 Add spec for Enumerable#chunk_while with single element.
f8e0ea5 Add spec for invalid rescue clauses
19ec73a Spec for rescue comparing against classes using ===
61b4c61 add spec for String#scan's output when passed block has multiple arguments
6047836 Fix bad spec which inadvertently passed due to have_data.
2268696 Replace have_data matchers with File.read().should ==
ca87e50 Merge pull request #383 from manu-crealytics/gzipreader_enumerator_specs
1c4d467 Extend GzipReader specs to cover returned Enumerators.
859eeb0 Kerenel#warn no longer splats an array in 2.5 (#382)
cee9571 Extract WIN32OLE.new to a fixture to ease retrying
0079c36 Merge pull request #379 from kachick/2.4-match-p
9b95f54 Merge pull request #377 from kachick/hash-transform_values
96735c5 Merge pull request #376 from kachick/hash-compact
5be340b Write specs around {String|Symbol}#match?
95b9bbb Write specs around Regexp#match?
ef69806 Merge pull request #375 from kachick/concat-prepend-with-multi-arg
9496ea5 Remove suspicious spec
28a88ed Remove empty WIN32OLE specs
ab2925b Retry on WIN32OLE.new failure
db38e03 Avoid double error
2d281a2 Temporarily disable new spec for String#each_line in 2.5 as Travis trunk lags
2d08e6d Enable 2.4.0 in CI
80f92c2 Write Hash#transform_values! specs around frozen
5611438 Write Hash#transform_values! specs
4d31731 Write Hash#transform_values specs
906bd11 suppress error for a while
dd2360f Write Hash#compact{!} specs
6f66a66 Add Array#concat with multiple arguments
d650d57 Add String#concat with multiple arguments
9020e16 Add String#prepend with multiple arguments
c4a0aa2 Sort by versions
85b2434 Merge pull request #371 from vais/windows
eaf3da5 There is no /dev/tty on Windows, but the rest of the tty spec works
9b81e7d Don't leave zombie IE processes hanging around after specs finish running
f9e6b34 There is no /dev/tty on Windows
dc9c3e1 Silence the nuisance "'does_not_exist' is not recognized..." message

git-subtree-dir: spec/ruby
git-subtree-split: fb3e3ac203386b2b82a7787490ed4f861ab4169e
d83cbf7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment