Can't properly parse emoji stored in JSON using jruby-1.7 #290

Closed
jpignata opened this Issue Sep 5, 2012 · 12 comments

Projects

None yet

5 participants

jpignata commented Sep 5, 2012

I'm trying to parse a JSON string which happens to contain some emoji which is stored in a UTF-8 surrogate pair with both the ext and pure JSON parsers. I'm getting the following exception:

Encoding::UndefinedConversionError: Input length = 1
/app/jruby/lib/ruby/1.9/json/pure/parser.rb:171:in `convert_encoding'
/app/jruby/lib/ruby/1.9/json/pure/parser.rb:76:in `initialize'
/app/jruby/lib/ruby/1.9/json/common.rb:148:in `parse'
org/jruby/RubyString.java:7446:in `encode'

Here's a sample of the input:

"{\"name\":\"John Pignata\",\"text\":\"\xF0\x9F\x8E\x83\"}"

And here's what I see when I try to call encode directly on the string as is happening in the parser:

irb(main):037:0> "{\"name\":\"John Pignata\",\"text\":\"\xF0\x9F\x8E\x83\"}".encode(Encoding::UTF_8)
Encoding::UndefinedConversionError: "\xF0" from ASCII-8BIT to UTF-8
    from org/jruby/RubyString.java:7446:in `encode'
    from (irb):37:in `evaluate'
    from org/jruby/RubyKernel.java:1037:in `eval'
    from org/jruby/RubyKernel.java:1353:in `loop'
    from org/jruby/RubyKernel.java:1146:in `catch'
    from org/jruby/RubyKernel.java:1146:in `catch'
    from jruby/bin/irb:13:in `(root)'
Owner
enebo commented Sep 5, 2012

I think the reduction is not quote right. This also fails in MRI 1.9.3:

/Users/enebo/Developer/.rvm/rubies/ruby-1.9.3-p194/bin/irb:4: warning: Insecure world writable dir /Users/enebo/Applications/jay/bin in PATH, mode 040777
1.9.3p194 :001 > "{\"name\":\"John Pignata\",\"text\":\"\xF0\x9F\x8E\x83\"}".encode(Encoding::UTF_8)
"{\"name\":\"John Pignata\",\"text\":\"\xF0\x9F\x8E\x83\"}".encode(Encoding::UTF_8)
Encoding::UndefinedConversionError: "\xF0" from ASCII-8BIT to UTF-8
    from (irb):1:in `encode'
    from (irb):1
    from /Users/enebo/Developer/.rvm/rubies/ruby-1.9.3-p194/bin/irb:16:in `<main>'
jpignata commented Sep 5, 2012

Setting the LANG environment variable to en_US.UTF-8 makes the above work. That's the difference in my environment that I didn't have in our deployed system. Additionally, it actually allows the pure JSON parser to properly parse but I'm still seeing an error in the ext parser.

Exception in thread "pool-1-thread-95" org.jruby.exceptions.RaiseException: (Encoding::UndefinedConversionError) "\xF0" from ASCII-8BIT to UTF-8
org.jruby.RubyString.encode(org/jruby/RubyString.java:7446)
json.ext.GeneratorMethods$RbHash.to_json(json/ext/GeneratorMethods.java:71)
jpignata commented Sep 6, 2012
15:15 < enebo> pignata: to work around it you can try setting -J-Dfile.coding=UTF-8 (if you can control JVM flags)

That unfortunately did not address the issue.

Owner

I suspect that this is the same issue as #314, which we fixed on the master branch. Please test the binary from http://ci.jruby.org/snapshots/master.

Thank you.

jpignata commented Oct 5, 2012

This actually turned out to be due to some other environmental settings on Heroku, actually. Setting LANG to en_US.UTF-8 eliminated the issue. Thanks.

@jpignata jpignata closed this Oct 5, 2012
sairam commented Nov 5, 2012

Hi,

I am still able to see the error with the following string on jruby 1.7 with json-1.7.5-java gem .

 JSON.parse("{\"sample\": \"Hello, \x96 world!\"}")
Encoding::UndefinedConversionError: Input length = 1
    from org/jruby/RubyString.java:7508:in `encode'
    from json/ext/Parser.java:175:in `initialize'
    from json/ext/Parser.java:151:in `new'
    from ...../gems/json-1.7.5-java/lib/json/common.rb:155:in `parse'
    from (irb):7:in `evaluate'
    from org/jruby/RubyKernel.java:1065:in `eval'
    from org/jruby/RubyKernel.java:1390:in `loop'
    from org/jruby/RubyKernel.java:1173:in `catch'
    from org/jruby/RubyKernel.java:1173:in `catch'
    from ..../gems/railties-3.2.8/lib/rails/commands/console.rb:47:in `start'
Owner
headius commented Nov 6, 2012

It seems like this is not valid behavior in MRI either...it allows this \x96 character in a UTF-8 string, which is not a valid UTF-8 character.

Investigating.

Owner
headius commented Nov 6, 2012

@sairam Although your example works on MRI, I think it is their bug. \x96 is not a valid UTF-8 character, and should not be left as-is in a UTF-8 character string or stream.

Note the following output, where MRI will happily encode the already-UTF-8 string (which is malformed) to UTF-8 (probably just does nothing at all) but won't transcode to some other encoding (because it's malformed):

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-1.9.3 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
    from -e:1:in `<main>'

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-8")'
"{\"sample\": \"Hello, \x96 world!\"}"

system ~/projects/jruby $ ruby-2.0.0 -e 'p "{\"sample\": \"Hello, \x96 world!\"}".encode("UTF-16")'
-e:1:in `encode': "\x96" on UTF-8 (Encoding::InvalidByteSequenceError)
    from -e:1:in `<main>'
Owner
headius commented Nov 6, 2012

@sairam I filed this bug for what I believe is incorrect behavior in MRI: https://bugs.ruby-lang.org/issues/7282

sairam commented Nov 6, 2012

@headius This works fine on JRuby 1.6.7 and fails back safely.but does not work on JRuby 1.7.0 .

jruby-1.6.7 :003 > JSON.parse("{\"sample\": \"Hello, \x96 world!\"}")
 => {"sample"=>"Hello, ? world!"} 
jruby-1.6.7 :003 > JSON.parse("{\"sample\": \"Hello, \xE0 world!\"}")
 => {"sample"=>"Hello, ? world!"} 
Owner
headius commented Nov 6, 2012

The 1.6.7 behavior is somewhat akin to Ruby 1.8, which will ignore pretty much any invalid text. That it fails now in JRuby 1.7.0 is probably due to 1.7.0 being more strict about encodings to support 1.9 mode correctly.

However, we also do not parse it correctly if it has been encoded as ASCII-8BIT, so there may be something to investigate.

@sairam Can you open a separate bug for this?

sairam commented Nov 6, 2012

That's great to know about the behaviour.

Created an issue for this bug at #374 .

@eregon eregon added a commit that referenced this issue Sep 27, 2016
@eregon eregon Squashed 'spec/ruby/' changes from ffdfa41..ae9cea3
ae9cea3 Guard File.setgid? spec for superuser.
486bd3d Specs for reopening class/module private constant.
0e019a5 Show the exit status when compilation failed
2fdf80d Spec for high-precision BigMath.log on a Rational.
c37bf53 define_method should modify the visibility of an UnboundMethod.
a93fdab Add spec asserting that Kernel#` lets stderr through
c2c3091 Improve formatting
651dfd1 Add a spec for jruby/jruby#2376
f24a00f Expose two Enumerator-related bugs
d05fabf Avoid File.split in require specs
2c791fb Use expand_path/realpath with a directory base
9daa861 Fix running a spec file directly:
2300e4c Merge pull request #293 from iliabylich/add-test-case-for-time-to_i
3cd1560 Added test case for non-zero Time#to_i.
f0548a9 getsockopt(2) may retuns SO_LINGER instead of 1
f6793fa Merge pull request #291 from etehtsea/improve-setsockopt-specs
5712b54 Add Socket::Option as arg specs to setsockopt
7f21c2c Merge pull request #290 from etehtsea/improve-unpack-pack-sockaddr
16ac165 Improve Socket.(pack_)sockaddr_un specs
19c28c3 Fix pack_sockaddr shared spec
dddcf78 Add Socket.unpack_sockaddr_un specs
8bdec12 Improve Socket.unpack_sockaddr_in specs

git-subtree-dir: spec/ruby
git-subtree-split: ae9cea38bdbeddaf16968ddc311f8ca6c481fca2
0eede44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment