New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-ASCII Symbol gives ArgumentError when calling inspect on the symbol #4070

Closed
donv opened this Issue Aug 12, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@donv
Member

donv commented Aug 12, 2016

Environment

$ ruby -v
jruby 9.1.3.0-SNAPSHOT (2.3.0) 2016-08-12 93bd82f Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
$ uname -a
Darwin macbeth-3.local 14.5.0 Darwin Kernel Version 14.5.0: Thu Jun 16 19:58:21 PDT 2016; root:xnu-2782.50.4~1/RELEASE_X86_64 x86_64

Expected Behavior

$ ruby -v
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin14]
$ irb
irb(main):001:0> :Renè
=> :Renè

Actual Behavior

$ ruby -v
jruby 9.1.3.0-SNAPSHOT (2.3.0) 2016-08-12 93bd82f Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
$ irb
irb(main):002:0> :René
ArgumentError: invalid byte sequence in UTF-8
from org/jruby/RubySymbol.java:274:in inspect' from org/jruby/RubySymbol.java:259:ininspect'
from org/jruby/RubyKernel.java:1295:in loop' from org/jruby/RubyKernel.java:1114:incatch'
from org/jruby/RubyKernel.java:1114:in catch' from /Users/uwe/.rubies/jruby-9.1.3.0-snapshot/bin/irb:13:in

'

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 12, 2016

Member

This appears to be specific to irb itself. If I make a symbol in a file or via -e then :Renè parses fine.

Member

enebo commented Aug 12, 2016

This appears to be specific to irb itself. If I make a symbol in a file or via -e then :Renè parses fine.

@phluid61

This comment has been minimized.

Show comment
Hide comment
@phluid61

phluid61 Aug 19, 2016

Contributor

It's an error in Symbol#inspect

$ bin/jruby -v -e 'p :Renè'
jruby 9.1.3.0-SNAPSHOT (2.3.0) 2016-07-07 6600d9f OpenJDK 64-Bit Server VM 25.91-b14 on 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14 +jit [linux-x86_64]
ArgumentError: invalid byte sequence in UTF-8
  inspect at org/jruby/RubySymbol.java:274
  inspect at org/jruby/RubySymbol.java:259
        p at org/jruby/RubyKernel.java:476
   <main> at -e:1
Contributor

phluid61 commented Aug 19, 2016

It's an error in Symbol#inspect

$ bin/jruby -v -e 'p :Renè'
jruby 9.1.3.0-SNAPSHOT (2.3.0) 2016-07-07 6600d9f OpenJDK 64-Bit Server VM 25.91-b14 on 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14 +jit [linux-x86_64]
ArgumentError: invalid byte sequence in UTF-8
  inspect at org/jruby/RubySymbol.java:274
  inspect at org/jruby/RubySymbol.java:259
        p at org/jruby/RubyKernel.java:476
   <main> at -e:1
@phluid61

This comment has been minimized.

Show comment
Hide comment
@phluid61

phluid61 Aug 19, 2016

Contributor

Actually, at some point preciseLength() in utils/StringSupport.java is returning -3, so codePoint() is throwing the argument error, which bubbles up via isPrintable() in RubySymbol.java

The root cause seems to be in the org.jcodings.Encoding child's length(), but I don't have time right now to keep digging.

Contributor

phluid61 commented Aug 19, 2016

Actually, at some point preciseLength() in utils/StringSupport.java is returning -3, so codePoint() is throwing the argument error, which bubbles up via isPrintable() in RubySymbol.java

The root cause seems to be in the org.jcodings.Encoding child's length(), but I don't have time right now to keep digging.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 19, 2016

Member

Following up on @phluid61 I see that the bytelist for Renè only contains a single byte for è and the -3 represents -1-missing or that we are missing 2 bytes. So we definitely are storing something wrong as byte data for symbols. Thanks for figuring this is just inspect doing the wrong thing. In retrospect it would make sense this would be why irb was unhappy.

Member

enebo commented Aug 19, 2016

Following up on @phluid61 I see that the bytelist for Renè only contains a single byte for è and the -3 represents -1-missing or that we are missing 2 bytes. So we definitely are storing something wrong as byte data for symbols. Thanks for figuring this is just inspect doing the wrong thing. In retrospect it would make sense this would be why irb was unhappy.

@donv

This comment has been minimized.

Show comment
Hide comment
@donv

donv Aug 19, 2016

Member

@enebo @phluid61 Maybe I misunderstood what you are saying, but this definitely happens outside of IRB as well. We have this pop up in our applications. Our workaround is to quote the symbols. Quoted symbols work as expected:

$ ruby -v
jruby 9.1.3.0-SNAPSHOT (2.3.1) 2016-08-19 6600d9f Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
$ ruby -e "puts :Renè"
Ren?
$ ruby -e "puts :'Renè'"
Renè
Member

donv commented Aug 19, 2016

@enebo @phluid61 Maybe I misunderstood what you are saying, but this definitely happens outside of IRB as well. We have this pop up in our applications. Our workaround is to quote the symbols. Quoted symbols work as expected:

$ ruby -v
jruby 9.1.3.0-SNAPSHOT (2.3.1) 2016-08-19 6600d9f Java HotSpot(TM) 64-Bit Server VM 25.92-b14 on 1.8.0_92-b14 +jit [darwin-x86_64]
$ ruby -e "puts :Renè"
Ren?
$ ruby -e "puts :'Renè'"
Renè
@phluid61

This comment has been minimized.

Show comment
Hide comment
@phluid61

phluid61 Aug 20, 2016

Contributor

@enebo I see the problem, the raw symbol :Renè has bytes 52 65 6E E8, which corresponds with the ISO-8859-1 encoding of "Renè"; however the symbol believes it is encoded in UTF-8.

When the symbol is quoted (:"Renè" or :'Renè') it has the correct UTF-8 bytes 52 65 6E C3 AE, so the call to #inspect works properly.

Again, I don't have time to dig further into it right now.

Contributor

phluid61 commented Aug 20, 2016

@enebo I see the problem, the raw symbol :Renè has bytes 52 65 6E E8, which corresponds with the ISO-8859-1 encoding of "Renè"; however the symbol believes it is encoded in UTF-8.

When the symbol is quoted (:"Renè" or :'Renè') it has the correct UTF-8 bytes 52 65 6E C3 AE, so the call to #inspect works properly.

Again, I don't have time to dig further into it right now.

@phluid61

This comment has been minimized.

Show comment
Hide comment
@phluid61

phluid61 Aug 20, 2016

Contributor

Okay, in org.jruby.RubySymbol.newSymbol(Ruby, String, Encoding) the newSymbol's String and ByteList objects don't match.

In org.jruby.RubySymbol:

  • newSymbol(Ruby, String, Encoding) calls...
  • newSymbol(Ruby, String) calls...
  • SymbolTable.getSymbol(String, boolean false) calls...
  • symbolBytesFromString(Ruby, String) calls...
  • new ByteList(ByteList.plain(internedSymbol), USASCIIEncoding.INSTANCE, false);

In that final line, ByteList.plain is essentially just return encode(s, "ISO-8859-1");

After that call stack, newSymbol() calls newSymbol.associateEncoding(encoding), which directly sets the encoding of the ByteList object to UTF-8. So it holds ISO-8859-1 bytes, but it thinks its encoding is UTF-8.

Not sure what the appropriate fix is.

Contributor

phluid61 commented Aug 20, 2016

Okay, in org.jruby.RubySymbol.newSymbol(Ruby, String, Encoding) the newSymbol's String and ByteList objects don't match.

In org.jruby.RubySymbol:

  • newSymbol(Ruby, String, Encoding) calls...
  • newSymbol(Ruby, String) calls...
  • SymbolTable.getSymbol(String, boolean false) calls...
  • symbolBytesFromString(Ruby, String) calls...
  • new ByteList(ByteList.plain(internedSymbol), USASCIIEncoding.INSTANCE, false);

In that final line, ByteList.plain is essentially just return encode(s, "ISO-8859-1");

After that call stack, newSymbol() calls newSymbol.associateEncoding(encoding), which directly sets the encoding of the ByteList object to UTF-8. So it holds ISO-8859-1 bytes, but it thinks its encoding is UTF-8.

Not sure what the appropriate fix is.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 20, 2016

Member

@phluid61 I think this is running into what we need to do but haven't. We have made some stuff work but it is inconsistent and the broken way is the way which allows symbols to work in some cases. See: #3880 (comment)

Member

enebo commented Aug 20, 2016

@phluid61 I think this is running into what we need to do but haven't. We have made some stuff work but it is inconsistent and the broken way is the way which allows symbols to work in some cases. See: #3880 (comment)

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 20, 2016

Member

I should add to that comment by saying properly encoded values are not just strictly for display purposes but are also needed in cases which will cross a resource gap like to a native extension or to a Java type via our Java Integration.

Member

enebo commented Aug 20, 2016

I should add to that comment by saying properly encoded values are not just strictly for display purposes but are also needed in cases which will cross a resource gap like to a native extension or to a Java type via our Java Integration.

brocktimus added a commit to brocktimus/jruby that referenced this issue Aug 21, 2016

@enebo enebo changed the title from Non-ASCII Symbol gives ArgumentError: invalid byte sequence in UTF-8 to Non-ASCII Symbol gives ArgumentError when calling inspect on the symbol Aug 22, 2016

@enebo enebo modified the milestones: JRuby 9.1.4.0, JRuby 9.1.3.0 Aug 22, 2016

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 22, 2016

Member

Too risky before 9.1.3.0 but this along with #3880 should be one of the first things we do for 9.1.4.0 so it can bake.

Member

enebo commented Aug 22, 2016

Too risky before 9.1.3.0 but this along with #3880 should be one of the first things we do for 9.1.4.0 so it can bake.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Apr 27, 2017

Member

Fixed, likely by changes for #4564.

Member

headius commented Apr 27, 2017

Fixed, likely by changes for #4564.

@headius headius closed this Apr 27, 2017

@headius headius modified the milestones: JRuby 9.1.9.0, JRuby 9.2.0.0 Apr 27, 2017

@donv

This comment has been minimized.

Show comment
Hide comment
@donv

donv May 6, 2017

Member

Confirmed fixed in my application!

Member

donv commented May 6, 2017

Confirmed fixed in my application!

kares added a commit that referenced this issue Jan 10, 2018

Merge pull request #4096 from brocktimus/stickers_symbols_with_encoding
Extra tests for symbol encoding Re: #4070 + #3719 and possibly #3880
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment