Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding of symbol literals does not respect the encoding of the source file #1328

Closed
DavidEGrayson opened this Issue Dec 12, 2013 · 14 comments

Comments

Projects
None yet
4 participants
@DavidEGrayson
Copy link
Contributor

DavidEGrayson commented Dec 12, 2013

Hello. JRuby does not seem to respect encoding marking at the top of source files when deciding the encoding of symbol literals. In a file marked as UTF-8, all symbols get marked as US-ASCII even if they contain special characters. Here is the code that reproduces the issue:

# coding: UTF-8
puts :µ.encoding  # jruby says US-ASCII, MRI says UTF-8
puts :a.encoding  # jruby and MRI both say US-ASCII

It is dangerous to have an symbol like µ with its encoding set to US-ASCII because there are many common things you might do, like calling inspect, that result in "ArgumentError: invalid byte sequence in US-ASCII".

One workaround is to use a string literal followed by .to_sym.

Here is the output of my jruby -v:

jruby 1.7.9 (1.9.3p392) 2013-12-06 87b108a on Java HotSpot(TM) 64-Bit Server VM
1.7.0_07-b10 [Windows 8-amd64]

The MRI I compared this to is: "ruby 2.0.0p0 (2013-02-24) [x64-mingw32]".

Sorry if this is a duplicate; I did check every closed or open github issue tagged with "encoding" trying to avoid that.

@k77ch7

This comment has been minimized.

Copy link
Contributor

k77ch7 commented Dec 21, 2014

This works on master and jruby-1_7. #2172 and #1685 are duplicate of this one.

@DavidEGrayson

This comment has been minimized.

Copy link
Contributor Author

DavidEGrayson commented Dec 21, 2014

Those bug reports got it wrong: the problem only affected symbol literals, not all symbols.

@enebo

This comment has been minimized.

Copy link
Member

enebo commented Dec 22, 2014

Are you saying we still have this wrong or that just the reports were wrong?

-Tom

On Sun, Dec 21, 2014 at 2:11 PM, David Grayson notifications@github.com
wrote:

Those bug reports got it wrong: the problem only affected symbol literals,
not all symbols.


Reply to this email directly or view it on GitHub
#1328 (comment).

blog: http://blog.enebo.com twitter: tom_enebo
mail: tom.enebo@gmail.com

@DavidEGrayson

This comment has been minimized.

Copy link
Contributor Author

DavidEGrayson commented Dec 22, 2014

I was just talking about the other bug reports. I haven't tested jruby recently regarding this.

@enebo

This comment has been minimized.

Copy link
Member

enebo commented Dec 22, 2014

Cool. Thanks for the clarification. This last round fixed all remaining
rubyspec failures, so if we are not passing something I don't think neither
MRI nor rubyspec has coverage for it.

-Tom

On Mon, Dec 22, 2014 at 10:36 AM, David Grayson notifications@github.com
wrote:

I was just talking about the other bug reports. I haven't tested jruby
recently regarding this.


Reply to this email directly or view it on GitHub
#1328 (comment).

blog: http://blog.enebo.com twitter: tom_enebo
mail: tom.enebo@gmail.com

@DavidEGrayson

This comment has been minimized.

Copy link
Contributor Author

DavidEGrayson commented Jan 22, 2015

I tested the script from my original post on JRuby 9.0.0.0-pre1 and I got a more serious error:

$ jruby -v sym_enc.rb
jruby 9.0.0.0.pre1 (2.2.0p0) 2015-01-20 d537cab Java HotSpot(TM) 64-Bit Server VM 24.45-b08 on 1.7.0
_45-b18 +jit [Windows 8-amd64]
io/console not supported; tty will not be manipulated
UTF8Encoding.java:35:in `length': java.lang.ArrayIndexOutOfBoundsException: -1
        from MultiByteEncoding.java:209:in `strLength'
        from ByteList.java:593:in `lengthEnc'
        from SymbolNode.java:71:in `<init>'
        from RubyParser.java:4224:in `execute'
        from RubyParser.java:1647:in `yyparse'
        from RubyParser.java:1538:in `yyparse'
        from RubyParser.java:5248:in `parse'
        from Parser.java:108:in `parse'
        from Parser.java:89:in `parse'
        from Ruby.java:2691:in `parseFileAndGetAST'
        from Ruby.java:2684:in `parseFileFromMainAndGetAST'
        from Ruby.java:2672:in `parseFileFromMain'
        from Ruby.java:602:in `parseFromMain'
        from Ruby.java:548:in `runFromMain'
        from Main.java:405:in `doRunFromMain'
        from Main.java:300:in `internalRun'
        from Main.java:227:in `run'
        from Main.java:199:in `main'

To help people reproduce the issue, here is what my hex editor (WinHex) shows for sym_enc.rb:

Offset      0  1  2  3  4  5  6  7   8  9  A  B  C  D  E  F

00000000   23 20 63 6F 64 69 6E 67  3A 20 55 54 46 2D 38 0D   # coding: UTF-8 
00000010   0A 70 75 74 73 20 3A C2  B5 2E 65 6E 63 6F 64 69    puts :µ.encodi
00000020   6E 67 0D 0A 70 75 74 73  20 3A 61 2E 65 6E 63 6F   ng  puts :a.enco
00000030   64 69 6E 67 0D 0A                                  ding  
@DavidEGrayson

This comment has been minimized.

Copy link
Contributor Author

DavidEGrayson commented Jan 22, 2015

I put the script into a gist because headius asked me to on IRC: https://gist.github.com/DavidEGrayson/5a03462b705f6d96e2fe

@cheald

This comment has been minimized.

Copy link
Contributor

cheald commented Jan 22, 2015

I couldn't reproduce this on Fedora, and @headius couldn't on OS X, but I did successfully reproduce it on Windows 8.1 with pre1. Yay for platform bugs!

@DavidEGrayson

This comment has been minimized.

Copy link
Contributor Author

DavidEGrayson commented Jan 22, 2015

If I run this slightly modified script, JRuby seems to get stuck in an infinite loop:

# coding: UTF-8
puts :aµ.encoding
puts :a.encoding

cheald added a commit to cheald/jruby that referenced this issue Jan 23, 2015

Fix ascii-only detection of symbol names.
If you run the issue described in jruby#1328 with -J-Dfile.encoding=CP1252
then the string length comparison fails. This smoothes over the issue,
but it is still possible that there are problems with bytes being read
with the JVM's file.encoding rather than the encoding specified in the
magic comment.

Closes jruby#1328
@DavidEGrayson

This comment has been minimized.

Copy link
Contributor Author

DavidEGrayson commented Jul 24, 2015

I just tested this in JRuby 9.0.0.0. Unfortunately, JRuby still gets stuck in an infinite loop when I run the script posted in my last comment.

@enebo enebo added this to the JRuby 9.0.1.0 milestone Jul 27, 2015

@enebo

This comment has been minimized.

Copy link
Member

enebo commented Jul 27, 2015

@DavidEGrayson Thanks for the update. This bug involves Windows having a file.encoding=cp1252 (or potentially another one). I can get an infinite loop as well with your script on MacOS if I invoke it as:

jruby -J-Dfile.encoding=CP1252 ../snippets/parser19.rb 

Clearly, we are doing something wrong. If you want to work around this until we fix this you can pass -J-Dfile.encoding=UTF-8.

@enebo enebo closed this in e590e5d Aug 21, 2015

@enebo

This comment has been minimized.

Copy link
Member

enebo commented Aug 21, 2015

For posterity I will write a note on this especially since I think we can make a better fix later (which was too invasive to consider now). All identifiers are created as Java intern'd Strings. This is also how symbols are made (token ':' + tIDENT -- more cases but you get the idea). When we make SymbolNodes we called getBytes() on that Java String which would return bytes as presented via Java file.encoding property. This led to bogus byte sequences which did not map to the encoding Ruby thought the symbol was.

The better fix long-term would be to puts bytelist for all identifiers and then for values actually destined to be things like method names or lvars as intern'd strings. The problem with this long term fix is constructing a ByteList per real identifier and it also means more complicated logic in the parser to convert as appropriate.

In thinking about this I also thought about an extra field on RubyLexer. The problem with this is for productions like label where we have tLABEL expr_value where expr_value could have an embedded identifier.

@DavidEGrayson

This comment has been minimized.

Copy link
Contributor Author

DavidEGrayson commented Aug 22, 2015

Thanks @enebo! I assume your commit fixes the infinite loop and the original problem with symbol literal encoding.

@enebo

This comment has been minimized.

Copy link
Member

enebo commented Aug 22, 2015

@DavidEGrayson yeah I did check that case as an extra sanity check but the reason for the infinite loop was we were taking CP1252 bytes and trying to calculate character length as if they were a UTF-8 set of bytes. I am not sure why jcodings can get stuck in an infinite loop in that case but that case should never happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.