New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string encoding is incorrect when compiling to .class files via jrubyc #4445

Closed
amarkowitz opened this Issue Jan 18, 2017 · 8 comments

Comments

Projects
None yet
2 participants
@amarkowitz

amarkowitz commented Jan 18, 2017

Environment

user@ ~/dev/jruby (master)$ jruby -v
jruby 9.1.5.0 (2.3.1) 2016-09-07 036ce39 Java HotSpot(TM) 64-Bit Server VM 25.112-b16 on 1.8.0_112-b16 +jit [darwin-x86_64]
user@ ~/dev/jruby (master)$ uname -a
Darwin computer_name 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64

Expected Behavior

I expect the default encoding of a string in ruby 2.x compatible jruby to be UTF-8 and not ASCII-8BIT when no encoding is specified in the source file header.

Actual Behavior

user@ ~/dev$ rvm use jruby-9.1.5.0
Using /Users/user/.rvm/gems/jruby-9.1.5.0
user@ ~/dev$ cat enc_test.rb 
puts ''.encoding
user@ ~/dev$ jrubyc enc_test.rb 
user@ ~/dev$  java -cp .:/Users/user/.rvm/rubies/jruby-9.1.5.0/lib/jruby.jar enc_test 
ASCII-8BIT
user@ ~/dev$  jruby enc_test.rb
UTF-8

Notice how the default encoding of the empty string that is declared when run from the .class file generated by jrubyc is ASCII-8BIT, but when run via jruby it is UTF-8.

The docs here indicate:

All Ruby script code has an associated Encoding which any String literal created in the source code will be associated to.

The default script encoding is Encoding::UTF-8 after v2.0...

jruby 9.1.6.0 seems to also suffer from the same problem:

user@ ~/dev$  rvm use jruby-9.1.6.0
Using /Users/user/.rvm/gems/jruby-9.1.6.0
user@ ~/dev$ jrubyc enc_test.rb 
user@ ~/dev$  java -cp .:/Users/user/.rvm/rubies/jruby-9.1.6.0/lib/jruby.jar enc_test 
ASCII-8BIT
user@ ~/dev$ jruby enc_test.rb
UTF-8

jruby 9.1.7.0 looks to have a separate problem that prevents even this simple example from running:

user@ ~/dev$ rvm use jruby-9.1.7.0
Using /Users/user/.rvm/gems/jruby-9.1.7.0
user@ ~/dev$  jrubyc enc_test.rb 
TypeError: failed to coerce org.objectweb.asm.ClassWriter to org.jruby.org.objectweb.asm.ClassVisitor
  block in compile_files_with_options at /Users/user/.rvm/rubies/jruby-9.1.7.0/lib/ruby/stdlib/jruby/compiler.rb:189
  block in compile_files_with_options at /Users/user/.rvm/rubies/jruby-9.1.7.0/lib/ruby/stdlib/jruby/compiler.rb:297
                                 each at org/jruby/RubyArray.java:1733
           compile_files_with_options at /Users/user/.rvm/rubies/jruby-9.1.7.0/lib/ruby/stdlib/jruby/compiler.rb:281
                         compile_argv at /Users/user/.rvm/rubies/jruby-9.1.7.0/lib/ruby/stdlib/jruby/compiler.rb:94
                               <main> at /Users/user/.rvm/rubies/jruby-9.1.7.0/bin/jrubyc:5
user@ ~/dev$ jruby enc_test.rb
UTF-8

@amarkowitz amarkowitz changed the title from Default encoding is incorrect when compiling to .class files via jrubyc to string encoding is incorrect when compiling to .class files via jrubyc Jan 18, 2017

@amarkowitz

This comment has been minimized.

Show comment
Hide comment
@amarkowitz

amarkowitz Jan 19, 2017

Possibly related to this code which opens the file to compile with ASCII-8BIT?

https://github.com/jruby/jruby/blob/9.1.5.0/lib/ruby/stdlib/jruby/compiler.rb#L128

amarkowitz commented Jan 19, 2017

Possibly related to this code which opens the file to compile with ASCII-8BIT?

https://github.com/jruby/jruby/blob/9.1.5.0/lib/ruby/stdlib/jruby/compiler.rb#L128

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 19, 2017

Member

Are you using javac to produce normal-looking Java classes (using --java or --javac) or are you just using it to obfuscate code?

If it's the former, your analysis is probably correct; the code is being parsed using an incorrect encoding. This feature was always experimental, but it shouldn't be a hard fix. Simplest answer might be to just allow the default encoding, omitting "ASCII-8BIT" altogether. That should be simple enough for you to put in a PR.

If it's the latter, we have more investigation to do.

Member

headius commented Jan 19, 2017

Are you using javac to produce normal-looking Java classes (using --java or --javac) or are you just using it to obfuscate code?

If it's the former, your analysis is probably correct; the code is being parsed using an incorrect encoding. This feature was always experimental, but it shouldn't be a hard fix. Simplest answer might be to just allow the default encoding, omitting "ASCII-8BIT" altogether. That should be simple enough for you to put in a PR.

If it's the latter, we have more investigation to do.

@headius headius added this to the JRuby 9.1.8.0 milestone Jan 19, 2017

@amarkowitz

This comment has been minimized.

Show comment
Hide comment
@amarkowitz

amarkowitz Jan 19, 2017

In our case we're using it so we don't ship our native ruby sources, so more in the vein of obfuscation.

amarkowitz commented Jan 19, 2017

In our case we're using it so we don't ship our native ruby sources, so more in the vein of obfuscation.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 24, 2017

Member

Ok, then somewhere in the pipeline we're reading those sources in with an incorrect encoding. The line you linked to is probably the right place. I'll modify the code to just use system default encoding (or encoding pragma) and add a flag to allow specifying encoding.

Member

headius commented Jan 24, 2017

Ok, then somewhere in the pipeline we're reading those sources in with an incorrect encoding. The line you linked to is probably the right place. I'll modify the code to just use system default encoding (or encoding pragma) and add a flag to allow specifying encoding.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 24, 2017

Member

I traced that line to 1778496 by @enebo a couple years back, intended to fix #3175. It may be fine for it to read the content in as binary, but I believe it then passes that unencoded content along to the parser as if it were a properly encoded string.

Member

headius commented Jan 24, 2017

I traced that line to 1778496 by @enebo a couple years back, intended to fix #3175. It may be fine for it to read the content in as binary, but I believe it then passes that unencoded content along to the parser as if it were a properly encoded string.

@amarkowitz

This comment has been minimized.

Show comment
Hide comment
@amarkowitz

amarkowitz Jan 24, 2017

@headius - thanks! Yes, that looks like that is what is happening. The script file is loaded as a binary string encoded as ASCII-8BIT and passed to the parser which then uses this encoding for the script encoding, affecting the encoding of strings created in that script file. If the script file that was loaded does not specify an encoding comment at the top the encoding of the binary source string, as currently loaded by jrubyc, is used for compilation. Since Ruby 2 seems to use UTF-8 by default for script encoding (http://ruby-doc.org/core-2.3.1/Encoding.html#class-Encoding-label-Script+encoding) it might be more proper to load the file as UTF-8 by default (instead of system default encoding), but allow for the user to override via a flag.

amarkowitz commented Jan 24, 2017

@headius - thanks! Yes, that looks like that is what is happening. The script file is loaded as a binary string encoded as ASCII-8BIT and passed to the parser which then uses this encoding for the script encoding, affecting the encoding of strings created in that script file. If the script file that was loaded does not specify an encoding comment at the top the encoding of the binary source string, as currently loaded by jrubyc, is used for compilation. Since Ruby 2 seems to use UTF-8 by default for script encoding (http://ruby-doc.org/core-2.3.1/Encoding.html#class-Encoding-label-Script+encoding) it might be more proper to load the file as UTF-8 by default (instead of system default encoding), but allow for the user to override via a flag.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 24, 2017

Member

I have a workaround for you: specify the file encoding with an # encoding: utf-8 line at the top of the files in question. We appear to honor that even though we read it in as ASCII-8BIT.

I'm working on a fix. The two paths (plain jrubyc versus jrubyc --java) appear to parse the content differently...for no obvious reason.

Member

headius commented Jan 24, 2017

I have a workaround for you: specify the file encoding with an # encoding: utf-8 line at the top of the files in question. We appear to honor that even though we read it in as ASCII-8BIT.

I'm working on a fix. The two paths (plain jrubyc versus jrubyc --java) appear to parse the content differently...for no obvious reason.

@headius headius closed this in d7f5669 Jan 24, 2017

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 24, 2017

Member

I have pushed a fix but we need a test for this.

Member

headius commented Jan 24, 2017

I have pushed a fix but we need a test for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment