New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby creates symbols with US-ASCII encoding but non-ASCII bytes #4828

Closed
jmiettinen opened this Issue Oct 25, 2017 · 9 comments

Comments

Projects
None yet
5 participants
@jmiettinen
Contributor

jmiettinen commented Oct 25, 2017

When JRuby creates symbols for undefined local variable, the symbols' ByteList have US-ASCII encoding but bytes in it may not actually be within US-ASCII range.

Environment

Reproduces at least on JRuby 1.7.27 && JRuby 9.1.12.0. Master seems to currently be the same.

This is JRuby internal, so platform is mostly irrevelant. However:

  • uname -a says Darwin jmiettinen.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
  • locale says LC_ALL="fi_FI.UTF-8"

Expected Behavior

Given this small script (named utf8_fail.rb in my example outputs):

# encoding: utf-8                                                                                                                                                                                                                                                             
begin
  öÖa
rescue => e
  puts e.message.encoding
  puts e.message
  puts /foo/ === e.message
end

I would expect to get the following output (this is from 1.9.3-p448 and 2.3.1):

UTF-8
undefined local variable or method `öÖa' for main:Object
false

Actual Behavior

However, when the same file is run with JRuby 1.7.27 / JRuby 9.1.12.0, we get problems with bytes in the created symbol öÖa:

US-ASCII
undefined local variable or method `??a' for main:Object
ArgumentError: invalid byte sequence in US-ASCII
     === at org/jruby/RubyRegexp.java:1078
  <main> at utf8_fail.rb:8

Here the error message differs and RubyRegexp notices that there are some non-ASCII bytes in the string with US-ASCII encoding and throws ArgumentError.

If we run this through hexdump (ruby utf8_fail.rb 2>&1| hexdump -C), we get

00000000  55 53 2d 41 53 43 49 49  0a 75 6e 64 65 66 69 6e  |US-ASCII.undefin|
00000010  65 64 20 6c 6f 63 61 6c  20 76 61 72 69 61 62 6c  |ed local variabl|
00000020  65 20 6f 72 20 6d 65 74  68 6f 64 20 60 f6 d6 61  |e or method `..a|
00000030  27 20 66 6f 72 20 6d 61  69 6e 3a 4f 62 6a 65 63  |' for main:Objec|
00000040  74 0a 41 72 67 75 6d 65  6e 74 45 72 72 6f 72 3a  |t.ArgumentError:|
00000050  20 69 6e 76 61 6c 69 64  20 62 79 74 65 20 73 65  | invalid byte se|
00000060  71 75 65 6e 63 65 20 69  6e 20 55 53 2d 41 53 43  |quence in US-ASC|
00000070  49 49 0a 20 20 20 20 20  3d 3d 3d 20 61 74 20 6f  |II.     === at o|
00000080  72 67 2f 6a 72 75 62 79  2f 52 75 62 79 52 65 67  |rg/jruby/RubyReg|
00000090  65 78 70 2e 6a 61 76 61  3a 31 30 37 38 0a 20 20  |exp.java:1078.  |
000000a0  3c 6d 61 69 6e 3e 20 61  74 20 75 74 66 38 5f 66  |<main> at utf8_f|
000000b0  61 69 6c 2e 72 62 3a 38  0a                       |ail.rb:8.|
000000b9

Here it can be seen that codepoints for ö and Ö (f6 and d6) are copied just directly to the ByteList used in that message.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Oct 26, 2017

Member

Nice...this shouldn't be hard to fix. Currently we still process most identifiers as Java strings, though @enebo has been experimenting with moving that to a ByteList (byte[] + encoding) or Symbol. I'm guessing we don't track encoding properly here and then produce the error wrong as a result.

Member

headius commented Oct 26, 2017

Nice...this shouldn't be hard to fix. Currently we still process most identifiers as Java strings, though @enebo has been experimenting with moving that to a ByteList (byte[] + encoding) or Symbol. I'm guessing we don't track encoding properly here and then produce the error wrong as a result.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Oct 26, 2017

Member

This is unlikely to get fixed for 9.1.14 because there's some wide-ranging changes I'm not sure I'd be comfortable with. Is this affecting you or did you just happen to notice it?

Member

headius commented Oct 26, 2017

This is unlikely to get fixed for 9.1.14 because there's some wide-ranging changes I'm not sure I'd be comfortable with. Is this affecting you or did you just happen to notice it?

@headius headius added this to the JRuby 9.2.0.0 milestone Oct 26, 2017

@jmiettinen

This comment has been minimized.

Show comment
Hide comment
@jmiettinen

jmiettinen Oct 26, 2017

Contributor

We have a work-around so it's no biggie.

Contributor

jmiettinen commented Oct 26, 2017

We have a work-around so it's no biggie.

@cmthakur

This comment has been minimized.

Show comment
Hide comment
@cmthakur

cmthakur Jan 25, 2018

@jmiettinen Would you please let us know the workaround so that we can fix the issue until it gets fixed in upcoming versions?

cmthakur commented Jan 25, 2018

@jmiettinen Would you please let us know the workaround so that we can fix the issue until it gets fixed in upcoming versions?

@jmiettinen

This comment has been minimized.

Show comment
Hide comment
@jmiettinen

jmiettinen Jan 25, 2018

Contributor

I am not sure if our work-around helps as it was related to printing backtrace and message of an Exception. But it might give you some ideas @cmthakur.

We encountered the problem in situation where we had

class Exception
  def stacktrace
    "#{message}\n\t#{(backtrace || []).join("\n\t")}"
  end
end

which was to be written to an output stream. But writing did not succeed as we'd get bytes outside ASCII range. We just ended up sanitizing the result by replacing all bytes > 127 with ?. One could try to interpret those bytes as UTF-8 bytes too which they most probably are. So here's the simple approach:

# Usage: sanitize_nonascii_stacktrace(exception.stacktrace)
ASCII_ENCODING = Encoding.find("US-ASCII")
def sanitize_nonascii_stacktrace(stacktrace)
  if stacktrace.encoding == ASCII_ENCODING && !stacktrace.valid_encoding?
    question_mark_byte = '?'.encode(ASCII_ENCODING).getbyte(0)
    (0...stacktrace.bytesize).each do |index|
      if stacktrace.getbyte(index) > 0x7f
        stacktrace.setbyte(index, question_mark_byte)
      end
    end
  end
  stacktrace
end
Contributor

jmiettinen commented Jan 25, 2018

I am not sure if our work-around helps as it was related to printing backtrace and message of an Exception. But it might give you some ideas @cmthakur.

We encountered the problem in situation where we had

class Exception
  def stacktrace
    "#{message}\n\t#{(backtrace || []).join("\n\t")}"
  end
end

which was to be written to an output stream. But writing did not succeed as we'd get bytes outside ASCII range. We just ended up sanitizing the result by replacing all bytes > 127 with ?. One could try to interpret those bytes as UTF-8 bytes too which they most probably are. So here's the simple approach:

# Usage: sanitize_nonascii_stacktrace(exception.stacktrace)
ASCII_ENCODING = Encoding.find("US-ASCII")
def sanitize_nonascii_stacktrace(stacktrace)
  if stacktrace.encoding == ASCII_ENCODING && !stacktrace.valid_encoding?
    question_mark_byte = '?'.encode(ASCII_ENCODING).getbyte(0)
    (0...stacktrace.bytesize).each do |index|
      if stacktrace.getbyte(index) > 0x7f
        stacktrace.setbyte(index, question_mark_byte)
      end
    end
  end
  stacktrace
end
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jan 25, 2018

Member

@enebo actually chatted with me about the encoding of exception messages and backtraces, relating to his work on the bytelist_love branch.

Member

headius commented Jan 25, 2018

@enebo actually chatted with me about the encoding of exception messages and backtraces, relating to his work on the bytelist_love branch.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Jan 25, 2018

Member

@jmiettinen I have been working towards merging a large branch now to upcoming 9.2.x which will largely solve these problems. The main issue we have today is that at some point all data for method and variable names end up as a Java String and we run into lots of scenarios where we try and make it back into a Ruby String or symbol and we have lost the ability to regain its encoding. The new code works around this by leveraging our symbol tables so we can use the strings we are passing around to regain the original symbol we used (thus getting the encoding back).

I actually suspect once this lands we will spend many point releases correcting missing piece of logic for this but we have a MASSIVE codebase. Your particular issue I will record against the feature.

Member

enebo commented Jan 25, 2018

@jmiettinen I have been working towards merging a large branch now to upcoming 9.2.x which will largely solve these problems. The main issue we have today is that at some point all data for method and variable names end up as a Java String and we run into lots of scenarios where we try and make it back into a Ruby String or symbol and we have lost the ability to regain its encoding. The new code works around this by leveraging our symbol tables so we can use the strings we are passing around to regain the original symbol we used (thus getting the encoding back).

I actually suspect once this lands we will spend many point releases correcting missing piece of logic for this but we have a MASSIVE codebase. Your particular issue I will record against the feature.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Jan 25, 2018

Member

Oh I should add we have no current plans to entertain this for 9.1.x. It is a lot of work and a few small (which no one should experience) breaking changes. 9.1.x will still see innovations but just not in this area. It is too icky to guarantee on a stable line.

Member

enebo commented Jan 25, 2018

Oh I should add we have no current plans to entertain this for 9.1.x. It is a lot of work and a few small (which no one should experience) breaking changes. 9.1.x will still see innovations but just not in this area. It is too icky to guarantee on a stable line.

@kares

This comment has been minimized.

Show comment
Hide comment
@kares

kares May 24, 2018

Member

thanks for @enebo this is now fixed (on master -- JRuby >= 9.2) and outputs exactly as MRI :

UTF-8
undefined local variable or method `öÖa' for main:Object
false
Member

kares commented May 24, 2018

thanks for @enebo this is now fixed (on master -- JRuby >= 9.2) and outputs exactly as MRI :

UTF-8
undefined local variable or method `öÖa' for main:Object
false

@kares kares closed this May 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment