Nice...this shouldn't be hard to fix. Currently we still process most identifiers as Java strings, though @enebo has been experimenting with moving that to a ByteList (byte + encoding) or Symbol. I'm guessing we don't track encoding properly here and then produce the error wrong as a result.
which was to be written to an output stream. But writing did not succeed as we'd get bytes outside ASCII range. We just ended up sanitizing the result by replacing all bytes > 127 with ? . One could try to interpret those bytes as UTF-8 bytes too which they most probably are. So here's the simple approach:
@jmiettinen I have been working towards merging a large branch now to upcoming 9.2.x which will largely solve these problems. The main issue we have today is that at some point all data for method and variable names end up as a Java String and we run into lots of scenarios where we try and make it back into a Ruby String or symbol and we have lost the ability to regain its encoding. The new code works around this by leveraging our symbol tables so we can use the strings we are passing around to regain the original symbol we used (thus getting the encoding back).
I actually suspect once this lands we will spend many point releases correcting missing piece of logic for this but we have a MASSIVE codebase. Your particular issue I will record against the feature.
Oh I should add we have no current plans to entertain this for 9.1.x. It is a lot of work and a few small (which no one should experience) breaking changes. 9.1.x will still see innovations but just not in this area. It is too icky to guarantee on a stable line.
When JRuby creates symbols for undefined local variable, the symbols'
US-ASCIIencoding but bytes in it may not actually be within
Reproduces at least on JRuby 1.7.27 && JRuby 184.108.40.206. Master seems to currently be the same.
This is JRuby internal, so platform is mostly irrevelant. However:
Darwin jmiettinen.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
Given this small script (named
utf8_fail.rbin my example outputs):
I would expect to get the following output (this is from 1.9.3-p448 and 2.3.1):
However, when the same file is run with JRuby 1.7.27 / JRuby 220.127.116.11, we get problems with bytes in the created symbol öÖa:
Here the error message differs and
RubyRegexpnotices that there are some non-ASCII bytes in the string with
US-ASCIIencoding and throws
If we run this through hexdump (
ruby utf8_fail.rb 2>&1| hexdump -C), we get
Here it can be seen that codepoints for ö and Ö (f6 and d6) are copied just directly to the
ByteListused in that message.
The text was updated successfully, but these errors were encountered: