New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken JRuby and UTF-8 method names #1285
Comments
This appears to be a problem with define_method that might also affect instance_method. I suspect it is not converting the incoming string into a symbolic identifier correctly. |
I have some partial fixes for these methods but there appears to be issues throughout method definition and querying as far as how they coerce strings to identifiers. You should be able to work around most of these issues by using symbols. Replacing your string identifiers with symbols makes the script work properly. |
There may also still be parser/compiler/interprester issues with multi-byte identifiers. /cc @enebo |
So in both cases we store the actual identifier as a Java String. Why symbol makes a proper one while String does not is a bit of a mystery to me (although I have not went sleuthing yet). I remember we had an issue with this in the past but atm I cannot remember that aha explanation. |
Still an issue with JRuby 9.0.0.0:
|
Hate to bump this but it's making my specs fail. jruby 9.0.5.0 |
Some quick notes on this.... We have two paths to looking up symbols: String and ByteList. When we create symbols from String (which btw is how all Symbol operands get created in IR) we defer to String.hashCode for its hashing function. When we create symbols from a bytelist we pretend the bytelist is in fact a raw string (e.g. iso8859_1) and then stuff it into a String of encoding iso8859_1 and call hashCode on it. In the first case the symbols end up being bytes which are really encded as their charset (e.g. UTF-8). This means if we try and look up things from symbols later we end up hashing completely differently because we assume raw string. So the explanation is actually really simple. Surprisingly simple. The solution might be less so... At this point I know we have some code which depends on thinking all symbols are raw but we are clearly not storing them that way anymore from parsed code. If I try and change how we lookup symbols (latter lookup above) to use real encoded bytes maybe everything is happy or maybe I will play whack a mole? |
Ok. So I am running with a solution and I will document it here. @headius should definitely go over this and figure out if there is an issue. This change makes me nervous but the other shoe has not dropped. Also my fix seems to pass all common tests thus far and I think it now makes more sense. At some point we flipped strategies internally but not fully. Base assumptions:
Question:
newSymbol used by all Ruby-land code uses the intern'd java string with the proper encded bytes to store into our internal symbol table. So the encoded strings hashCode is used as the key. with newHardSymbol we would save our bytes as iso88591 and then use hashCode on that string. These two obviously will give different results so we will not be able to find symbols. My guess is at some point we fixed integration with Java by properly encoding the intern'd Java String so it was not rubbish (???????). This ended up trickling into how we calc'd hash for stored symbol. By switching newHardSymbol to use it's stored intern'd values (or encoded value in the case of RubyString) our hashing lines up again. I did track these two methods and I do not think there are other used mechanisms for making symbols. |
* JRuby doesn't seem to be able to handle method names containing multi-byte strings correctly. * see jruby/jruby#1285 * see jruby/jruby#3880
Here's a test case for irb:
Fails with
NameError: undefined method
on JRuby-1.7.6Pass ok on MRI and RBX
Here's a fix:
For me it seems broken. JRuby 1.7.4 passed this test.
The text was updated successfully, but these errors were encountered: