Broken JRuby and UTF-8 method names #1285

Closed
ujifgc opened this Issue Nov 30, 2013 · 8 comments

Projects

None yet

5 participants

@ujifgc
ujifgc commented Nov 30, 2013

Here's a test case for irb:

class A
  define_method 'проверка', lambda { p 'nunu' }
  p instance_method 'проверка'
end

Fails with NameError: undefined method on JRuby-1.7.6
Pass ok on MRI and RBX

Here's a fix:

class A
  define_method :'проверка', lambda { p 'nunu' }
  p instance_method :'проверка'
end

For me it seems broken. JRuby 1.7.4 passed this test.

@headius
Member
headius commented Dec 1, 2013

This appears to be a problem with define_method that might also affect instance_method. I suspect it is not converting the incoming string into a symbolic identifier correctly.

@headius
Member
headius commented Dec 1, 2013

I have some partial fixes for these methods but there appears to be issues throughout method definition and querying as far as how they coerce strings to identifiers.

You should be able to work around most of these issues by using symbols. Replacing your string identifiers with symbols makes the script work properly.

@headius
Member
headius commented Dec 1, 2013

There may also still be parser/compiler/interprester issues with multi-byte identifiers. /cc @enebo

@enebo
Member
enebo commented Dec 2, 2013

So in both cases we store the actual identifier as a Java String. Why symbol makes a proper one while String does not is a bit of a mystery to me (although I have not went sleuthing yet). I remember we had an issue with this in the past but atm I cannot remember that aha explanation.

@rtyler
rtyler commented Aug 7, 2015

Still an issue with JRuby 9.0.0.0:

[1] pry(main)> class A
[1] pry(main)*   define_method 'проверка', lambda { p 'nunu' }  
[1] pry(main)*   p instance_method 'проверка'  
[1] pry(main)* end  
NameError: undefined method `проверка' for class `A'
from org/jruby/RubyModule.java:2315:in `instance_method'
[2] pry(main)> class A
[2] pry(main)*   define_method :'проверка', lambda { p 'nunu' }  
[2] pry(main)*   p instance_method :'проверка'  
[2] pry(main)* end  
#<UnboundMethod: A#п�ове�ка>
=> #<UnboundMethod: A#п�ове�ка>
[3] pry(main)> JRUBY_VERSION
=> "9.0.0.0"
[4] pry(main)> 
@Kagetsuki

Hate to bump this but it's making my specs fail. jruby 9.0.5.0

@enebo
Member
enebo commented Mar 31, 2016

Some quick notes on this....

We have two paths to looking up symbols: String and ByteList. When we create symbols from String (which btw is how all Symbol operands get created in IR) we defer to String.hashCode for its hashing function. When we create symbols from a bytelist we pretend the bytelist is in fact a raw string (e.g. iso8859_1) and then stuff it into a String of encoding iso8859_1 and call hashCode on it.

In the first case the symbols end up being bytes which are really encded as their charset (e.g. UTF-8). This means if we try and look up things from symbols later we end up hashing completely differently because we assume raw string. So the explanation is actually really simple. Surprisingly simple. The solution might be less so...

At this point I know we have some code which depends on thinking all symbols are raw but we are clearly not storing them that way anymore from parsed code. If I try and change how we lookup symbols (latter lookup above) to use real encoded bytes maybe everything is happy or maybe I will play whack a mole?

@enebo
Member
enebo commented Apr 1, 2016

Ok. So I am running with a solution and I will document it here. @headius should definitely go over this and figure out if there is an issue. This change makes me nervous but the other shoe has not dropped. Also my fix seems to pass all common tests thus far and I think it now makes more sense. At some point we flipped strategies internally but not fully.

Base assumptions:

  • Interpreter and JIT both call newSymbol(Ruby, String, Encoding). The String itself seems to have always been encoded as bytes represented by Encoding. String's hashCode used for looking up in the symbol table.
  • define_method and things which need to pin symbols or make them 'hard' call newHardSymbol(runtime, IRubyObject). Both RubyString and RubySymbol provide properly encoded bytes as per its encoding.

Question:

  • We have a third catch-all type for newHardSymbol which just punts over to newSymbol. What case is this for?

newSymbol used by all Ruby-land code uses the intern'd java string with the proper encded bytes to store into our internal symbol table. So the encoded strings hashCode is used as the key. with newHardSymbol we would save our bytes as iso88591 and then use hashCode on that string. These two obviously will give different results so we will not be able to find symbols.

My guess is at some point we fixed integration with Java by properly encoding the intern'd Java String so it was not rubbish (???????). This ended up trickling into how we calc'd hash for stored symbol. By switching newHardSymbol to use it's stored intern'd values (or encoded value in the case of RubyString) our hashing lines up again.

I did track these two methods and I do not think there are other used mechanisms for making symbols.

@enebo enebo closed this in 851974d Apr 1, 2016
@enebo enebo added this to the JRuby 9.1.0.0 milestone Apr 1, 2016
@enebo enebo added the core label Apr 1, 2016
@namusyaka namusyaka referenced this issue in sinatra/sinatra Nov 6, 2016
Merged

Drop symbolization in #generate_method #1198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment