New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emojis in method names breaks alias/alias_method and __callee__/__method__ #4878

Open
ivoanjo opened this Issue Nov 30, 2017 · 10 comments

Comments

Projects
None yet
3 participants
@ivoanjo
Contributor

ivoanjo commented Nov 30, 2017

Hello there!

I'm working on a new gem called Persistent馃拵 that uses creatively-named methods with the 馃拵 emoji to have a very clean syntax for creating immutable data structures. (馃帀)

When I tried running my test suite on JRuby it failed, as I make use of alias and JRuby seems to be having trouble with encodings in this case. I also noticed the same problem with __callee__ and __method__ while working on the example below.

Environment

  • JRuby: jruby 9.1.14.0 (2.3.3) 2017-11-08 2176f24 Java HotSpot(TM) 64-Bit Server VM 25.151-b12 on 1.8.0_151-b12 [linux-x86_64]
  • Kernel: Linux maruhime 4.10.0-40-generic #44-Ubuntu SMP Thu Nov 9 14:49:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Distro: Ubuntu 17.04

Expected Behavior

Example code:

puts RUBY_DESCRIPTION

def to_馃拵
  puts "hello there! callee is #{__callee__}, method is #{__method__} and backtrace top is #{Thread.current.backtrace(1..1)}"
end

to_馃拵

alias :to_dmnd :to_馃拵

to_dmnd

class << self
  alias_method :to_dmnd_2, :to_馃拵
end

to_dmnd_2

Output on MRI:

$ ruby testcase.rb
ruby 2.4.2p198 (2017-09-14 revision 59899) [x86_64-linux]
hello there! callee is to_馃拵, method is to_馃拵 and backtrace top is ["testcase.rb:4:in `to_馃拵'"]
hello there! callee is to_dmnd, method is to_馃拵 and backtrace top is ["testcase.rb:4:in `to_馃拵'"]
hello there! callee is to_dmnd_2, method is to_馃拵 and backtrace top is ["testcase.rb:4:in `to_馃拵'"]

Actual Behavior

Output on JRuby:

$ jruby testcase.rb 
jruby 9.1.14.0 (2.3.3) 2017-11-08 2176f24 Java HotSpot(TM) 64-Bit Server VM 25.151-b12 on 1.8.0_151-b12 [linux-x86_64]
hello there! callee is to_?, method is to_? and backtrace top is ["testcase.rb:4:in `to_馃拵'"]
NameError: undefined method `to_?' for class `Object'
  <main> at testcase.rb:9
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 1, 2017

Member

Relates to @enebo symbol work.

Member

headius commented Dec 1, 2017

Relates to @enebo symbol work.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Dec 1, 2017

Member

@ivoanjo definitely our plan to move to either RubySymbol (wip on symbol_love branch) or ByteList (which might still be in contention due to complicated bootstrapping issues using RubySymbol) will solve all the issues.

Unfortunately master (and 9.1) will not 100% use iso8859_1 nor proper encoding when registering symbols. I am guessing this is fallout from that inconsistency. The direct call does work but the aliases are messed up.

Member

enebo commented Dec 1, 2017

@ivoanjo definitely our plan to move to either RubySymbol (wip on symbol_love branch) or ByteList (which might still be in contention due to complicated bootstrapping issues using RubySymbol) will solve all the issues.

Unfortunately master (and 9.1) will not 100% use iso8859_1 nor proper encoding when registering symbols. I am guessing this is fallout from that inconsistency. The direct call does work but the aliases are messed up.

@ivoanjo

This comment has been minimized.

Show comment
Hide comment
@ivoanjo

ivoanjo Dec 1, 2017

Contributor

Thanks for the heads up!

I've since discovered by a bit more trial and error that using symbols for everything works (e.g. if I define with define_method, alias with alias_method and then public_send), but then the direct call doesn't because it's not looking for the correctly-named method.

So for the direct call to work I do need to use def because it mangles the name in exactly the same way as the direct call does. (E.g. if I call methods I'll see a :"to_?\x00" which is what actually gets defined). I think I can live with that ;)

Contributor

ivoanjo commented Dec 1, 2017

Thanks for the heads up!

I've since discovered by a bit more trial and error that using symbols for everything works (e.g. if I define with define_method, alias with alias_method and then public_send), but then the direct call doesn't because it's not looking for the correctly-named method.

So for the direct call to work I do need to use def because it mangles the name in exactly the same way as the direct call does. (E.g. if I call methods I'll see a :"to_?\x00" which is what actually gets defined). I think I can live with that ;)

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 5, 2017

Member

@ivoanjo Yeah that fits what we know about this issue. The problem is that all our internal tables (methods, constants, etc) are keyed off java.lang.String. In order to make that work, we have three options:

  • Always transcode identifiers to UTF-16 so the table has proper Java strings in it.
    This is cleanest for integration with Java (and Java tools) but we lose the original encoding of the identifier. That has been reported enough times that we try a compromise...
  • Never transcode identifiers to UTF-16 and just stuff the raw bytes into String.
    This approach is used some of the time right now, primarily for identifiers coming directly from the parser (i.e. normal def, constant assignment, etc). It mostly works, but if two identifiers have the same bytes but different encoding, we'll only be able to represent one of them in those tables, and we won't be able to return the expected identifier via reflective methods. There's also a number of places that are not doing this, leaning more toward Java integration and transcoding to UTF-16.
  • Stop using Java strings as keys for internal tables.
    This is the work I mentioned, which both @enebo and I have tackled in the past. My current vote goes for ByteList, but ByteList itself needs a lot of cleanup to behave properly as a key (hashcode, equals don't consider encoding right now, for example).

So that basically explains why you ran into this and why using all symbols or all plain identifiers both work fine.

Member

headius commented Dec 5, 2017

@ivoanjo Yeah that fits what we know about this issue. The problem is that all our internal tables (methods, constants, etc) are keyed off java.lang.String. In order to make that work, we have three options:

  • Always transcode identifiers to UTF-16 so the table has proper Java strings in it.
    This is cleanest for integration with Java (and Java tools) but we lose the original encoding of the identifier. That has been reported enough times that we try a compromise...
  • Never transcode identifiers to UTF-16 and just stuff the raw bytes into String.
    This approach is used some of the time right now, primarily for identifiers coming directly from the parser (i.e. normal def, constant assignment, etc). It mostly works, but if two identifiers have the same bytes but different encoding, we'll only be able to represent one of them in those tables, and we won't be able to return the expected identifier via reflective methods. There's also a number of places that are not doing this, leaning more toward Java integration and transcoding to UTF-16.
  • Stop using Java strings as keys for internal tables.
    This is the work I mentioned, which both @enebo and I have tackled in the past. My current vote goes for ByteList, but ByteList itself needs a lot of cleanup to behave properly as a key (hashcode, equals don't consider encoding right now, for example).

So that basically explains why you ran into this and why using all symbols or all plain identifiers both work fine.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 5, 2017

Member

@enebo Finally land symbols or bytelist identifiers in 9.2?

Member

headius commented Dec 5, 2017

@enebo Finally land symbols or bytelist identifiers in 9.2?

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Dec 5, 2017

Member

@headius I really wanted RubySymbol to work and it still may but I am now thinking ByteList is much less complicated. I love the removal of needing weak referencing of symbols with RubySymbol but the bootstrapping is heinous. ByteList is an incremental step towards what we want and will solve this issue but we will retain the weak references.

Member

enebo commented Dec 5, 2017

@headius I really wanted RubySymbol to work and it still may but I am now thinking ByteList is much less complicated. I love the removal of needing weak referencing of symbols with RubySymbol but the bootstrapping is heinous. ByteList is an incremental step towards what we want and will solve this issue but we will retain the weak references.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 5, 2017

Member

@enebo The weak references should only be necessary for the intern table, right?

Member

headius commented Dec 5, 2017

@enebo The weak references should only be necessary for the intern table, right?

@ivoanjo

This comment has been minimized.

Show comment
Hide comment
@ivoanjo

ivoanjo Dec 5, 2017

Contributor

Thanks for the great writeup! I ended up hacking my way around this for JRuby 1.7 too for persistent-馃拵 (was not pretty), but nothing that users will notice or have to bother which is awesome.

Contributor

ivoanjo commented Dec 5, 2017

Thanks for the great writeup! I ended up hacking my way around this for JRuby 1.7 too for persistent-馃拵 (was not pretty), but nothing that users will notice or have to bother which is awesome.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Dec 5, 2017

Member

@headius yeah Ruby.getSymbolTable() weak vs hard. As RubySymbol this just completely disappears since everything already has the thing. Obviously it also allows GC of any unused symbols. As nice as that is we are ok with our current table and it already stores bytelist data so we just need better identity in bytelist at most.

Member

enebo commented Dec 5, 2017

@headius yeah Ruby.getSymbolTable() weak vs hard. As RubySymbol this just completely disappears since everything already has the thing. Obviously it also allows GC of any unused symbols. As nice as that is we are ok with our current table and it already stores bytelist data so we just need better identity in bytelist at most.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Dec 8, 2017

Member

There's another justification for using ByteList: it won't leak across runtimes if someone's in an embedding situation. They can use the same keys for method tables in two different runtimes, which is impossible to do using RubySymbol (without first looking up the runtime-local symbol).

Member

headius commented Dec 8, 2017

There's another justification for using ByteList: it won't leak across runtimes if someone's in an embedding situation. They can use the same keys for method tables in two different runtimes, which is impossible to do using RubySymbol (without first looking up the runtime-local symbol).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment