New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define_method using symbols string syntax works incorrectly #3880

Closed
catfishuang opened this Issue May 12, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@catfishuang

catfishuang commented May 12, 2016

Environment

jruby 9.1.0.0 (2.2.3) 2016-01-26 7bee00d Java HotSpot(TM) 64-Bit Server VM 24.79-b02 on 1.7.0_79-b15 +jit [Windows 7-amd64]

Expected Behavior

def 牛逼
puts '111'
end

牛逼 # It's OK

@name = "中文"
send :define_method,:"#{@name}" do
puts "中文test"
end

中文 # Error occur but it run well in cruby

Actual Behavior

NameError: undefined local variable or method `??' for main:Object

it seems that the issue has not been resolved....... i checked it just now

@headius

This comment has been minimized.

Member

headius commented May 13, 2016

Trivial repro:

define_method(:"中文")do
  puts "中文test"
end
中文
@headius

This comment has been minimized.

Member

headius commented May 13, 2016

Another example that shows we're encoding identifiers and symbols differently:

class X
  def 你好; end
end
nihao = X.instance_methods(false)[0]
zhongwen = :"中文"

p nihao # => :??
p nihao.encoding # => #<Encoding:ASCII-8BIT>
p zhongwen # => :中文
p zhongwen.encoding # => #<Encoding:UTF-8>

This isn't really a problem with define_method, it's a known problem with how we encode method names versus symbols. I know we have other issues open for this...will try to locate.

@headius

This comment has been minimized.

Member

headius commented May 13, 2016

Comparative output from MRI:

$ ruby23 blah.rb
:你好
#<Encoding:UTF-8>
:中文
#<Encoding:UTF-8>
@headius

This comment has been minimized.

Member

headius commented May 13, 2016

Related to #1340, #3719, #3697 at least.

@enebo enebo changed the title from define_method still works incorrectly to define_method using symbols string syntax works incorrectly May 13, 2016

@enebo

This comment has been minimized.

Member

enebo commented May 13, 2016

I corrected the title to be more descriptive of what is broken. Note that this works in cases with no quote chars in the symbol:

# coding: utf-8
define_method(:中文)do
  puts "中文test"
end
中文

This is very specific to encoding with the newer-ish quote syntax (2.1?) of symbols specifically. I also wonder whether the reduced case and the interpolated version originally reported are the same problem but they are both in the parser even if so. As such, I think the other issues linked are problems but fixing this issue will fix none of the other linked issues (although #3719 might be the same kind of fix).

@headius

This comment has been minimized.

Member

headius commented May 13, 2016

@enebo Very strange! The parser is definitely handling the quoted case differently:

[] ~/projects/jruby $ ast -e "puts :'中文'"
AST:
RootNode 0
  FCallNode:puts 0
    ArrayNode 0
      SymbolNode:中� 0
, null


[] ~/projects/jruby $ ast -e "puts :中文"
AST:
RootNode 0
  FCallNode:puts 0
    ArrayNode 0
      SymbolNode:中文 0
, null

Oddly enough, both cases do have encoding of UTF-8, so there's simply a bug here turning string-like symbols into real symbols the same way non-string-like symbols are handled.

@headius

This comment has been minimized.

Member

headius commented May 13, 2016

Fun:

[] ~/projects/jruby $ jruby -e "puts :'中文'"
中文

[] ~/projects/jruby $ jruby -e "puts :中文"
??

And @enebo discovered this:

$ jruby -e 'a= :"中文"; p a.to_s.bytes'
[228, 184, 173, 230, 150, 135]
$ jruby -e 'a= :中文; p a.to_s.bytes'
[63, 63]

@enebo enebo added this to the JRuby 9.1.2.0 milestone May 13, 2016

@enebo

This comment has been minimized.

Member

enebo commented May 13, 2016

So here is where we stand. We inconsistently store symbols in our symbol table. Specifically we lookup the symbol entry based on the hash of a Java String. The Strings we use are inconsistent and is either:

  1. iso8859-1 (raw bytes if you want to think of it that way)
  2. properly Charset encoded Java String

1 is what MRI also does. It is interesting to note this is mildly broken by design in that two symbols with same byte sequence but different encoding will trip over each other. It is what Ruby does so this is just an interesting aside.

We started partially doing 2 because all methods in JRuby are keyed off properly encoded Java Strings. So if we use reflective API or try and call a method using something like send then we need to go from a symbol to the properly encoded Java String. By using 2 then our symbol table lines up with these method identifiers.

After some discussion we think we need to fully embrace 1 with some changes:

  1. All methods, variables, constants will need to use iso859_1 string as opposed to properly encoded string. This means identifiers in parser (and string symbols :"foo") will generate {iso8859, encoding}.
  2. symbol table will be like 1 from above (java hashing of iso8859_1)
  3. Anything which displays variables, constants, symbols, etc... will need to retrieve the symbol entry from the table to get the properly encoded string for display purposes. At a Java method table level we just have a sequence of bytes but have no idea what those bytes represent.

I think the main risk will be making all our code uniform. When making anything which can be a symbol or identifier we need to audit all creation paths and make sure they are 8859_1 and know their encoding. Anything which can possibly print out anything involving symbols or identifiers must end up calling a helper method symbolDisplay(iso) -> encoded String.

After having said all this ... THIS IS WAY TOO RISKY to try and change for 9.1.1.0 since it is a quick regression fix release. So we need to punt. This work also needs to happen on the front end of a new dev cycle to make sure we can address any fallout of making this change.

Addendum: The iso8859_1 string as common lookup mechanism (this is also intern'd as well) is almost a complete analogue to MRI's ID. Under the covers it should be as cheap as a number and we cannot really use this string for anything we display with. I guess the only advantage of it will be that in debugging more of the time we will know what the ID represents at a glance.

@enebo enebo modified the milestones: Post-9000, JRuby 9.1.2.0 May 13, 2016

@catfishuang

This comment has been minimized.

catfishuang commented Aug 10, 2016

what is milestone "Post-9000" ? will the issue be fixed?

@catfishuang catfishuang reopened this Aug 10, 2016

@enebo

This comment has been minimized.

Member

enebo commented Aug 10, 2016

@catfishuang It is just not targeted for our next point release atm. We need to prioritize when we fix this as it involves some largish changes.

@headius

This comment has been minimized.

Member

headius commented Apr 27, 2017

Still not working in JRuby master (9.1.9.0 in process) despite some symbol fixes.

@enebo There's a good chance this might get fixed if we carry the bytelist symbol stuff through all the way.

@headius headius removed this from the JRuby 9.1.9.0 milestone Apr 27, 2017

kares added a commit that referenced this issue Jan 10, 2018

Merge pull request #4096 from brocktimus/stickers_symbols_with_encoding
Extra tests for symbol encoding Re: #4070 + #3719 and possibly #3880
@headius

This comment has been minimized.

Member

headius commented May 16, 2018

This is fixed by @enebo's work on identifiers and symbols for 9.2.

@headius headius closed this May 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment