Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatching a string through send to method_missing causes encoding to revert to ASCII #736

Closed
bri3d opened this Issue May 14, 2013 · 9 comments

Comments

Projects
None yet
3 participants
@bri3d
Copy link

bri3d commented May 14, 2013

When dispatching a string through send to method_missing, the string loses its encoding and becomes ASCII. However, it is not re-encoded and hence causes all sorts of undefined behavior down the road.

JRuby 1.7.3, java version "1.7.0_15"
Java(TM) SE Runtime Environment (build 1.7.0_15-b03)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

External and internal encodings are UTF-8 as well as JVM's file encoding.

here is a simple test case:

class TestDispatch
  def method_missing(name, *args)
    puts name.encoding
  end
end

nyay = 'ñ'
puts nyay.encoding # =>  #<Encoding:UTF-8>
puts nyay.to_sym.encoding # => #<Encoding:UTF-8>
t = TestDispatch.new
t.send(nyay) # => US-ASCII, should be UTF-8
t.send(nyay.to_sym) # => UTF-8

additionally

class TestDispatch
  def method_missing(name, *args)
    puts name.encoding
    name.to_s.bytes.map {|b| puts b }
  end
end
t = TestDispatch.new
t.send('ñ') # => US-ASCII \n 241

Note that 241 is the Unicode decimal representation of ñ, and not real ASCII. So the string is not being re-encoded into ASCII but rather is losing its UTF-8 encoding.

@bri3d

This comment has been minimized.

Copy link
Author

bri3d commented May 23, 2013

https://gist.github.com/bri3d/5639849

I made a more robust test case and verified against JRuby 1.7.4. Will look deeper into the internals later to see if I can hunt this myself.

The issue is admittedly esoteric; I was able to build a workaround for where I was doing this in a DSL in production.

@enebo

This comment has been minimized.

Copy link
Member

enebo commented May 24, 2013

This might be a more difficult problem to solve. All method names internally are Java Strings and when we end up passing name to method_missing it has long since lost its Ruby Stringness (including original encoding). We will need to ponder on this one a bit...

@bri3d

This comment has been minimized.

Copy link
Author

bri3d commented May 24, 2013

The part that's fun to me is that interning the method name as a symbol ahead of time causes the whole chain to work.

The difference appears to be that the Symbol is interned into the symbol table, preserving the original ByteList and hence the encoding information in the RubySymbol. Eventually prepareMethodMissingArgs uses newSymbol to construct a symbol for the method name, and since the hashes match and a symbol with the correct encoding has already been interned it gets looked up and is returned instead with Ruby encoding intact.

I agree that this one is tricky to fix; interning all method names in the symbol table when they're sent would be untenable but without using bytes rather than a Java String there's no way to preserve the encoding information.

I suspect using ByteList for method names would affect performance in a very negative way as well.

@bri3d

This comment has been minimized.

Copy link
Author

bri3d commented May 24, 2013

I'm wondering if ensuring the ByteList in the newSymbol created in method_missing is constructed as UTF-8 would help. There might be an issue introduced there but I can't think of what it would be OTOH.

@headius

This comment has been minimized.

Copy link
Member

headius commented Jul 24, 2013

The problem here is that it's actually creating two symbols. The first is from nyay.to_sym, which uses the actual ByteList of the string to create a symbol. I believe the second one is coming from the #send call, which is using a different mechanism.

Both cases should resolve to the same Symbol object. Investigating.

@headius

This comment has been minimized.

Copy link
Member

headius commented Jul 24, 2013

Ok, I found the issue. Inside the symbol table logic where it produces a Java String to store along with the symbol object, it's using ByteList.toString. ByteList.toString is a dirt-stupid implementation that just converts the bytes using ISO-8859-1, losing any encoding specified.

I'm modifying the logic to use appropriate decoding logic and we'll see how it goes.

@bri3d

This comment has been minimized.

Copy link
Author

bri3d commented Jul 24, 2013

Makes sense. My attempted workaround was trying to convert the String used for Symbol lookup in method_missing to a ByteList ahead of time, which bypassed the busted String lookup but caused other issues.

Seems better to fix the root cause 👍

@headius

This comment has been minimized.

Copy link
Member

headius commented Jul 24, 2013

So, I got pretty close to fixing this, and then everything went to hell.

The problem I describe above was the basic issue; we have multiple paths constructing symbols in different ways, resulting in multiple symbols. My patch fixes this...but unfortunately parts of the multiple-symbol behavior are correct.

MRI represents all symbols in a table keyed off their raw bytes. So if you have two strings that have the same content with different bytes, they result in different symbols.

$ ruby2.0.0 -e "sym1 = 'ñ'.to_sym; sym2 = 'ñ'.force_encoding('ISO-8859-1').to_sym; p sym2.equal?(sym1); p sym1.encoding; p sym2.encoding"
false
#<Encoding:UTF-8>
#<Encoding:ISO-8859-1>

I won't try to judge the merits of this logic except to say that I don't see the value in having the same string encoded differently become a different symbol.

In any case, this is basically the logic we try to emulate by forcing all symbol strings to be decoded as raw bytes; we store them as a Java String internally that's a UTF-16 representation of those raw bytes interpreted as ISO-8859-1. This allows us to usually key the symbol table the same way as MRI.

String#to_sym, however, actually decodes the string using its correct encoding, which results in a UTF-8 to_sym producing the properly decoded UTF-16 representation of that symbol rather than a raw representation of UTF-8 bytes. Going by MRI's logic for symbols, this is actually incorrect behavior, since all symbols should be held internally as their original bytes rather than their decoded form.

My fix attempted to expand proper decoding to the rest of the system, ensuring that all incoming symbols decoded as their actual character representation. This allowed differently-encoded strings to resolve to the same symbol, which seems much more sane to me. However, it broke tests and specs that expected differently-encoded strings to produce different symbols. What's an implementer to do?

The fix I'm contemplating now is to actually remove the decoding logic from String#to_sym. It is currently the only place in the system where we attempt to properly decode a string into a symbol, and "breaking" that logic should also fix this issue. I don't feel great about it, since I don't agree with having symbols basically be a representation of raw bytes that happens to have an encoding attached...but at the moment that's too big an issue to force.

@headius

This comment has been minimized.

Copy link
Member

headius commented Jul 24, 2013

A note of clarification... it was not String#to_sym that was actually decoding the string, it was the logic in #send that turns a String into a Symbol. There, it used IRubyObject#asJavaString, which for Ruby Strings does properly decode. So the send was triggering the creation of a new symbol whose bytes were properly-decoded UTF-8 rather than raw UTF-8 represented as ISO-8859-1, and the resulting method_missing argument did not match expectations.

I'm now attempting to modify the send logic to decode these strings in the symbolic fashion, which will allow them to retain their differently-encoded nature consistently across all uses.

@headius headius closed this in 1ea388d Jul 25, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.