Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
double-quoted UTF8 hash key has the wrong encoding #2591
The following test.rb works on MRI Ruby 2.2.0,
MRI Ruby 2.2.0
In order to not open a new issue I'll add here, but I've just found out jRuby is basically not meeting m17n compliance like Ruby 1.9+ does. Note:
I have some code that needs to have hash keys that are UTF-8 and would love to run it on jRuby but it's a no go. Both 9k and 1.7.19 have the same issue and specifying code pages and language for Java environment fail to remedy the issue.
I'm pretty sure the root of the issue is both a mix of failure to recognize the nature of the issue (we're NOT talking about UTF-8 strings held in variables or file system paths) and perhaps unfamiliarity with the fact mainline Ruby supports UTF-8 variable/method/hash key/symbols etc. However, why the problem exists in the first place is likely due to how Java imposes the system/environment code set on whatever is being run.
@headius and @enebo : I actually just happened to be playing around with some encoding issues a few days ago - so good timing to post @headius because I think I further understand the issue now and can say with a much higher certainty that the root of the issue is very likely linked to the actual Java runtime more than jRuby itself. Though that isn't to say that jRuby is meeting m17n compliance for UTF-8 in actual code, because it isn't; but without figuring out how to force the JVM/runtime/whatever to use UTF-8 by default it will be hard to progress further.
From my investigation that isn't actually something simple. Apparently once Java starts running something telling it to change its code page is not only non-trivial, but because it's more of a system level operation and there is no clarification of how this should be standardized it's different with every Java implementation. To make matters worse it looks like different Java implementations actually handle UTF-8 slightly differently (EG apparently dalvik will actually give different results for some string comparisons of UTF-8 strings than other Java implementations...).
@enebo may have a few headaches waiting for him when he decides to tackle this. To be honest I have a feeling this could require way more work than anyone is expecting. I can only hope I'm wrong.
Also, just in case anyone wants to try and make a an argument like "but you shouldn't be using UTF-8 hash keys anyway!": there's a whole lot of JSON out there using UTF-8 strings as keys - try parsing that. YEAH, THAT'S WHAT I THOUGHT!
As for UTF-8 method names I can think of two examples where I've seen this that were nifty. My favourite is the math symbols:
Not "necessary" but if we've got UTF-8 symbols then is there really any added cost to having UTF-8 compatible methods too?