New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inherit encoding for interned symbols #7618
Conversation
I believe this does not actually fix the issue; rather, the last attempt to intern an encoded string will win, replacing the existing symbol's encoding with a different one. You can see this by just accessing the `:fée" symbol directly and watching its encoding change. We do not have a way to intern symbols with multiple encodings and the same bytes, since we key off the bytes and not the encoding. |
yes, you're right... CRuby doesn't seem to be sharing symbols with a different encoding, but the same bytes
could we extend the hashcode and cache it as 2 independent symbols, perhaps here? I'll open up an issue instead. Thanks |
It is a known problem mostly due to our using Java String as a key in method and constant tables etc. We can only key off the actual string bytes, since String is not multi-encoding, so there's no way to have two entries with different encodings. It might be possible to extend the Symbol table to include encoding in the hash/equality but it would not be a complete solution since you still couldn't use those symbols for reflective operations. The longer-term fix would be to change the key to something like ID in CRuby, but of course that would mean massive changes across JRuby to change every identifier-based table as well as all Java APIs to look things up. In other words, very unlikely to ever happen. You may file a bug, but there may be others already...possibly closed as WONTFIX. There are very few (if any) real-world cases where you need the same bytes with different encodings to yield different symbols. |
somehow it works on TruffleRuby. Maybe there's a trick to work around it? anyway, I come to the same conclusion. Massive changes to the ID table for a case like this weren't worth it. |
@ahorek I see it is in another test suite but I have to wonder if anyone has ever done this in a real application? As much work as it is...we may be more motivated if this exists in a more concrete example. As much as it annoys us to have this limitation we have never seen a real world issue involving it. |
actually, it comes from a real change msgpack/msgpack-ruby#248 |
@ahorek What I mean is did this test come from real code which ran into a problem or did they discover there were issues with encodings+symbols and synthesized a test to exercise what they discovered? I see mention that they only supported US-ASCII/BINARY and then added UTF-8 and this led to this. So I guess if this is a real issue they somehow decide to save a symbol as BINARY and UTF-8 when the string is the bytes of a UTF-8 string? |
I think the original bug that led to msgpack/msgpack-ruby#248 was ruby-i18n/i18n#606 basically, because of this
no special encodings were involved. note that msgpack without extensions doesn't serialize encodings at all, all strings and symbols are assumed to be UTF-8. msgpack/msgpack-ruby#248 added a test for other encodings (that suppose to fall back to BINARY with the new logic). It wasn't crafted for JRuby and there wasn't a JRuby issue before. After submitting a PR, the test started failing only on JRuby, so obviously we've investigated why. here's an example based on the test: require 'msgpack'
symbol = "fàe".encode(Encoding::ISO_8859_1).to_sym
factory = MessagePack::Factory.new
factory.register_type(0x00, ::Symbol)
dumped = factory.dump(symbol)
factory.load(dumped).encoding
CRuby
=> :"f\xE0e" # Encoding:ASCII-8BIT
JRuby
=> :"f\xE0e" # Encoding:ISO-8859-1 it is trying to serialize a symbol with an exotic encoding to a format that doesn't support encodings... IMO it would be better to simply raise an error (it does not for compatibility reasons). the difference between CRuby & JRuby could cause problems, but it seems rather artificial to me. It's ok to test corner cases, but I don't think this is a real application use case or at least behavior someone should depend on. |
originally, I thought it would be something simple to fix. Well, I was wrong. Sry, for the noise :) |
fixes a rare case with incorrect encoding sharing
it applies to 9.3 as well. Is it worth backporting?