Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inherit encoding for interned symbols #7618

Closed
wants to merge 2 commits into from

Conversation

ahorek
Copy link
Contributor

@ahorek ahorek commented Feb 2, 2023

fixes a rare case with incorrect encoding sharing

source = "fée"
iso_symbol = source.force_encoding(Encoding::ISO_8859_1).to_sym
iso_symbol.encoding # Encoding::ISO_8859_1
binary_symbol = source.force_encoding(Encoding::BINARY).to_sym
binary_symbol.encoding # Encoding::ISO_8859_1 (Encoding::BINARY is expected)

it applies to 9.3 as well. Is it worth backporting?

@headius
Copy link
Member

headius commented Feb 6, 2023

I believe this does not actually fix the issue; rather, the last attempt to intern an encoded string will win, replacing the existing symbol's encoding with a different one. You can see this by just accessing the `:fée" symbol directly and watching its encoding change.

We do not have a way to intern symbols with multiple encodings and the same bytes, since we key off the bytes and not the encoding.

@ahorek ahorek closed this Feb 6, 2023
@ahorek
Copy link
Contributor Author

ahorek commented Feb 6, 2023

yes, you're right...

CRuby doesn't seem to be sharing symbols with a different encoding, but the same bytes

iso_symbol = source.force_encoding(Encoding::ISO_8859_1).to_sym.object_id => 60
binary_symbol = source.force_encoding(Encoding::BINARY).to_sym.object_id => 80
iso2_symbol = source.force_encoding(Encoding::ISO_8859_1).to_sym.object_id => 60

could we extend the hashcode and cache it as 2 independent symbols, perhaps here?
https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubySymbol.java#L1035

I'll open up an issue instead. Thanks

@headius
Copy link
Member

headius commented Feb 6, 2023

It is a known problem mostly due to our using Java String as a key in method and constant tables etc. We can only key off the actual string bytes, since String is not multi-encoding, so there's no way to have two entries with different encodings. It might be possible to extend the Symbol table to include encoding in the hash/equality but it would not be a complete solution since you still couldn't use those symbols for reflective operations.

The longer-term fix would be to change the key to something like ID in CRuby, but of course that would mean massive changes across JRuby to change every identifier-based table as well as all Java APIs to look things up. In other words, very unlikely to ever happen.

You may file a bug, but there may be others already...possibly closed as WONTFIX. There are very few (if any) real-world cases where you need the same bytes with different encodings to yield different symbols.

@ahorek
Copy link
Contributor Author

ahorek commented Feb 6, 2023

somehow it works on TruffleRuby. Maybe there's a trick to work around it?

anyway, I come to the same conclusion. Massive changes to the ID table for a case like this weren't worth it.

@enebo
Copy link
Member

enebo commented Feb 6, 2023

@ahorek I see it is in another test suite but I have to wonder if anyone has ever done this in a real application? As much work as it is...we may be more motivated if this exists in a more concrete example. As much as it annoys us to have this limitation we have never seen a real world issue involving it.

@ahorek
Copy link
Contributor Author

ahorek commented Feb 6, 2023

actually, it comes from a real change msgpack/msgpack-ruby#248
these tests are testing that one of the serialization methods doesn't change the encoding unintentionally, but I must agree it's unlikely to be a problem outside of the scope of these tests.

@enebo
Copy link
Member

enebo commented Feb 6, 2023

@ahorek What I mean is did this test come from real code which ran into a problem or did they discover there were issues with encodings+symbols and synthesized a test to exercise what they discovered? I see mention that they only supported US-ASCII/BINARY and then added UTF-8 and this led to this. So I guess if this is a real issue they somehow decide to save a symbol as BINARY and UTF-8 when the string is the bytes of a UTF-8 string?

@ahorek
Copy link
Contributor Author

ahorek commented Feb 6, 2023

I think the original bug that led to msgpack/msgpack-ruby#248 was ruby-i18n/i18n#606

basically, because of this

"foo".encoding
=> #Encoding:UTF-8
"foo".to_sym.encoding
=> #Encoding:US-ASCII

no special encodings were involved.

note that msgpack without extensions doesn't serialize encodings at all, all strings and symbols are assumed to be UTF-8.

msgpack/msgpack-ruby#248 added a test for other encodings (that suppose to fall back to BINARY with the new logic). It wasn't crafted for JRuby and there wasn't a JRuby issue before. After submitting a PR, the test started failing only on JRuby, so obviously we've investigated why.

here's an example based on the test:

require 'msgpack'

symbol = "fàe".encode(Encoding::ISO_8859_1).to_sym
factory = MessagePack::Factory.new
factory.register_type(0x00, ::Symbol)
dumped = factory.dump(symbol)
factory.load(dumped).encoding

CRuby
=> :"f\xE0e" # Encoding:ASCII-8BIT

JRuby
=> :"f\xE0e" # Encoding:ISO-8859-1

it is trying to serialize a symbol with an exotic encoding to a format that doesn't support encodings... IMO it would be better to simply raise an error (it does not for compatibility reasons).

the difference between CRuby & JRuby could cause problems, but it seems rather artificial to me. It's ok to test corner cases, but I don't think this is a real application use case or at least behavior someone should depend on.

@ahorek
Copy link
Contributor Author

ahorek commented Feb 6, 2023

originally, I thought it would be something simple to fix. Well, I was wrong. Sry, for the noise :)

@enebo enebo added this to the Invalid or Duplicate milestone Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants