inherit encoding for interned symbols #7618

ahorek · 2023-02-02T22:03:25Z

fixes a rare case with incorrect encoding sharing

source = "fée"
iso_symbol = source.force_encoding(Encoding::ISO_8859_1).to_sym
iso_symbol.encoding # Encoding::ISO_8859_1
binary_symbol = source.force_encoding(Encoding::BINARY).to_sym
binary_symbol.encoding # Encoding::ISO_8859_1 (Encoding::BINARY is expected)

it applies to 9.3 as well. Is it worth backporting?

headius · 2023-02-06T15:46:00Z

I believe this does not actually fix the issue; rather, the last attempt to intern an encoded string will win, replacing the existing symbol's encoding with a different one. You can see this by just accessing the `:fée" symbol directly and watching its encoding change.

We do not have a way to intern symbols with multiple encodings and the same bytes, since we key off the bytes and not the encoding.

ahorek · 2023-02-06T18:12:22Z

yes, you're right...

CRuby doesn't seem to be sharing symbols with a different encoding, but the same bytes

iso_symbol = source.force_encoding(Encoding::ISO_8859_1).to_sym.object_id => 60
binary_symbol = source.force_encoding(Encoding::BINARY).to_sym.object_id => 80
iso2_symbol = source.force_encoding(Encoding::ISO_8859_1).to_sym.object_id => 60

could we extend the hashcode and cache it as 2 independent symbols, perhaps here?
https://github.com/jruby/jruby/blob/master/core/src/main/java/org/jruby/RubySymbol.java#L1035

I'll open up an issue instead. Thanks

headius · 2023-02-06T18:57:27Z

It is a known problem mostly due to our using Java String as a key in method and constant tables etc. We can only key off the actual string bytes, since String is not multi-encoding, so there's no way to have two entries with different encodings. It might be possible to extend the Symbol table to include encoding in the hash/equality but it would not be a complete solution since you still couldn't use those symbols for reflective operations.

The longer-term fix would be to change the key to something like ID in CRuby, but of course that would mean massive changes across JRuby to change every identifier-based table as well as all Java APIs to look things up. In other words, very unlikely to ever happen.

You may file a bug, but there may be others already...possibly closed as WONTFIX. There are very few (if any) real-world cases where you need the same bytes with different encodings to yield different symbols.

ahorek · 2023-02-06T19:19:24Z

somehow it works on TruffleRuby. Maybe there's a trick to work around it?

anyway, I come to the same conclusion. Massive changes to the ID table for a case like this weren't worth it.

enebo · 2023-02-06T19:39:05Z

@ahorek I see it is in another test suite but I have to wonder if anyone has ever done this in a real application? As much work as it is...we may be more motivated if this exists in a more concrete example. As much as it annoys us to have this limitation we have never seen a real world issue involving it.

ahorek · 2023-02-06T20:01:56Z

actually, it comes from a real change msgpack/msgpack-ruby#248
these tests are testing that one of the serialization methods doesn't change the encoding unintentionally, but I must agree it's unlikely to be a problem outside of the scope of these tests.

enebo · 2023-02-06T20:25:05Z

@ahorek What I mean is did this test come from real code which ran into a problem or did they discover there were issues with encodings+symbols and synthesized a test to exercise what they discovered? I see mention that they only supported US-ASCII/BINARY and then added UTF-8 and this led to this. So I guess if this is a real issue they somehow decide to save a symbol as BINARY and UTF-8 when the string is the bytes of a UTF-8 string?

ahorek · 2023-02-06T22:56:21Z

I think the original bug that led to msgpack/msgpack-ruby#248 was ruby-i18n/i18n#606

basically, because of this

"foo".encoding
=> #Encoding:UTF-8
"foo".to_sym.encoding
=> #Encoding:US-ASCII

no special encodings were involved.

note that msgpack without extensions doesn't serialize encodings at all, all strings and symbols are assumed to be UTF-8.

msgpack/msgpack-ruby#248 added a test for other encodings (that suppose to fall back to BINARY with the new logic). It wasn't crafted for JRuby and there wasn't a JRuby issue before. After submitting a PR, the test started failing only on JRuby, so obviously we've investigated why.

here's an example based on the test:

require 'msgpack'

symbol = "fàe".encode(Encoding::ISO_8859_1).to_sym
factory = MessagePack::Factory.new
factory.register_type(0x00, ::Symbol)
dumped = factory.dump(symbol)
factory.load(dumped).encoding

CRuby
=> :"f\xE0e" # Encoding:ASCII-8BIT

JRuby
=> :"f\xE0e" # Encoding:ISO-8859-1

it is trying to serialize a symbol with an exotic encoding to a format that doesn't support encodings... IMO it would be better to simply raise an error (it does not for compatibility reasons).

the difference between CRuby & JRuby could cause problems, but it seems rather artificial to me. It's ok to test corner cases, but I don't think this is a real application use case or at least behavior someone should depend on.

ahorek · 2023-02-06T22:57:58Z

originally, I thought it would be something simple to fix. Well, I was wrong. Sry, for the noise :)

ahorek added 2 commits February 2, 2023 21:57

inherit encoding for interned symbols

459e91e

ascii only

d7934cb

ahorek closed this Feb 6, 2023

ahorek mentioned this pull request Feb 6, 2023

Encoding issue with interned symbols #7627

Closed

enebo added this to the Invalid or Duplicate milestone Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inherit encoding for interned symbols #7618

inherit encoding for interned symbols #7618

ahorek commented Feb 2, 2023 •

edited

headius commented Feb 6, 2023

ahorek commented Feb 6, 2023

headius commented Feb 6, 2023 •

edited

ahorek commented Feb 6, 2023

enebo commented Feb 6, 2023

ahorek commented Feb 6, 2023

enebo commented Feb 6, 2023

ahorek commented Feb 6, 2023 •

edited

ahorek commented Feb 6, 2023

inherit encoding for interned symbols #7618

inherit encoding for interned symbols #7618

Conversation

ahorek commented Feb 2, 2023 • edited

headius commented Feb 6, 2023

ahorek commented Feb 6, 2023

headius commented Feb 6, 2023 • edited

ahorek commented Feb 6, 2023

enebo commented Feb 6, 2023

ahorek commented Feb 6, 2023

enebo commented Feb 6, 2023

ahorek commented Feb 6, 2023 • edited

ahorek commented Feb 6, 2023

ahorek commented Feb 2, 2023 •

edited

headius commented Feb 6, 2023 •

edited

ahorek commented Feb 6, 2023 •

edited