Propagate code range into Symbol and calculate hash like String #6341

headius · 2020-07-23T18:21:15Z

In #1348 a case was reported where 7-bit ASCII bytes forced to UTF-16 did not intern separately from the same bytes as US-ASCII. This was mostly due to the hash calculation for the Symbol table not considering the String's encoding, in the way that String#hash does.

This patch makes the following changes:

Propagate code range into RubySymbol and make it a CodeRangeable
Use the same hash logic for Symbol's ByteList + CR that String uses.
Fix lookup and construction paths that did not provide CR to scan for it.
Modify paths that have CR available to pass it in, avoiding the scan.
Move generic CodeRangeable logic to the interface as default methods.

This should finally fix all remaining issues with Symbol encodings not reflecting the original encoding they were interned with. In jruby#1348 a case was reported where 7-bit ASCII bytes forced to UTF-16 did not intern separately from the same bytes as US-ASCII. This was mostly due to the hash calculation for the Symbol table not considering the String's encoding, in the way that String#hash does. This patch makes the following changes: * Propagate code range into RubySymbol and make it a CodeRangeable * Use the same hash logic for Symbol's ByteList + CR that String uses. * Fix lookup and construction paths that did not provide CR to scan for it. * Modify paths that have CR available to pass it in, avoiding the scan. * Move generic CodeRangeable logic to the interface as default methods.

This logic mimics the way CRuby unmarshals Symbol, by registering the bytes as the link (ByteList in SymbolTuple) before proceeding to read any encoding information. By updating the tuple with the final Symbol object we avoid subsequent lookups.

With the work in jruby#6341 to properly hash symbols using bytes and encoding, the previous logic that could do a "fast" retrieval of a Symbol for a given Java String is now incorrect. The logic to retrieve a Symbol must always match the String hash logic, which requires the encoded bytes plus the encoding, with an optional code range to avoid doing a scan. The "fast" path to get a Symbol is now via passing a ByteList and code range int, or optionally omitting the code range and accepting that it will be scanned for.

headius · 2020-07-23T21:31:41Z

This needed additional work after the initial commit.

Unmarshal of Symbol

Symbols unmarshal oddly, by pulling the bytes off the stream first and registering that as the symbol "link" target. Only after registration can the bytes' actual encoding be pulled off the stream.

Our logic always registered the Symbol object, which led to a complicated (and probably unsafe) workflow that created a temporary Symbol and then modified it after we got the encoding, but that broke once we started considering the encoding as part of the internal symbol table hashing.

The new logic instead registers a tuple based on the ByteList we get from the stream, updating its encoding once we have it and only then creating (or retrieving) the Symbol. This avoids having to modify the temporary Symbol and properly adds the entry to our symbol table.

"fast" Java String to Symbol paths

A few paths for getting a Symbol from a Java String were "fast", in that they did not intern the string and used a "raw" calculation of the hash value based on iso-8859-1 bytes. Since we now require the bytes and the encoding, these paths
were wrongly using the old hash value.

In order to fix this, the fast paths were deprecated and modified to use logic that creates a ByteList from the Java String and then proceeds as normal. This means all paths passing in a plain Java String will be creating a new ByteList even if the
Symbol they represent has already been created and registered.

I am looking into specialized logic for Java String that avoids allocation by using a stream of the String's UTF-8 bytes to calculate the hash code and coderange without requiring a ByteList.

headius · 2020-07-23T23:37:35Z

A new version of "fast" logic has been introduced for getSymbol(String) that avoids constructing an intermediate ByteList. Using our existing cached UTF-8 encoding logic, it will calculate code range and hash without requiring a ByteList, and use those for lookup of an existing Symbol. Only if a new Symbol must be created is the ByteList eventually constructed.

There are a bunch of symbol-related failures I will need to look into, but I expect they're mostly due to other parts of JRuby expecting the old/wrong lookup logic.

This logic avoids constructing the ByteList until absolutely needed, using our existing cached UTF-8 encoding logic to calculate code range and hash for lookup and only constructing a ByteList if it ends up being necessary to create the Symbol. This addresses the removal of the "fast" path symbol lookup logic removed earlier due to incorrect hash calculation.

headius · 2020-07-27T19:18:35Z

After all is said and done I ran into the same ultimate limitation that let me to attempt this work: we can't use Java String as an identifier and also guarantee that differently-encoded Symbol with the same bytes can coexist in the symbol table.

The main problem eventually comes down to the translation from Java String identifiers into symbols and strings for reflection, error messages, and other places where we must put those identifiers back into Ruby.

The bulk of failures here are due to a mechanical problem: the logic we have for converting a Java String into a Symbol is still based on the old logic of "ID strings" which are Java String objects that have been decoded as raw ISO-8859-1 bytes rather than as their actual encoding. These ID strings can be hashed the same way as the raw bytes in a Ruby String's ByteList, allowing us to look up the related symbols with either form.

With the patches in this PR, we now require an encoding to come along with those bytes, which is where the complication arises. In order to make the Java String identifiers look up the proper symbol, we would need to assume an encoding for those original bytes. Since ID strings have been decoded as ISO-8859-1 bytes, it's not possible anymore to know what the original encoding actually was. We could assume UTF-8, but then there would be no reason to decode as ISO-8859-1 bytes and we could decode them as proper strings. However then we run into cases where the identifier was not originally UTF-8 (e.g. differently-encoded source) and so the symbols don't match when round-tripped through a Java String identifer.

Ultimately there's no way to reconcile these differences without major surgery of some form.

The most complete answer would be to use RubySymbol or a similarly-structured ID type that contains bytes + encoding. This means a rework of every metastructure in JRuby as well as all consumers of those structures. For example, method tables would be keyed by ID, which means call sites would have to be keyed by ID, the call logic from jitted code would have to have a way to do lookup by ID, and so on. This would also complicate calling these methods from Java code, which would either need to accept that a Java String-based lookup might fail to find the method, or force everything to a single encoding.

The "give up" answer would be to only allow UTF-8 as an encoding throughout JRuby. This would of course break any code that wanted to use an alternative encoding, and would impact pass rates for specs and tests that try to use multiple encodings. It would "fix" the problem, in that all identifiers would round-trip, properly, but it would actually aggravate the issues in #1348 since we would be unable to accept symbols with different encodings at all (or at least we'd have to accept that they would never be able to interact with metadata structures).

It still seems the "best" option remains the middle ground we have today, where we can support multiple Symbol encodings and differently-encoded identifiers in source, but without the ability to support multiple differently-encoded symbols or identifiers that happen to have the same bytes.

We may want to look into normalizing all ASCII-compatible 7-bit identifiers and symbols to US-ASCII as a way to explicitly indicate that 7-bit clean identifiers will always be the same symbol.

headius · 2020-09-09T17:53:04Z

This is unfortunately a failed experiment. With most of the JRuby internal tables using Java String, there's no way to round-trip symbols in all situations, and moving either toward a single encoding (UTF-8) or trying to support multiple encodings more consistently breaks some aspect of the current design.

headius added this to the JRuby 9.3.0.0 milestone Jul 23, 2020

headius linked an issue Jul 23, 2020 that may be closed by this pull request

Cannot make two symbols with same bytes and different encodings #1348

Closed

headius mentioned this pull request Jul 23, 2020

Cannot make two symbols with same bytes and different encodings #1348

Closed

headius added 3 commits July 23, 2020 14:27

Update MRI test_symbol excludes

b65549b

headius force-pushed the symbol_encoding_fixes branch from 865dfd9 to fb24409 Compare July 23, 2020 23:57

headius mentioned this pull request Jul 27, 2020

Ensure all paths to Symbol normalize 7-bit ASCII to US-ASCII encoding #6344

Open

headius closed this Sep 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate code range into Symbol and calculate hash like String #6341

Propagate code range into Symbol and calculate hash like String #6341

headius commented Jul 23, 2020

headius commented Jul 23, 2020

headius commented Jul 23, 2020

headius commented Jul 27, 2020 •

edited

Loading

headius commented Sep 9, 2020

Propagate code range into Symbol and calculate hash like String #6341

Propagate code range into Symbol and calculate hash like String #6341

Conversation

headius commented Jul 23, 2020

headius commented Jul 23, 2020

Unmarshal of Symbol

"fast" Java String to Symbol paths

headius commented Jul 23, 2020

headius commented Jul 27, 2020 • edited Loading

headius commented Sep 9, 2020

headius commented Jul 27, 2020 •

edited

Loading