Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate code range into Symbol and calculate hash like String #6341

Closed
wants to merge 5 commits into from

Conversation

headius
Copy link
Member

@headius headius commented Jul 23, 2020

In #1348 a case was reported where 7-bit ASCII bytes forced to UTF-16 did not intern separately from the same bytes as US-ASCII. This was mostly due to the hash calculation for the Symbol table not considering the String's encoding, in the way that String#hash does.

This patch makes the following changes:

  • Propagate code range into RubySymbol and make it a CodeRangeable
  • Use the same hash logic for Symbol's ByteList + CR that String uses.
  • Fix lookup and construction paths that did not provide CR to scan for it.
  • Modify paths that have CR available to pass it in, avoiding the scan.
  • Move generic CodeRangeable logic to the interface as default methods.

This should finally fix all remaining issues with Symbol encodings
not reflecting the original encoding they were interned with.

In jruby#1348 a case was reported where 7-bit ASCII bytes forced to
UTF-16 did not intern separately from the same bytes as US-ASCII.
This was mostly due to the hash calculation for the Symbol table
not considering the String's encoding, in the way that String#hash
does.

This patch makes the following changes:

* Propagate code range into RubySymbol and make it a CodeRangeable
* Use the same hash logic for Symbol's ByteList + CR that String
  uses.
* Fix lookup and construction paths that did not provide CR to
  scan for it.
* Modify paths that have CR available to pass it in, avoiding the
  scan.
* Move generic CodeRangeable logic to the interface as default
  methods.
This logic mimics the way CRuby unmarshals Symbol, by registering
the bytes as the link (ByteList in SymbolTuple) before proceeding
to read any encoding information. By updating the tuple with the
final Symbol object we avoid subsequent lookups.
With the work in jruby#6341 to properly hash symbols using bytes and
encoding, the previous logic that could do a "fast" retrieval of
a Symbol for a given Java String is now incorrect. The logic to
retrieve a Symbol must always match the String hash logic, which
requires the encoded bytes plus the encoding, with an optional
code range to avoid doing a scan.

The "fast" path to get a Symbol is now via passing a ByteList and
code range int, or optionally omitting the code range and
accepting that it will be scanned for.
@headius
Copy link
Member Author

headius commented Jul 23, 2020

This needed additional work after the initial commit.

Unmarshal of Symbol

Symbols unmarshal oddly, by pulling the bytes off the stream first and registering that as the symbol "link" target. Only after registration can the bytes' actual encoding be pulled off the stream.

Our logic always registered the Symbol object, which led to a complicated (and probably unsafe) workflow that created a temporary Symbol and then modified it after we got the encoding, but that broke once we started considering the encoding as part of the internal symbol table hashing.

The new logic instead registers a tuple based on the ByteList we get from the stream, updating its encoding once we have it and only then creating (or retrieving) the Symbol. This avoids having to modify the temporary Symbol and properly adds the entry to our symbol table.

"fast" Java String to Symbol paths

A few paths for getting a Symbol from a Java String were "fast", in that they did not intern the string and used a "raw" calculation of the hash value based on iso-8859-1 bytes. Since we now require the bytes and the encoding, these paths
were wrongly using the old hash value.

In order to fix this, the fast paths were deprecated and modified to use logic that creates a ByteList from the Java String and then proceeds as normal. This means all paths passing in a plain Java String will be creating a new ByteList even if the
Symbol they represent has already been created and registered.

I am looking into specialized logic for Java String that avoids allocation by using a stream of the String's UTF-8 bytes to calculate the hash code and coderange without requiring a ByteList.

@headius
Copy link
Member Author

headius commented Jul 23, 2020

A new version of "fast" logic has been introduced for getSymbol(String) that avoids constructing an intermediate ByteList. Using our existing cached UTF-8 encoding logic, it will calculate code range and hash without requiring a ByteList, and use those for lookup of an existing Symbol. Only if a new Symbol must be created is the ByteList eventually constructed.

There are a bunch of symbol-related failures I will need to look into, but I expect they're mostly due to other parts of JRuby expecting the old/wrong lookup logic.

This logic avoids constructing the ByteList until absolutely
needed, using our existing cached UTF-8 encoding logic to
calculate code range and hash for lookup and only constructing a
ByteList if it ends up being necessary to create the Symbol.

This addresses the removal of the "fast" path symbol lookup logic
removed earlier due to incorrect hash calculation.
@headius
Copy link
Member Author

headius commented Jul 27, 2020

After all is said and done I ran into the same ultimate limitation that let me to attempt this work: we can't use Java String as an identifier and also guarantee that differently-encoded Symbol with the same bytes can coexist in the symbol table.

The main problem eventually comes down to the translation from Java String identifiers into symbols and strings for reflection, error messages, and other places where we must put those identifiers back into Ruby.

The bulk of failures here are due to a mechanical problem: the logic we have for converting a Java String into a Symbol is still based on the old logic of "ID strings" which are Java String objects that have been decoded as raw ISO-8859-1 bytes rather than as their actual encoding. These ID strings can be hashed the same way as the raw bytes in a Ruby String's ByteList, allowing us to look up the related symbols with either form.

With the patches in this PR, we now require an encoding to come along with those bytes, which is where the complication arises. In order to make the Java String identifiers look up the proper symbol, we would need to assume an encoding for those original bytes. Since ID strings have been decoded as ISO-8859-1 bytes, it's not possible anymore to know what the original encoding actually was. We could assume UTF-8, but then there would be no reason to decode as ISO-8859-1 bytes and we could decode them as proper strings. However then we run into cases where the identifier was not originally UTF-8 (e.g. differently-encoded source) and so the symbols don't match when round-tripped through a Java String identifer.

Ultimately there's no way to reconcile these differences without major surgery of some form.

The most complete answer would be to use RubySymbol or a similarly-structured ID type that contains bytes + encoding. This means a rework of every metastructure in JRuby as well as all consumers of those structures. For example, method tables would be keyed by ID, which means call sites would have to be keyed by ID, the call logic from jitted code would have to have a way to do lookup by ID, and so on. This would also complicate calling these methods from Java code, which would either need to accept that a Java String-based lookup might fail to find the method, or force everything to a single encoding.

The "give up" answer would be to only allow UTF-8 as an encoding throughout JRuby. This would of course break any code that wanted to use an alternative encoding, and would impact pass rates for specs and tests that try to use multiple encodings. It would "fix" the problem, in that all identifiers would round-trip, properly, but it would actually aggravate the issues in #1348 since we would be unable to accept symbols with different encodings at all (or at least we'd have to accept that they would never be able to interact with metadata structures).

It still seems the "best" option remains the middle ground we have today, where we can support multiple Symbol encodings and differently-encoded identifiers in source, but without the ability to support multiple differently-encoded symbols or identifiers that happen to have the same bytes.

We may want to look into normalizing all ASCII-compatible 7-bit identifiers and symbols to US-ASCII as a way to explicitly indicate that 7-bit clean identifiers will always be the same symbol.

@headius
Copy link
Member Author

headius commented Sep 9, 2020

This is unfortunately a failed experiment. With most of the JRuby internal tables using Java String, there's no way to round-trip symbols in all situations, and moving either toward a single encoding (UTF-8) or trying to support multiple encodings more consistently breaks some aspect of the current design.

@headius headius closed this Sep 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot make two symbols with same bytes and different encodings
1 participant