Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifier-keyed tables must always use raw or always use encoded identifiers #3697

headius opened this issue Feb 24, 2016 · 2 comments


Copy link

headius commented Feb 24, 2016

With the move to M17N in Ruby 1.9, it became possible to store variables, constants, etc with arbitrary encodings.

In JRuby, we have never fully supported this because all our identifier-keyed tables (method table, constant table, etc) use a Java String, and traditionally used a properly decoded string. This works fine when all identifiers are the same encoding, but breaks if different encodings are used (since we lose the original when going to a UTF-16 String).

In order to support this better, we attempted to represent our identifiers like MRI represents its IDs: as the raw bytes of whatever parsed identifier came in. This allows uniquely referencing a given symbol given just its raw bytes, provided the symbol is still alive. What we didn't do is propagate raw bytes throughout all identifier-related APIs; only some of them actually use the raw string, while others still use fully-decoded strings as characters.

If we wish to fix this, we can't do it part way. This leads to API conflicts that are hard or impossible to resolve.

There are two paths forward, as I see them:

  • Complete the transition to an ID-like system where every identifier can be properly converted back into an original byte[]+encoding tuple. I started this process in f7f5417 in the new_ids branch. This is a very large effort and may need to wait until a "JRuby 10k" given the wide-reaching API breakage that will result. It may never be feasible.
  • Accept that we will only ever be able to represent identifiers as UTF-16 and make that explicit. Use UTF-16 throughout all identifier APIs. In this approach, I'm not sure what would happen to symbols, since people do depend on them preserving their encoding, and then use those encoded symbols as identifiers.

Neither approach is really great.

headius added a commit that referenced this issue Feb 24, 2016
This works properly, but because it uses a "raw" string the
resulting error message is mangled when MBC are present.

See #3697.
Copy link
Member Author

headius commented Apr 27, 2017

Likely to be fixed by identifier work by me and @enebo for 9.2.

Copy link
Member Author

headius commented May 16, 2018

Largely fixed by @enebo's symbol work.

@headius headius closed this as completed May 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

1 participant