Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mangling for string literals #64

Open
zygoloid opened this issue Oct 2, 2018 · 0 comments
Open

mangling for string literals #64

zygoloid opened this issue Oct 2, 2018 · 0 comments

Comments

@zygoloid
Copy link
Contributor

zygoloid commented Oct 2, 2018

The ABI says that string literals in instantiation-dependent expressions are mangled thusly:

<expr-primary> ::= L <string type> E # string literal

... presumably because, in C++98, the type of the literal was the only property that could affect the validity of the instantiation-dependent expression. That is no longer the case; a C++11 program can inspect the contents of such a string literal in an instantiation-dependent expression, so we need to mangle said contents.

Proposal:

<expr-primary> ::= L <string type> <char>* <hash>? E    # string literal
<char> ::= <0-9a-zA-DF-Z>                               # values 48-57, 97-122, 65-68, 70-90
       ::= _<hex><hex>                                  # other chars encoded in (big-endian) hexadecimal
       ::= __<hex><hex><hex><hex>
       ::= ___<hex><hex><hex><hex><hex><hex>
       ::= ____<hex><hex><hex><hex><hex><hex><hex><hex>
<hex> ::= <0-9a-f>
<hash> ::= <hex>{M}

... where the first N (say, 16) characters of the string are encoded directly, followed by a 4M-bit hash of the entire string (algorithm TBD, but following target endianness) if its length is greater than N (where for all purposes other than determining the type, the terminating nul character is ignored).

The idea here is to preserve the string literal contents (at least the start of it) so that demanglers can display it, while avoiding mangling the entire contents of very long strings.

As an example, if we take N = 16, M = 8, and use MD5 as our hashing algorithm (taking the high-order 32 bits of its output), "Hello, world!" would mangle as LA14_cHello_2c_20world_21E, and U"this is a very long string indeed" would mangle as LA34_Dithis_20is_20a_20very_20l1cf8df38`.


If we like this direction, there are a few open questions:

  • Should we encode the remainder of the string if that would be shorter than the hash?
  • What hash algorithm should we use (and what values of N and M)? How much do we care about collision-resistance, given that almost any choice will shield us from accidental collisions? It seems plausible that someone will use a pair of strings with known-colliding MD5 sums as template arguments in (eg) test code for an MD5 algorithm, and at least one common way of generating such a pair produces two strings with the same prefix. How much should we care about that? (It'd be easy to "fix" such cases by applying some simple invertible transform on the string data first, such that the colliding pairs that people are likely to want to use in practice are different from the colliding pairs for our hash.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant