mangling for string literals #64

zygoloid · 2018-10-02T19:38:03Z

The ABI says that string literals in instantiation-dependent expressions are mangled thusly:

<expr-primary> ::= L <string type> E # string literal

... presumably because, in C++98, the type of the literal was the only property that could affect the validity of the instantiation-dependent expression. That is no longer the case; a C++11 program can inspect the contents of such a string literal in an instantiation-dependent expression, so we need to mangle said contents.

Proposal:

<expr-primary> ::= L <string type> <char>* <hash>? E    # string literal
<char> ::= <0-9a-zA-DF-Z>                               # values 48-57, 97-122, 65-68, 70-90
       ::= _<hex><hex>                                  # other chars encoded in (big-endian) hexadecimal
       ::= __<hex><hex><hex><hex>
       ::= ___<hex><hex><hex><hex><hex><hex>
       ::= ____<hex><hex><hex><hex><hex><hex><hex><hex>
<hex> ::= <0-9a-f>
<hash> ::= <hex>{M}

... where the first N (say, 16) characters of the string are encoded directly, followed by a 4M-bit hash of the entire string (algorithm TBD, but following target endianness) if its length is greater than N (where for all purposes other than determining the type, the terminating nul character is ignored).

The idea here is to preserve the string literal contents (at least the start of it) so that demanglers can display it, while avoiding mangling the entire contents of very long strings.

As an example, if we take N = 16, M = 8, and use MD5 as our hashing algorithm (taking the high-order 32 bits of its output), "Hello, world!" would mangle as LA14_cHello_2c_20world_21E, and U"this is a very long string indeed" would mangle as LA34_Dithis_20is_20a_20very_20l1cf8df38`.

If we like this direction, there are a few open questions:

Should we encode the remainder of the string if that would be shorter than the hash?
What hash algorithm should we use (and what values of N and M)? How much do we care about collision-resistance, given that almost any choice will shield us from accidental collisions? It seems plausible that someone will use a pair of strings with known-colliding MD5 sums as template arguments in (eg) test code for an MD5 algorithm, and at least one common way of generating such a pair produces two strings with the same prefix. How much should we care about that? (It'd be easy to "fix" such cases by applying some simple invertible transform on the string data first, such that the colliding pairs that people are likely to want to use in practice are different from the colliding pairs for our hash.)

The text was updated successfully, but these errors were encountered:

zygoloid mentioned this issue Oct 2, 2018

mangling for non-type template arguments of class type #63

Open

zygoloid mentioned this issue Apr 10, 2019

need a mangling for string literals in inline variable initializers #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mangling for string literals #64

mangling for string literals #64

zygoloid commented Oct 2, 2018

mangling for string literals #64

mangling for string literals #64

Comments

zygoloid commented Oct 2, 2018