-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approach for library-specific mangling compression #70
Comments
I’m worried about tying mangling to non-source attributes like “vendor”: How does a compiler “know” which vendor’s library it is compiling?
Have you instead considered a general “catalog substitution” attribute? E.g.:
[[cxxabi::mangling(“Scup”)]] template<typename T, …> …
?
Daveed
… On Nov 19, 2018, at 2:37 AM, John McCall ***@***.***> wrote:
libc++ is revising its ABI, at least for some of its clients, and is very interested in using new "catalog" substitutions for the new ABI.
Some of its clients that wish to use a new ABI also correspond to new targets, but libc++ is not suggesting that they would use target-specific mangling rules; instead, they will also be changing their versioning namespace from __1 to __2 for these clients, and so manglings will not change for any existing entities.
We should recognize that the list of "catalog" substitutions is likely to keep growing. This will surely not be the last ABI version of libc++; further, the C++ committee will surely add more entities to the standard library; and then, libc++ may only belatedly realize that a particular entity was worth compressing, such that it will only be in the catalog for ABI versions N and higher. And, of course, this catalog offer also has to be extended to other standard library implementations, and in some cases they may need to put slightly different entries in the catalog. So the cataloging work will scale by the number of implementations, and the number of ABI versions, and the size of the standard library.
Nevertheless, I personally feel that it's appropriate for the Itanium ABI to support a large catalog here. If we're careful about the structure of these substitutions, we can keep the costs from getting too obviously combinatorial. But I'd like to get consensus on this before encouraging libc++ to start investigating which substitutions to include.
My current thinking is that we should add this in a fairly structured way to the grammar:
<substitution> ::= S <library-vendor> <library-version number> <library-entity>
<library-vendor> ::= c # libc++
etc.
<library-entity> ::= s # lib::basic_string<char, lib::char_traits<char>, lib::allocator<char>>
<library-entity> ::= up # lib::unique_ptr
etc.
with the expectation that there's an ad hoc rule for turning a combination of a library vendor and version into a namespace. Manglers and demanglers then only need to know three things:
the mapping of library-entities to/from relative entities within the library namespace,
the mapping of a particular vendor+version to/from a particular library namespace, and
the set of library-entities that are substituted in any particular vendor+version.
We should be relatively parsimonious about adding new library-vendor abbreviations, especially one-byte ones; there are only 19 characters available following S. This could create a bit of a political minefield in the future.
Library version numbers don't have to correspond to any versioning scheme used elsewhere. In particular, they do not have to correspond to the number used in e.g. std::__2. Note that one advantage of adding these compressions is that it eliminates some of the pressure for library vendors to use short names for their versioning namespaces in the first place. In fact, we may want to encourage libraries to use namespaces that are systematized the same way as the mangling, e.g. std::__c2 — although they might not want to do that, since such names have a habit of making their way into user-visible diagnostics.
We may want to consider whether these substitutions should introduce candidate substitutions for the seq-id compression. seq-id substitutions will often be shorter than these 4–5-byte catalog substitutions, which isn't possible for the current catalog. Of course, introducing candidates this way may also lengthen other candidates.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#70>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA8ZT-8sFYofoEvueoNmiRY6zX3QrGdKks5uwl-dgaJpZM4Yoja3>.
|
It doesn't, sorry for being unclear. Let me try to explain that part better.
|
It strikes me that there are three major stakeholders here:
I don't think a system whereby the only indication that a substitution has been added can satisfy the needs of standard library authors and compiler vendors regarding ABI here. And I don't think a system with an ad-hoc grammar satisfies the needs of demangler vendors. So I have an alternative suggestion:
(In order to permit demanglers to skip unknown substitutions, no backreferences are introduced for elements within a substitution.) Possible grammar: <substitution> ::= S <library-vendor> <vendor-specific-substitution> For example, libc++ could add:
... and also add the matching patterns to our substitution table:
(Because we don't hardcode the knowledge that |
Being able to avoid a compiler cycle to add a substitution is a really nice feature. That said, I have some concerns:
Combined, I think this points towards keeping the basic schema from my proposal but allowing the individual components to be source-directed: the namespace would declare the I'm not sure that allowing demanglers to silently skip over unrecognized library-entities is particularly important, especially not when it comes at a real cost for symbol lengths. The specialization-mangling idea is nice and does address my biggest concern about a source-directed approach, namely that it won't handle manglings well. It would be nice if we could extend this to partial specializations, e.g. |
@zygoloid Does my suggestion make sense? |
I think having an ad-hoc substitution format, where there's no guarantee that even a version-locked-to-your-compiler demangler can demangle your program's symbols, is a problem. There are lots of use cases that can cope with unknown substitutions, but that would fail if the name can't be demangled at all (eg, using an old c++filt on a new symbol, or LLVM's profile data remapping system). The symbol length question is interesting. How we build what is effectively a Huffman code depends entirely on the probabilities of the different substitution. My scheme would be good in the case where there are <= 52 very-common substitutions; yours would work better if the distribution is flatter. If we think there are more than 52 common substitutions, we could use a different encoding scheme, such as: a capital letter indicates more letters follow; a lowercase letter indicates the substitution has ended. So you then get 26^N substitutions of length N. I also want us to be thinking about assigning vendor prefixes to common but non-standard C++ libraries, such as boost, eigen, absl, gsl (and reserve one for end-user use). Having a fixed global set of suffixes doesn't make any sense in that space. I think I would mostly buy into requiring all standard library vendor prefixes to have the same set of suffixes (and letting those other cases define their own suffixes), but that doesn't extend well to cover vendor extensions, and will at least create confusion if vendors adopt abbreviations for the same type at different times. (Eg, if Sc2up would be a demangleable mangling for std:__2::unique_ptr, but libc++ doesn't actually use the up suffix until std::__3 for ABI compatibility reasons.) Maybe that's not such a big deal; going from demangled name to mangled name without the original source isn't really something we support anyway. We also need to figure out how this interacts with |
I was thinking about third-party libraries, and I agree that a source-directed approach makes this feasible for manglers, but I'm really worried about the impact on demanglers. We should set an expectation that demanglers won't be tracking any third-party suffixes — tracking even a single library's suffixes puts the demangler in a really nasty political position where it suddenly has to justify which libraries it includes. Instead, there should be some standard way to seed a demangler with the suffixes used by your libraries — maybe an environment variable containing a list of files containing suffix-to-entity mappings? But that's an idea we need to run by the actual demangler implementors. The set of |
FWIW: I think the ability to compress the mangling for third party libraries would be awesome (I've done crazy/stupid stuff in Boost to make symbols shorter). However, I think it would be great to have a solution that works for To get started, would it be useful to make some kind of survey of which symbols we want to compress (e.g. by checking a non-trivial code base and looking at which symbols occur more frequently)? This is something I could probably help with. |
Please note that there is zero reason for libc++ to use |
We should restart this conversation and try to reach consensus. I think @zygoloid and I were in agreement about this:
|
We notably do not have buy-in from any other compiler vendors about this, though. |
I like @ldionne's suggestion of collecting stdlib provided classes/templates for C++17 and for what's more or less decided for C++20, in terms of The primary focus as far as demanglers go should be on LLVM and GNU binutils (while hopefully letting other demanglers catch up in the meantime, ie. Abseil/Google Demangler), which is the main reason for putting stdlib ahead of vendor specific namespaces as it's easier to centralize and can (and probably should be) done in a manner where both libstdcxx and libc++ use a format similar enough to be understood by common demanglers without having to make adjustments for every stdlib implementation. This makes it a far more realistic goal for introducing the new stdlib mangling schemes hopefully around the same time C++20 support is more or less complete as it seems reasonable to roll out ABI changes around the time the new C++ specifications from the C++ committee are finalized and implemented by compiler vendors.
I think as far as stdlib compression goes, if we stick with the numbered ABI versioning scheme (which, for libc++ is considered to be reserved for stdlib vendors), I don't think there is going to be a lot of pushback from GCC although I'm speculating, hopefully someone who works more closely with GCC may be able to give a better answer/opinion. And while not being ideal, the committee reserving certain numbers for stdlib vendors, given that libc++ and libstdcxx already do that, may be a good idea, as otherwise it seems improbable that such an ABI change could come around the time C++20 support comes about in major compilers. Hopefully what I said more or less makes sense. |
C++20 is feature complete now. Can we move this forward? |
I think we're blocked on three things:
|
libc++ is revising its ABI, at least for some of its clients, and is very interested in using new "catalog" substitutions for the new ABI.
Some of its clients that wish to use a new ABI also correspond to new targets, but libc++ is not suggesting that they would use target-specific mangling rules; instead, they will also be changing their versioning namespace from
__1
to__2
for these clients, and so manglings will not change for any existing entities.We should recognize that the list of "catalog" substitutions is likely to keep growing. This will surely not be the last ABI version of libc++; further, the C++ committee will surely add more entities to the standard library; and then, libc++ may only belatedly realize that a particular entity was worth compressing, such that it will only be in the catalog for ABI versions N and higher. And, of course, this catalog offer also has to be extended to other standard library implementations, and in some cases they may need to put slightly different entries in the catalog. So the cataloging work will scale by the number of implementations, and the number of ABI versions, and the size of the standard library.
Nevertheless, I personally feel that it's appropriate for the Itanium ABI to support a large catalog here. If we're careful about the structure of these substitutions, we can keep the costs from getting too obviously combinatorial. But I'd like to get consensus on this before encouraging libc++ to start investigating which substitutions to include.
My current thinking is that we should add this in a fairly structured way to the grammar:
with the expectation that there's an ad hoc rule for turning a combination of a library vendor and version into a namespace. Manglers and demanglers then only need to know three things:
We should be relatively parsimonious about adding new
library-vendor
abbreviations, especially one-byte ones; there are only 19 characters available followingS
. This could create a bit of a political minefield in the future.Library version numbers don't have to correspond to any versioning scheme used elsewhere. In particular, they do not have to correspond to the number used in e.g.
std::__2
. Note that one advantage of adding these compressions is that it eliminates some of the pressure for library vendors to use short names for their versioning namespaces in the first place. In fact, we may want to encourage libraries to use namespaces that are systematized the same way as the mangling, e.g.std::__c2
— although they might not want to do that, since such names have a habit of making their way into user-visible diagnostics.We may want to consider whether these substitutions should introduce candidate substitutions for the
seq-id
compression.seq-id
substitutions will often be shorter than these 4–5-byte catalog substitutions, which isn't possible for the current catalog. Of course, introducing candidates this way may also lengthen other candidates.The text was updated successfully, but these errors were encountered: