Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approach for library-specific mangling compression #70

Open
rjmccall opened this issue Nov 19, 2018 · 15 comments
Open

Approach for library-specific mangling compression #70

rjmccall opened this issue Nov 19, 2018 · 15 comments
Labels

Comments

@rjmccall
Copy link
Collaborator

libc++ is revising its ABI, at least for some of its clients, and is very interested in using new "catalog" substitutions for the new ABI.

Some of its clients that wish to use a new ABI also correspond to new targets, but libc++ is not suggesting that they would use target-specific mangling rules; instead, they will also be changing their versioning namespace from __1 to __2 for these clients, and so manglings will not change for any existing entities.

We should recognize that the list of "catalog" substitutions is likely to keep growing. This will surely not be the last ABI version of libc++; further, the C++ committee will surely add more entities to the standard library; and then, libc++ may only belatedly realize that a particular entity was worth compressing, such that it will only be in the catalog for ABI versions N and higher. And, of course, this catalog offer also has to be extended to other standard library implementations, and in some cases they may need to put slightly different entries in the catalog. So the cataloging work will scale by the number of implementations, and the number of ABI versions, and the size of the standard library.

Nevertheless, I personally feel that it's appropriate for the Itanium ABI to support a large catalog here. If we're careful about the structure of these substitutions, we can keep the costs from getting too obviously combinatorial. But I'd like to get consensus on this before encouraging libc++ to start investigating which substitutions to include.

My current thinking is that we should add this in a fairly structured way to the grammar:

  <substitution>   ::= S <library-vendor> <library-version number> <library-entity>

  <library-vendor> ::= c     # libc++
  etc.

  <library-entity> ::= s     # lib::basic_string<char, lib::char_traits<char>, lib::allocator<char>>
  <library-entity> ::= up    # lib::unique_ptr
  etc.

with the expectation that there's an ad hoc rule for turning a combination of a library vendor and version into a namespace. Manglers and demanglers then only need to know three things:

  • the mapping of library-entities to/from relative entities within the library namespace,
  • the mapping of a particular vendor+version to/from a particular library namespace, and
  • the set of library-entities that are substituted in any particular vendor+version.

We should be relatively parsimonious about adding new library-vendor abbreviations, especially one-byte ones; there are only 19 characters available following S. This could create a bit of a political minefield in the future.

Library version numbers don't have to correspond to any versioning scheme used elsewhere. In particular, they do not have to correspond to the number used in e.g. std::__2. Note that one advantage of adding these compressions is that it eliminates some of the pressure for library vendors to use short names for their versioning namespaces in the first place. In fact, we may want to encourage libraries to use namespaces that are systematized the same way as the mangling, e.g. std::__c2 — although they might not want to do that, since such names have a habit of making their way into user-visible diagnostics.

We may want to consider whether these substitutions should introduce candidate substitutions for the seq-id compression. seq-id substitutions will often be shorter than these 4–5-byte catalog substitutions, which isn't possible for the current catalog. Of course, introducing candidates this way may also lengthen other candidates.

@rjmccall
Copy link
Collaborator Author

Tagging @jicama and @zygoloid especially.

@daveedvdv
Copy link

daveedvdv commented Nov 19, 2018 via email

@rjmccall
Copy link
Collaborator Author

rjmccall commented Nov 19, 2018

I’m worried about tying mangling to non-source attributes like “vendor”: How does a compiler “know” which vendor’s library it is compiling?

It doesn't, sorry for being unclear. Let me try to explain that part better.

  1. As part of this proposal, the Itanium ABI would start keeping track of the inline namespaces used by various library vendors. Library implementors using ABI-versioning namespaces are generally already trying to avoid using namespaces that have been used by other implementors, so even completely ignoring compression, it makes sense for someone to act as a central registrar for these namespaces as a service to the community. Part of the purpose of the Itanium ABI is to encourage coordination between implementors, so it's abstractly reasonable for us to take that on; the only more central organization I can think of is the ISO committee, and if they'd like to do it, that's fine, too.

  2. The Itanium ABI would gradually grow a catalog of inline-namespace-relative substitutions, like std::<ns>::basic_string => b. We can grow this catalog without regard to ABI concerns: an entity being in the catalog doesn't mean it's abbreviated for any particular library. This catalog would be seeded with the seven substitutions that we apply in std:: (or at least it could be — arguably some of those don't merit single-character abbreviations).

  3. The Itanium ABI's registry of inline namespaces would optionally associate them with a vendor prefix + version and a subset of the catalog. New namespaces would presumably incorporate the entire current catalog as of that date.

  4. Manglers and demanglers would be responsible for turning entities within registered namespaces into the appropriate short mangling and vice-versa. So there's no extrinsic knowledge of the library vendor required here, just recognition that an entity is declared within a particular namespace which is known to the ABI.

  5. We would encourage library vendors to use a systematized approach for naming their inline namespaces, like std::__libcxx_v2. Maybe we'd even require this as a condition of getting the compression. This is much longer than they'd currently consider, but it no longer matters for symbol-table sizes because they're getting compressed to (at least) Sc2t, and the inline namespace names don't occur often enough in source code to make a difference to compile times.

@zygoloid
Copy link
Contributor

It strikes me that there are three major stakeholders here:

  • Standard library authors want to be able to add new mangling shorthands, and have these be respected by existing compilers. They need to have stable ABIs, so it's unacceptable to introduce a name and then have a time delay before all compilers mangle it using a new substitution.
  • Compiler vendors want to be able to support shortened manglings, and don't want to be continuously updating the code that detects mangling shorthands for various different libraries. They also don't want to introduce ABI breaks when doing so.
  • Demangler vendors (more generally, anyone who wants to be able to interpret mangled names) would like to understand what such mangled names mean. Assumption: it's not completely critical for them to be able to demangle new shorthands added since the demangler was built, but they should still be able to produce a demangling for the name as a whole even if they need to replace the substitution with placeholder text.

I don't think a system whereby the only indication that a substitution has been added can satisfy the needs of standard library authors and compiler vendors regarding ABI here. And I don't think a system with an ad-hoc grammar satisfies the needs of demangler vendors.

So I have an alternative suggestion:

  • As in the original proposal, allocate prefix letters to vendors and allow them to choose what follows (but following a grammar that /we/ define, so that demangler vendors can skip substitutions they don't know)
  • Require a source-level annotation (an attribute on the first declaration of the entity) to describe the substitution
  • Require vendors to update a machine-readable list that we host with all substitutions they use, intended for use by authors of demanglers (maybe a list of lines where each line contains the substitution followed by a space followed by the mangled form of the replacement)

(In order to permit demanglers to skip unknown substitutions, no backreferences are introduced for elements within a substitution.)

Possible grammar:

<substitution> ::= S <library-vendor> <vendor-specific-substitution>
<library-vendor> ::= c # libc++
etc.
<vendor-specific-substitution> ::= [A-Za-z]
<vendor-specific-substitution> ::= <source-name>

For example, libc++ could add:

namespace std {
  [[abi::mangling_substitution("ScB")]] // ScB == ::std::__2
  namespace __2 {
    template<typename T, /*...*/>
      struct [[abi::mangling_substitution("Scb", <char>: "Scs")]]
        basic_string

... and also add the matching patterns to our substitution table:

ScB St3__2
Scb NScB12basic_stringE
Scs ScbIcNScB11char_traitsIcEENScB9allocatorIcEEE

(Because we don't hardcode the knowledge that ScB is a namespace substitution, we can't shorten NScB12basic_stringE to Sc12basic_string like we do for St. Maybe we could reserve uppercase letters for namespace substitutions and lowercase letters for other kinds of substitutions to fix that?)

@rjmccall
Copy link
Collaborator Author

Being able to avoid a compiler cycle to add a substitution is a really nice feature. That said, I have some concerns:

  • Having an quadratically-scaling database of substitutions with ad hoc structure seems really problematic for demanglers even if we're technically maintaining that database for them. I think we really want to encourage substitution suffixes to be the same across libraries/versions.

  • Putting the entire substitution in source, rather than some "universal" suffix of it, also seems really problematic. In practice, most libraries will be supporting multiple ABI versions out of the same source tree. We shouldn't make errors of omission (failing to update a particular attribute) into silent ABI problems.

Combined, I think this points towards keeping the basic schema from my proposal but allowing the individual components to be source-directed: the namespace would declare the Sc2 part, and the nested declaration would just declare b (and any specialization information). The nested declaration would only be allowed within an attributed namespace. Libraries would still be expected to wait for ABI approval before rolling out new library-entities, but ABI approval takes about a week and back-deploys to old compilers, so that seems acceptable.

I'm not sure that allowing demanglers to silently skip over unrecognized library-entities is particularly important, especially not when it comes at a real cost for symbol lengths.

The specialization-mangling idea is nice and does address my biggest concern about a source-directed approach, namely that it won't handle manglings well. It would be nice if we could extend this to partial specializations, e.g. <?>: "up". There are some very important templates which have unpredictable first arguments but frequently-defaulted later arguments. For example, we can't anticipate that std::unique_ptr<foo> is worth an abbreviation, but it would be very to mangle it as Sc2u3foo, or at least Sc2uI3fooE, instead of Sc2uI3fooSc2ddS_E.

@rjmccall
Copy link
Collaborator Author

@zygoloid Does my suggestion make sense?

@zygoloid
Copy link
Contributor

I think having an ad-hoc substitution format, where there's no guarantee that even a version-locked-to-your-compiler demangler can demangle your program's symbols, is a problem. There are lots of use cases that can cope with unknown substitutions, but that would fail if the name can't be demangled at all (eg, using an old c++filt on a new symbol, or LLVM's profile data remapping system).

The symbol length question is interesting. How we build what is effectively a Huffman code depends entirely on the probabilities of the different substitution. My scheme would be good in the case where there are <= 52 very-common substitutions; yours would work better if the distribution is flatter. If we think there are more than 52 common substitutions, we could use a different encoding scheme, such as: a capital letter indicates more letters follow; a lowercase letter indicates the substitution has ended. So you then get 26^N substitutions of length N.

I also want us to be thinking about assigning vendor prefixes to common but non-standard C++ libraries, such as boost, eigen, absl, gsl (and reserve one for end-user use). Having a fixed global set of suffixes doesn't make any sense in that space. I think I would mostly buy into requiring all standard library vendor prefixes to have the same set of suffixes (and letting those other cases define their own suffixes), but that doesn't extend well to cover vendor extensions, and will at least create confusion if vendors adopt abbreviations for the same type at different times. (Eg, if Sc2up would be a demangleable mangling for std:__2::unique_ptr, but libc++ doesn't actually use the up suffix until std::__3 for ABI compatibility reasons.) Maybe that's not such a big deal; going from demangled name to mangled name without the original source isn't really something we support anyway.

We also need to figure out how this interacts with abi_tag: should we assume that you only get one ABI tag per version (and so not mangle it) or that the ABI tag is orthogonal (and so mangle it in addition to the substitution)?

@rjmccall
Copy link
Collaborator Author

c++filt doesn't partially demangle symbols that contain bad substitutions; it refuses to demangle them at all. That could be changed, of course, but otherwise I don't see a difference in observable behavior whether or not the symbols can be demangled without an up-to-date catalog. But I really like the uppercase-as-running-character idea: we can always demangle, it produces pretty short symbols, and it's literally impossible for a library to run out of compressions. Let's go with that.

I was thinking about third-party libraries, and I agree that a source-directed approach makes this feasible for manglers, but I'm really worried about the impact on demanglers. We should set an expectation that demanglers won't be tracking any third-party suffixes — tracking even a single library's suffixes puts the demangler in a really nasty political position where it suddenly has to justify which libraries it includes. Instead, there should be some standard way to seed a demangler with the suffixes used by your libraries — maybe an environment variable containing a list of files containing suffix-to-entity mappings? But that's an idea we need to run by the actual demangler implementors.

The set of abi_tags for an entity can vary by template instantiation, right?

@ldionne
Copy link

ldionne commented Jan 23, 2019

FWIW: I think the ability to compress the mangling for third party libraries would be awesome (I've done crazy/stupid stuff in Boost to make symbols shorter). However, I think it would be great to have a solution that works for __2 minimally sooner than later, given that some folks have expressed an interest of revving the ABI number in libc++ and that's an occasion we don't want to miss.

To get started, would it be useful to make some kind of survey of which symbols we want to compress (e.g. by checking a non-trivial code base and looking at which symbols occur more frequently)? This is something I could probably help with.

@rjmccall
Copy link
Collaborator Author

Please note that there is zero reason for libc++ to use __2 if you have an ABI-mandated compression for it.

@rjmccall
Copy link
Collaborator Author

rjmccall commented Jan 24, 2019

We should restart this conversation and try to reach consensus. I think @zygoloid and I were in agreement about this:

  • We should use the grammar from my proposal in the first post, except that library-entity should follow Richard's running-character idea so that a demangler can skip an unknown entity. (An unknown library-vendor can be skipped by scanning to the next digit.)
  • Standard libraries will use a common set of library-entity manglings, which the ABI must approve additions to.
  • library-entitys should be source-directed in as-yet-unspecified ways, with as-yet-unspecified ways of specifying entities for concrete specializations and (maybe) patterns of specializations.
  • Non-std library vendors can ask for library-vendor prefixes, and we don't yet have consensus for how demanglers will handle that.

@rjmccall
Copy link
Collaborator Author

We notably do not have buy-in from any other compiler vendors about this, though.

@christinaa
Copy link

I like @ldionne's suggestion of collecting stdlib provided classes/templates for C++17 and for what's more or less decided for C++20, in terms of std:: namespace. For example std::string_view or std::span would be good examples of something that is commonly used and yet lacks shorthand mangling. As far as vendor specific mangling especially for something like the boost:: namespace, I definitely agree, in which case the mentioned syntax for defining shorthand mangling forms may be more appropriate though in case of stdlib implementations, I still think it would make more sense to just extend the the scope of what's covered by shorthand mangling schemes as I don't see this switch being viable, especially considering it has to be coordinated between Clang/lib++ and GCC/libstdcxx and other compiler vendors that use this ABI, as well as any other standard library vendors that may use Clang or GCC or their own compilers with other standard libraries.

The primary focus as far as demanglers go should be on LLVM and GNU binutils (while hopefully letting other demanglers catch up in the meantime, ie. Abseil/Google Demangler), which is the main reason for putting stdlib ahead of vendor specific namespaces as it's easier to centralize and can (and probably should be) done in a manner where both libstdcxx and libc++ use a format similar enough to be understood by common demanglers without having to make adjustments for every stdlib implementation.

This makes it a far more realistic goal for introducing the new stdlib mangling schemes hopefully around the same time C++20 support is more or less complete as it seems reasonable to roll out ABI changes around the time the new C++ specifications from the C++ committee are finalized and implemented by compiler vendors.

We notably do not have buy-in from any other compiler vendors about this, though.

I think as far as stdlib compression goes, if we stick with the numbered ABI versioning scheme (which, for libc++ is considered to be reserved for stdlib vendors), I don't think there is going to be a lot of pushback from GCC although I'm speculating, hopefully someone who works more closely with GCC may be able to give a better answer/opinion. And while not being ideal, the committee reserving certain numbers for stdlib vendors, given that libc++ and libstdcxx already do that, may be a good idea, as otherwise it seems improbable that such an ABI change could come around the time C++20 support comes about in major compilers.

Hopefully what I said more or less makes sense.

@jfbastien
Copy link

C++20 is feature complete now. Can we move this forward?

@rjmccall
Copy link
Collaborator Author

rjmccall commented Aug 1, 2019

I think we're blocked on three things:

  • We need an actual implementation (or two) to demonstrate feasibility.
  • We need a draft change to the ABI.
  • We need input from implementors other than Clang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants