Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of std::move in libc++ leads to worsened debug performance #53689

Closed
vittorioromeo opened this issue Feb 9, 2022 · 42 comments
Closed

Use of std::move in libc++ leads to worsened debug performance #53689

vittorioromeo opened this issue Feb 9, 2022 · 42 comments
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance

Comments

@vittorioromeo
Copy link

std::accumulate is defined as follows in libc++:

template <class _InputIterator, class _Tp>
_LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR_AFTER_CXX17
_Tp
accumulate(_InputIterator __first, _InputIterator __last, _Tp __init)
{
    for (; __first != __last; ++__first)
#if _LIBCPP_STD_VER > 17
        __init = _VSTD::move(__init) + *__first;
#else
        __init = __init + *__first;
#endif
    return __init;
}

When compiling a program using std::accumulate in debug mode, under -Og, there is a noticeable performance impact due to the presence of std::move.

This performance degradation is one example of why many people (especially in the gamedev community) are not adopting standard library algorithms and modern C++ more widely.

Would it be possible to replace std::move calls internal to libc++ with a cast, or some sort of compiler intrinsic? Or maybe mark std::move as "always inline" even without optimizations enabled?

@EugeneZelenko EugeneZelenko added libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance and removed new issue labels Feb 9, 2022
@efriedma-quic
Copy link
Collaborator

The behavior of -O1/-Og changed in clang 13 when we switched on the new pass manager by default, so it now does simple inlining. gcc similarly does inlining at -O1/-Og. That's probably better than sprinkling always_inline annotations in libc++... unless you're worried about -O0.

@vittorioromeo
Copy link
Author

The behavior of -O1/-Og changed in clang 13 when we switched on the new pass manager by default, so it now does simple inlining. gcc similarly does inlining at -O1/-Og.

Thanks for the info, it is good to hear that things have improved in 13.

That's probably better than sprinkling always_inline annotations in libc++... unless you're worried about -O0.

Well, let's put it this way. Why should a std::move invocation ever result in a function call? Same thing with std::forward, and probably more similar other functions. It's just an unnecessary performance overhead, it makes the call stack unnecessarily deeper, and it has an effect on the overall reputation and popularity of the language. I have seen multiple reports of people bewildered by functions like std::move appearing in their profiler outputs, especially if they come internally from their standard library.

If we want to improve the situation for libc++ internals specifically, it would be sufficient to change internal usages of std::move with a cast. I would be happy to give it a go as my first ever LLVM PR, if deemed a reasonable change.

However, if we want to improve the situation for everyone, to me it sounds like a good idea to also mark functions such as std::move or std::forward with some special attribute that forces inlining even in -O0. These functions are basically wrappers around casts -- people expect them to behave as such even in terms of performance and impact on the call stack.

@philnik777
Copy link
Contributor

The problem with your approach is that then the question then becomes 'Why not also mark begin() and end() as __attribute__((__always_inline__))?'
I don't really see a problem if it is optimized away at -O1. I think it's OK to expect people who need more performance when debugging to let the compiler optimize a little bit. And remember, the standard library is also a code base which people work on. It's already completely unreadable because of _Uglification, don't make it worse without much of a benefit (and writing static_cast<T&&>(__val) is worse than writing std::move(__val)).

@vittorioromeo
Copy link
Author

vittorioromeo commented Feb 16, 2022

The problem with your approach is that then the question then becomes 'Why not also mark begin() and end() as __attribute__((__always_inline__))?'

Is that actually a bad idea? I cannot think of a reason why you wouldn't want these functions inlined for all standard containers, unless it significantly decreases compilation speed at -O0.

I would want every function that is either (1) basically a cast (e.g. std::move, std::forward) or (2) a hint to the compiler (e.g. std::launder) or (3) trivial factory function (e.g. std::vector<T>::begin()) to not affect debug performance at all. Honest question -- would marking these as [[always_inline]] be unreasonable?

the standard library is also a code base which people work on. It's already completely unreadable because of _Uglification, don't make it worse without much of a benefit (and writing static_cast<T&&>(__val) is worse than writing std::move(__val)).

I am genuinely sympathetic to the suffering caused by having to work with an "ugly" codebase. But as you said, it is already completely unreadable. While it is technically true that static_cast<_Tp&&>(__init) is slightly uglier than _VSTD::move(__init) (note, not std::move...), the readability impact is miniscule given the context we are in. Anybody who is used to work in a "completely unreadable" (by necessity) codebase such as libc++ will be the kind of person who knows that static_cast<T&&>(x) is a move (assuming that x is not a forwarding reference)`.

I believe that miniscule readability loss in an already unreadable situation is worth avoiding a 3x performance loss for the end user (in a micro-benchmark). Additionally, the positive impact such a change would have on the reputation of C++ in the gamedev community cannot be understated.

@philnik777
Copy link
Contributor

Is that actually a bad idea? I cannot think of a reason why you wouldn't want these functions inlined for all standard containers, unless it significantly decreases compilation speed at -O0.
I would want every function that is either (1) basically a cast (e.g. std::move, std::forward) or (2) a hint to the compiler (e.g. std::launder) or (3) trivial factory function (e.g. std::vector::begin()) to not affect debug performance at all. Honest question -- would marking these as [[always_inline]] be unreasonable?

I don't think it's a good idea to just put [[gnu::always_inline]] just everywhere. As stated earlier, if you want these things inlined you just have to add -O1 to your command line args. That's not much of a task. But it just gets harder to read all the function signatures. For example move is currently defined as:

template <class _Tp>
_LIBCPP_NODISCARD_EXT inline _LIBCPP_INLINE_VISIBILITY _LIBCPP_CONSTEXPR typename remove_reference<_Tp>::type&&
move(_Tp&& __t) _NOEXCEPT {
  typedef _LIBCPP_NODEBUG typename remove_reference<_Tp>::type _Up;
  return static_cast<_Up&&>(__t);
}

Using another attribute and some reformatting we get:

_LIBCPP_NODISCARD_EXT _LIBCPP_ALWAYS_INLINE inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR
typename remove_reference<_Tp>::type&&
move(_Tp&& __t) _NOEXCEPT {
  typedef _LIBCPP_NODEBUG typename remove_reference<_Tp>::type _Up;
  return static_cast<_Up&&>(__t);
}

It just doesn't get better. Note that we can't use [[gnu::always_inline]] because we support C++03 (kind of).

(note, not std::move...)

Actually, it is std::move() since https://llvm.org/D117811 (yay!).

Anybody who is used to work in a "completely unreadable" (by necessity) codebase such as libc++ will be the kind of person who knows that static_cast<T&&>(x) is a move (assuming that x is not a forwarding reference)`.

Yes, I know that. But when I see std::move(x) I don't have to think. Seeing static_cast<T&&>(x) makes me think "What was the intention of this cast?". I don't like to think unnecessarily. You yourself said:

assuming that x is not a forwarding reference

So I have to look at a few things instead of just seeing std::move() and instantly knowing what the intention was and what it actually does.

I believe that miniscule readability loss in an already unreadable situation is worth avoiding a 3x performance loss for the end user (in a micro-benchmark). Additionally, the positive impact such a change would have on the reputation of C++ in the gamedev community cannot be understated.

I don't see anybody who actually needs some performance in a debug build using -O0, and since it's inlined in -O1 I don't see a reason we should make libc++ even uglier than it already is. Maybe it would be more appropriate to make -Og actually a nice to debug optimization setting and - who knows - even make it default? But that's not libc++'s problem, that's clang's and GCC's problem to solve.

@vittorioromeo
Copy link
Author

As stated earlier, if you want these things inlined you just have to add -O1 to your command line args. That's not much of a task.

By doing that, the debugging experience suffers greatly as optimizations can make it much harder for humans to step through the code in the debugger.

But it just gets harder to read all the function signatures.

You are technically correct, but going from "75% unreadable" to "76% unreadable" for concrete benefits to the end-user experience and debugging performance seems like a worthwhile sacrifice to me. I hope you get the point I'm trying to make: adding an extra attribute doesn't really change the overall readability situation.

Yes, I know that. But when I see std::move(x) I don't have to think. Seeing static_cast<T&&>(x) makes me think "What was the intention of this cast?". I don't like to think unnecessarily.

That's fair enough, I would prefer to see std::move as well. But I would always have the user experience as a priority in my mind. If I were tasked with giving out priorities, this is what I would say:

  1. Ensure that std::move is always inlined via attribute or special compiler magic and use it (no negative impact on end user);
  2. Use static_cast<T&&> (no negative impact on end user);
  3. Use std::move with the risk of degrading the end user experience.

I don't see anybody who actually needs some performance in a debug build using -O0

Here are a few examples:

I can find many many more. The gamedev community is very vocal about this issue, and the message that they're sending out is: "avoid Modern C++". Truthfully speaking, I cannot blame them, and -- sadly -- they are getting a lot of popularity and consensus.

We need to do better and start supporting their use cases. If an extra attribute on a few function signatures can significantly improve debugging performance, I think it's a huge win for the C++ community as a whole.

Maybe it would be more appropriate to make -Og actually a nice to debug optimization setting and - who knows - even make it default? But that's not libc++'s problem, that's clang's and GCC's problem to solve.

It's easy to say it's somebody else's problem... Don't get me wrong, I would love compilers to become even smarter and figure out that things like std::move and std::vector::iterator member functions should always be inlined, but that's wishful thinking. The "poor debugging experience" problem has been around for decades. I think it's time to solve it, maybe not in the best possible way, but at least in a way that benefits the end user until we get a better and more general solution. "Perfect" is the enemy of "good".

I politely ask you to reconsider my request, and discuss it with other colleagues to see what the general opinion is. It would also be wise to discuss this with people involved in SG14.

I would be happy to send PRs to libc++ to implement these improvements. Let me know what you think.

Cheers!

@zeux
Copy link
Contributor

zeux commented Feb 28, 2022

It feels like there's an extra option here that balances readability with performance impact reasonably well: add a macro, _LIBCXX_MOVE(v), that expands to static_cast<decltype(v)&&>(v), plus or minus removing the reference qualifiers.

This keeps the intent clear for the reader while avoiding the function call cost in debug.

@philnik777
Copy link
Contributor

You know what, I think it might be reasonable for std::move. Since you said you want to make a PR, let's begin with that. Please read https://libcxx.llvm.org/Contributing.html carefully (I know I didn't do that when I started) and also add a benchmark to the PR to show how much of a difference it makes. I don't think it's worth for things like vector::begin, since I can't imagine a scenario where it's called in a hot loop and inlining it makes a big difference.

It feels like there's an extra option here that balances readability with performance impact reasonably well: add a macro, _LIBCXX_MOVE(v), that expands to static_cast<decltype(v)&&>(v), plus or minus removing the reference qualifiers.

This keeps the intent clear for the reader while avoiding the function call cost in debug.

I'd say definitely no. Adding an attribute to std::move is by far the superior approach here.

@manxorist
Copy link

manxorist commented Feb 28, 2022

I don't think it's worth for things like vector::begin, since I can't imagine a scenario where it's called in a hot loop and inlining it makes a big difference.

Yes, it does make a difference. Iterators are often passed to algorithms, and not inlining them cripples performance of such algorithms in debug builds. It's exactly situations like this, that make adopting modern C++ algorithms difficult.

The overall goal that (I think) @vittorioromeo wants to target here, is making it feasible to use algorithms in hot loops and still get usable performance in debug builds

(I'm the one who probably triggered this issue in the first place with this tweet: https://twitter.com/manxorist/status/1491493715382358025 )

@vittorioromeo
Copy link
Author

vittorioromeo commented Feb 28, 2022

It feels like there's an extra option here that balances readability with performance impact reasonably well: add a macro, _LIBCXX_MOVE(v), that expands to static_cast<decltype(v)&&>(v), plus or minus removing the reference qualifiers.

This keeps the intent clear for the reader while avoiding the function call cost in debug.

@philnik777: I don't think this idea should be immediately discarded. We C++ developers tend to abhor macros, but they are the only tool in the language that guarantees predictable performance due to textual expansion and still carry semantic meaning via good naming choice.

One advantage of using a macro or cast is that compilation time also gets improved, see this commit from Boost.Hana as an example.

Related reading material:

@philnik777
Copy link
Contributor

I don't think it's worth for things like vector::begin, since I can't imagine a scenario where it's called in a hot loop and inlining it makes a big difference.

Yes, it does make a difference. Iterators are often passed to algorithms, and not inlining them cripples performance of such algorithms in debug builds. It's exactly situations like this, that make adopting modern C++ algorithms difficult.

The overall goal that (I think) @vittorioromeo wants to target here, is making it feasible to use algorithms in hot loops and still get usable performance in debug builds

(I'm the one who probably triggered this issue in the first place with this tweet: https://twitter.com/manxorist/status/1491493715382358025 )

But where do you call vector::begin in a loop? If you go through a vector, you don't call vector::begin every time. You call it once.

@philnik777
Copy link
Contributor

It feels like there's an extra option here that balances readability with performance impact reasonably well: add a macro, _LIBCXX_MOVE(v), that expands to static_cast<decltype(v)&&>(v), plus or minus removing the reference qualifiers.
This keeps the intent clear for the reader while avoiding the function call cost in debug.

I don't think this idea should be immediately discarded. One advantage of using a macro or cast is that compilation time also gets improved, see this commit from Boost.Hana as an example.

Related reading material:

* https://www.foonathan.net/2020/09/move-forward/

I ran the libc++ test suite with _LIBCPP_HIDE_FROM_ABI with _LIBCPP_ALWAYS_INLINE and I think it didn't make a difference of more than a second overall. But I might remember it wrong.

@vittorioromeo
Copy link
Author

I don't think it's worth for things like vector::begin, since I can't imagine a scenario where it's called in a hot loop and inlining it makes a big difference.

Yes, it does make a difference. Iterators are often passed to algorithms, and not inlining them cripples performance of such algorithms in debug builds. It's exactly situations like this, that make adopting modern C++ algorithms difficult.
The overall goal that (I think) @vittorioromeo wants to target here, is making it feasible to use algorithms in hot loops and still get usable performance in debug builds
(I'm the one who probably triggered this issue in the first place with this tweet: https://twitter.com/manxorist/status/1491493715382358025 )

But where do you call vector::begin in a loop? If you go through a vector, you don't call vector::begin every time. You call it once.

Not exactly the same issue, but consider any algorithm from <algorithm> or <numeric>. Any operation on the iterators (e.g. operator*, operator++, operator==) might result in a extra function call per algorithm iteration if functions like std::vector<T>::iterator::operator* are not inlined, which completely destroy the performance.

@zeux
Copy link
Contributor

zeux commented Feb 28, 2022

One other reason that I suggested a macro is that when libc++ is used by MSVC, there's no option of using always-inline (MSVC doesn't inline __forceinline functions unless inlining is enabled for inline functions as well).

That said, of course the macro is limited to internal implementation (like uses of std::move) and for externally facing API like iterator operations the attribute will need to be used instead anyhow. Given that libc++ primarily targets clang the runtime performance gains for that compiler will hopefully be the same for std::move as well.

@jeremyong-az
Copy link

Commenting to second the macro suggestion from @zeux. I think it's a pragmatic choice that addresses the root problem (at the expense of some sensibilities depending on how macro adverse you are).

From @philnik777

But where do you call vector::begin in a loop? If you go through a vector, you don't call vector::begin every time. You call it once.

The parent post is likely referring to iterating through a vector within a loop. In a debug build, you don't expect amazing performance, but you also don't want performance to fall off a cliff if you can help it.

@philnik777
Copy link
Contributor

Let's do one thing first. I haven't talked to anybody about this yet, so there is no guarantee you get std::move always inlined.

One other reason that I suggested a macro is that when libc++ is used by MSVC, there's no option of using always-inline (MSVC doesn't inline __forceinline functions unless inlining is enabled for inline functions as well).

That said, of course the macro is limited to internal implementation (like uses of std::move) and for externally facing API like iterator operations the attribute will need to be used instead anyhow. Given that libc++ primarily targets clang the runtime performance gains for that compiler will hopefully be the same for std::move as well.

We don't support MSVC. We only support clang-cl, which supports __attribute__((__always_inline__)).

@vittorioromeo
Copy link
Author

vittorioromeo commented Feb 28, 2022

Let's do one thing first. I haven't talked to anybody about this yet, so there is no guarantee you get std::move always inlined.

@philnik777: Would it be worthwhile to bring another maintainer or two into this discussion to see what they think, before I start putting some effort into the creation of a PR?

@philnik777
Copy link
Contributor

Let's do one thing first. I haven't talked to anybody about this yet, so there is no guarantee you get std::move always inlined.

@philnik777: Would it be worthwhile to bring another maintainer or two into this discussion to see what they think, before I start putting some effort into the creation of a PR?

I think the best way to sell it is to provide a benchmark, which I think is most of the work. The change inside the headers should only be one or two lines anyways. I too still want to see benchmark results before I approve such a change. So even getting someone else's opinion wouldn't guarantee that such a change is approved.

@jsonn
Copy link
Contributor

jsonn commented Feb 28, 2022

There is also the downside that __always_inline can seriously harm the debug experience. If some game developers are told to use -g and still expect inlining of some parts to happen, that seems to be a problem of their use. I bet those same people are also complaining about the compiler doing stupid things in other parts of the code that would be solved by constant folding etc. I'm opposed to springing __always_inline for non-trivial functions for performance reasons in explicitly non-optimized builds.

@vittorioromeo
Copy link
Author

There is also the downside that __always_inline can seriously harm the debug experience.

I do not believe that is true for functions like std::move, std::forward, or std::vector<T>::iterator::operator*. Stepping into any of these functions is not useful at all to the debug experience -- in fact, seeing them as part of the call stack is just noise.

If some game developers are told to use -g and still expect inlining of some parts to happen, that seems to be a problem of their use.

No, game developers are not told to use -g. Game developers create a debug build of their product to debug issues. In order to effectively debug logic issues and step through the code, -g plus -O0 or -Og are needed to prevent excessive compiler optimization. However, as seen by the benchmarks I posted in the issue opener, any use of trivial functions like std::move or std::forward in a tight loop results in a frankly unacceptable performance slowdown.

I bet those same people are also complaining [...]

This doesn't seem like a very constructive attitude. I would advise you to consider other people and industries' needs and points of views. We are all trying to make C++ better.

I'm opposed to springing __always_inline for non-trivial functions for performance reasons in explicitly non-optimized builds.

I did not suggest sprinkling __always_inline on non-trivial functions. I suggested the opposite: carefully apply that attribute to functions that should always be inlined and that should not have any impact on the performance of a program. std::move is a prime candidate of that class of functions.

@philnik777
Copy link
Contributor

I think I have to take back what I said yesterday. Playing around with the benchmark a bit shows that there is almost no difference. https://quick-bench.com/q/HLD8AS7d1TZhXTY77bjexGvXBB0 shows almost no difference in performance. Using -Og with clang 13 results in identical assembly (https://godbolt.org/z/oWeb936bv). Note that we don't support clang 12 anymore.

@mcencora
Copy link

mcencora commented Mar 1, 2022

There is a simple way to de-uglify most (all?) of libc++.
We just need one small feature from compiler: preprocessor should not trigger macros expansion inside STL headers, when macro has non-reserved name and was defined outside of STL header (to allow assert, offsetof, NULL, etc).

@vittorioromeo
Copy link
Author

vittorioromeo commented Mar 1, 2022

I think I have to take back what I said yesterday. Playing around with the benchmark a bit shows that there is almost no difference. https://quick-bench.com/q/HLD8AS7d1TZhXTY77bjexGvXBB0 shows almost no difference in performance. Using -Og with clang 13 results in identical assembly (https://godbolt.org/z/oWeb936bv). Note that we don't support clang 12 anymore.

According to Clang's documentation, -Og is a placeholder for -O1. From https://clang.llvm.org/docs/CommandGuide/clang.html:

-Og Like -O1. In future versions, this option might disable different optimizations in order to improve debuggability.

(Related: #53681)

Therefore, it is not an acceptable solution at the moment as it might negatively affect the debugging experience. There are also reasons to compile with -O0, namely maximizing compilation speed, and I don't think it would hurt to get better performance in those situations as well.

@philnik777
Copy link
Contributor

I don't think 1.1x performance improvement at -O0 (in a micro benchmark) is significant enough to warrant adding always inline. If you can produce a (realistic) scenario where it has a significant impact (something like 2x in a micro benchmark maybe) I would be more open to adding always inline.

I get that -O1 isn't always an option, but having a 1.1x speedup in a micro benchmark isn't a very significant improvement. 1.1x speedup in a macro benchmark would be another story.

@vittorioromeo
Copy link
Author

I don't think 1.1x performance improvement at -O0 (in a micro benchmark) is significant enough to warrant adding always inline. If you can produce a (realistic) scenario where it has a significant impact (something like 2x in a micro benchmark maybe) I would be more open to adding always inline.

I get that -O1 isn't always an option, but having a 1.1x speedup in a micro benchmark isn't a very significant improvement. 1.1x speedup in a macro benchmark would be another story.

This benchmark shows a 1.4x performance improvement with -O0 by simply adding [[gnu::always_inline]] to std::move. Minimal effort for a significant improvement, IMHO:
https://quick-bench.com/q/K_ASahxgCA9n02Gju9ol-fKEbsU

This benchmark shows a 5.7x performance improvement with -Og by simply adding [[gnu::always_inline]] to std::move. Again... minimal effort for a massive improvement, IMHO:
https://quick-bench.com/q/zU2A38EmazbRAh_Wmvtv7MDn16M

@eteran
Copy link

eteran commented Mar 1, 2022

@philnik777 Just to chime in, because I think that more people than you may realize, think this is an important issue. The standards committee often will refer to concerns over performance as a "quality of implementation" issue. And they are probably right to do so since often it's not a language level issue, because C++ defers so much to the standard library and expects good competition between compilers.

So, here's the way that I see it:

For things like unique_ptr, move, forward, operator*(), etc... the base assumption in the implementation is that they will get inlined. The performance is basically expected to be notably worse if these kinds of functions aren't inlined because a simple call instruction is known to have non-trivial overhead in hot code. This is why inlining exists.

I also think we can agree that -O0 is a valid use case, because otherwise... you wouldn't offer it as an option to users.

So we have a class of users, who need to debug code and do so with reasonable performance since maybe a bug doesn't happen until several minutes into a game and having to painfully wait for that is just a terrible debugging experience if that wait time is notably inflated.

I think that @vittorioromeo 's solution is pretty reasonable. I think that the reason for always_inline is to indicate to the compiler "this is expected to be inlined, always, even when not optimizing"... which I think perfectly describes the cases that he's referring too.

So why not help the compiler along and tell it that these things are particularly important to always inline? I'm not sure I'm sold on the given downsides. As discussed above, the library is already "expert friendly" and hard to read unless you have a lot of experience with it. I don't think adding one more macro/keyword in front of a function declaration in some key spots changes that.

For reference, I've implemented a hand rolled version of unique_ptr which at -O0 has nearly no overhead. In gcc/clang, sure it has a few redundant instructions for shuffling data in and out of places because it doesn't optimize that out (nor should it at -O0) But it only has the two calls a user should expect. new and delete

I don't think my addition of _ALWAYS_INLINE in a dozen-ish places reduced readability and it has at worst, half of the calls as std::unique_ptr does (assuming that std::default_delete gets inlined... which it might not be).

https://godbolt.org/z/qqY5cbeMe

Just some food for thought. C++ is easily my favorite language to develop in, but I do feel that debug performance is a place where we can make some relatively low cost improvements.

@duk-37
Copy link
Contributor

duk-37 commented Mar 1, 2022

Just my two cents here --

It's important to keep in mind that LLVM does not optimize stack slot lifetimes with -O0, meaning that every inlined variable has its own stack slot.

This causes a big blowup in stack usage if too many functions are inlined, so we may want to be careful about this.

The macro approach would probably sidestep this issue.

@vittorioromeo
Copy link
Author

Just my two cents here --

It's important to keep in mind that LLVM does not optimize stack slot lifetimes with -O0, meaning that every inlined variable has its own stack slot.

This causes a big blowup in stack usage if too many functions are inlined, so we may want to be careful about this.

The macro approach would probably sidestep this issue.

It really bugs me that the compiler is generating code for a function that's conceptually a cast that should emit no instructions. Not only it slows down debug performance, but it makes the compiler perform additional work. The inlining would be nice for debug performance, but that also means the compiler will be doing even more work... for a cast.

Doesn't this bug anyone else? I think that macros or compiler intrinsics would be a great idea for things like std::move or std::forward.

@duk-37
Copy link
Contributor

duk-37 commented Mar 1, 2022

Just my two cents here --
It's important to keep in mind that LLVM does not optimize stack slot lifetimes with -O0, meaning that every inlined variable has its own stack slot.
This causes a big blowup in stack usage if too many functions are inlined, so we may want to be careful about this.
The macro approach would probably sidestep this issue.

It really bugs me that the compiler is generating code for a function that's conceptually a cast that should emit no instructions. Not only it slows down debug performance, but it makes the compiler perform additional work. The inlining would be nice for debug performance, but that also means the compiler will be doing even more work... for a cast.

Doesn't this bug anyone else? I think that macros or compiler intrinsics would be a great idea for things like std::move or std::forward.

Macros or builtins for move/forward could work in libcxx but the issue would still arise in user code. The problem is that the code you see there is always generated for any function with arguments.

Maybe clang could pattern-match the move idiom or function, but that sounds somewhat error-prone. Something like __attribute__((move)) (like the already existing __attribute__((malloc))) could work as well.

At that point something more generic like __attribute__((special("move"))) could be used so that a new attribute doesn't need to be implemented for move, forward, and anywhere else you'd want to do this.

@vittorioromeo
Copy link
Author

@duk-37: the GCC implementers are working towards implementing special cases in the frontend to automagically fold calls to std::move and std::forward -- I think that's also a reasonable solution. I like your idea of special attributes as well.

Here's the GCC patch:
https://gcc.gnu.org/bugzilla/attachment.cgi?id=51732

+++ b/gcc/cp/cp-gimplify.c

+	extern bool is_std_move_p (tree);
+	extern bool is_std_forward_p (tree);
+	if (is_std_move_p (x) || is_std_forward_p (x))
+	  {
+	    tree arg = CALL_EXPR_ARG (x, 0);
+	    x = cp_convert (TREE_TYPE (x), arg, tf_none);
+	    break;
+	  }

And related ticket:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96780

@eteran
Copy link

eteran commented Mar 1, 2022

Just as a followup, just adding always inline to my unique_ptr implementation shows up to 2x+ improvement in performance at -O0, even with the register spilliage that @duk-37 mentioned (thanks for that info, that's a very interesting aspect of this discussion).

https://quick-bench.com/q/ph6e1zAcrPxoTczrC2uOCwW9vlw

and potentially multiple 1000's x performance difference at -Og / -O1 ...

https://quick-bench.com/q/v90Aw6jl5Pmo4DP8S_phT40cDZ4

Maybe I'm doing something wrong with the benchmark, that seems like a HUGE difference, but either way, strategic always_inline seems to positively affect performance in general.

@vittorioromeo
Copy link
Author

Maybe I'm doing something wrong with the benchmark, that seems like a HUGE difference, but either way, strategic always_inline seems to positively affect performance in general.

You have to wrap parts of your benchmarks in benchmark::DoNotOptimize to prevent the compiler from completely optimizing them out. Here's a fixed version of your second benchmark: https://quick-bench.com/q/_dGZA8R09piEFULYWUN7RSB-yI8

@duk-37
Copy link
Contributor

duk-37 commented Mar 1, 2022

Just as a followup, just adding always inline to my unique_ptr implementation shows up to 2x+ improvement in performance at -O0, even with the register spilliage that @duk-37 mentioned (thanks for that info, that's a very interesting aspect of this discussion).

https://quick-bench.com/q/ph6e1zAcrPxoTczrC2uOCwW9vlw

and potentially multiple 1000's x performance difference at -Og / -O1 ...

https://quick-bench.com/q/v90Aw6jl5Pmo4DP8S_phT40cDZ4

Maybe I'm doing something wrong with the benchmark, that seems like a HUGE difference, but either way, strategic always_inline seems to positively affect performance in general.

Was about to say exactly the same thing as @vittorioromeo.

I'd also like to make it clear that I'm not trying to say that always_inline is necessarily bad at -O0 (far from it, as your benchmarks show!), just that the stack usage is something that should be kept in mind.

@eteran
Copy link

eteran commented Mar 1, 2022

Was about to say exactly the same thing as @vittorioromeo.

I'd also like to make it clear that I'm not trying to say that always_inline is necessarily bad at -O0 (far from it, as your benchmarks show!), just that the stack usage is something that should be kept in mind.

Yeah, I knew something was off with the benchmark. I wasn't expecting a 19,000x difference, LOL. But yeah, think a 5x difference in debug performance is notable enough that the juice is worth the squeeze. I actually kinda love the idea of an attribute or similar for move and forward as that would also help with user code. For things which are supposed to be zero-cost wrappers around basic operations, I'd love to see some more inlining to help debug builds be less painful to work with.

It's also worth noting that even for cases where the performance is not hugely different. a simpler callstack goes a long way into being able to understand "how did I get here". I know that's a weaker argument than just raw numbers... but this discussion is for me, largely about how we can make debug builds less painful to work with in general.

@manxorist
Copy link

Let's look at a simple audio algorithm (attenuates a 4 channel audio signal), implemented with <algorithm>, <array>, <span>,

#include <algorithm>
#include <array>
#include <span>
#include <cstddef>
template <std::size_t channels>
static inline void process(std::span<std::array<float, channels>> frames) {
    std::transform(std::begin(frames), std::end(frames), std::begin(frames),
        [](std::array<float, channels> frame) {
            std::transform(std::begin(frame), std::end(frame), std::begin(frame),
                [](float sample) {
                    return sample * 0.7f;
                }
            );
            return frame;
        }
    );
}
void process4(std::span<std::array<float, 4>> frames) {
     process(frames);
}

vs

#include <cstddef>
template <std::size_t channels>
static inline void process(float * frames, std::size_t countFrames) {
    std::size_t count = countFrames * channels;
    while (count--) {
        *frames *= 0.7f;
        frames++;
    }
}
void process4(float * frames, std::size_t countFrames) {
    process<4>(frames, countFrames);
}

https://godbolt.org/z/TeqqPzbas

Let's ignore the fact that Clang's codegen for the low-level implementation (bottom) is maybe worse in the optimized case (middle) than it could be (compare with GCC). That's a separate issue.

We can see a few things:

  1. Codegen in the optimized case is good-ish with both implementations.
  2. Codegen in the non-optimized case (right) is significantly worse when we are using algorithms (top) and other modern constructs, compared to the low-level implementation (bottom) where codegen is almost as good as the optimized case.
  3. Debug performance in the non-optimized case with algorithms (top right) is unusable (there is really no need to even benchmark this), and this is 2/3 due to not inlining trivial members and helper functions of span, or array, or iterator: https://godbolt.org/z/v3r61oaaa
  4. Switching to -Og fixes all codegen issues (not surprising with Clang, as it is -O1; not shown).

From a user perspective, I am limited to the following choices:

  • Never use -O0, and only ever use -Og. However, especially with Clang, where -Og equals -O1, there are easily cases where -Og optimizes too much or at least significantly much that it hinders gathering required clues while debugging.
  • Do not use modern C++ features.

If using array or algorithms or other similarly thin simple wrappers always induces the fear of not being able to debug my code ever again with -O0, I am frankly just choosing to ignore all these modern C++ features and implement in a lower level dialect. If the C++ community wants to see the STL being used in the affected fields, things really need to change here with respect to debug performance. Also, I would suggest to reconsider speedups of "only 1.1x" here. If you dismiss 10 of such speedups because they individually are too small, you may in the end have dismissed the difference between unusable and usable. Improvements have to start somewhere.

In an audio application, the algorithm implemented above is probably the simplest of all that ever runs. And it runs often. Each instance runs at least 48kHz/chunkSize(<=128) times per second, with easily thousands of instances of such (and more complex) algorithms. This is not a purely contrived toy micro benchmark. Code very similar to that exists in the real world.

The question one should ask here really comes down to: Is there really any justification whatsoever at all why with -O0 move/forward should ever perform worse than a raw static_cast, or why any members of std::array should perform worse than a C-style array, or why any members of unique_ptr or iterator (for span/array/vector, as well as related begin, end, data, and size members) should perform worse than a raw pointer? Other functions to consider here are also container access wrapper free functions like std::begin, std::end, std::data, std::size. The one thing that all functions mentioned here have in common, is, that they do not actually do ANYTHING. They are just abstractions around already existing trivial functionality. These may be zero-cost abstractions in optimized builds, but they are very much not in debug builds, up to the point of costing too much to be even usable for a lot of people. Also, stepping through any of the functions mentioned here does not provide any benefit even in debug builds, as they are too trivial to reveal any useful information while debugging, meaning there is not even anything lost with respect to debuggability from inlining them.

I think tagging such functions in the STL with either [[gnu::always_inline]] or some new attribute [[debug_inline]] or [[simple]], would be a good decision. Using an attribute or macro distinct from [[gnu::always_inline]] could have the advantage of giving users the ability (via a compiler switch or even a macro) to opt-in or opt-out of inlining such functions at -O0.

Also note that just re-using [[gnu::always_inline]] could also result in performance pessimization when applied to algorithms. For algorithms operating on complex objects, always inlining them could easily result in code bloat. For optimized builds, compiler heuristics are probably a better choice here than forcing it unconditionally with an attribute.

Just rambling, but, one could maybe even consider something like [[frontend_inline]], which is I think what could in fact be the most useful implementation here from a debug experience point-of-view, though of course this would (I guess) require a significant implementation effort on the compiler side. There are certainly lower hanging fruits to take first.

@philnik777
Copy link
Contributor

First of all, please keep the discussion somewhat on topic. This thread is primarily about adding [[gnu::always_inline]] or something similar (which would require compiler support) to std::move and not any other functions. @vittorioromeo mentioned that there is a patch for GCC to inline move and forward. I think that is a better approach, as the compiler could make these things behave exactly like a cast, while marking move as always inline leaves behind useless code (https://godbolt.org/z/or4bxG4v9).

Addressing @vittorioromeo's benchmarks:
(First benchmark) Re-running the benchmark shows that algorithm and algorithm_inline are on par (https://quick-bench.com/q/K_ASahxgCA9n02Gju9ol-fKEbsU). That leads me to the conclusion that
(1) something is wrong with your benchmark, or
(2) the difference isn't actually statistically significant enough to show in a benchmark.
There might be other options, but I think this is significant evidence that adding [[gnu::always_inline]] to std::move doesn't actually make much of a difference. Running the benchmark locally shows about the same results.

I don't think the second benchmark is relevant here, because it relies on the fact that clang 12 didn't inline any code in -O1. That has changed in clang 13, which is the oldest currently supported compiler.

@manxorist
Copy link

Addressing @vittorioromeo's benchmarks: (First benchmark) Re-running the benchmark shows that algorithm and algorithm_inline are on par (https://quick-bench.com/q/K_ASahxgCA9n02Gju9ol-fKEbsU).

Corrected benchmark (sum_vector_algorithm and sum_vector_algorithm_inline were both using copied_accumulate instead of copied_accumulate_inline): https://quick-bench.com/q/lmyjMzr1O-9MP94NWjY00s9mOSY

@philnik777
Copy link
Contributor

Thanks @manxorist!
With the updated benchmark I see a 1.1x speedup locally. I think my other points still stand.
Just to have a list (in no particular order):
(1) It would be better to let the compiler know that std::move is a cast and let it generate the same code as if it were one
(2) [[gnu::always_inline]] would be the wrong tool for the job. Make one that would tell the compiler to consider inlining and optimizing a function.
(3) -Og should be worked on to have a better debugging experience, which should make it the go-to for debugging with some performance. I haven't tried GCC's -Og, but I heard it's pretty good.
(4) (Probably the most controversial) 1.1x in a micro benchmark isn't significant enough to warrant adding [[gnu::always_inline]] to std::move.

I think (1) would be quite easy to implement and give a better result than adding [[gnu::always_inline]] and (3) would be the most beneficial, but is probably a lot of work.

@eteran
Copy link

eteran commented Mar 2, 2022

@philnik777 since we want to keep issues to be as focused as possible, is there any interest in a similar but different issue regarding more debug inlining of trivial operations and thin wrapper classes?

I think these issues are related... But clearly separate.

So as long as you don't think the request is dead on arrival (which would be regrettable), I'm happy to open an issue and possibly submit a PR for it.

As shown above, strategic usage of always inline on unique_ptr improves debug performance by 2-5x and results in a simpler and easier to grok call stack.

@philnik777
Copy link
Contributor

@eteran I think a discussion of it on discord is the way to go. This is clearly controversial and Github isn't really the right place for that. If you give me your discord handle I can open a new thread for that discussion.

There is a thread for this topic, which we should probably switch to. @vittorioromeo Is that OK for you?

@ldionne
Copy link
Member

ldionne commented Mar 2, 2022

Please move to https://discord.com/channels/636084430946959380/948605372728487956.

IMO, we should look into why the compiler doesn't inline these simple functions at -Og instead of trying to tweak libc++ -- that would benefit all users and would automatically select other "trivial" functions like .begin() without us having to lift a finger. Doing this sort of stuff manually is error prone (in the sense that we'll forget some or get it wrong sometimes).

Locking for now.

@llvm llvm locked and limited conversation to collaborators Mar 2, 2022
@mordante
Copy link
Member

It seems clang implemented a solution for the original issue. That solution works at -O0. See https://reviews.llvm.org/D123345 for more details.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance
Projects
None yet
Development

No branches or pull requests