-
Notifications
You must be signed in to change notification settings - Fork 11.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of std::move
in libc++ leads to worsened debug performance
#53689
Comments
The behavior of |
Thanks for the info, it is good to hear that things have improved in 13.
Well, let's put it this way. Why should a If we want to improve the situation for However, if we want to improve the situation for everyone, to me it sounds like a good idea to also mark functions such as |
The problem with your approach is that then the question then becomes 'Why not also mark |
Is that actually a bad idea? I cannot think of a reason why you wouldn't want these functions inlined for all standard containers, unless it significantly decreases compilation speed at I would want every function that is either (1) basically a cast (e.g.
I am genuinely sympathetic to the suffering caused by having to work with an "ugly" codebase. But as you said, it is already completely unreadable. While it is technically true that I believe that miniscule readability loss in an already unreadable situation is worth avoiding a 3x performance loss for the end user (in a micro-benchmark). Additionally, the positive impact such a change would have on the reputation of C++ in the gamedev community cannot be understated. |
I don't think it's a good idea to just put
Using another attribute and some reformatting we get:
It just doesn't get better. Note that we can't use
Actually, it is
Yes, I know that. But when I see
So I have to look at a few things instead of just seeing
I don't see anybody who actually needs some performance in a debug build using |
By doing that, the debugging experience suffers greatly as optimizations can make it much harder for humans to step through the code in the debugger.
You are technically correct, but going from "75% unreadable" to "76% unreadable" for concrete benefits to the end-user experience and debugging performance seems like a worthwhile sacrifice to me. I hope you get the point I'm trying to make: adding an extra attribute doesn't really change the overall readability situation.
That's fair enough, I would prefer to see
Here are a few examples:
I can find many many more. The gamedev community is very vocal about this issue, and the message that they're sending out is: "avoid Modern C++". Truthfully speaking, I cannot blame them, and -- sadly -- they are getting a lot of popularity and consensus. We need to do better and start supporting their use cases. If an extra attribute on a few function signatures can significantly improve debugging performance, I think it's a huge win for the C++ community as a whole.
It's easy to say it's somebody else's problem... Don't get me wrong, I would love compilers to become even smarter and figure out that things like I politely ask you to reconsider my request, and discuss it with other colleagues to see what the general opinion is. It would also be wise to discuss this with people involved in SG14. I would be happy to send PRs to Cheers! |
It feels like there's an extra option here that balances readability with performance impact reasonably well: add a macro, _LIBCXX_MOVE(v), that expands to This keeps the intent clear for the reader while avoiding the function call cost in debug. |
You know what, I think it might be reasonable for
I'd say definitely no. Adding an attribute to |
Yes, it does make a difference. Iterators are often passed to algorithms, and not inlining them cripples performance of such algorithms in debug builds. It's exactly situations like this, that make adopting modern C++ algorithms difficult. The overall goal that (I think) @vittorioromeo wants to target here, is making it feasible to use algorithms in hot loops and still get usable performance in debug builds (I'm the one who probably triggered this issue in the first place with this tweet: https://twitter.com/manxorist/status/1491493715382358025 ) |
@philnik777: I don't think this idea should be immediately discarded. We C++ developers tend to abhor macros, but they are the only tool in the language that guarantees predictable performance due to textual expansion and still carry semantic meaning via good naming choice. One advantage of using a macro or cast is that compilation time also gets improved, see this commit from Boost.Hana as an example. Related reading material: |
But where do you call |
I ran the libc++ test suite with |
Not exactly the same issue, but consider any algorithm from |
One other reason that I suggested a macro is that when libc++ is used by MSVC, there's no option of using always-inline (MSVC doesn't inline That said, of course the macro is limited to internal implementation (like uses of |
Commenting to second the macro suggestion from @zeux. I think it's a pragmatic choice that addresses the root problem (at the expense of some sensibilities depending on how macro adverse you are). From @philnik777
The parent post is likely referring to iterating through a vector within a loop. In a debug build, you don't expect amazing performance, but you also don't want performance to fall off a cliff if you can help it. |
Let's do one thing first. I haven't talked to anybody about this yet, so there is no guarantee you get
We don't support MSVC. We only support clang-cl, which supports |
@philnik777: Would it be worthwhile to bring another maintainer or two into this discussion to see what they think, before I start putting some effort into the creation of a PR? |
I think the best way to sell it is to provide a benchmark, which I think is most of the work. The change inside the headers should only be one or two lines anyways. I too still want to see benchmark results before I approve such a change. So even getting someone else's opinion wouldn't guarantee that such a change is approved. |
There is also the downside that |
I do not believe that is true for functions like
No, game developers are not told to use
This doesn't seem like a very constructive attitude. I would advise you to consider other people and industries' needs and points of views. We are all trying to make C++ better.
I did not suggest sprinkling |
I think I have to take back what I said yesterday. Playing around with the benchmark a bit shows that there is almost no difference. https://quick-bench.com/q/HLD8AS7d1TZhXTY77bjexGvXBB0 shows almost no difference in performance. Using |
There is a simple way to de-uglify most (all?) of libc++. |
According to Clang's documentation,
(Related: #53681) Therefore, it is not an acceptable solution at the moment as it might negatively affect the debugging experience. There are also reasons to compile with |
I don't think 1.1x performance improvement at I get that |
This benchmark shows a 1.4x performance improvement with This benchmark shows a 5.7x performance improvement with |
@philnik777 Just to chime in, because I think that more people than you may realize, think this is an important issue. The standards committee often will refer to concerns over performance as a "quality of implementation" issue. And they are probably right to do so since often it's not a language level issue, because C++ defers so much to the standard library and expects good competition between compilers. So, here's the way that I see it: For things like I also think we can agree that So we have a class of users, who need to debug code and do so with reasonable performance since maybe a bug doesn't happen until several minutes into a game and having to painfully wait for that is just a terrible debugging experience if that wait time is notably inflated. I think that @vittorioromeo 's solution is pretty reasonable. I think that the reason for So why not help the compiler along and tell it that these things are particularly important to always inline? I'm not sure I'm sold on the given downsides. As discussed above, the library is already "expert friendly" and hard to read unless you have a lot of experience with it. I don't think adding one more macro/keyword in front of a function declaration in some key spots changes that. For reference, I've implemented a hand rolled version of unique_ptr which at I don't think my addition of https://godbolt.org/z/qqY5cbeMe Just some food for thought. C++ is easily my favorite language to develop in, but I do feel that debug performance is a place where we can make some relatively low cost improvements. |
Just my two cents here -- It's important to keep in mind that LLVM does not optimize stack slot lifetimes with -O0, meaning that every inlined variable has its own stack slot. This causes a big blowup in stack usage if too many functions are inlined, so we may want to be careful about this. The macro approach would probably sidestep this issue. |
It really bugs me that the compiler is generating code for a function that's conceptually a cast that should emit no instructions. Not only it slows down debug performance, but it makes the compiler perform additional work. The inlining would be nice for debug performance, but that also means the compiler will be doing even more work... for a cast. Doesn't this bug anyone else? I think that macros or compiler intrinsics would be a great idea for things like |
Macros or builtins for Maybe clang could pattern-match the move idiom or function, but that sounds somewhat error-prone. Something like At that point something more generic like |
@duk-37: the GCC implementers are working towards implementing special cases in the frontend to automagically fold calls to Here's the GCC patch: +++ b/gcc/cp/cp-gimplify.c
+ extern bool is_std_move_p (tree);
+ extern bool is_std_forward_p (tree);
+ if (is_std_move_p (x) || is_std_forward_p (x))
+ {
+ tree arg = CALL_EXPR_ARG (x, 0);
+ x = cp_convert (TREE_TYPE (x), arg, tf_none);
+ break;
+ } And related ticket: |
Just as a followup, just adding always inline to my https://quick-bench.com/q/ph6e1zAcrPxoTczrC2uOCwW9vlw and potentially multiple 1000's x performance difference at https://quick-bench.com/q/v90Aw6jl5Pmo4DP8S_phT40cDZ4 Maybe I'm doing something wrong with the benchmark, that seems like a HUGE difference, but either way, strategic |
You have to wrap parts of your benchmarks in |
Was about to say exactly the same thing as @vittorioromeo. I'd also like to make it clear that I'm not trying to say that |
Yeah, I knew something was off with the benchmark. I wasn't expecting a 19,000x difference, LOL. But yeah, think a 5x difference in debug performance is notable enough that the juice is worth the squeeze. I actually kinda love the idea of an attribute or similar for It's also worth noting that even for cases where the performance is not hugely different. a simpler callstack goes a long way into being able to understand "how did I get here". I know that's a weaker argument than just raw numbers... but this discussion is for me, largely about how we can make debug builds less painful to work with in general. |
Let's look at a simple audio algorithm (attenuates a 4 channel audio signal), implemented with
vs
https://godbolt.org/z/TeqqPzbas Let's ignore the fact that Clang's codegen for the low-level implementation (bottom) is maybe worse in the optimized case (middle) than it could be (compare with GCC). That's a separate issue. We can see a few things:
From a user perspective, I am limited to the following choices:
If using array or algorithms or other similarly thin simple wrappers always induces the fear of not being able to debug my code ever again with In an audio application, the algorithm implemented above is probably the simplest of all that ever runs. And it runs often. Each instance runs at least 48kHz/chunkSize(<=128) times per second, with easily thousands of instances of such (and more complex) algorithms. This is not a purely contrived toy micro benchmark. Code very similar to that exists in the real world. The question one should ask here really comes down to: Is there really any justification whatsoever at all why with I think tagging such functions in the STL with either Also note that just re-using Just rambling, but, one could maybe even consider something like |
First of all, please keep the discussion somewhat on topic. This thread is primarily about adding Addressing @vittorioromeo's benchmarks: I don't think the second benchmark is relevant here, because it relies on the fact that clang 12 didn't inline any code in |
Corrected benchmark (sum_vector_algorithm and sum_vector_algorithm_inline were both using copied_accumulate instead of copied_accumulate_inline): https://quick-bench.com/q/lmyjMzr1O-9MP94NWjY00s9mOSY |
Thanks @manxorist! I think (1) would be quite easy to implement and give a better result than adding |
@philnik777 since we want to keep issues to be as focused as possible, is there any interest in a similar but different issue regarding more debug inlining of trivial operations and thin wrapper classes? I think these issues are related... But clearly separate. So as long as you don't think the request is dead on arrival (which would be regrettable), I'm happy to open an issue and possibly submit a PR for it. As shown above, strategic usage of always inline on unique_ptr improves debug performance by 2-5x and results in a simpler and easier to grok call stack. |
@eteran I think a discussion of it on discord is the way to go. This is clearly controversial and Github isn't really the right place for that. If you give me your discord handle I can open a new thread for that discussion. There is a thread for this topic, which we should probably switch to. @vittorioromeo Is that OK for you? |
Please move to https://discord.com/channels/636084430946959380/948605372728487956. IMO, we should look into why the compiler doesn't inline these simple functions at Locking for now. |
It seems clang implemented a solution for the original issue. That solution works at |
std::accumulate
is defined as follows inlibc++
:When compiling a program using
std::accumulate
in debug mode, under-Og
, there is a noticeable performance impact due to the presence ofstd::move
.std::move
: https://quick-bench.com/q/h_M_AUs3pgBE3bYr82rsA1_VtjUstd::move
: https://quick-bench.com/q/ysis2b1CgIZkRsO2cqfjZm9JkioThis performance degradation is one example of why many people (especially in the gamedev community) are not adopting standard library algorithms and modern C++ more widely.
Would it be possible to replace
std::move
calls internal tolibc++
with a cast, or some sort of compiler intrinsic? Or maybe markstd::move
as "always inline" even without optimizations enabled?The text was updated successfully, but these errors were encountered: