Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowdown from the final keyword #90685

Open
define-private-public opened this issue Apr 30, 2024 · 10 comments
Open

Slowdown from the final keyword #90685

define-private-public opened this issue Apr 30, 2024 · 10 comments
Labels
clang Clang issues not falling into any other category

Comments

@define-private-public
Copy link

Hi. I published this blog post last week Monday about the final keyword. I made a note of a slowdown that was found when using clang/LLVM to compile and run the code.

On Hacker News discussion an LLVM developer commented:

As an LLVM developer, I really wish the author filed a bug report and waited for some analysis BEFORE publishing an article (that may never get amended) that recommends not using this keyword with clang for performance reasons. I suspect there's just a bug in clang.

  1. Is this an actual bug?
  2. Would someone be willing to provide analysis as to what's going on?

I plan on writing a follow up post shortly since the discussion surrounding everything seemed to be a bit more than I anticipated; and I feel a lot of people got the intent of the article wrong.

@EugeneZelenko EugeneZelenko added clang Clang issues not falling into any other category and removed new issue labels Apr 30, 2024
@shafik
Copy link
Collaborator

shafik commented May 1, 2024

It is not clear to me if this is a bug on quick read through the blog. If you could reduce this and put in a godbolt example that would really help. Then we could more easily look at the AST and IR and see if anything obvious pops up.

This is not my area, CC @erichkeane

@erichkeane
Copy link
Collaborator

Without a good minimal example (and there isn't any actionable code on that blog post), I can't really see any reason why this would be the case. It isn't particularly reproducible unfortunately.

As far as I know/can tell, the only difference from the CFE's perspective is that we do a de-virtualization in the case of a 'final' class: https://godbolt.org/z/8q9qM91o5

See how the use of UF in baz does a direct call to UF::f, rather than via a vtable load. SO, clang ended up devirtualizing, which, on its face, should be gain.

Even under opt(https://godbolt.org/z/xqefEE1cj) that is effectively the difference you see. UF has its function called directly.

@nickdesaulniers is the one who commented on the original HN article, but he's away apparently, so unknown if he can comment. IF we can get some sort of reproducer, it would be nice to see some LLVM folks around to analyze.

MY only (CFE only-knowledge though) guesses are:
1- There is something wrong with the testing methodology. I have no way of knowing one way or the other here, but it is actually quite shocking to me that a: Devirtualization showed so little gain across two compilers, and b; that it showed much negative effects.

2- This hit some sort of pathological case, where the inlining of some now-not-virtual functions resulted in LLVM optimizing it in a way that is non-optimal.

So without some better reproducer, I don't really have an idea, nor even if we did, do I have the expertise here to help.

@cor3ntin
Copy link
Contributor

cor3ntin commented May 1, 2024

According to https://news.ycombinator.com/item?id=40156196 and https://gitlab.com/define-private-public/PSRayTracing/-/issues/85, it's related to uniform_real_distribution / logl not being inlined

@cor3ntin
Copy link
Contributor

cor3ntin commented May 1, 2024

It might be #19916

@cor3ntin
Copy link
Contributor

cor3ntin commented May 1, 2024

#19916 (comment)

@define-private-public
Copy link
Author

I'm a little confused, is the actually that the use of final is actually causing an issue with the uniform_real_distribution? I was thinking that they might be separate issues and Mr. Zhechev might have uncovered a separate issue aside from the A B testing of final.

@pkasting
Copy link
Member

pkasting commented May 16, 2024

Is it possible that final -> devirtualization -> unconditionally inlining more -> increased icache pressure -> performance drops?

We've seen a variety of performance issues in Chromium due to overly-aggressive inlining (leading to me asking our compiler team to look at tuning inlining heuristics if they can), but I don't know how plausible that is here.

@antoniofrighetto
Copy link
Contributor

Is it possible that final -> devirtualization -> unconditionally inlining more -> increased icache pressure -> performance drops?

I think it can be the case once they are turned into direct calls, indeed final seems to aid devirtualization as per:

// If that method is marked final, we can devirtualize it.
if (DevirtualizedMethod->hasAttr<FinalAttr>())
return DevirtualizedMethod;

Out of curiosity, do you have some IR at hand (from Chromium) of overly-aggressive inlining that showed up to be detrimental for performance?

@pkasting
Copy link
Member

I don't have any offhand, but Chromium installs a custom clang plugin to force authors to move class constructors/destructors out-of-line for any non-tiny class, because of this problem; this is irritating in C++20 because then people can't use aggregates nearly as often. We are also looking at some binary size changes due to different inlining as part of https://issues.chromium.org/issues/40255003; see comments 15 and 16 for some LLVM IR that changed due to using __builtin_trap() more and modifying inlining as a result. I don't know what the perf impact of the changes on that bug is, though (that one is being pursued to understand the binary size changes more than because we saw a perf hit).

@nickdesaulniers
Copy link
Member

nickdesaulniers commented May 22, 2024

According to https://news.ycombinator.com/item?id=40156196 and https://gitlab.com/define-private-public/PSRayTracing/-/issues/85, it's related to uniform_real_distribution / logl not being inlined

Yeah, looks like we're not folding calls to logl with constants! EDIT: or any long double libcalls!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang Clang issues not falling into any other category
Projects
None yet
Development

No branches or pull requests

8 participants