New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: PGO opportunities umbrella issue #62463
Comments
For this one "Select between branches and conditional MOVs", some architecture features such as Arm's Branch Record Buffer Extension (BRBE) may help. |
Another possibility could be to automatically apply Function Multi-Versioning to hot functions with large loops, which would greatly improve the performance of math-heavy workloads. Intel's Clear Linux uses this approach to automatically take advantage of modern CPU instructions while still being backwards-compatible with x86-64-v1. From my understanding, Go doesn't have a particularly mature auto-vectorizer, but having PGO-guided FMV should allow for new opportunities in this area. |
@erifan LBR support is on our plan, and we've been thinking about it. I'm not sure it belongs to this issue, though (perhaps it could be). |
@erifan Does BRBE record conditional moves? That would certainly be nice. x86's LBR does not, which would make this pretty tricky on x86.
These are things we're considering doing, but we're not staking a claim or anything. :) Currently, we're definitely working on "Indirect call devirtualization" (1.21 had a version of this, but with significant limitations we're hoping to lift in 1.22). We're also thinking seriously about "Dynamic escapes on cold paths", but I don't think we've started implementing it. We haven't made inroads on the others ourselves. I believe Uber has done some work on "Function ordering", but I haven't heard any updates on that in quite a while. |
@myaaaaaaaaa , thanks. That's what I meant by "Architecture feature check unswitching", but I've added the term "function multi-versioning" to that in my list. Function multi-versioning is a specific way to do this, and does have a nice advantage that if you have a call from A -> B and both A and B are multi-versioned, A can make a direct call to the right version of B. |
Thanks @cherrymui |
There may also be an opportunity to scan for code regions that can be safely parallelized (such as pure functions/loops), and automatically rewrite them to launch as separate goroutines that send their results back through a channel, effectively implementing a function-scale version of instruction-level parallelism. This would normally risk introducing synchronization/context switching overhead, but PGO would allow the compiler to apply this optimization only to functions/loops that typically run for a long time every invocation (say, 1-10ms). I'm unsure how broadly applicable this would be in practice. On the other hand, I would imagine that successfully detecting just a few functions can easily create enough parallelism to saturate even 32-thread machines, since goroutine counts would increase exponentially with every parallelized function in the stack. |
I don’t think PGO (CPU Profiles) know anything about the number of a times a function is invoked or how long those invocations last? |
This issue is to track the list of PGO optimization opportunities we're considering. As we begin work on any of these, it should be broken into its own issue. We'll edit and add to this list over time.
(This list was originally based on an old comment of mine, and this issue is partly to surface and track this list better.)
The text was updated successfully, but these errors were encountered: