Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: add basic block counters for PGO #65466

Open
alexanius opened this issue Feb 2, 2024 · 7 comments
Open

cmd/compile: add basic block counters for PGO #65466

alexanius opened this issue Feb 2, 2024 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@alexanius
Copy link

Proposal Details

This is a proposal for implementing PGO basic block counters in the Go compiler. The issue #62463 describes profile-based optimizations useful for the Go compiler. Most of them (basic block ordering, loop unroll, PGO register allocation and others) need counters inside the basic blocks. Currently, the Go compiler has the weighted call graph, which cannot be used for such optimizations directly.

Here I propose to add the basic block counters to make possible implementation of profile guided optimizations.

General approach

The general approach is based on adding counter values to the AST and SSA IR nodes, getting these values from the pprof file and correcting them during the compilation.

Step 1. Load the counter values to the AST nodes. The counters from the samples can easily be loaded to the corresponding AST nodes. As we use sampling profile, not all the nodes will have the values.

Step 2. Propagate the values to the remaining nodes. Here we traverse the AST nodes and propagate existing values to the nodes with no values. This is needed for further steps

Step 3. Correct values after devirtualization and inline. The callee function nodes contains the summary value of all the calls, but after inline, we should re-evaluate these values according to the inline point counter.

Step 4. Assign counters to the basic blocks during the SSA generation.

Step 5. Correct the counters of the basic blocks if any optimization changes the control flow.

Step 6. Implement the optimizations that rely on basic block counters.

Notes on implementation

  1. Alternative approach. The suggested approach assumes storing and correcting the counters during the whole compilation pipeline. This will add additional field to the IR nodes and can complicate the optimization implementation (at least additional steps to the inline). As an alternative, we could try to load counters to the particular SSA basic blocks, basing on the position information of the operations. This approach has the following disadvantages: we still need counter correction, based on top-down and bottom-up control flow graph traversing, and additional correction based on inline tree information. If there exists an optimization, that changes the control flow, we still need correction. Also, the dynamic escapes on cold paths optimization needs the counters on the AST nodes. So, loading the counters to the AST nodes is not more complicated (probably even easier) and gives more opportunities.

  2. One of the non-trivial parts is Step 2 - propagating nodes on the AST. Probably, this algorithm will be implemented as a down-top and top-down walk through the tree. The particular algorithm will be designed during implementation.

  3. To make the profile more precise, we need line discriminators. Currently, the debug information in the Go binary contains only per-line information. This will play a role in the cases of a few conditions in "if" construction, for example, but even without this information, the profile will be useful. The approach for loading this information is described in issue cmd/compile: add intra-line discrimination to PGO profiles #59612.

Implementation plan

I made a prototype that loads counters into the AST IR nodes and going to pass them to the SSA basic blocks. After that, I will implement the Steps 2 and 3. Then I'm going to add discriminators and implement the rest. After that I'm going to implement some of the optimizations like local basic block ordering.

I would like to get feedback from the community and understand if the community finds this useful.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 2, 2024
@prattmic
Copy link
Member

prattmic commented Feb 2, 2024

We certainly would like a way to precisely identify basic blocks in profiles so we can use that information for basic block level optimizations. That is what #59612 is intended to cover.

That issue doesn't cover precisely how the discriminator values are determined or how they get matched back to IR nodes and/or SSA blocks/values during the next build. It seems like this is something your design above tries to define, which is great.

That said, I don't quite follow exactly how you are defining them, or how propagation works as IR is mutated and eventually becomes SSA. Are you assigning discriminator values to high-level IR nodes (like ir.ForStmt) and then propagating through mutations and SSA?

I suspect this kind of propagation will get very complicated. I'd be tempted, at least for an initial version, to keep everything in SSA. Discriminators are based on basic block numbers, and optimizations prior to SSA simply cannot use them. Even in this case, I think tracking correlation to the profile through the different layers of SSA passes will be difficult.

I happen to be working on a prototype to assign discriminators to each PC, and plumb that into the binary metadata and the pprof profile. Once I have that done, you may be able to use this prototype to play with the PGO side of matching the samples-with-discriminators back to the build and applying basic block optimizations.

cc @cherrymui @aclements @jinlin-bayarea

@alexanius
Copy link
Author

@prattmic thank you for your answer.

Yes, you are right that discriminators are based on the basic blocks. My proposal needs the information about the column for the sampled instruction, so maybe the dwarf column information will be more suitable for my proposal. Thanks for highlighting that. Answering the question about the discriminator assignment - I do not do that for now (and need not discriminators itself, but the column number).

Yet, the column information is useful, we can load the counters to the AST nodes without it. The profile may be less precise, but still we can implement the profile-based optimizations. My motivation for loading the counters to the AST level is an opportunity to implement not only inline optimization on the AST level, so I think we should start loading the profile counters at the AST level.

@gopherbot
Copy link

Change https://go.dev/cl/560781 mentions this issue: PROTOTYPE: cmd/compile,runtime: add discriminator, plumb to pprof

@prattmic
Copy link
Member

prattmic commented Feb 2, 2024

https://go.dev/cl/560781 plumbs a discriminator value from the compiler to the pprof Line.Column field (even though the value is not actually the column number.

Feel free to use this if you'd like to play with using discriminators for PGO. If you'd like just the column number itself, see my comment at https://go-review.googlesource.com/c/go/+/560781/2/src/cmd/internal/obj/pcln.go.

@ianlancetaylor
Copy link
Contributor

I don't see why this has to be a proposal; taking it out of the proposal process.

@mknyszek mknyszek added this to the Backlog milestone Feb 7, 2024
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 7, 2024
@gopherbot
Copy link

Change https://go.dev/cl/564055 mentions this issue: [WIP] DO NOT REVIEW cmd/compile: add basic block counters

@alexanius
Copy link
Author

I implemented some proof-of-concept prototype for basic block counters. Currently done:

  • Loading counters to AST nodes by line number information
  • The AST counter propagation algorithm
  • Moving counters to SSA nodes from AST nodes
  • Adding profile usage in basic block layout pass (just proof-of-concept)
  • Print counters to the html dumps

The patch shows how we can load basic block counters from the pprof file, how we can propagate them, and how to use them. The patch was tested on the go1 benchmark set, and it showed that depending on bb-layout pass the test fannkuch can be faster from 3.5% to 12.2% on the intel core machine (the xeon machine does not show this improvement).

To use basic block counters you should add an option -bbpgo: go test -a -bbpgo -pgo=go1.prof

This patch just shows the main idea of basic block counters loading. Before we can use it in the compiler, the following steps should be done:

  • Improve the counter propagation on the AST. Currently it does not take in account the returns from the middle of function. Also switches are not evaluated as needed.
  • Implement the counter corrections after inline
  • Improve the counter moving to SSA. At least we should process PanicBounds generation in a proper way.
  • Improve basic block layout and register allocation passes with profile information
  • Fix formatting when generating html dumps

This patch itself is not for review, but I would like to get some feedback for the general approach - the propagation algorithm and the Node structure modifications itself. Please, take a look at the irgraph.go and the ssa part.

cc @cherrymui @aclements @jinlin-bayarea @prattmic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

5 participants