Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: profile-guided optimization #55022

Open
cherrymui opened this issue Sep 12, 2022 · 67 comments
Open

cmd/compile: profile-guided optimization #55022

cherrymui opened this issue Sep 12, 2022 · 67 comments
Assignees
Labels
Milestone

Comments

@cherrymui
Copy link
Member

cherrymui commented Sep 12, 2022

We propose adding support for profile-guided optimization (PGO) to the Go gc toolchain. PGO will enable the toolchain to perform application- and workload-specific optimizations based on run-time information. Unlike many compiler optimizations, PGO requires user involvement to collect profiles and feed them back into the build process. Hence, we propose a design that centers user experience and ease of deployment and fits naturally into the broader Go build ecosystem.

Detailed design can be found here.

In summary, we propose

  • The go command will search for a profile named default.pgo in the source directory of the main package and, if present, will supply it to the compiler for all packages in the transitive dependencies of the main package. We will also add a -pgo=<path> command line flag to go build that specifies a profile location to use for a PGO build, specifically,
    • -pgo=<path> will select the profile at path,
    • -pgo=auto will select the profile stored in the source directory of the main package (the behavior mentioned above)
    • -pgo=off will turn off PGO entirely
    • For Go 1.20, it will be default to off. In a future release we plan to make it default to auto.
  • The content of the profile will be considered an input to the build, and will be incorporated into the build cache key calculation and BuildInfo.
  • The compiler will support reading pprof profiles passed in from the go command, and modify its optimization heuristics to use this profile information. Initially we plan to add PGO-based inlining. More optimizations may be added in the future.
  • We will initially support pprof CPU profiles, which can be collected through usual means, such as the the runtime/pprof or net/http/pprof packages.

Input welcome. Beyond input on the general approach, we're particularly looking for input on whether PGO should be default enabled in 1.20, and flag and file naming conventions.

If accepted, we plan to implement a preview of PGO in Go 1.20.
Raj Barik and Jin Lin plan to contribute their work on the compiler implementation (initial prototype in CL 429863, but not yet following proposed API).

See also previous discussions at #28262. Filing a new issue to make it clearer what we are proposing.
Thanks.

cc @aclements @prattmic @rajbarik

@gopherbot gopherbot added this to the Proposal milestone Sep 12, 2022
@gopherbot
Copy link

gopherbot commented Sep 12, 2022

Change https://go.dev/cl/430355 mentions this issue: design/55022-pgo.md: add design doc

gopherbot pushed a commit to golang/proposal that referenced this issue Sep 12, 2022
We propose adding support for profile-guided optimization (PGO) to
Go.

For golang/go#55022.

Change-Id: I9bf8afc4858a530bc32fc2b40698209a45695203
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/430355
Reviewed-by: Michael Pratt <mpratt@google.com>
@prattmic prattmic added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 12, 2022
@rhysh
Copy link
Contributor

rhysh commented Sep 12, 2022

This looks very interesting! Starting with the pprof format sounds good, but I have some concerns about its verbosity in two ways.

First, the files can be really big and processing them can be slow. When I work with profiling data manually, I often start with pprof-format files that include hundreds or thousands of samples, usually covering less than one thread-minute. For PGO, I'd be inclined to use "the most and best data available". For some applications I support, that means multi-megabyte pprof-format files which take multiple seconds to load. That's a big file to add to an application's source control repository (for build environments that work that way), and a big build speed hit if it applies to every package. The approach of "Some of these downsides can be solved transparently by converting the profile to a simpler, indexed format at build time and storing this processed profile in the build cache" sounds promising, but I wonder if that format should be obtainable through some explicit go tool subcommand, which could then be used until the next Go version series (re-generated when upgrading from go1.20.x to go1.21.y).

Second, how can application owners understand what the compiler is learning from the profile? When I encounter a bug that might be affected by or otherwise involve PGO, I'd like to be able to report it (including steps to reproduce) without accidentally disclosing proprietary information about the rest of my applications' performance and without accidentally changing the compiler's PGO-related decisions. When I have the choice between using a 100kB pprof file or a 10MB pprof file, I'd like to know if using the more detailed file actually changes the results. When reviewing a change to the PGO input file (maybe to adjust its weight towards more latency-critical parts of the app), I'd like to know what it's communicating to the compiler. These could also be addressed by a "simpler, indexed format" that is semi-stable and human-readable / human-redactable.

@cherrymui
Copy link
Member Author

cherrymui commented Sep 12, 2022

@rhysh thanks for the feedback!

Yes, if we later switch to a different format, we will provide a tool to convert profiles between the formats. And yes, if we choose a different format we will consider how to make it more human-readable. (We considered this possibility but it was unclear to me how to do that. There is also a range of possibilities. The human readable form could be very low level, basically a textual representation of the raw profile, to very high level, basically a list of "compiler optimization directives", or something in between.)

Related, I think the compiler will have an option to output what optimizations are done based on the profile, either through the -m, -json flag, or a new flag. Maybe that helps?

@rhysh
Copy link
Contributor

rhysh commented Sep 12, 2022

Visibility into the compiler's decisions via -m/-json/etc sounds like a big help, yes. That lets users report bugs like "I have this problem with the 'encoding/json' package when the compiler uses 'this list' of PGO decisions". Will the maintainers who receive a bug report like that be able to convince their copy of the compiler to apply the same set of PGO decisions?

If so, it sounds like we have the "simpler [...] format" described in the proposal, in an ad-hoc way. Maybe ad-hoc is all we need, and would keep the messier and less stable parts of the ecosystem out of the main go toolchain. But I can see it being useful enough to earn top-tier support over time.

This sounds like the "basically a list of 'compiler optimization directives'" option. What design challenges did you encounter with that? Thanks @cherrymui !

@josharian
Copy link
Contributor

josharian commented Sep 12, 2022

Like @rhysh, I feel strongly that the .pgo file format should be something more like a hand-editable list of compiler optimization directives, and definitely not a raw input like a pprof file. Reciting the list of reasons I gave for this previously, and adding some more...

  • the compiler can stay simpler (internal)
  • it is easier/possible to write compiler tests (internal)
  • it is easier to debug: is the issue the interpretation of the pprof profile? (internal)
  • users can easily see what it is doing
  • users can bisect/minimize to figure out what optimization directive is causing issues (or providing speed-ups)
  • provides a mechanism for advanced tuning, that everyone always wants compiler directives for
  • it opens the door for other mechanisms/tooling to add information and optimizations, such as: offline static analysis, offline human analysis, bespoke profiling techniques, or combining profiles taken across a fleet of machines
  • it at least has a chance of supporting other common forms of optimization information, such as dynamic type info, starting goroutine stack sizes, scheduler hints
  • if optimizations are not getting applied correctly (function renamed?), instead of having to generated a brand new profile, which could be expensive/difficult, it can be fixed directly by hand-editing
  • easier to write CI tests around

@cherrymui
Copy link
Member Author

cherrymui commented Sep 12, 2022

Interesting idea about compiler optimization directive list for bug report. Thanks.

I thought for bug report due to PGO we would expect one to share the profile in order to reproduce the buggy build. It is understandable that this can be difficult in some cases. Maybe we can consider supporting an optimization list, even just for debugging purposes. (FWIW, we even had a prototype for optimization list (thanks @dr2chase) which I used for early experiments.)

One challenge is how to generate such an optimization directive list from raw profiles. For a streamlined user experience, one would take profiles from production binary, and go build could use that to build a new binary, without much manual work. So if we use optimization list, we will need a tool to convert/compute it from the raw profile. I would think the tool will need to, besides parse the profile, also parse the source files, do some analysis such as control flow propagation, etc., in order to map profile samples to optimization decisions. This has a lot overlap with what the compiler already does. So, instead of building a separate tool (and keep in in sync with the compiler), probably we just let the compiler do it.

Another benefit for profiles is that we can do multiple optimizations based on, say, a CPU profile. If we add a new type of PGO-based optimization, one would automatically get the optimization without taking a new profile. For optimization lists, one would need to explicitly add it to the file.

I'm not really clear what the optimization list would include. Should it be binary (boolean) options, or weights? Binary options sound simpler, but maybe weights are better for optimizations, so the compiler can take into account the information from source code and its own analysis?

There is also the question about the file becoming stale. Of course profiles can also become stale. It might be possible to design an optimization directive list format that is resilient to staleness.

Maybe one option is to have the compiler

  • accept pprof profiles
  • generate an optimization list based on profile
  • also accept an optimization list as an alternative / debugging option ?

@rajbarik
Copy link
Contributor

rajbarik commented Sep 13, 2022

@klauspost
Copy link
Contributor

klauspost commented Sep 13, 2022

Non-main packages

As a package writer, I cannot supply any improvements. It would require package users to implement pgo benchmarks that hits my package. Writing these benchmarks to build cpu profiles are far from trivial. As a result 99% of all compiles with my packages would be without (useful) pgo.

You mention this as future work, but it seems like a big miss.

Platforms

This will only cover platform independent code. Much optimized code has separate code paths for different platforms. As proposed I don't see any reasonable way to handle multiple platforms automatically, for example when cross-compiling.

You could make this a user problem, and say they should use go build -pgo=$GOOS_$GOARCH.pgo, but it isn't very friendly, and will probably hinder adoption.

Future Optimization Scope

I may be lacking imagination, but I don't see the scope of this going much beyond inline decisions.

Maybe with branch-information, you could re-order unordered switch statements to check the most common targets first. Maybe some auto-vectorization could be done in a few cases.

To justify such a rather complex setup it is important to keep the end-goal in sight.

Counter Proposal: Simplify Proposal

I appreciate the intent to make this automatic and low-effort. But writing the benchmarks and/or capture useful profiles is a big task and for a big application with many subsystems it will be a massive. While it proposes to just run pprof on a production server, apply the profile -> faster application, it isn't really that simple, since profiles would need to continuously be captured as code changes.

I very much agree with Josh, et al, that this should be simpler and in the hands of the developer.

To add to that, I know that Go doesn't like pragmas, but it seems to me with a few, well designed pragmas we could get the same result, that doesn't have all the problems.
I feel that that would bring us the benefit we are looking for in a much more controlled manner as described by Josh.

As far as I can see having a program propose pragmas (or code reordering) from a profile would be just as simple, and allows you to review the proposed changes before they are committed.

If Go still isn't ready for performance related pragmas, then an external file describing the same would also be an (IMO suboptimal) solution. I dread a bit how you would specify inlining at callers and auto-vectorization of a for loop without resorting to line numbers in such a file.

@merykitty
Copy link

merykitty commented Sep 13, 2022

@klauspost Given inlining is the most significant optimisation mechanism, I believe even at the current status, pgo is a much valuable improvement. In addition to inlining, pgo can be applied to interface dispatch, which helps devirtualise method invocations, profiling on branch-taken frequency helps produce better code layout at the very least, and guides the optimiser toward better decisions in other operations such as loop unrolling, cloning operations through phi nodes, in extreme cases branch pruning is also a possibility.

While pragmas or external decorative files work on call graphs, optimisation decisions are most optimal when having information in the call trees, this would be a suboptimal proposal. Additionally, it is not internal details, so I would disagree with this idea.

@klauspost
Copy link
Contributor

klauspost commented Sep 13, 2022

@merykitty Thanks! I see your point with interface dispatch, where knowing common implementation could output implementation specific branches.

The other cases I don't really see as something that couldn't be done with pragsmas. Granted, generic implementations could make this hard to control, since they would probably apply to all instances.

I don't see how you can make branch pruning without the compiler being able to prove that branches are impossible, in which case you don't really need pgo. Branch reordering and loop unrolling (without vectorization) has very limited impact on modern CPUs.

Either way, I stand by my point, that as proposed seems like a complex system that is hard to use correctly, and will provide limited benefits. It is not a "set up and forget", which is my main problem.

I am hoping for a simpler system that is more controllable, that doesn't require 'main-function' implementation with continuous updates required, which provides 90% of the benefits.

@mvdan
Copy link
Member

mvdan commented Sep 13, 2022

One worry I have with using profiles directly is that it can easily be an unstable mechanism. If I profile a program's current build and use that profile to rebuild it with PGO, grabbing another profile would likely result in fairly different results. For example, if a function call used to take ~2% of CPU time due to being in a very hot path and PGO now inlines it, then it could become far less relevant in the second profile.

I imagine we want developers to only use PGO with profiles obtained from runs without PGO. Can we restrict that to prevent confusion?

@prattmic
Copy link
Member

prattmic commented Sep 13, 2022

@mvdan The ability to collect profiles from a PGO-optimized binary is actually a core requirement in our design because it simplifies profile collection significantly (you can collect directly from production deployments without need for some kind of unoptimized "canary" deployment used only for profile collection).

It certainly has the challenge you mention, which we call "iterative stability" and discuss here and here. Sections 4.2.1 and 5.2 of the AutoFDO paper also discuss this. This is something we will need to pay close attention to and test well to make sure results are stable, and may be an area where additional metadata could help (e.g., collecting a CPU "last branch record" (LBR) profile would tell us which basic blocks are frequently executed, even if they have become much cheaper, so, used carefully, that could further mitigate this issue).

@prattmic
Copy link
Member

prattmic commented Sep 13, 2022

@josharian Having a list of optimization directives (or pragmas in the source) is something we've thought about, and definitely has pros and cons. I certainly think we want PGO to be applicable from a profile (which I think you agree with?), but whether translation to optimization decisions happens directly in the compiler, or as a pre-processing step.

Some thoughts on list of advantages below. Note, this list actually talks about two axes: profile vs optimization list and binary vs plain text format. I think we could decide along either of these axes (e.g., we could use a plain-text format that describes the pprof profile).

  • the compiler can stay simpler (internal)

I'm not sure I fully agree with this. While I agree that the compiler will be simpler if it only takes a list of directives rather than having to process a profile, something still needs to convert the profile to a list of directives. That new tool will contain the complexity, and share a lot of similarities with the compiler, perhaps even directly sharing code, depending on the specifics of the directives.

e.g., if inlining directives are binary decisions ("inline this function"), then the conversion tool should probably contain a near copy of the compiler's inlining heuristics so that it makes very similar decisions. That is perhaps less important if the directives are more abstract (providing an inlining importance "weight"?), but if we go too abstract then we are probably just describing a profile anyways.

As Cherry mentioned, one option here is that the compiler could accept either a profile or optimization list. When given a profile, it generates the optimization list to be used for future compilations.

  • it is easier/possible to write compiler tests (internal)
  • it is easier to debug: is the issue the interpretation of the pprof profile? (internal)

I agree this would be easier to test/debug because it adds an intermediate format which you could test both sides of.

  • users can easily see what it is doing
  • users can bisect/minimize to figure out what optimization directive is causing issues (or providing speed-ups)

100% agreed that a plain test format and an optimization list are more transparent about what is happening in the build.

  • provides a mechanism for advanced tuning, that everyone always wants compiler directives for
  • it opens the door for other mechanisms/tooling to add information and optimizations, such as: offline static analysis, offline human analysis, bespoke profiling techniques, or combining profiles taken across a fleet of machines

While I agree that an optimization list makes manual optimization tuning easier/possible, that has not been a goal of this proposal, and it is something we have avoided adding pragmas for (also likely a reason we wouldn't have PGO work by adding pragmas to source code). I think it is a bigger discussion if we want to support custom tuning of optimizations.

or combining profiles taken across a fleet of machines

This strikes me as out of place in your list. We absolutely want to support (and encourage!) merging profiles from multiple instances in order to get a more representative sample. This seems to be a point in favor of pprof, as it is easy to merge pprof profiles (go tool pprof -proto 1.pprof 2.pprof > merged.pprof).

On the other hand, optimization lists seem difficult to merge. If one list says "inline foo" and the other says "do not inline foo", how do you merged? It seems in this case you'd still want to merge the profiles prior to generating the optimization list.

  • it at least has a chance of supporting other common forms of optimization information, such as dynamic type info, starting goroutine stack sizes, scheduler hints

Earlier in our design we were planning to create a new PGO format specifically to have the flexibility of adding information beyond CPU profiles. We switched to proposing pprof directly because the format, while not perfect, is actually quite flexible.

e.g., dynamic type info could be encoded as a Label on Samples of calls describing the type being called (though for calls the type is already obvious from the call destination, so not the best example). Stack sizes could be sampled as Samples with Location == gp.startpc, and value == stack size.

The format is not perfect and we may find ourselves limited in the future (discussion here), but I think there is a lot of runway.

  • if optimizations are not getting applied correctly (function renamed?), instead of having to generated a brand new profile, which could be expensive/difficult, it can be fixed directly by hand-editing

Agreed that a plain-text format makes fix-ups easier. We've discussed tooling for renames for pprof files, but plain text is easier.

  • easier to write CI tests around

I'm not sure what you mean by this, could you expand?

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Sep 13, 2022

@klauspost My opinion is that history tells us that using pragmas to guide optimizations doesn't sustain over time. What happens is that someone does a bunch of analysis with a specific compiler version and a specific set of benchmarks, and writes a bunch of pragmas that makes those benchmarks faster. So far so good. But two years later the compiler has changed, the libraries have changed, and the program has changed. The existing optimization pragmas have not changed, and no longer do any good, and occasionally actually do harm.

Profiling can have the same problem, of course, if nobody updates the profiles. But updating the profiles is a much simpler task than analyzing performance in detail, and therefore tends to happen more frequently. In the best case, updating the profiling is simply automated.

So over the long term, I believe that profile guided optimization is a better approach for most projects. It's unlikely to beat careful analysis and hand tuning at the point in time when that happens. But profile guided optimization is much cheaper in what is for most projects the greatest expense: developer time.

Just my opinion, of course, but based on experience.

@merykitty
Copy link

merykitty commented Sep 13, 2022

@klauspost, just a small point

I don't see how you can make branch pruning without the compiler being able to prove that branches are impossible

Branch pruning here means splitting everything below the if to separate hot paths and cold paths in the whole function.

@prattmic
Copy link
Member

prattmic commented Sep 13, 2022

@klauspost

Platforms

This will only cover platform independent code. Much optimized code has separate code paths for different platforms. As proposed I don't see any reasonable way to handle multiple platforms automatically, for example when cross-compiling.

You could make this a user problem, and say they should use go build -pgo=$GOOS_$GOARCH.pgo, but it isn't very friendly, and will probably hinder adoption.

This is something we've been thinking about recently and are open to suggestions. I think we may want automatic build configuration selection eventually, but the details are tricky. Do we use suffixes, like .go files (default_linux_amd64.pgo)? That could make directories quite noisy with a dozen or more files (should we move to a subdirectory like testdata?). How should we handle selection for build tags?

What we've proposed I think is the MVP option which remains forward compatible with future auto-selection we might do.

(FWIW, I think that a single profile for all platforms is likely good enough for more users. A platform-specific one will of course be at least a bit better, but my intuition says that the vast majority of most profiles will be common across platforms. We could verify this with data by comparing profiles).)

@aclements
Copy link
Member

aclements commented Sep 13, 2022

@klauspost

I may be lacking imagination, but I don't see the scope of this going much beyond inline decisions.

To add to what others have said, we've actually been keeping an eye toward many possible PGO-based optimizations while designing this. Here's a non-exhaustive list:

  • Inlining just seems like the obvious first step, since it's likely to provide a lot of bang for your buck. We've been talking about using PGO for inlining for many, many years. :)
  • Local basic block ordering: ordering blocks within a function to cluster hot blocks and potentially improve branch prediction (though the win of the latter on modern CPUs is small).
  • Register allocation: register allocation currently uses heuristics to determine a hot path and move spills off that hot path. PGO can tell it the true hot path.
  • Function ordering: Clustering functions at a whole-binary level for better locality.
  • Global block ordering: A step beyond function ordering. The basic form of this is hot/cold splitting, but it can be more aggressive than that.
  • Indirect call devirtualization: If the profile shows that a particular function is by far the most common target of an indirect call, specialize the caller to check that that's the target and make a direct call if it is (possibly even inlining the target). (Edit: Note that this would apply to both closure calls and interface method calls.)
  • Stenciling: Stencil hot generic functions, possibly based on type information from the profile.
  • Map/slice pre-sizing: Pre-size maps and slices based on allocation site. (This requires more than a CPU profile.)
  • Lifetime allocation: Co-locate allocations with similar lifetimes by allocation site. (This also requires more than a CPU profile.)

@cherrymui
Copy link
Member Author

cherrymui commented Sep 13, 2022

@klauspost thanks for the feedback!

Non-main packages

I totally agree that PGO for non-main packages are important (we want to do it for the standard library as well), and we spent quite some time considering it. But many details are still unclear to us. At the moment we're proposing what we know how to do, and leave the door open for future improvements including PGO for non-main packages. I think this is better than waiting.

It would require package users to implement pgo benchmarks that hits my package.

This is not what we expected. Instead, we expect the user to take profiles from their main program. If in that program your package is in the hot path, PGO will apply to your package.

Platforms

As @prattmic mentioned, we are proposing an MVP option which remains forward compatible with future improvements.

As mentioned in the design doc

In the future, we may support automatic lookup for different profiles for different build configurations (e.g., GOOS/GOARCH), but for now we expect profiles to be fairly portable between configurations.

By "portable", for example, if the workloads are similar across platforms, the hot functions on one platform is likely also hot on another platform.

See also "Input welcome" from the original post. We're particularly welcome input for naming default profiles.

Future Optimization Scope

I'm not really sure what your comment meant. But here is a (non-exhaustive) list of optimizations we plan to do in the future https://go.googlesource.com/proposal/+/master/design/55022-pgo.md#optimizations

@apparentlymart
Copy link

apparentlymart commented Sep 14, 2022

I will first admit that I do not have direct experience with profile-guided optimization in any other language, and so my line of questioning here may be naive. If this seems like a nonsense question, I'm happy to accept that as an answer to avoid anyone feeling the need to do my homework for me! 😄


The proposal and some of the commentary above discusses the fact that a particular set of profile information is specific to a particular version of the compiler (because it may generate code with different execution characteristics) and to particular input source code (because changing the program will at least change the behavior of the part of the program that changed, and may also have knock-on effects elsewhere from optimizations that are able to consider global information).

I quite enjoy the fact that go test is able to (mostly) reliably determine when a particular part of the test cache has been invalidated by changes to the input source code. It works well enough that I rarely question whether it's working correctly, and am surprised in the rare cases where I catch it out by doing something particularly odd!

Of course I understand that profile-guided optimization cannot possibly be implemented with an automatically-maintained cache, because gathering profile information is always an explicit step and similarly it's up to the person running go build to decide which profile information is relevant to the current situation.

However, I do wonder if it seems feasible for the toolchain to automatically announce when a particular profile information file has been invalidated, so that I could be prompted to regenerate it. I assume "invalidated" is not a boolean value in this case but perhaps more like a code coverage report: how much of the program is still the same now as it was when this profile information was generated? What specific parts of the program are not covered by this profile because they have changed since the profile was generated? Was the profile generated from a program compiled with the same toolchain version as I'm currently using? (Hopefully this could also somehow take into account the degree to which the system can "make do" with mismatching profiles, giving higher precedence to changes that are outside the set that the compiler and trace format can account for automatically, as described in Stability to source code and compiler changes.)

With what I've understood from reviewing the proposal so far it seems like it would be too expensive to track this sort of "profile coverage" on a per-line or per-statement basis, since it would seem to require retaining some sort of checksum for every single line/statement. But I wonder if it would be feasible to track at a coarser grain, such as per-function, per-file, or per-package, just to give the user something meaningful to understand the results against, so that they can then use their intuition to estimate whether the indicated objects have changed enough to warrant the effort of capturing fresh profile data.

(For this question I'm imagining a situation where a team collects profile information semi-manually once and uses it for some time before semi-manually generating it again. I expect that the capability I'm describing would be far less useful in situations where a team is able to constantly gather profile information for a currently-running and feed it into the next build, as described in Iterative Stability. The software I spend most of my time working on is "shrinkwrap" software which we package and ship to users to run on their own computers, and so our ability to capture traces from real user runs is limited. "This feature is primarily intended for servers you can profile constantly" would be a perfectly reasonable argument to dismiss my questioning above, of course.)

@josharian
Copy link
Contributor

josharian commented Sep 14, 2022

easier to write CI tests around

I'm not sure what you mean by this, could you expand?

Yes, that was rather vague. :) It's similar to @apparentlymart's point (which is a good one).

In many systems, performance properties are correctness properties. Losing key optimizations can cause performance regressions, which can cause existing provisioning to be insufficient, which can take down a system. Even merely slowing down a system can cause cascading failures. (I speak from experience, unfortunately.)

The more opaque the toolchain and its inputs are, the harder it is to (a) write safety checks that detect performance problems before they make it to production and (b) diagnose performance issues after they make it to production. This is in some ways even worse with less severe regressions, which don't result in an obvious, immediate problem.

The obvious approach here using pprof files is to have excellent compiler diagnostic output. This puts us in the same boat as (for example) current inlining tests: exec a build from inside a test and check for magic output strings. It's kind of a miserable existence, but it works.

@apparentlymart's notion of a PGO fitness/staleness score would also help, albeit in a less fine-grained way.

@josharian
Copy link
Contributor

josharian commented Sep 14, 2022

Or to put it a different way: How do I code review a commit that replaces one pprof profile with a new one?

@prattmic
Copy link
Member

prattmic commented Sep 14, 2022

Thanks for the clarification, that makes sense. Having an optimization list doesn't solve the problem of determining if a new profile has "good" results, but it is at least easier to look at a diff and notice potentially worrying changes.

(Though if the list is thousands of lines long and changes a lot from profile-to-profile, then I imagine that could also be difficult to review).

@cherrymui
Copy link
Member Author

cherrymui commented Sep 14, 2022

Thanks.

I think we all agree that we want the compiler to emit an optimization list for optimizations it does based on a given profile.

Interesting idea about profile invalidation. I think that is a good point. And we can make the compiler emit some information when some profile information doesn't apply, either in the same optimization list file our some other file. I think it shouldn't be too hard or expensive.

gopherbot pushed a commit that referenced this issue Oct 31, 2022
Since pgo is a new package, it is reasonably straightforward to
encapsulate its state into a non-global object that we pass around,
which will help keep it isolated.

There are no functional changes in this CL, just packaging up the
globals into a new object.

There are two major pieces of cleanup remaining:

1. reflectdata and noder have separate InlineCalls calls for method
   wrappers. The Profile is not plumbed there yet, but this is not a
   regression as the globals were previously set only right around the
   main inlining pass in gc.Main.

2. pgo.ListOfHotCallSites is still global, as it will require more work
   to clean up. It is effectively a local variable in InlinePackage,
   except that it assumes that InlineCalls is immediately preceded by a
   CanInline call for the same function. This is not necessarily true
   due to the recursive nature of CanInline. This also means that some
   InlineCalls calls may be missing the list of hot callsites right now.

For #55022.

Change-Id: Ic1fe41f73df96861c65f8bfeecff89862b367290
Reviewed-on: https://go-review.googlesource.com/c/go/+/446303
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link

gopherbot commented Nov 1, 2022

Change https://go.dev/cl/447016 mentions this issue: cmd/compile: use CDF to determine PGO inline threshold

@gopherbot
Copy link

gopherbot commented Nov 1, 2022

Change https://go.dev/cl/447015 mentions this issue: cmd/compile: use edge weights to decide inlineability in PGO

@gopherbot
Copy link

gopherbot commented Nov 1, 2022

Change https://go.dev/cl/446977 mentions this issue: cmd/compile/internal/pgo: unexport local types and fields

@gopherbot
Copy link

gopherbot commented Nov 1, 2022

Change https://go.dev/cl/446976 mentions this issue: cmd/compile/internal/inline,cmd/compile/internal/pgo: remove candHotNodeMap and candHotEdgeMap

gopherbot pushed a commit to golang/benchmarks that referenced this issue Nov 1, 2022
With -pgo, for each configuration sweet automatically runs an initial
profiling configuration. The merged profiles from these runs are used as
the PGO input to a ".pgo" variant of the configuration. Comparing the
base configuration to the ".pgo" variant indicates the effect of PGO.

At a lower level, the config format adds a "pgofiles" map, which can be
used to specify PGO profiles to use for each benchmark.

Ultimately this sets GOFLAGS=-pgo=/path in BuildEnv. Some benchmarks may
not currently properly plumb this into their build (e.g., the GoBuild
benchmarks don't build the compiler at all). Existing benchmarks need to
be double-checked that they actually get PGO enabled.

For golang/go#55022.

Change-Id: I81c0cb085ef3b5196a05d2565dd0e2f83057b9fa
Reviewed-on: https://go-review.googlesource.com/c/benchmarks/+/444557
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit to golang/benchmarks that referenced this issue Nov 1, 2022
go-build is a subtly strange benchmark. For every other benchmark, sweet
builds the benchmarked program in set up, using Config.BuildEnv.
go-build does not do that because the benchmarked programs (cmd/compile,
cmd/link, and (to a lesser degree) cmd/go) are conveniently prebuilt in
GOROOT.

But this breaks the intuition that BuildEnv can control build flags for
the benchmarked program, particularly for sweet run -pgo, where sweet
itself expects a PGO flag in BuildEnv to impact the benchmarked program.

Adjust go-build to follow these standards by copying GOROOT for each
config and running `go install cmd/compile cmd/link` in the copied
GOROOT, allowing BuildEnv to take effect.

This fixes the general case for PGO, though it could use more work as
really cmd/compile and cmd/link should receive _different_ PGO profiles.

We also run the dummy build with the ExecEnv (matching how the actual
benchmark will run) to avoid passing -pgo to those builds.

For golang/go#55022.

Change-Id: I36e4486c79ee4200c2e10ccba913d559187bdad2
Reviewed-on: https://go-review.googlesource.com/c/benchmarks/+/444895
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

gopherbot commented Nov 1, 2022

Change https://go.dev/cl/447135 mentions this issue: runtime: yield in goschedIfBusy if gp.preempt

gopherbot pushed a commit that referenced this issue Nov 1, 2022
The global ListOfHotCallSites set is used to communicate between
CanInline and InlineCalls the set of call sites that InlineCalls may
increase the budget for.

CanInline clears this map on each call, thus assuming that
InlineCalls(x) is called immediately after CanInline(x). This assumption
is false, as CanInline (among other cases) is recursive (CanInline ->
hairyVisitor.doNode -> inlCallee -> CanInline).

When this assumption proves false, we will lose the opportunity to
inline hot calls.

This CL is the least invasive fix for this. ListOfHotCallSites is
actually just a subset of the candHotEdgeMap, with CallSiteInfo.Callee
cleared. candHotEdgeMap doesn't actually need to distinguish based on
Callee, so we can drop callee from candHotEdgeMap as well and just use
that directly [1].

Later CLs should do more work to remove the globals entirely.

For cmd/compile, this inceases the number of PGO inlined functions by
~50% for one set of PGO parameters. I have no evaluated performance
impact.

[1] This is something that we likely want to change in the future.

For #55022.

Change-Id: I57735958d651f6dfa9bd296499841213d20e1706
Reviewed-on: https://go-review.googlesource.com/c/go/+/446755
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

gopherbot commented Nov 2, 2022

Change https://go.dev/cl/447315 mentions this issue: cmd/compile/internal/pgo: match on call line offsets

gopherbot pushed a commit that referenced this issue Nov 2, 2022
Currently, with PGO, the inliner uses node weights to decide if a
function is inlineable (with a larger budget). But the actual
inlining is determined by the weight of the call edge. There is a
discrepancy that, if a callee node is hot but the call edge is not,
it would not inlined, and marking the callee inlineable would of
no use.

Instead of using two kinds of weights, we just use the edge
weights to decide inlineability. If a function is the callee of a
hot call edge, its inlineability is determined with a larger
threshold. For a function that exceeds the regular inlining budget,
it is still inlined only when the call edge is hot, as it would
exceed the regular inlining cost for non-hot call sites, even if
it is marked inlineable.

For #55022.

Change-Id: I93fa9919fc6bcbb394e6cfe54ec96a96eede08f7
Reviewed-on: https://go-review.googlesource.com/c/go/+/447015
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Nov 3, 2022
Currently in PGO we use a percentage threshold to determine if a
callsite is hot. This CL uses a different method -- treating the
hottest callsites that make up cumulatively top X% of total edge
weights as hot (X=95 for now). This default might work better for
a wider range of profiles. (The absolute threshold can still be
changed by a flag.)

For #55022.

Change-Id: I7e3b6f0c3cf23f9a89dd5994c10075b498bf14ee
Reviewed-on: https://go-review.googlesource.com/c/go/+/447016
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
This adds the function "start line number" to runtime._func and
runtime.inlinedCall objects. The "start line number" is the line number
of the func keyword or TEXT directive for assembly.

Subtracting the start line number from PC line number provides the
relative line offset of a PC from the the start of the function. This
helps with source stability by allowing code above the function to move
without invalidating samples within the function.

Encoding start line rather than relative lines directly is convenient
because the pprof format already contains a start line field.

This CL uses a straightforward encoding of explictly including a start
line field in every _func and inlinedCall. It is possible that we could
compress this further in the future. e.g., functions with a prologue
usually have <line of PC 0> == <start line>. In runtime.test, 95% of
functions have <line of PC 0> == <start line>.

According to bent, this is geomean +0.83% binary size vs master and
-0.31% binary size vs 1.19.

Note that //line directives can change the file and line numbers
arbitrarily. The encoded start line is as adjusted by //line directives.
Since this can change in the middle of a function, `line - start line`
offset calculations may not be meaningful if //line directives are in
use.

For golang#55022.

Change-Id: Iaabbc6dd4f85ffdda294266ef982ae838cc692f6
Reviewed-on: https://go-review.googlesource.com/c/go/+/429638
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Now that we plumb the start line to the runtime, we can include in pprof
files. Since runtime.Frame.startLine is not (currently) exported, we
need a runtime helper to get the value.

For golang#55022.
Updates golang#56135.

Change-Id: Ifc5b68a7b7170fd7895e4099deb24df7977b22ea
Reviewed-on: https://go-review.googlesource.com/c/go/+/438255
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Austin Clements <austin@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
For golang#55022

Change-Id: I51f1ba166d5a66dcaf4b280756be4a6bf9545c5e
Reviewed-on: https://go-review.googlesource.com/c/go/+/429863
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
The most important change here is to log output from the child, making
it easier to diagnose problems when the child 'go test' fails.

We can also eliminate the cmd.Wait goroutine by using an os.Pipe, whose
reader will return io.EOF when the child exits.

For golang#55022.

Change-Id: I1573ea444407d545bdca8525c9ff7b0a2baebf5e
Reviewed-on: https://go-review.googlesource.com/c/go/+/446300
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Parts of package pgo fetch the line number of a node by parsing the
number out of the string returned from ir.Line().

This is indirect and inefficient, so it should be replaced with a more
direct lookup. It is also potentially buggy: ir.Line uses
ctxt.OutermostPos, i.e., the line number where an inlined node in
inlined. We want ctxt.InnermostPos, because that is the line number used
in pprof profiles that we are matching against (See comments on
OutermostPos and InnermostPos).

I'm not sure whether this was an active, as we use ir.Line before and
during inlining. I think we could see CALL nodes with OutermostPos !=
InnermostPos during midstack inlining, but I am not sure. Regardless,
explicitly using the desired position is clearer.

For golang#55022.

Change-Id: Ic640761c9e1d01cacbf91f3aaeaf284ad7e38dbd
Reviewed-on: https://go-review.googlesource.com/c/go/+/446302
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Since pgo is a new package, it is reasonably straightforward to
encapsulate its state into a non-global object that we pass around,
which will help keep it isolated.

There are no functional changes in this CL, just packaging up the
globals into a new object.

There are two major pieces of cleanup remaining:

1. reflectdata and noder have separate InlineCalls calls for method
   wrappers. The Profile is not plumbed there yet, but this is not a
   regression as the globals were previously set only right around the
   main inlining pass in gc.Main.

2. pgo.ListOfHotCallSites is still global, as it will require more work
   to clean up. It is effectively a local variable in InlinePackage,
   except that it assumes that InlineCalls is immediately preceded by a
   CanInline call for the same function. This is not necessarily true
   due to the recursive nature of CanInline. This also means that some
   InlineCalls calls may be missing the list of hot callsites right now.

For golang#55022.

Change-Id: Ic1fe41f73df96861c65f8bfeecff89862b367290
Reviewed-on: https://go-review.googlesource.com/c/go/+/446303
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
The global ListOfHotCallSites set is used to communicate between
CanInline and InlineCalls the set of call sites that InlineCalls may
increase the budget for.

CanInline clears this map on each call, thus assuming that
InlineCalls(x) is called immediately after CanInline(x). This assumption
is false, as CanInline (among other cases) is recursive (CanInline ->
hairyVisitor.doNode -> inlCallee -> CanInline).

When this assumption proves false, we will lose the opportunity to
inline hot calls.

This CL is the least invasive fix for this. ListOfHotCallSites is
actually just a subset of the candHotEdgeMap, with CallSiteInfo.Callee
cleared. candHotEdgeMap doesn't actually need to distinguish based on
Callee, so we can drop callee from candHotEdgeMap as well and just use
that directly [1].

Later CLs should do more work to remove the globals entirely.

For cmd/compile, this inceases the number of PGO inlined functions by
~50% for one set of PGO parameters. I have no evaluated performance
impact.

[1] This is something that we likely want to change in the future.

For golang#55022.

Change-Id: I57735958d651f6dfa9bd296499841213d20e1706
Reviewed-on: https://go-review.googlesource.com/c/go/+/446755
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Currently, with PGO, the inliner uses node weights to decide if a
function is inlineable (with a larger budget). But the actual
inlining is determined by the weight of the call edge. There is a
discrepancy that, if a callee node is hot but the call edge is not,
it would not inlined, and marking the callee inlineable would of
no use.

Instead of using two kinds of weights, we just use the edge
weights to decide inlineability. If a function is the callee of a
hot call edge, its inlineability is determined with a larger
threshold. For a function that exceeds the regular inlining budget,
it is still inlined only when the call edge is hot, as it would
exceed the regular inlining cost for non-hot call sites, even if
it is marked inlineable.

For golang#55022.

Change-Id: I93fa9919fc6bcbb394e6cfe54ec96a96eede08f7
Reviewed-on: https://go-review.googlesource.com/c/go/+/447015
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Currently in PGO we use a percentage threshold to determine if a
callsite is hot. This CL uses a different method -- treating the
hottest callsites that make up cumulatively top X% of total edge
weights as hot (X=95 for now). This default might work better for
a wider range of profiles. (The absolute threshold can still be
changed by a flag.)

For golang#55022.

Change-Id: I7e3b6f0c3cf23f9a89dd5994c10075b498bf14ee
Reviewed-on: https://go-review.googlesource.com/c/go/+/447016
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link

gopherbot commented Nov 4, 2022

Change https://go.dev/cl/447997 mentions this issue: DO NOT SUBMIT: tool for parallel PGO testing on gomotes

gopherbot pushed a commit that referenced this issue Nov 4, 2022
Rather than matching calls to edges in the profile based directly on
line number in the source file, use the line offset from the start of
the function. This makes matching robust to changes in the source file
above the function containing the call.

The start line in the profile comes from Function.start_line, which is
included in Go pprof output since CL 438255.

Currently it is an error if no samples set start_line to help users
detect profiles missing this information. In the future, we should
fallback to using absolute lines, which is better than nothing.

For #55022.

Change-Id: Ie621950cfee1fef8fb200907a2a3f1ded41d04fa
Reviewed-on: https://go-review.googlesource.com/c/go/+/447315
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Nov 7, 2022
runtime.bgsweep contains an infinite loop. With aggressive enough
inlining, it may not perform any CALLs on a typical iteration. If
the runtime trying to preempt this goroutine, the lack of CALLs may
prevent preemption for ever occurring.

bgsweep does happen to call goschedIfBusy. Add a preempt check there to
make sure we yield eventually.

For #55022.

Change-Id: If22eb86fd6a626094b3c56dc745c8e4243b0fb40
Reviewed-on: https://go-review.googlesource.com/c/go/+/447135
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit that referenced this issue Nov 8, 2022
Add a -pgo flag for "go build" (and other build actions), to
specify the file path of a profile used for PGO. Special name
"off" turns off PGO.

The given profile path is passed to the compiler.

The build cache is sensitive to the content of the given PGO
profile.

TODO: auto mode.

For #55022.

Change-Id: Ieee1b131b4c041f9502fd0a1acf112f3e44246be
Reviewed-on: https://go-review.googlesource.com/c/go/+/438736
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
@gopherbot
Copy link

gopherbot commented Nov 10, 2022

Change https://go.dev/cl/449500 mentions this issue: DO NOT SUBMIT: experiemntal pprof trimming tool

@lkarlslund
Copy link

lkarlslund commented Nov 13, 2022

Excellent discussion here, and this initiative is really great overall. Reading the above, I think @klauspost has some very valid concerns, and I know how much effort he puts into making things perform really well.

I think that some pragmas (forced inlining hints) would be a huge help for package maintainers, and looking at older discussions it was denied with the argument that you could just "copy and paste code" to do the inlining yourself, which to me sounds absurd. Defining functions and methods is about clarity and human understanding, a layer that is removed at compile time, to make efficient execution the primary concern. There is a clash here, which could be improved upon.

My use case is primarily as app developer, but I have one or two modules that others use. I often do profiling of my primary application, which does some heavy reading, decompression, interning and correlations ... but that's just one profile. Another run might be to create data, not to read it. So are multiple profiles supported, and how are they prioritized?

For module developers, there is no reason not to think that they are actually the specialists on what performs or not, and not allowing them to "chip in" with knowledge (either by manual inline pragma hints or by supplying profiling data to be merged at compile time) would be a shame.

@cherrymui
Copy link
Member Author

cherrymui commented Nov 14, 2022

So are multiple profiles supported, and how are they prioritized?

Sorry, I'm not sure I really understand your question. I'll try to answer as I understand it. For multiple profiles for the same program, user can specify which profile to use in the command line (go build -pgo=<profile>). If a program has different workloads, only the user would know what workloads to optimize for in a particular build. Or, if one wants to build a single binary that works reasonably well for different workloads, they can merge the profiles before the build.

For module developers, there is no reason not to think that they are actually the specialists on what performs or not, and not allowing them to "chip in" with knowledge (either by manual inline pragma hints or by supplying profiling data to be merged at compile time) would be a shame.

Module/library profiles are on our road map, and we definitely want to support it in the future. It is just that it still requires more thinking and designing, so not done in Go 1.20. Thanks for understanding.

gopherbot pushed a commit that referenced this issue Nov 16, 2022
Add "auto" mode for the -pgo build flag. When -pgo=auto is
specified, if there is a default.pgo file in the directory of the
main package, it will be selected and used for the build.

Currently it requires exactly one main package when -pgo=auto is
specified. (We'll support multiple main packages in the future.)

Also apply to other build-related subcommands, "go install", "go
run", "go test", and "go list".

For #55022.

Change-Id: Iab7974ab8932daf0e83506de505e044a8e412466
Reviewed-on: https://go-review.googlesource.com/c/go/+/438737
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
@gopherbot
Copy link

gopherbot commented Nov 28, 2022

Change https://go.dev/cl/453636 mentions this issue: doc/go1.20: add release notes for PGO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Status: Accepted
Development

No branches or pull requests