-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: merge temporaries and named variables by GC shape #62737
Comments
Change https://go.dev/cl/566177 mentions this issue: |
Change https://go.dev/cl/553055 mentions this issue: |
Introduce a helper type "Intervals" that contains sets of sorted disjoint ranges corresponding to live ranges within a function. Example: the Intervals set "{ [0,1), [4,10) }" would indicate that something is live starting at instruction 0, then up to but not including instruction 1, then dead from 1-3, then live again at instruction 4 up to (but not including) instruction 10. This patch provides APIs for constructing interval sets, testing to see whether two sets overlap, and unioning/merging together two intervals sets. Updates #62737. Updates #65532. Updates #65495. Change-Id: I7140a5989eba93bf3b8762d9224261f5eba0646d Reviewed-on: https://go-review.googlesource.com/c/go/+/566177 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
Change https://go.dev/cl/575415 mentions this issue: |
I have a roll-forward CL currently being tested that should resolve the PPC64/S390x problems that caused the original CL to be reverted. |
Change https://go.dev/cl/576680 mentions this issue: |
Change https://go.dev/cl/576735 mentions this issue: |
Change https://go.dev/cl/577615 mentions this issue: |
Change https://go.dev/cl/576681 mentions this issue: |
[This is a partial roll-forward of CL 553055, the main change here is that the stack slot overlap operation is flagged off by default (can be enabled by hand with -gcflags=-d=mergelocals=1) ] Preliminary compiler support for merging/overlapping stack slots of local variables whose access patterns are disjoint. This patch includes changes in AllocFrame to do the actual merging/overlapping based on information returned from a new liveness.MergeLocals helper. The MergeLocals helper identifies candidates by looking for sets of AUTO variables that either A) have the same size and GC shape (if types contain pointers), or B) have the same size (but potentially different types as long as those types have no pointers). Variables must be greater than (3*types.PtrSize) in size to be considered for merging. After forming candidates, MergeLocals collects variables into "can be overlapped" equivalence classes or partitions; this process is driven by an additional liveness analysis pass. Ideally it would be nice to move the existing stackmap liveness pass up before AllocFrame and "widen" it to include merge candidates so that we can do just a single liveness as opposed to two passes, however this may be difficult given that the merge-locals liveness has to take into account writes corresponding to dead stores. This patch also required a change to the way ssa.OpVarDef pseudo-ops are generated; prior to this point they would only be created for variables whose type included pointers; if stack slot merging is enabled then the ssagen code creates OpVarDef ops for all auto vars that are merge candidates. Note that some temporaries created late in the compilation process (e.g. during ssa backend) are difficult to reason about, especially in cases where we take the address of a temp and pass it to the runtime. For the time being we mark most of the vars created post-ssagen as "not a merge candidate". Stack slot merging for locals/autos is enabled by default if "-N" is not in effect, and can be disabled via "-gcflags=-d=mergelocals=0". Fixmes/todos/restrictions: - try lowering size restrictions - re-evaluate the various skips that happen in SSA-created autotmps Updates #62737. Updates #65532. Updates #65495. Change-Id: Ifda26bc48cde5667de245c8a9671b3f0a30bb45d Reviewed-on: https://go-review.googlesource.com/c/go/+/575415 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
For -gcflags=-d=mergelocalstrace=1 (which reports estimated savings from stack slot merging), emit separate values for pointerful vs non-pointerful variables, for a bit more detail. Updates #62737. Updates #65532. Updates #65495. Change-Id: I9dd27d2a254036448c85c13d189d1ed36157c9d7 Reviewed-on: https://go-review.googlesource.com/c/go/+/576680 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
…didates It is possible to have situations where a given ir.Name is non-address-taken at the source level, but whose address is materialized in order to accommodate the needs of arch-dependent memory ops. The issue here is that the SymAddr op will show up as touching a variable of interest, but the subsequent memory op will not. This is generally not an issue for computing whether something is live across a call, but it is problematic for collecting the more fine-grained live interval info that drives stack slot merging. As an example, consider this Go code: package p type T struct { x [10]int f float64 } func ABC(i, j int) int { var t T t.x[i&3] = j return t.x[j&3] } On amd64 the code sequences we'll see for accesses to "t" might look like v10 = VarDef <mem> {t} v1 v5 = MOVOstoreconst <mem> {t} [val=0,off=0] v2 v10 v23 = LEAQ <*T> {t} [8] v2 : DI v12 = DUFFZERO <mem> [80] v23 v5 v14 = ANDQconst <int> [3] v7 : AX v19 = MOVQstoreidx8 <mem> {t} v2 v14 v8 v12 v22 = ANDQconst <int> [3] v8 : BX v24 = MOVQloadidx8 <int> {t} v2 v22 v19 : AX v25 = MakeResult <int,mem> v24 v19 : <> Note that the the loads and stores (ex: v19, v24) all refer directly to "t", which means that regular live analysis will work fine for identifying variable lifetimes. The DUFFZERO is (in effect) an indirect write, but since there are accesses immediately after it we wind up with the same live intervals. Now the same code with GOARCH=ppc64: v10 = VarDef <mem> {t} v1 v20 = MOVDaddr <*T> {t} v2 : R20 v12 = LoweredZero <mem> [88] v20 v10 v3 = CLRLSLDI <int> [212543] v7 : R5 v15 = MOVDaddr <*T> {t} v2 : R6 v19 = MOVDstoreidx <mem> v15 v3 v8 v12 v29 = CLRLSLDI <int> [212543] v8 : R4 v24 = MOVDloadidx <int> v15 v29 v19 : R3 v25 = MakeResult <int,mem> v24 v19 : <> Here instead of memory ops that refer directly to the symbol, we take the address of "t" (ex: v15) and then pass the address to memory ops (where the ops themselves no longer refer to the symbol). This patch enhances the stack slot merging liveness analysis to handle cases like the PPC64 one above. We add a new phase in candidate selection that collects more precise use information for merge candidates, and screens out candidates that are too difficult to analyze. The phase make a forward pass over each basic block looking for instructions of the form vK := SymAddr(N) where N is a raw candidate. It then creates an entry in a map with key vK and value holding name and the vK use count. As the walk continues, we check for uses of of vK: when we see one, record it in a side table as an upwards exposed use of N. At each vK use we also decrement the use count in the map entry, and if we hit zero, remove the map entry. If we hit the end of the basic block and we still have map entries, this implies that the address in question "escapes" the block -- at that point to be conservative we just evict the name in question from the candidate set. Although this CL fixes the issues that forced a revert of the original merging CL, this CL doesn't enable stack slot merging by default; a subsequent CL will do that. Updates #62737. Updates #65532. Updates #65495. Change-Id: Id41d359a677767a8e7ac1e962ae23f7becb4031f Reviewed-on: https://go-review.googlesource.com/c/go/+/576735 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Flag flip to enable stack slot merging by default when optimizing. Please see the earlier CL for details on what this is doing. Updates #62737. Updates #65532. Updates #65495. Change-Id: I8e30d553e74ace43d418f883199721f05320d3d7 Reviewed-on: https://go-review.googlesource.com/c/go/+/576681 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
This patch revises the algorithm/strategy used for overlapping the stack slots of disjointly accessed local variables. The main change here is to allow merging the stack slot of B into the slot for A if B's size is less then A (prior to this they had to be identical), and to also allow merging a non-pointer variables into pointer-variable slots. The new algorithm sorts the candidate list first by pointerness (pointer variables first), then by alignment, then by size, and finally by name. We no longer check that two variables have the same GC shape before merging: since it should never be the case that we have two vars X and Y both live across a given callsite where X and Y share a stack slot, their gc shape doesn't matter. Doing things this new way increases the total number of bytes saved (across all functions) from 91256 to 124336 for the sweet benchmarks. Updates #62737. Updates #65532. Updates #65495. Change-Id: I1daaac1b1240aa47a6975e98ccd24e03304ab602 Reviewed-on: https://go-review.googlesource.com/c/go/+/577615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
Change https://go.dev/cl/581439 mentions this issue: |
Postscript on this issue: the CLs I've submitted enable merging of stack slots for disjointly accessed local variables. The optimization targets regular user-declared variables, variables that are imported into a function as a result of inlining, and most autotmp vars (we stay away from those created late in the SSA backend). The new optimization does not handle stack-allocated vars that are address-taken and read/written indirectly; I have some ideas on how to do this but it is a bit of a larger project. One other note is that the merging/overlapping algorithm doesn't actually look at GC shape, this turns out to not be necessary. If you have two vars X and Y whose access patterns are entirely disjoint, then it will never be the case that both X and Y are live across a call, so putting them in the same stack slot is fine even if they have different types. Closing out this issue. |
Add a blurb to the toolchain section talking about stack slot merging. Updates #62737. Change-Id: I26193a6a381c95ff5d79ce80b77c10c7561d00cc Reviewed-on: https://go-review.googlesource.com/c/go/+/581439 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
Once upon a time (~10 years ago) the Go compiler would merge temporaries on the stack (https://golang.org/cl/12829043). At some point, however, this optimization was removed.
Stack space is generally cheap compared to heap space (since it doesn't generally increase the rate of garbage collection), but in some circumstances, the lack of merged temporaries can lead to a substantial amount of stack space waste.
Some applications have a lot of stacks floating around (e.g. Kubernetes' API server, via "watch" requests), in which case the cost is direct: additional memory use. Another cost is that a larger stack may means more stack growth, which takes CPU time. Lastly, the GC has to scan these stacks, and while the cost mostly proportional to the number of live pointers on the stack, the GC would still have less work to do overall for each stack.
Work by Uber and Kubernetes has shown that stack space overheads can be significantly reduced by manually re-scoping named variables.
It would be tricky to change the locations of pointers on the stack mid-function, but we can at least merge stack slots where the pointer/scalar data (GC shape) is the same. Credit to @mdempsky for the idea for reviving this.
This is a more general version of #62077, which focuses on aggregates.
The text was updated successfully, but these errors were encountered: