Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: pgo can dramatically increase goroutine stack size #65532

Open
felixge opened this issue Feb 5, 2024 · 4 comments
Open

cmd/compile: pgo can dramatically increase goroutine stack size #65532

felixge opened this issue Feb 5, 2024 · 4 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@felixge
Copy link
Contributor

felixge commented Feb 5, 2024

Go version

go1.21.5

Output of go env in your module/workspace:

GOARCH=amd64
GOOS=linux
GOAMD64=v1

What did you do?

Build my application using a default.pgo CPU profile from production.

What did you see happen?

Go memory usage (/memory/classes/total:bytes − /memory/classes/heap/released:bytes) increased from 720 MB to 850 MB (18%) until rollback, see below.

2024-02-05 pgo for koutris-forwarder-intake (increase goroutine stack size)  Datadog at 23 03 54@2x

This increase in memory usage seems to have been caused by an increase in goroutine stack size (/memory/classes/heap/stacks:bytes) from 207 MB to 280MB (35%).

2024-02-05 pgo for koutris-forwarder-intake (increase goroutine stack size)  Datadog at 23 06 01@2x

This increase was not due to an increase in the number of active goroutines, but due to an increase of the average stack size (/memory/classes/heap/stacks:bytes / /sched/goroutines:goroutines).

2024-02-05 pgo for koutris-forwarder-intake (increase goroutine stack size)  Datadog at 23 05 17@2x

To debug this further, I built a hacky goroutine stack frame profiler. This pointed me to to google.golang.org/grpc/internal/transport.(*loopyWriter).run

For the binary compiled without pgo, my tool estimated 2MB of stack usage for ~1000 goroutines:

2024-02-05 koutris-forwarder-intake goroutine_space at 23 23 33@2x

And for the binary compiled with pgo, my tool estimated 71MB of stack usage for ~1000 goroutines:

2024-02-05 koutris-forwarder-intake goroutine_space at 23 22 14@2x

Looking at the assembly, it becomes clear that is due to the frame size increasing from 0x50 (80) bytes to 0xc1f8 (49656) bytes.

assembly

before pgo:

TEXT google.golang.org/grpc/internal/transport.(*loopyWriter).run(SB) /go/pkg/mod/google.golang.org/grpc@v1.58.2/internal/transport/controlbuf.go
  0x8726e0              493b6610                CMPQ SP, 0x10(R14)                   // cmp 0x10(%r14),%rsp
  0x8726e4              0f86ab020000            JBE 0x872995                         // jbe 0x872995
  0x8726ea              55                      PUSHQ BP                             // push %rbp
  0x8726eb              4889e5                  MOVQ SP, BP                          // mov %rsp,%rbp
  0x8726ee              4883ec50                SUBQ $0x50, SP                       // sub $0x50,%rsp

after pgo:

TEXT google.golang.org/grpc/internal/transport.(*loopyWriter).run(SB) /go/pkg/mod/google.golang.org/grpc@v1.58.2/internal/transport/controlbuf.go
  0x8889a0              4989e4                          MOVQ SP, R12                         // mov %rsp,%r12
  0x8889a3              4981ec80c10000                  SUBQ $0xc180, R12                    // sub $0xc180,%r12
  0x8889aa              0f82c0300000                    JB 0x88ba70                          // jb 0x88ba70
  0x8889b0              4d3b6610                        CMPQ R12, 0x10(R14)                  // cmp 0x10(%r14),%r12
  0x8889b4              0f86b6300000                    JBE 0x88ba70                         // jbe 0x88ba70
  0x8889ba              55                              PUSHQ BP                             // push %rbp
  0x8889bb              4889e5                          MOVQ SP, BP                          // mov %rsp,%rbp
  0x8889be              4881ecf8c10000                  SUBQ $0xc1f8, SP                     // sub $0xc1f8,%rsp

And the root cause for this appears to be the inlining of 3 calls to processData, each of which allocates a 16KiB byte array on its stack

What did you expect to see?

No significant increase in memory usage.

Maybe PGO could take frame sizes into account for inlining, especially if multiple calls are being made to a function that has a large frame size.

Meanwhile, maybe we should send a PR that adds a //go:noinline pragma to the processData func in gRPC. Given the current code structure, it seems highly undesirable to inline this function up to 3 times in the run method.

cc @prattmic

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 5, 2024
@prattmic
Copy link
Member

prattmic commented Feb 6, 2024

Thanks for the detailed report! I know you are aware, but just to be clear to others skimming the issue: this stack size estimate tool will likely underestimate the actual stack allocation size. We are seeing the goroutines sitting parked in loopyWriter.run, below a call to processData. But presumably these goroutines do call processData, at which point they will need 16KB of stack.

Inlining three copies of this function is a particularly bad case because without inlining you can't call processData three times at once, so the peak stack use is 16KB, but with inlining, the frame size of the caller is 16KB*3.

I agree with you that frame size does seem reasonable to consider when making inlining decisions. We probably want to avoid making frames too large. This is potentially in scope for #61502, or a follow-up to that (cc @mdempsky @thanm). I don't think this would really be PGO-specific, though PGO inlining may make us more likely to hit general frame size thresholds. I'm also not sure how good of a sense of final frame size we have during inlining; this is pretty early in compilation and even before we do escape analysis.

At the other end of the spectrum, I wonder if we could more aggressively reuse stack slots for large objects. Since localBuf doesn't escape the function, it seems likely that even after inlining, the three uses of localBuf are mutually exclusive and could technically use the same stack slot. I have no idea how complicated this would be.

cc @golang/compiler @cherrymui @aclements

@thanm
Copy link
Contributor

thanm commented Feb 7, 2024

As part of my work on inlining heuristics I considered adding a heuristic that would bias against inlining functions with large stack frames, but didn't get as far as implementing it. It is a tricky thing to get right, since the inliner runs at an early stage in the compiler, making it tricky to compute an accurate stack size estimate.

I think a better long term solution here is something like #62737, e.g. do the inlines but then arrange for the three inlined blobs to share the same instance of the array in question. This can be done in the back end (e.g. late during stack frame layout) or we can change the inliner itself to reuse temps (this is something that the LLVM inliner does IIRC).

@felixge
Copy link
Contributor Author

felixge commented Feb 7, 2024

I know you are aware, but just to be clear to others skimming the issue: this stack size estimate tool will likely underestimate the actual stack allocation size.

Yup, it's a rough estimate. As discussed during the last diagnostic sync, I'll try to file my proposal for a runtime implementation of this profile type which would allow us to overcome some of the estimation issues.

think a better long term solution here is something like #62737, e.g. do the inlines but then arrange for the three inlined blobs to share the same instance of the array in question.

That's probably a good solution in most cases, including this one.

There will still be edge cases e.g. A inlining B (with a big frame) and then calling C in a for loop forever. But maybe those edge cases are rare enough to not be a problem in practice.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 7, 2024
@mknyszek mknyszek added this to the Backlog milestone Feb 7, 2024
@aclements
Copy link
Member

This is closely related to #62077, #62737, and #65495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

6 participants