Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: long latency of sweep assists #57523

Closed
zyguan opened this issue Dec 30, 2022 · 13 comments
Closed

runtime: long latency of sweep assists #57523

zyguan opened this issue Dec 30, 2022 · 13 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@zyguan
Copy link

zyguan commented Dec 30, 2022

What version of Go are you using (go version)?

$ go version
go version go1.19.1 linux/amd64

Does this issue reproduce with the latest release?

Not sure, but at least with v1.19.1.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/zyguan/.cache/go-build"
GOENV="/home/zyguan/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/zyguan/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/zyguan/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.1"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/usr/local/lib/go/src/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2199602951=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm investigating a long tail latency issue of tidb and found that it's sweep assists that lead to those slow queries.

2022-12-30_110129

I also added a debug log and found that the heapLive might be less than the sweepHeapLiveBasis here. As a result, we got a extremely large pagesTarget (because of underflow) and then spent a lot of time on sweeping (need to sweep 2767037498505953 pages for allocating 8192 bytes).

2022-12-30_121552

diff --git a/src/runtime/mgcsweep.go b/src/runtime/mgcsweep.go
index de57f18..d26d03b 100644
--- a/src/runtime/mgcsweep.go
+++ b/src/runtime/mgcsweep.go
@@ -810,6 +810,7 @@ func deductSweepCredit(spanBytes uintptr, callerSweepPages uintptr) {
                return
        }
 
+       startTime := nanotime()
        if trace.enabled {
                traceGCSweepStart()
        }
@@ -818,7 +819,8 @@ retry:
        sweptBasis := mheap_.pagesSweptBasis.Load()
 
        // Fix debt if necessary.
-       newHeapLive := uintptr(atomic.Load64(&gcController.heapLive)-mheap_.sweepHeapLiveBasis) + spanBytes
+       heapLive := atomic.Load64(&gcController.heapLive)
+       newHeapLive := uintptr(heapLive-mheap_.sweepHeapLiveBasis) + spanBytes
        pagesTarget := int64(mheap_.sweepPagesPerByte*float64(newHeapLive)) - int64(callerSweepPages)
        for pagesTarget > int64(mheap_.pagesSwept.Load()-sweptBasis) {
                if sweepone() == ^uintptr(0) {
@@ -834,6 +836,9 @@ retry:
        if trace.enabled {
                traceGCSweepDone()
        }
+       if dur := nanotime()-startTime; dur > 100000000 {
+               print(">>> deductSweepCredit dur:", dur/1000000, "ms span-bytes:", spanBytes, " heap-live:", heapLive, " heap-live-basis:", mheap_.sweepHeapLiveBasis, " target:", pagesTarget, "\n")
+       }
 }
 
 // clobberfree sets the memory content at x to bad content, for debugging

What did you expect to see?

reasonable latency of sweep assists.

What did you see instead?

it takes hundreds of milliseconds for deductSweepCredit.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 30, 2022
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 31, 2022
@dmitshur dmitshur added this to the Backlog milestone Dec 31, 2022
@dmitshur
Copy link
Contributor

CC @golang/runtime.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 3, 2023

Thanks for the detailed report and investigation!

I also added a debug log and found that the heapLive might be less than the sweepHeapLiveBasis here. As a result, we got a extremely large pagesTarget (because of underflow) and then spent a lot of time on sweeping (need to sweep 2767037498505953 pages for allocating 8192 bytes).

Yeah, this just seems like a bug, and like it would cause exactly this latency issue. We should just check for overflow defensively. I think that code assumes that sweepHeapLiveBasis is never less than heapLive (set here: https://cs.opensource.google/go/go/+/master:src/runtime/mgcsweep.go;l=941;bpv=1;bpt=1?q=sweepHeapLiveBasis&ss=go%2Fgo) because that code is typically run during a STW at the beginning of the sweep phase normally (https://cs.opensource.google/go/go/+/master:src/runtime/mgcsweep.go;l=941;bpv=1;bpt=1?q=sweepHeapLiveBasis&ss=go%2Fgo).

However, if your application is calling runtime/debug.SetMemoryLimit or runtime/debug.SetGCPercent semi-regularly then that could absolutely cause this kind of overflow, since heapLive might be set greater. That sweepHeapLiveBasis should also at the very least be atomic (but that's also insufficient to fix the issue).

Does TiDB call runtime/debug.SetMemoryLimit or runtime/debug.SetGCPercent (and if so, regularly?)? If not, I think this needs some more investigation to understand how (though, checking for overflow should still fix the problem).

@mknyszek mknyszek self-assigned this Jan 3, 2023
@zyguan
Copy link
Author

zyguan commented Jan 4, 2023

Does TiDB call runtime/debug.SetMemoryLimit or runtime/debug.SetGCPercent (and if so, regularly?)?

Yes, TiDB calls both of them regularly by registering finalizers here and here.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

Got it, thanks. Then I think I'm satisfied with calls to these functions as the complete explanation for how we can see this overflow. I'll send out a patch shortly.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

Actually, looking again at your execution trace, the fact that so many goroutines get stuck in sweep assist all at the same time is suspicious. There is another facet to this story and I think it's the fact that heapLive doesn't increase monotonically but may decrease if an mcache is force flushed. Something as innocent as a ReadMemStats call could cause this.

That's how we can get into this persistently bad state. mcaches get flushed, and inevitably each goroutine tries to refill its mcache. Then they all get stuck because heapLive.< sweepHeapLiveBasis.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

Sent for review: https://go.dev/cl/460376

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/460376 mentions this issue: runtime: check for overflow in sweep assist

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

An interesting question is whether this is bad enough to deserve a backport. The fix is very small and pretty safe.

@zyguan
Copy link
Author

zyguan commented Jan 4, 2023

Thank you for your prompt response, @mknyszek .

The issue seems to be related to the calling of runtime/debug.SetMemoryLimit. I cannot reproduce it after setting tidb_server_memory_limit to 0, which prevents tidb from calling runtime/debug.SetMemoryLimit regularly.

@mknyszek
Copy link
Contributor

@gopherbot Please open up backport issues for Go 1.19 and Go 1.20. This issue has the potential to cause severe latency issues if runtime/debug.SetMemoryLimit or runtime/debug.SetGCPercent is called often. The fix is also small and safe (an overflow check on an unsigned subtraction).

Just about to land the fix. I think I'm in favor of backporting.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #58535 (for 1.19), #58536 (for 1.20).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/468375 mentions this issue: [release-branch.go1.20] runtime: check for overflow in sweep assist

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/468376 mentions this issue: [release-branch.go1.19] runtime: check for overflow in sweep assist

@dmitshur dmitshur modified the milestones: Backlog, Go1.21 Feb 15, 2023
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 15, 2023
gopherbot pushed a commit that referenced this issue Feb 15, 2023
The sweep assist computation is intentionally racy for performance,
since the specifics of sweep assist aren't super sensitive to error.
However, if overflow occurs when computing the live heap delta, we can
end up with a massive sweep target that causes the sweep assist to sweep
until sweep termination, causing severe latency issues. In fact, because
heapLive doesn't always increase monotonically then anything that
flushes mcaches will cause _all_ allocating goroutines to inevitably get
stuck in sweeping.

Consider the following scenario:
1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive.
2. Very shortly after, ReadMemStats is called, flushing mcaches and
   decreasing heapLive below the value sweepHeapLiveBasis was set to.
3. Every allocating goroutine goes to refill its mcache, calls into
   deductSweepCredit for sweep assist, and gets stuck sweeping until
   the sweep phase ends.

Fix this by just checking for overflow in the delta live heap calculation
and if it would overflow, pick a small delta live heap. This probably
means that no sweeping will happen at all, but that's OK. This is a
transient state and the runtime will recover as soon as heapLive
increases again.

Note that deductSweepCredit doesn't check overflow on other operations
but that's OK: those operations are signed and extremely unlikely to
overflow. The subtraction targeted by this CL is only a problem because
it's unsigned. An alternative fix would be to make the operation signed,
but being explicit about the overflow situation seems worthwhile.

For #57523.
Fixes #58536.

Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215
Reviewed-on: https://go-review.googlesource.com/c/go/+/468375
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Feb 15, 2023
The sweep assist computation is intentionally racy for performance,
since the specifics of sweep assist aren't super sensitive to error.
However, if overflow occurs when computing the live heap delta, we can
end up with a massive sweep target that causes the sweep assist to sweep
until sweep termination, causing severe latency issues. In fact, because
heapLive doesn't always increase monotonically then anything that
flushes mcaches will cause _all_ allocating goroutines to inevitably get
stuck in sweeping.

Consider the following scenario:
1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive.
2. Very shortly after, ReadMemStats is called, flushing mcaches and
   decreasing heapLive below the value sweepHeapLiveBasis was set to.
3. Every allocating goroutine goes to refill its mcache, calls into
   deductSweepCredit for sweep assist, and gets stuck sweeping until
   the sweep phase ends.

Fix this by just checking for overflow in the delta live heap calculation
and if it would overflow, pick a small delta live heap. This probably
means that no sweeping will happen at all, but that's OK. This is a
transient state and the runtime will recover as soon as heapLive
increases again.

Note that deductSweepCredit doesn't check overflow on other operations
but that's OK: those operations are signed and extremely unlikely to
overflow. The subtraction targeted by this CL is only a problem because
it's unsigned. An alternative fix would be to make the operation signed,
but being explicit about the overflow situation seems worthwhile.

For #57523.
Fixes #58535.

Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215
Reviewed-on: https://go-review.googlesource.com/c/go/+/468376
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
johanbrandhorst pushed a commit to Pryz/go that referenced this issue Feb 22, 2023
The sweep assist computation is intentionally racy for performance,
since the specifics of sweep assist aren't super sensitive to error.
However, if overflow occurs when computing the live heap delta, we can
end up with a massive sweep target that causes the sweep assist to sweep
until sweep termination, causing severe latency issues. In fact, because
heapLive doesn't always increase monotonically then anything that
flushes mcaches will cause _all_ allocating goroutines to inevitably get
stuck in sweeping.

Consider the following scenario:
1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive.
2. Very shortly after, ReadMemStats is called, flushing mcaches and
   decreasing heapLive below the value sweepHeapLiveBasis was set to.
3. Every allocating goroutine goes to refill its mcache, calls into
   deductSweepCredit for sweep assist, and gets stuck sweeping until
   the sweep phase ends.

Fix this by just checking for overflow in the delta live heap calculation
and if it would overflow, pick a small delta live heap. This probably
means that no sweeping will happen at all, but that's OK. This is a
transient state and the runtime will recover as soon as heapLive
increases again.

Note that deductSweepCredit doesn't check overflow on other operations
but that's OK: those operations are signed and extremely unlikely to
overflow. The subtraction targeted by this CL is only a problem because
it's unsigned. An alternative fix would be to make the operation signed,
but being explicit about the overflow situation seems worthwhile.

Fixes golang#57523.

Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215
Reviewed-on: https://go-review.googlesource.com/c/go/+/460376
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Mar 3, 2023
The sweep assist computation is intentionally racy for performance,
since the specifics of sweep assist aren't super sensitive to error.
However, if overflow occurs when computing the live heap delta, we can
end up with a massive sweep target that causes the sweep assist to sweep
until sweep termination, causing severe latency issues. In fact, because
heapLive doesn't always increase monotonically then anything that
flushes mcaches will cause _all_ allocating goroutines to inevitably get
stuck in sweeping.

Consider the following scenario:
1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive.
2. Very shortly after, ReadMemStats is called, flushing mcaches and
   decreasing heapLive below the value sweepHeapLiveBasis was set to.
3. Every allocating goroutine goes to refill its mcache, calls into
   deductSweepCredit for sweep assist, and gets stuck sweeping until
   the sweep phase ends.

Fix this by just checking for overflow in the delta live heap calculation
and if it would overflow, pick a small delta live heap. This probably
means that no sweeping will happen at all, but that's OK. This is a
transient state and the runtime will recover as soon as heapLive
increases again.

Note that deductSweepCredit doesn't check overflow on other operations
but that's OK: those operations are signed and extremely unlikely to
overflow. The subtraction targeted by this CL is only a problem because
it's unsigned. An alternative fix would be to make the operation signed,
but being explicit about the overflow situation seems worthwhile.

For golang#57523.
Fixes golang#58536.

Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215
Reviewed-on: https://go-review.googlesource.com/c/go/+/468375
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Feb 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

4 participants