Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: long latency of sweep assists #57523

Open
zyguan opened this issue Dec 30, 2022 · 9 comments
Open

runtime: long latency of sweep assists #57523

zyguan opened this issue Dec 30, 2022 · 9 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@zyguan
Copy link

zyguan commented Dec 30, 2022

What version of Go are you using (go version)?

$ go version
go version go1.19.1 linux/amd64

Does this issue reproduce with the latest release?

Not sure, but at least with v1.19.1.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/zyguan/.cache/go-build"
GOENV="/home/zyguan/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/zyguan/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/zyguan/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.19.1"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/usr/local/lib/go/src/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2199602951=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I'm investigating a long tail latency issue of tidb and found that it's sweep assists that lead to those slow queries.

2022-12-30_110129

I also added a debug log and found that the heapLive might be less than the sweepHeapLiveBasis here. As a result, we got a extremely large pagesTarget (because of underflow) and then spent a lot of time on sweeping (need to sweep 2767037498505953 pages for allocating 8192 bytes).

2022-12-30_121552

diff --git a/src/runtime/mgcsweep.go b/src/runtime/mgcsweep.go
index de57f18..d26d03b 100644
--- a/src/runtime/mgcsweep.go
+++ b/src/runtime/mgcsweep.go
@@ -810,6 +810,7 @@ func deductSweepCredit(spanBytes uintptr, callerSweepPages uintptr) {
                return
        }
 
+       startTime := nanotime()
        if trace.enabled {
                traceGCSweepStart()
        }
@@ -818,7 +819,8 @@ retry:
        sweptBasis := mheap_.pagesSweptBasis.Load()
 
        // Fix debt if necessary.
-       newHeapLive := uintptr(atomic.Load64(&gcController.heapLive)-mheap_.sweepHeapLiveBasis) + spanBytes
+       heapLive := atomic.Load64(&gcController.heapLive)
+       newHeapLive := uintptr(heapLive-mheap_.sweepHeapLiveBasis) + spanBytes
        pagesTarget := int64(mheap_.sweepPagesPerByte*float64(newHeapLive)) - int64(callerSweepPages)
        for pagesTarget > int64(mheap_.pagesSwept.Load()-sweptBasis) {
                if sweepone() == ^uintptr(0) {
@@ -834,6 +836,9 @@ retry:
        if trace.enabled {
                traceGCSweepDone()
        }
+       if dur := nanotime()-startTime; dur > 100000000 {
+               print(">>> deductSweepCredit dur:", dur/1000000, "ms span-bytes:", spanBytes, " heap-live:", heapLive, " heap-live-basis:", mheap_.sweepHeapLiveBasis, " target:", pagesTarget, "\n")
+       }
 }
 
 // clobberfree sets the memory content at x to bad content, for debugging

What did you expect to see?

reasonable latency of sweep assists.

What did you see instead?

it takes hundreds of milliseconds for deductSweepCredit.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 30, 2022
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 31, 2022
@dmitshur dmitshur added this to the Backlog milestone Dec 31, 2022
@dmitshur
Copy link
Contributor

CC @golang/runtime.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 3, 2023

Thanks for the detailed report and investigation!

I also added a debug log and found that the heapLive might be less than the sweepHeapLiveBasis here. As a result, we got a extremely large pagesTarget (because of underflow) and then spent a lot of time on sweeping (need to sweep 2767037498505953 pages for allocating 8192 bytes).

Yeah, this just seems like a bug, and like it would cause exactly this latency issue. We should just check for overflow defensively. I think that code assumes that sweepHeapLiveBasis is never less than heapLive (set here: https://cs.opensource.google/go/go/+/master:src/runtime/mgcsweep.go;l=941;bpv=1;bpt=1?q=sweepHeapLiveBasis&ss=go%2Fgo) because that code is typically run during a STW at the beginning of the sweep phase normally (https://cs.opensource.google/go/go/+/master:src/runtime/mgcsweep.go;l=941;bpv=1;bpt=1?q=sweepHeapLiveBasis&ss=go%2Fgo).

However, if your application is calling runtime/debug.SetMemoryLimit or runtime/debug.SetGCPercent semi-regularly then that could absolutely cause this kind of overflow, since heapLive might be set greater. That sweepHeapLiveBasis should also at the very least be atomic (but that's also insufficient to fix the issue).

Does TiDB call runtime/debug.SetMemoryLimit or runtime/debug.SetGCPercent (and if so, regularly?)? If not, I think this needs some more investigation to understand how (though, checking for overflow should still fix the problem).

@mknyszek mknyszek self-assigned this Jan 3, 2023
@zyguan
Copy link
Author

zyguan commented Jan 4, 2023

Does TiDB call runtime/debug.SetMemoryLimit or runtime/debug.SetGCPercent (and if so, regularly?)?

Yes, TiDB calls both of them regularly by registering finalizers here and here.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

Got it, thanks. Then I think I'm satisfied with calls to these functions as the complete explanation for how we can see this overflow. I'll send out a patch shortly.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

Actually, looking again at your execution trace, the fact that so many goroutines get stuck in sweep assist all at the same time is suspicious. There is another facet to this story and I think it's the fact that heapLive doesn't increase monotonically but may decrease if an mcache is force flushed. Something as innocent as a ReadMemStats call could cause this.

That's how we can get into this persistently bad state. mcaches get flushed, and inevitably each goroutine tries to refill its mcache. Then they all get stuck because heapLive.< sweepHeapLiveBasis.

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

Sent for review: https://go.dev/cl/460376

@gopherbot
Copy link

Change https://go.dev/cl/460376 mentions this issue: runtime: check for overflow in sweep assist

@mknyszek
Copy link
Contributor

mknyszek commented Jan 4, 2023

An interesting question is whether this is bad enough to deserve a backport. The fix is very small and pretty safe.

@zyguan
Copy link
Author

zyguan commented Jan 4, 2023

Thank you for your prompt response, @mknyszek .

The issue seems to be related to the calling of runtime/debug.SetMemoryLimit. I cannot reproduce it after setting tidb_server_memory_limit to 0, which prevents tidb from calling runtime/debug.SetMemoryLimit regularly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests

4 participants