-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: long latency of sweep assists #57523
Comments
CC @golang/runtime. |
Thanks for the detailed report and investigation!
Yeah, this just seems like a bug, and like it would cause exactly this latency issue. We should just check for overflow defensively. I think that code assumes that However, if your application is calling Does TiDB call |
Got it, thanks. Then I think I'm satisfied with calls to these functions as the complete explanation for how we can see this overflow. I'll send out a patch shortly. |
Actually, looking again at your execution trace, the fact that so many goroutines get stuck in sweep assist all at the same time is suspicious. There is another facet to this story and I think it's the fact that That's how we can get into this persistently bad state. |
Sent for review: https://go.dev/cl/460376 |
Change https://go.dev/cl/460376 mentions this issue: |
An interesting question is whether this is bad enough to deserve a backport. The fix is very small and pretty safe. |
Thank you for your prompt response, @mknyszek . The issue seems to be related to the calling of |
@gopherbot Please open up backport issues for Go 1.19 and Go 1.20. This issue has the potential to cause severe latency issues if Just about to land the fix. I think I'm in favor of backporting. |
Backport issue(s) opened: #58535 (for 1.19), #58536 (for 1.20). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/468375 mentions this issue: |
Change https://go.dev/cl/468376 mentions this issue: |
The sweep assist computation is intentionally racy for performance, since the specifics of sweep assist aren't super sensitive to error. However, if overflow occurs when computing the live heap delta, we can end up with a massive sweep target that causes the sweep assist to sweep until sweep termination, causing severe latency issues. In fact, because heapLive doesn't always increase monotonically then anything that flushes mcaches will cause _all_ allocating goroutines to inevitably get stuck in sweeping. Consider the following scenario: 1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive. 2. Very shortly after, ReadMemStats is called, flushing mcaches and decreasing heapLive below the value sweepHeapLiveBasis was set to. 3. Every allocating goroutine goes to refill its mcache, calls into deductSweepCredit for sweep assist, and gets stuck sweeping until the sweep phase ends. Fix this by just checking for overflow in the delta live heap calculation and if it would overflow, pick a small delta live heap. This probably means that no sweeping will happen at all, but that's OK. This is a transient state and the runtime will recover as soon as heapLive increases again. Note that deductSweepCredit doesn't check overflow on other operations but that's OK: those operations are signed and extremely unlikely to overflow. The subtraction targeted by this CL is only a problem because it's unsigned. An alternative fix would be to make the operation signed, but being explicit about the overflow situation seems worthwhile. For #57523. Fixes #58536. Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215 Reviewed-on: https://go-review.googlesource.com/c/go/+/468375 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
The sweep assist computation is intentionally racy for performance, since the specifics of sweep assist aren't super sensitive to error. However, if overflow occurs when computing the live heap delta, we can end up with a massive sweep target that causes the sweep assist to sweep until sweep termination, causing severe latency issues. In fact, because heapLive doesn't always increase monotonically then anything that flushes mcaches will cause _all_ allocating goroutines to inevitably get stuck in sweeping. Consider the following scenario: 1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive. 2. Very shortly after, ReadMemStats is called, flushing mcaches and decreasing heapLive below the value sweepHeapLiveBasis was set to. 3. Every allocating goroutine goes to refill its mcache, calls into deductSweepCredit for sweep assist, and gets stuck sweeping until the sweep phase ends. Fix this by just checking for overflow in the delta live heap calculation and if it would overflow, pick a small delta live heap. This probably means that no sweeping will happen at all, but that's OK. This is a transient state and the runtime will recover as soon as heapLive increases again. Note that deductSweepCredit doesn't check overflow on other operations but that's OK: those operations are signed and extremely unlikely to overflow. The subtraction targeted by this CL is only a problem because it's unsigned. An alternative fix would be to make the operation signed, but being explicit about the overflow situation seems worthwhile. For #57523. Fixes #58535. Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215 Reviewed-on: https://go-review.googlesource.com/c/go/+/468376 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
The sweep assist computation is intentionally racy for performance, since the specifics of sweep assist aren't super sensitive to error. However, if overflow occurs when computing the live heap delta, we can end up with a massive sweep target that causes the sweep assist to sweep until sweep termination, causing severe latency issues. In fact, because heapLive doesn't always increase monotonically then anything that flushes mcaches will cause _all_ allocating goroutines to inevitably get stuck in sweeping. Consider the following scenario: 1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive. 2. Very shortly after, ReadMemStats is called, flushing mcaches and decreasing heapLive below the value sweepHeapLiveBasis was set to. 3. Every allocating goroutine goes to refill its mcache, calls into deductSweepCredit for sweep assist, and gets stuck sweeping until the sweep phase ends. Fix this by just checking for overflow in the delta live heap calculation and if it would overflow, pick a small delta live heap. This probably means that no sweeping will happen at all, but that's OK. This is a transient state and the runtime will recover as soon as heapLive increases again. Note that deductSweepCredit doesn't check overflow on other operations but that's OK: those operations are signed and extremely unlikely to overflow. The subtraction targeted by this CL is only a problem because it's unsigned. An alternative fix would be to make the operation signed, but being explicit about the overflow situation seems worthwhile. Fixes golang#57523. Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215 Reviewed-on: https://go-review.googlesource.com/c/go/+/460376 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Knyszek <mknyszek@google.com>
The sweep assist computation is intentionally racy for performance, since the specifics of sweep assist aren't super sensitive to error. However, if overflow occurs when computing the live heap delta, we can end up with a massive sweep target that causes the sweep assist to sweep until sweep termination, causing severe latency issues. In fact, because heapLive doesn't always increase monotonically then anything that flushes mcaches will cause _all_ allocating goroutines to inevitably get stuck in sweeping. Consider the following scenario: 1. SetGCPercent is called, updating sweepHeapLiveBasis to heapLive. 2. Very shortly after, ReadMemStats is called, flushing mcaches and decreasing heapLive below the value sweepHeapLiveBasis was set to. 3. Every allocating goroutine goes to refill its mcache, calls into deductSweepCredit for sweep assist, and gets stuck sweeping until the sweep phase ends. Fix this by just checking for overflow in the delta live heap calculation and if it would overflow, pick a small delta live heap. This probably means that no sweeping will happen at all, but that's OK. This is a transient state and the runtime will recover as soon as heapLive increases again. Note that deductSweepCredit doesn't check overflow on other operations but that's OK: those operations are signed and extremely unlikely to overflow. The subtraction targeted by this CL is only a problem because it's unsigned. An alternative fix would be to make the operation signed, but being explicit about the overflow situation seems worthwhile. For golang#57523. Fixes golang#58536. Change-Id: Ib18f71f53468e913548aac6e5358830c72ef0215 Reviewed-on: https://go-review.googlesource.com/c/go/+/468375 Reviewed-by: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Not sure, but at least with v1.19.1.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I'm investigating a long tail latency issue of tidb and found that it's sweep assists that lead to those slow queries.
I also added a debug log and found that the
heapLive
might be less than thesweepHeapLiveBasis
here. As a result, we got a extremely largepagesTarget
(because of underflow) and then spent a lot of time on sweeping (need to sweep 2767037498505953 pages for allocating 8192 bytes).What did you expect to see?
reasonable latency of sweep assists.
What did you see instead?
it takes hundreds of milliseconds for
deductSweepCredit
.The text was updated successfully, but these errors were encountered: