New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: broken work conservation due to CL 310850 #45867
Comments
2021-11-17T04:31:29-fceca2c/freebsd-arm64-dmgk |
Marking as release-blocker for Go 1.18 due to the high failure rate. (Note that at least one failure within the current cycle is on a |
Looking at the test, and specifically the comment
I suspect that what's wrong here is that a GC is already in progress, and this is fallout from the new minimum heap size making that more likely. This is probably already mitigated by https://go.dev/cl/368137, but still theoretically a failure here is possible. Indeed, the last failure was on Dec 2nd (going by the I'll send a CL to fix this for good. |
Change https://golang.org/cl/369747 mentions this issue: |
These tests disable GC because of the potential for a deadlock, but don't consider that a GC could be in progress due to other tests. The likelihood of this case was increased when the minimum heap size was lowered during the Go 1.18 cycle. The issue was then mitigated by CL 368137 but in theory is always a problem. This change is intended specifically for #45867, but I just walked over a whole bunch of other tests that don't take this precaution where it seems like it could be relevant (some tests it's not, like the UserForcedGC test, or testprogs where no other code has run before it). Fixes #45867. Change-Id: I6a1b4ae73e05cab5a0b2d2cce14126bd13be0ba5 Reviewed-on: https://go-review.googlesource.com/c/go/+/369747 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: David Chase <drchase@google.com> Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Update on the root cause here: the fixes for #49680, #49695, and #45867 all assumed that |
Change https://golang.org/cl/369815 mentions this issue: |
Fixes for #49680, #49695, #45867, and #49370 all assumed that SetGCPercent(-1) doesn't block until the GC's mark phase is done, but it actually does. The cause of 3 of those 4 failures comes from the fact that at the beginning of the sweep phase, the GC does try to preempt every P once, and this may run concurrently with test code. In the fourth case, the issue was likely that only *one* of the debug_test.go tests was missing a call to SetGCPercent(-1). Just to be safe, leave a TODO there for now to remove the extraneous runtime.GC calls, but leave the calls in. Updates #49680, #49695, #45867, and #49370. Change-Id: Ibf4e64addfba18312526968bcf40f1f5d54eb3f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/369815 Reviewed-by: Austin Clements <austin@google.com> Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Unfortunately this symptom is still present: https://build.golang.org/log/9c21b1044b95e285295c9cf6040932dd3f5d63b2 occurred after CL 369747. |
Hmm. The test still deadlocked, but Perhaps at least the failure mechanism changed! |
This regexp matches the one new failure observed since that change.
|
Two makes it a pattern, right?
2021-12-08T17:55:13-c1c303f/freebsd-arm64-dmgk |
I'm having a really hard time reproducing this on ppc64. I'm spinning it in a loop, and after about 60 mins in 1 gomote I've got nothing. I may scale this up. Looking at this failure we do have more closely, it looks like a problem in the scheduler! Note that in https://build.golang.org/log/9c21b1044b95e285295c9cf6040932dd3f5d63b2 we're currently in the
The main goroutine (177239) is blocked on the If that goroutine (177301) were to run, the test would get unstuck. But the question is... why is it not running? What scheduler list is it sitting in for 10 whole minutes such that nothing can see it? Sure, 3 of 4 Ps are taken up by these non-preemptible goroutines that are actively spinning, but 1 P should be free to run code. There are no other running goroutines that appear to be taking up the 4th P. This suggests that what we have here is a missing wakeup somewhere. I don't see any major changes that landed in the scheduler this cycle. I do see two potential culprit CLs in Gerrit history (my search methods were crude, so perhaps there are more). There's https://golang.org/cl/310850 from May, but if that was the culprit we'd be seeing this go back much further in time to the 1.17 cycle. Then, the only other CL I see is https://golang.org/cl/317509 from August. If I could reproduce, I'd try reverting either of these, but so far no dice. (Side note: taking a more fundamental approach to the code... I'm also curious as to why this test deadlocks due to preemption at all (even before I added the |
It's been about 4 hours now and I haven't gotten anything yet. :( Maybe I need to run all of |
I have not been able to reproduce this yet, but I make an interesting observation that even though it passes, this test takes a very long time on
That is a rather extreme delay which I will be investigating further. |
Update: I have managed to reproduced this on |
I've got a couple of core files now. (Including one with the same symptoms in What I am seeing is that the runnable goroutines are on local runqueues, but there is a P that is unexpectedly idle. e.g.,
Goroutine 85460 is on P 1's run queue:
P 3 is not running:
P 3 not running is a bug, and likely the root cause of the deadlock. We have available work and an idle P, so we should have woken a P to run this work. This is where I will focus investigation. That said, in theory async preemption should still preempt the running goroutines and allow the runnable goroutine to execute. It is possible that the tight loop those goroutines are not async safe points, in which case we would indeed deadlock here. I'm not sure if this is the case or not. The loop [1] seems like it should be preemptible to me, but I haven't dug into the PCDATA. [1]
|
Yeah, that loop is async preemptible. |
End-of-evening summary: I've added decent tracing, here's a simple case:
|
I'm at end-of-day fried brain status, but this feels like a bug in the wakep conditions. When a spinning thread finds work, it must wake another spinning thread to compensate for new work submission (see long comment at the top of proc.go). This is the job of But one of these Ps may be held by a non-spinning M in the final steps of releasing the P. In such a case, the non-spinning M will not discover the remaining work and block in netpoll. Thus we lose our work-conservation. Normally we could still make progress [1] because the other running Ps would eventually enter the scheduler and discover the work, but these tight loop are never preempted (even with async preemption, for unclear reasons). [1] That said, I don't see a fundamental reason why all others Ps couldn't be in the final steps of releasing the P, in which case we don't completely deadlock. (In particular because the spinning check doesn't use CAS). |
To summarize, what I see in the trace above is:
|
@aclements pointed out that "[M 4] drops P 1 ... and goes directly into blocking netpoll." is new behavior in https://go.dev/cl/310850, which was submitted just before the first one of these test failures. |
#50749 explains why async preemption did not kick in to allow forward progress. Disabling A few ideas I've brainstormed:
|
FWIW, explicitly setting |
Interesting. That explains that I tried to run that test in isolation but never reproduced the failure... |
Correct, the test should no longer fail.
To be clear, this isn't quite "only a performance issue". The failing tests were deadlocked and would never recover. You are correct that "requires a pretty specific and subtle setup to trigger." I think dropping this as a release blocker is OK, as this issue is new in 1.17, not 1.18. We may still want to backport the eventual fix to both 1.17 and 1.18. |
Change https://go.dev/cl/389014 mentions this issue: |
@prattmic This issue is in the 1.19 milestone. I'm not clear on the status. Should this be 1.20? Backlog? Closed? Thanks. |
2021-04-30T00:32:42-303b194/linux-amd64-noopt
2020-06-26T17:07:58-d1015f3/freebsd-386-11_2
Also seen on a recent TryBot:
https://storage.googleapis.com/go-build-log/eb7ebec0/freebsd-amd64-12_2_913d0c87.log
CC @jeremyfaller @aclements
The text was updated successfully, but these errors were encountered: