-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: race between stack shrinking and channel send/recv leads to bad sudog values #40641
Comments
This should be fixed for Go 1.14 and Go 1.15. It's a bug that was introduced in Go 1.14, and may cause random and unavoidable crashes at any point in time. There may not be enough time to fix this for 1.15 (the failure is very rare, but we've seen it internally), and if not, it should definitely go in a point release. @gopherbot please open a backport issue for 1.14. |
Backport issue(s) opened: #40642 (for 1.14), #40643 (for 1.15). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Change https://golang.org/cl/247050 mentions this issue: |
Oh no! That fix doesn't work because if there's a stack growth at any point in between, we could self-deadlock. Ugh. OK I don't know what the right fix is yet... |
Could we set GS as not safe for shrink stack? |
Yeah, that might work. I'm not sure if it's technically safe for stack scanning though because if it's a |
I think stack scanning should be fine, as it doesn't write to the stack. It is okay to read the old or the new value. As long as the write has a write barrier, it should be fine. |
Thanks Cherry, that makes sense. Your solution would mean that we could probably get rid of |
If we set it as not safe for stack shrinking, that means we can't shrink stacks of anything blocked on chan send, right? Would that be too limiting, given that such G's may be blocked indefinitely? I certainly regularly see thousands of G's blocked in chan recv, though I imagine chan send is less common. |
It applies to both chan send and chan recv in this case. Maybe it is too limiting? I hadn't considered the case where Gs are blocked indefinitely and you want to shrink their stacks. If the G will run soon, then the next synchronous preemption will likely shrink its stack (unless it's another channel op...). |
e.g., in one of the applications where we saw this crash, 44/557 G's were blocked in chan receive, most for >1000 minutes. That is a decent amount of stack space you'll effectively never be able to reclaim. |
There was a state of the world where all this worked, prior to https://golang.org/cl/172982. For 1.14 it's probably safest to revert that CL, since I think it can be cleanly reverted. For 1.15 (if not backporting) and 1.16 though, @prattmic and I agree would be nice to retain the explicit tracking of when we need to be careful in stack copying that https://golang.org/cl/172982 added, but I'm not sure how to do it safely. It may require some surgery in @cherrymui what do you think? |
Change https://golang.org/cl/247679 mentions this issue: |
Change https://golang.org/cl/247680 mentions this issue: |
OK actually I think I'm abandoning the idea of the revert. Enough changed after that CL that it doesn't make sense, and after I read some of the old comments, it feels like going back to that state would be subtle. |
Yeah, that CL's description says that it will break the rule when stack shrink can happen synchronously (as it is now). It may still be okay if we don't shrink stacks when we are holding locks? |
Re stack space, for blocked goroutines, we cannot release the used part of the stack regardless. The question is how much stack space we can actually shrink. |
I think it is only unsafe for stack shrinking between setting the G to Gwaiting and the G actually parks, due to the race of |
That idea makes sense to me. Block stack shrinking before calling |
By the way, I look forward to the test case. |
Thanks @cherrymui, that sounds right to me.
Yeah, that's going to be fun... |
OK I think I have a fix (trybots look good so far, 🤞), but still no working test. The goal of the test I'm trying to write is to create a situation where a goroutine whose stack is ripe for shrinking keeps going to sleep and waking up on channels. Meanwhile, a GC is triggered to try and get a mark worker to race with that goroutine going to sleep. I wrote a test which I've confirmed races stack shrinks with channel sends and receives, but I'm noticing a couple problems in trying to reproduce the issue.
As a result of these problems, I'm not certain that this is feasible to write a test for, at least not a short-running one, if we want it to fail reliably. But, maybe I'm just going about this the wrong way (i.e. a stress test is just the wrong choice here)? Any advice? |
Added release-blocker because it seems we should not release 1.16 without either resolving or revisiting and making some sort of decision on this issue. |
Change https://golang.org/cl/256058 mentions this issue: |
unlockf is called after the G is put into _Gwaiting, meaning another G may have readied this one before unlockf is called. This is implied by the current doc, but add additional notes to call out this behavior, as it can be quite surprising. Updates #40641 Change-Id: I60b1ccc6a4dd9ced8ad2aa1f729cb2e973100b59 Reviewed-on: https://go-review.googlesource.com/c/go/+/256058 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
Change https://golang.org/cl/256300 mentions this issue: |
Change https://golang.org/cl/256301 mentions this issue: |
…ckChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gopark. The trouble is that the unlockf is called *after* the G's status is changed, and the G's status is what is used by a concurrent mark worker (calling suspendG) to determine that a G has successfully been suspended. In this window between the status change and unlockf, the mark worker could try to shrink the G's stack, and in particular observe that activeStackChans is false. This observation will cause the mark worker to *not* synchronize with concurrent channel operations when it should, and so updating pointers in the sudog for the blocked goroutine (which may point to the goroutine's stack) races with channel operations which may also manipulate the pointer (read it, dereference it, update it, etc.). Fix the problem by adding a new atomically-updated flag to the g struct called parkingOnChan, which is non-zero in the race window above. Then, in isShrinkStackSafe, check if parkingOnChan is zero. The race is resolved like so: * Blocking G sets parkingOnChan, then changes status in gopark. * Mark worker successfully suspends blocking G. * If the mark worker observes parkingOnChan is non-zero when checking isShrinkStackSafe, then it's not safe to shrink (we're in the race window). * If the mark worker observes parkingOnChan as zero, then because the mark worker observed the G status change, it can be sure that gopark's unlockf completed, and gp.activeStackChans will be correct. The risk of this change is low, since although it reduces the number of places that stack shrinking is allowed, the window here is incredibly small. Essentially, every place that it might crash now is replaced with no shrink. This change adds a test, but the race window is so small that it's hard to trigger without a well-placed sleep in park_m. Also, this change fixes stackGrowRecursive in proc_test.go to actually allocate a 128-byte stack frame. It turns out the compiler was destructuring the "pad" field and only allocating one uint64 on the stack. For #40641. Fixes #40643. Change-Id: I7dfbe7d460f6972b8956116b137bc13bc24464e8 Reviewed-on: https://go-review.googlesource.com/c/go/+/247050 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com> (cherry picked from commit eb3c6a9) Reviewed-on: https://go-review.googlesource.com/c/go/+/256300 Reviewed-by: Austin Clements <austin@google.com>
…ckChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gopark. The trouble is that the unlockf is called *after* the G's status is changed, and the G's status is what is used by a concurrent mark worker (calling suspendG) to determine that a G has successfully been suspended. In this window between the status change and unlockf, the mark worker could try to shrink the G's stack, and in particular observe that activeStackChans is false. This observation will cause the mark worker to *not* synchronize with concurrent channel operations when it should, and so updating pointers in the sudog for the blocked goroutine (which may point to the goroutine's stack) races with channel operations which may also manipulate the pointer (read it, dereference it, update it, etc.). Fix the problem by adding a new atomically-updated flag to the g struct called parkingOnChan, which is non-zero in the race window above. Then, in isShrinkStackSafe, check if parkingOnChan is zero. The race is resolved like so: * Blocking G sets parkingOnChan, then changes status in gopark. * Mark worker successfully suspends blocking G. * If the mark worker observes parkingOnChan is non-zero when checking isShrinkStackSafe, then it's not safe to shrink (we're in the race window). * If the mark worker observes parkingOnChan as zero, then because the mark worker observed the G status change, it can be sure that gopark's unlockf completed, and gp.activeStackChans will be correct. The risk of this change is low, since although it reduces the number of places that stack shrinking is allowed, the window here is incredibly small. Essentially, every place that it might crash now is replaced with no shrink. This change adds a test, but the race window is so small that it's hard to trigger without a well-placed sleep in park_m. Also, this change fixes stackGrowRecursive in proc_test.go to actually allocate a 128-byte stack frame. It turns out the compiler was destructuring the "pad" field and only allocating one uint64 on the stack. For #40641. Fixes #40642. Change-Id: I7dfbe7d460f6972b8956116b137bc13bc24464e8 Reviewed-on: https://go-review.googlesource.com/c/go/+/247050 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com> (cherry picked from commit eb3c6a9) Reviewed-on: https://go-review.googlesource.com/c/go/+/256301 Reviewed-by: Austin Clements <austin@google.com>
…ckChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gopark. The trouble is that the unlockf is called *after* the G's status is changed, and the G's status is what is used by a concurrent mark worker (calling suspendG) to determine that a G has successfully been suspended. In this window between the status change and unlockf, the mark worker could try to shrink the G's stack, and in particular observe that activeStackChans is false. This observation will cause the mark worker to *not* synchronize with concurrent channel operations when it should, and so updating pointers in the sudog for the blocked goroutine (which may point to the goroutine's stack) races with channel operations which may also manipulate the pointer (read it, dereference it, update it, etc.). Fix the problem by adding a new atomically-updated flag to the g struct called parkingOnChan, which is non-zero in the race window above. Then, in isShrinkStackSafe, check if parkingOnChan is zero. The race is resolved like so: * Blocking G sets parkingOnChan, then changes status in gopark. * Mark worker successfully suspends blocking G. * If the mark worker observes parkingOnChan is non-zero when checking isShrinkStackSafe, then it's not safe to shrink (we're in the race window). * If the mark worker observes parkingOnChan as zero, then because the mark worker observed the G status change, it can be sure that gopark's unlockf completed, and gp.activeStackChans will be correct. The risk of this change is low, since although it reduces the number of places that stack shrinking is allowed, the window here is incredibly small. Essentially, every place that it might crash now is replaced with no shrink. This change adds a test, but the race window is so small that it's hard to trigger without a well-placed sleep in park_m. Also, this change fixes stackGrowRecursive in proc_test.go to actually allocate a 128-byte stack frame. It turns out the compiler was destructuring the "pad" field and only allocating one uint64 on the stack. For golang#40641. Fixes golang#40643. Change-Id: I7dfbe7d460f6972b8956116b137bc13bc24464e8 Reviewed-on: https://go-review.googlesource.com/c/go/+/247050 Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com> (cherry picked from commit eb3c6a9) Reviewed-on: https://go-review.googlesource.com/c/go/+/256300 Reviewed-by: Austin Clements <austin@google.com>
Going Go1.14.9 -> Go1.14.10 brings in compiler and runtime fixes including fix for crash in garbage-collector due to race condition: golang/go#40642 golang/go#40641 Tested on helloworld SR.
Internally we've seen a rare crash arise in the runtime since Go 1.14. The error message is typically
sudog with non-nil elem
stemming from a call toreleaseSudog
fromchansend
orchanrecv
.The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.
suspendG
.gopark
inchansend
. It calls intogopark
, and changes its status to_Gwaiting
BEFORE calling itsunlockf
, which setsgp.activeStackChans
._Gwaiting
and returns fromsuspendG
. It continues intoscanstack
where it checks if it's safe to shrink the stack. In this case, it's fine. So, it readsgp.activeStackChans
, and sees it as false. It begins adjustingsudog
pointers without synchronization. It reads thesudog
'selem
pointer from thechansend
, but has not written it back yet.gp.activeStackChans
and parks. It doesn't really matter when this happens at this point.chanrecv
on channel. It grabs the channel lock, reads from thesudog
'selem
field, and clears it. GR readies GS.sudog
, which has a non-nilelem
field.The fix here, I believe, is to set
gp.activeStackChans
before theunlockf
is called. Doing this ensures that the value is updated before any worker that could shrink GS's stack observes a useful G status insuspendG
. This could alternatively be fixed by changing the G status afterunlockf
is called, but I worry that will break a lot of things.CC @aclements @prattmic
The text was updated successfully, but these errors were encountered: