Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: race between stack shrinking and channel send/recv leads to bad sudog values [1.15 backport] #40643

Closed
gopherbot opened this issue Aug 7, 2020 · 5 comments

Comments

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 7, 2020

@mknyszek requested issue #40641 to be considered for backport to the next 1.15 minor release.

This should be fixed for Go 1.14 and Go 1.15. It's a bug that was introduced in Go 1.14, and may cause random and unavoidable crashes at any point in time. There may not be enough time to fix this for 1.15 (the failure is very rare, but we've seen it internally), and if not, it should definitely go in a point release.

@gopherbot please open a backport issue for 1.14.

@gopherbot
Copy link
Author

@gopherbot gopherbot commented Sep 21, 2020

Change https://golang.org/cl/256300 mentions this issue: [release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Sep 30, 2020

@dmitshur

Could you please take a look and see if the cherry pick should be approved? I think it should, since it's actively causing relatively rare crashes with no workaround. Also, could you submit https://golang.org/cl/256300 too, if it's approved? I've got 2 +2s on it now.

Thank you!

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 1, 2020

@mknyszek Thanks for providing a rationale for this backport and for getting the CLs into a ready state (here, and in #40642). We will discuss it in a release meeting soon, and take care of submission.

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 7, 2020

Approving per discussion in a release meeting. This backport applies to both 1.15 (this issue) and 1.14 (#40642).

@gopherbot
Copy link
Author

@gopherbot gopherbot commented Oct 7, 2020

Closed by merging bf79f91 to release-branch.go1.15.

@gopherbot gopherbot closed this Oct 7, 2020
gopherbot pushed a commit that referenced this issue Oct 7, 2020
…ckChans race window

Currently activeStackChans is set before a goroutine blocks on a channel
operation in an unlockf passed to gopark. The trouble is that the
unlockf is called *after* the G's status is changed, and the G's status
is what is used by a concurrent mark worker (calling suspendG) to
determine that a G has successfully been suspended. In this window
between the status change and unlockf, the mark worker could try to
shrink the G's stack, and in particular observe that activeStackChans is
false. This observation will cause the mark worker to *not* synchronize
with concurrent channel operations when it should, and so updating
pointers in the sudog for the blocked goroutine (which may point to the
goroutine's stack) races with channel operations which may also
manipulate the pointer (read it, dereference it, update it, etc.).

Fix the problem by adding a new atomically-updated flag to the g struct
called parkingOnChan, which is non-zero in the race window above. Then,
in isShrinkStackSafe, check if parkingOnChan is zero. The race is
resolved like so:

* Blocking G sets parkingOnChan, then changes status in gopark.
* Mark worker successfully suspends blocking G.
* If the mark worker observes parkingOnChan is non-zero when checking
  isShrinkStackSafe, then it's not safe to shrink (we're in the race
  window).
* If the mark worker observes parkingOnChan as zero, then because
  the mark worker observed the G status change, it can be sure that
  gopark's unlockf completed, and gp.activeStackChans will be correct.

The risk of this change is low, since although it reduces the number of
places that stack shrinking is allowed, the window here is incredibly
small. Essentially, every place that it might crash now is replaced with
no shrink.

This change adds a test, but the race window is so small that it's hard
to trigger without a well-placed sleep in park_m. Also, this change
fixes stackGrowRecursive in proc_test.go to actually allocate a 128-byte
stack frame. It turns out the compiler was destructuring the "pad" field
and only allocating one uint64 on the stack.

For #40641.
Fixes #40643.

Change-Id: I7dfbe7d460f6972b8956116b137bc13bc24464e8
Reviewed-on: https://go-review.googlesource.com/c/go/+/247050
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit eb3c6a9)
Reviewed-on: https://go-review.googlesource.com/c/go/+/256300
Reviewed-by: Austin Clements <austin@google.com>
claucece added a commit to claucece/go that referenced this issue Oct 22, 2020
…ckChans race window

Currently activeStackChans is set before a goroutine blocks on a channel
operation in an unlockf passed to gopark. The trouble is that the
unlockf is called *after* the G's status is changed, and the G's status
is what is used by a concurrent mark worker (calling suspendG) to
determine that a G has successfully been suspended. In this window
between the status change and unlockf, the mark worker could try to
shrink the G's stack, and in particular observe that activeStackChans is
false. This observation will cause the mark worker to *not* synchronize
with concurrent channel operations when it should, and so updating
pointers in the sudog for the blocked goroutine (which may point to the
goroutine's stack) races with channel operations which may also
manipulate the pointer (read it, dereference it, update it, etc.).

Fix the problem by adding a new atomically-updated flag to the g struct
called parkingOnChan, which is non-zero in the race window above. Then,
in isShrinkStackSafe, check if parkingOnChan is zero. The race is
resolved like so:

* Blocking G sets parkingOnChan, then changes status in gopark.
* Mark worker successfully suspends blocking G.
* If the mark worker observes parkingOnChan is non-zero when checking
  isShrinkStackSafe, then it's not safe to shrink (we're in the race
  window).
* If the mark worker observes parkingOnChan as zero, then because
  the mark worker observed the G status change, it can be sure that
  gopark's unlockf completed, and gp.activeStackChans will be correct.

The risk of this change is low, since although it reduces the number of
places that stack shrinking is allowed, the window here is incredibly
small. Essentially, every place that it might crash now is replaced with
no shrink.

This change adds a test, but the race window is so small that it's hard
to trigger without a well-placed sleep in park_m. Also, this change
fixes stackGrowRecursive in proc_test.go to actually allocate a 128-byte
stack frame. It turns out the compiler was destructuring the "pad" field
and only allocating one uint64 on the stack.

For golang#40641.
Fixes golang#40643.

Change-Id: I7dfbe7d460f6972b8956116b137bc13bc24464e8
Reviewed-on: https://go-review.googlesource.com/c/go/+/247050
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Trust: Michael Knyszek <mknyszek@google.com>
(cherry picked from commit eb3c6a9)
Reviewed-on: https://go-review.googlesource.com/c/go/+/256300
Reviewed-by: Austin Clements <austin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants