Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: race between stack shrinking and channel send/recv leads to bad sudog values #40641

Closed
mknyszek opened this issue Aug 7, 2020 · 26 comments
Closed
Assignees
Milestone

Comments

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Aug 7, 2020

Internally we've seen a rare crash arise in the runtime since Go 1.14. The error message is typically sudog with non-nil elem stemming from a call to releaseSudog from chansend or chanrecv.

The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.

  1. GW wants to suspend GS to scan its stack. It calls suspendG.
  2. GS is about to gopark in chansend. It calls into gopark, and changes its status to _Gwaiting BEFORE calling its unlockf, which sets gp.activeStackChans.
  3. GW observes _Gwaiting and returns from suspendG. It continues into scanstack where it checks if it's safe to shrink the stack. In this case, it's fine. So, it reads gp.activeStackChans, and sees it as false. It begins adjusting sudog pointers without synchronization. It reads the sudog's elem pointer from the chansend, but has not written it back yet.
  4. GS continues on its merry way and sets gp.activeStackChans and parks. It doesn't really matter when this happens at this point.
  5. GR comes in and wants to chanrecv on channel. It grabs the channel lock, reads from the sudog's elem field, and clears it. GR readies GS.
  6. GW then writes the updated sudog's elem pointer and continues on its merry way.
  7. Sometime later, GS wakes up because it was readied by GR, and tries to release the sudog, which has a non-nil elem field.

The fix here, I believe, is to set gp.activeStackChans before the unlockf is called. Doing this ensures that the value is updated before any worker that could shrink GS's stack observes a useful G status in suspendG. This could alternatively be fixed by changing the G status after unlockf is called, but I worry that will break a lot of things.

CC @aclements @prattmic

@mknyszek mknyszek added this to the Go1.16 milestone Aug 7, 2020
@mknyszek mknyszek self-assigned this Aug 7, 2020
@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 7, 2020

This should be fixed for Go 1.14 and Go 1.15. It's a bug that was introduced in Go 1.14, and may cause random and unavoidable crashes at any point in time. There may not be enough time to fix this for 1.15 (the failure is very rare, but we've seen it internally), and if not, it should definitely go in a point release.

@gopherbot please open a backport issue for 1.14.

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 7, 2020

Backport issue(s) opened: #40642 (for 1.14), #40643 (for 1.15).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 7, 2020

Change https://golang.org/cl/247050 mentions this issue: runtime: set activeStackChans before gopark

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 7, 2020

Oh no! That fix doesn't work because if there's a stack growth at any point in between, we could self-deadlock. Ugh. OK I don't know what the right fix is yet...

@mknyszek mknyszek added the NeedsFix label Aug 7, 2020
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Aug 7, 2020

Could we set GS as not safe for shrink stack?

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 7, 2020

Yeah, that might work. I'm not sure if it's technically safe for stack scanning though because if it's a chanrecv falling into this path then there could be a write into the stack in that same window from a sender? I guess that's already true (or we handle this case) today?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Aug 7, 2020

I think stack scanning should be fine, as it doesn't write to the stack. It is okay to read the old or the new value. As long as the write has a write barrier, it should be fine.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 8, 2020

Thanks Cherry, that makes sense. sendDirect has an explicit write barrier and we're not worried about reads of a partially-written pointer in stack scanning because memmove guarantees it'll copy a pointer-word a time.

Your solution would mean that we could probably get rid of syncadjustsudogs altogether which would be a nice simplification! 😀 A stack shrink will still be enqueued for the next sync preemption so it's probably also fine from a memory use perspective.

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 10, 2020

If we set it as not safe for stack shrinking, that means we can't shrink stacks of anything blocked on chan send, right? Would that be too limiting, given that such G's may be blocked indefinitely? I certainly regularly see thousands of G's blocked in chan recv, though I imagine chan send is less common.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 10, 2020

If we set it as not safe for stack shrinking, that means we can't shrink stacks of anything blocked on chan send, right? Would that be too limiting, given that such G's may be blocked indefinitely? I certainly regularly see thousands of G's blocked in chan recv, though I imagine chan send is less common.

It applies to both chan send and chan recv in this case. Maybe it is too limiting? I hadn't considered the case where Gs are blocked indefinitely and you want to shrink their stacks. If the G will run soon, then the next synchronous preemption will likely shrink its stack (unless it's another channel op...).

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 10, 2020

e.g., in one of the applications where we saw this crash, 44/557 G's were blocked in chan receive, most for >1000 minutes. That is a decent amount of stack space you'll effectively never be able to reclaim.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 10, 2020

There was a state of the world where all this worked, prior to https://golang.org/cl/172982. For 1.14 it's probably safest to revert that CL, since I think it can be cleanly reverted.

For 1.15 (if not backporting) and 1.16 though, @prattmic and I agree would be nice to retain the explicit tracking of when we need to be careful in stack copying that https://golang.org/cl/172982 added, but I'm not sure how to do it safely. It may require some surgery in gopark which is a little scary.

@cherrymui what do you think?

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 10, 2020

Change https://golang.org/cl/247679 mentions this issue: Revert "runtime: make copystack/sudog synchronization more explicit"

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 10, 2020

Change https://golang.org/cl/247680 mentions this issue: [release-branch.go1.14] Revert "runtime: make copystack/sudog synchronization more explicit"

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 10, 2020

OK actually I think I'm abandoning the idea of the revert. Enough changed after that CL that it doesn't make sense, and after I read some of the old comments, it feels like going back to that state would be subtle.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Aug 10, 2020

Yeah, that CL's description says that it will break the rule when stack shrink can happen synchronously (as it is now).

It may still be okay if we don't shrink stacks when we are holding locks?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Aug 10, 2020

Re stack space, for blocked goroutines, we cannot release the used part of the stack regardless. The question is how much stack space we can actually shrink.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Aug 10, 2020

I think it is only unsafe for stack shrinking between setting the G to Gwaiting and the G actually parks, due to the race of gp.activeStackChans? Could we block stack shrinking only for that small window? Then in most cases blocked goroutine can still have its stack shrunk.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 10, 2020

That idea makes sense to me. Block stack shrinking before calling gopark, and permit stack shrinking in chanparkcommit.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 10, 2020

By the way, I look forward to the test case.

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 10, 2020

Thanks @cherrymui, that sounds right to me.

By the way, I look forward to the test case.

Yeah, that's going to be fun...

@mknyszek
Copy link
Contributor Author

@mknyszek mknyszek commented Aug 10, 2020

OK I think I have a fix (trybots look good so far, 🤞), but still no working test. The goal of the test I'm trying to write is to create a situation where a goroutine whose stack is ripe for shrinking keeps going to sleep and waking up on channels. Meanwhile, a GC is triggered to try and get a mark worker to race with that goroutine going to sleep. I wrote a test which I've confirmed races stack shrinks with channel sends and receives, but I'm noticing a couple problems in trying to reproduce the issue.

  1. It's relatively expensive to generate a stack shrink in the runtime, so stress testing this isn't very effective. I could maybe get around this by writing an unsafe function in export_test.go which does suspendG followed by shrinkstack in a loop. This way we don't need to wait for the whole GC to end before trying again.
  2. The race window is really, really small compared to everything around it. This fact is evidenced not only by looking at the code, but also by the fact that this crash was fairly rare.

As a result of these problems, I'm not certain that this is feasible to write a test for, at least not a short-running one, if we want it to fail reliably. But, maybe I'm just going about this the wrong way (i.e. a stress test is just the wrong choice here)? Any advice?

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Sep 3, 2020

Added release-blocker because it seems we should not release 1.16 without either resolving or revisiting and making some sort of decision on this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 18, 2020

Change https://golang.org/cl/256058 mentions this issue: runtime: expand gopark documentation

@gopherbot gopherbot closed this in eb3c6a9 Sep 21, 2020
gopherbot pushed a commit that referenced this issue Sep 21, 2020
unlockf is called after the G is put into _Gwaiting, meaning another G
may have readied this one before unlockf is called.

This is implied by the current doc, but add additional notes to call out
this behavior, as it can be quite surprising.

Updates #40641

Change-Id: I60b1ccc6a4dd9ced8ad2aa1f729cb2e973100b59
Reviewed-on: https://go-review.googlesource.com/c/go/+/256058
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@gopherbot
Copy link

@gopherbot gopherbot commented Sep 21, 2020

Change https://golang.org/cl/256300 mentions this issue: [release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 21, 2020

Change https://golang.org/cl/256301 mentions this issue: [release-branch.go1.14] runtime: disable stack shrinking in activeStackChans race window

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.