Internally we've seen a rare crash arise in the runtime since Go 1.14. The error message is typically sudog with non-nil elem stemming from a call to releaseSudog from chansend or chanrecv.
The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.
- GW wants to suspend GS to scan its stack. It calls
suspendG.
- GS is about to
gopark in chansend. It calls into gopark, and changes its status to _Gwaiting BEFORE calling its unlockf, which sets gp.activeStackChans.
- GW observes
_Gwaiting and returns from suspendG. It continues into scanstack where it checks if it's safe to shrink the stack. In this case, it's fine. So, it reads gp.activeStackChans, and sees it as false. It begins adjusting sudog pointers without synchronization. It reads the sudog's elem pointer from the chansend, but has not written it back yet.
- GS continues on its merry way and sets
gp.activeStackChans and parks. It doesn't really matter when this happens at this point.
- GR comes in and wants to
chanrecv on channel. It grabs the channel lock, reads from the sudog's elem field, and clears it. GR readies GS.
- GW then writes the updated sudog's elem pointer and continues on its merry way.
- Sometime later, GS wakes up because it was readied by GR, and tries to release the
sudog, which has a non-nil elem field.
The fix here, I believe, is to set gp.activeStackChans before the unlockf is called. Doing this ensures that the value is updated before any worker that could shrink GS's stack observes a useful G status in suspendG. This could alternatively be fixed by changing the G status after unlockf is called, but I worry that will break a lot of things.
CC @aclements @prattmic
Internally we've seen a rare crash arise in the runtime since Go 1.14. The error message is typically
sudog with non-nil elemstemming from a call toreleaseSudogfromchansendorchanrecv.The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.
suspendG.goparkinchansend. It calls intogopark, and changes its status to_GwaitingBEFORE calling itsunlockf, which setsgp.activeStackChans._Gwaitingand returns fromsuspendG. It continues intoscanstackwhere it checks if it's safe to shrink the stack. In this case, it's fine. So, it readsgp.activeStackChans, and sees it as false. It begins adjustingsudogpointers without synchronization. It reads thesudog'selempointer from thechansend, but has not written it back yet.gp.activeStackChansand parks. It doesn't really matter when this happens at this point.chanrecvon channel. It grabs the channel lock, reads from thesudog'selemfield, and clears it. GR readies GS.sudog, which has a non-nilelemfield.The fix here, I believe, is to set
gp.activeStackChansbefore theunlockfis called. Doing this ensures that the value is updated before any worker that could shrink GS's stack observes a useful G status insuspendG. This could alternatively be fixed by changing the G status afterunlockfis called, but I worry that will break a lot of things.CC @aclements @prattmic