Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: make sure blocked channels run operations in FIFO order #11506
I've had mail conversations with two different projects using Go recently, both centered around the surprise that if many goroutines are blocked on a particular channel, and that channel becomes available, it is still possible for running goroutines who "drive by" at the right time to get their operation in before the goroutines that are blocked, and it is possible for the goroutines who are blocked to be reordered as a result. If this happens repeatedly, the effect is that the blocked goroutines can block arbitrarily long even though the channel is known to be ready at regular intervals. I wonder if we should adjust the channel implementation to ensure that when a channel does become available for sending or receiving, blocked operations take priority over "drive by" operations. It seems to me that this can be done by completing the blocked operation as part of the operation that unblocks it.
Sends and receives on unbuffered channels already behave this way: a send with blocked receivers picks the first receiver goroutine off the queue, delivers the value to it, readies it (puts it on a run queue), and continues execution. That is, the send completes the blocked operation. Receives on unbuffered channels similarly complete blocked sends.
Sends and receives on buffered channels do not behave this way:
A send into a buffered channel with blocked receivers stores the value in the buffer, wakes up a receiver, and continues executing. When the receiver is eventually scheduled, it checks the channel, and maybe it gets lucky and the value is still there. But maybe not, in which case it goes to the end of the queue.
A receive out of a buffered channel copies a value out of the buffer, wakes a blocked sender, and continues. When the sender is eventually scheduled, it checks the channel, and maybe it gets lucky and there is still room in the buffer for the send. But maybe not, in which case it goes to the end of the queue.
It seems to me that it would be easy and symmetric with the unbuffered channel operations for a send with blocked receivers to deliver the value straight to a receiver, completing the first pending receive operation. Similarly a receive with blocked senders could take its own value out of the channel buffer and then complete the send into the newly emptied space. That would be the only operation of the four that needs to transfer a pair of values instead of just one.
It doesn't seem to me that it would hurt performance to do this, and in fact it might help, since there would not be all the unnecessary wakeups that happen now. It would give much more predictable behavior when goroutines queue up on a stuck channel: once queued, that's the order they're going to be served, guaranteed.
From a "happens before" point of view, I'm talking nonsense. There is no "happens before" for two different goroutines blocking on the same channel, and so there is no difference between all the execution orders. I understand that. But from a "what actually happens" point of view, there certainly is a difference: if you know that the goroutines have been stacking up one per minute for an hour before the channel finally gets unblocked and they all complete, you know what order they blocked and might reasonably expect them to unblock in that same order. This remains roughly as true at shorter time scales.
Especially since so many Go programs care about latency, making channels first come, first served seems worth doing. And it looks to me like it can be done trivially and at no performance cost. The current behavior looks more like a bug than an intended result.
I'm not talking about a language change, just a change to our implementation.
Language semantics require that the values be FIFO, but the implications of that with multiple goroutines reading the values are unspecified and muddy at best. That said, I believe the claim here that it seems more natural for blocked ones to go first; it just seems intuitive.
I think you could even make a case that the current situation is almost a bug.
There may be a noticeable performance hit for heavy contention, though, because I believe this style requires asymptotically more context switches. I still think channels should behave as you suggest.
From a specification perspective, it's true that our current approach satisfies the happens-before graph, but it does not satisfy liveness. That's a formally defined, reasonable, and desirable property we could specify as a requirement of channels without specifying something as specific as FIFO blocking.
Implementation-wise, I'm concerned about making blocking strictly FIFO order. This seems like the exact same mistake as making map order deterministic or goroutine scheduling deterministic. People can and will come to depend on this and eventually we may want to weaken it. In particular, this is asking for scalability problems. One of the cardinal rules that came out of my thesis was to not impose strict ordering on operations unless it is a fundamental requirement. It is by no means a fundamental requirement of the blocking order.
I fully support addressing the liveness issue, and I like the approach of eliminating the wake-up race, but if we're going to do that, we should think about whether we can introduce bounded randomness into the blocking order to prevent people from depending on strict FIFO blocking.
I think we're overthinking the randomization. With randomized goroutine scheduling (which is already going in, at least for -race testing), we'll get randomized channel ordering for free. The random scheduling should change the order in which waiters get queued and the order in which sends get matched to the waiters.
You seem to assume that the entity that benefits from reduced latency is a goroutine only. This is true when a goroutine services a request and needs to acquire some aux resource (e.g. a database connection). But this is not true for all producer-consumer/pipelining scenarios, where the entity that benefits from reduced latency is a message in a chan. Today messages in chans are serviced strictly FIFO and with minimal latency: the next running goroutine picks up the first message in a chan. What you propose improves the database connection pool scenario but equally worsens the producer-consumer scenario. Because under your proposal we handoff the first message in the chan to a goroutine that will run who-knows-when, and a goroutine that drives by the chan next moment either blocks or services the second message ahead of the first one. If we switch the implementation we will start receiving complaints about the other scenarios.
I like the imagery of a drive by goroutine a lot, the literature also uses
On Thu, Jul 2, 2015 at 7:52 AM, Dmitry Vyukov firstname.lastname@example.org