Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: blocked channel operations starvation #21187

Closed
johnsnow4514 opened this issue Jul 27, 2017 · 8 comments

Comments

Projects
None yet
5 participants
@johnsnow4514
Copy link

commented Jul 27, 2017

What version of Go are you using (go version)?

1.8.0

What operating system and processor architecture are you using (go env)?

windows/amd64

What did you do?

Related to #11506
I have a messaging app that generates messages and passes them using a channel to
go-routines whose job is to transform those messages into small packets of data which then passes
using another channel to go-routines whose job is to read the packets and send them using a socket.

What did you expect to see?

packets that are generated at some point in time should be sent around the same time.
there shouldn't be more than a few milliseconds between sending a packet to a channel and receiving it
on the receiving go-routine.

What did you see instead?

In some cases a go-routine might starve for like 30 seconds while other packets are transferred just fine.
This causes problems in my app and I had to implement my own wrap around a channel to insure FIFO
ordering which fixed the problem.
is there any plans to fix #11506 ? @rsc @randall77
it really causes weird and rare behaviors in production environments
Here's my implementation for FIFO channeling if anyone is facing the same problem...

package utils

import (
	"sync"
	"sync/atomic"
)

type FifoChannel struct {
	innerChan                       chan interface{}
	conditions                      map[uint64]*sync.WaitGroup
	mapLock                         *sync.RWMutex
	currentPriority, latestPriority *uint64
}

func NewFifoChannel(innerChan chan interface{}) *FifoChannel {
	zero1 := uint64(0)
	zero2 := uint64(0)
	return &FifoChannel{
		innerChan:       innerChan,
		currentPriority: &zero1,
		latestPriority:  &zero2,
		conditions:      make(map[uint64]*sync.WaitGroup),
		mapLock:         new(sync.RWMutex),
	}
}

func (fc *FifoChannel) Add(item interface{}) {
	myPriority := atomic.AddUint64(fc.latestPriority, 1) - 1
	if myPriority != atomic.LoadUint64(fc.currentPriority) {
		gate := new(sync.WaitGroup)
		gate.Add(1)
		fc.mapLock.Lock()
		fc.conditions[myPriority] = gate
		fc.mapLock.Unlock()
		if myPriority != atomic.LoadUint64(fc.currentPriority) {
			gate.Wait()
		}
		fc.mapLock.Lock()
		delete(fc.conditions, myPriority)
		fc.mapLock.Unlock()
	}
	fc.innerChan <- item
	atomic.AddUint64(fc.currentPriority, 1)
	fc.mapLock.RLock()
	if gate, exist := fc.conditions[myPriority+1]; exist {
		gate.Done()
	}
	fc.mapLock.RUnlock()
}

func (fc *FifoChannel) Get() interface{} {
	return <-fc.innerChan
}

@mvdan mvdan changed the title blocked channel operations starvation runtime: blocked channel operations starvation Jul 27, 2017

@mvdan

This comment has been minimized.

Copy link
Member

commented Jul 27, 2017

Have you tried on 1.9rc1? What about other operating systems, like Linux?

Also CC @aclements

@johnsnow4514

This comment has been minimized.

Copy link
Author

commented Jul 27, 2017

@mvdan not yet. I find it hard to reproduce in my program.
The thing is, I am using my FIFO channel only with the packets producer/consumer logic
After 6 months in production the problem occurred on some higher level channel (on messages channel, instead of the packets channel).
The symptom was that some messages starved for around 30-40 seconds and created some kind of a timeout.
Fortunately I also found this issue #6205 which linked to a reproducible playground code
I'll soon test this code on 1.9rc1 and linux

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2017

I don't see a bug in the playground code in the last comment. It's true that the number of values sent on the different channels can be very unequal, but that is not a bug. It also doesn't seem to directly correspond to the original problem description, which was about multiple receiving goroutines rather than multiple sending goroutines.

With multiple receiving goroutines, it's still not a bug if one of them is starved. Why does it matter as long as all the values sent on the channel are being delivered to some goroutine?

@johnsnow4514

This comment has been minimized.

Copy link
Author

commented Jul 29, 2017

@ianlancetaylor I was talking about multiple sending go-routines. One can starve and it's message
will not be passed on for a long time.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2017

Your playground example does show that if the sending goroutines do a busy loop, then one or more them can starve sending a value. I'm not sure what, if anything, we can do about that. As you say above, it's possible to restructure the program to avoid the problem. Other than that, it's usually a good idea for a single processor to keep running the same set of goroutines, and it's usually a good idea to start the goroutine receiving from a channel when a channel is full (though I'm not sure the current scheduler actually does this). For an example like your playground example, it's possible for the sending and receiving goroutine to be on the same processor, and for that processor to keep flipping back and forth and, apparently, to tie up the channel sufficiently that no other goroutine gets in. But anything we do to change that will tend to slow down the scheduler and may penalize other programs. It's not obvious to me that we want to optimize for the case of multiple goroutines sending to the same channel in a busy loop. Of course if you have any suggestions I would like to hear them.

@johnsnow4514

This comment has been minimized.

Copy link
Author

commented Jul 29, 2017

I think @rsc derscribed it best (#11506)

Especially since so many Go programs care about latency, making channels first come, first served seems worth doing. And it looks to me like it can be done trivially and at no performance cost. The current behavior looks more like a bug than an intended result.

I believe that a channel should be FIFO when multiple senders are blocked.
The first sender to get blocked should be the first to be released.

@RLH

This comment has been minimized.

Copy link
Contributor

commented Jul 29, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2018

As far as I can tell there is nothing to do here. Please comment if you disagree.

@golang golang locked and limited conversation to collaborators Mar 30, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.