Skip to content

Commit

Permalink
pkg/kvstore: Fix for deadlock in etcd status checker
Browse files Browse the repository at this point in the history
Etcd quorum checks are falsely reported as failing even though connection
to etcd is intact. This can cause health checks to fail in both the agent
and the operator.

This happens due to a deadlock in pkg/kvstore/etcd after a prolonged
downtime of etcd. Status check errors are being sent into a channel for the
purpose of recreating kvstore connections in clustermesh. However when
clustermesh is not used, messages from this channel are never read. The
channel uses a buffer of size 128. After etcd has been down long enough to
generate 128 errors, we enter a deadlock state. Agent / operator will
continue to report etcd quorum failures and inturn health check failures
until they're restarted.

statusChecker()
	-> isConnectedAndHasQuorum()
		-> waitForInitLock()
			-> goroutine -> for -> ( initLockSucceeded <- err )
                        -> chan initLockSucceeded returned
		-> Block on receiving messages from initLockSucceeded channel
	-> e.statusCheckErrors <- e.latestErrorStatus [Blocked after 128 entries]

Blocked goroutines captured from cilium 1.10 operator:

goroutine 3309 [chan send, 13456 minutes]:
github.com/cilium/cilium/pkg/kvstore.(*etcdClient).statusChecker(0xc00017db30)
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:1171 +0x75a
created by github.com/cilium/cilium/pkg/kvstore.connectEtcdClient
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:801 +0x679

goroutine 7838665 [chan send, 13505 minutes]:
g.com/c/cilium/pkg/kvstore.(*etcdClient).waitForInitLock.func1(-,-,-,-)
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:433 +0x449
created by github.com/cilium/cilium/pkg/kvstore.(*etcdClient).waitForInitLock
	/go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:425 +0x7f

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
  • Loading branch information
hemanthmalla authored and tklauser committed Apr 14, 2023
1 parent 102170e commit 9bb669b
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion pkg/kvstore/etcd.go
Original file line number Diff line number Diff line change
Expand Up @@ -1195,7 +1195,13 @@ func (e *etcdClient) statusChecker() {

e.statusLock.Unlock()
if e.latestErrorStatus != nil {
e.statusCheckErrors <- e.latestErrorStatus
select {
case e.statusCheckErrors <- e.latestErrorStatus:
default:
// Channel's buffer is full, skip sending errors to the channel but log warnings instead
log.WithError(e.latestErrorStatus).
Warning("Status check error channel is full, dropping this error")
}
}

select {
Expand Down

0 comments on commit 9bb669b

Please sign in to comment.