Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
pkg/kvstore: Fix for deadlock in etcd status checker
Etcd quorum checks are falsely reported as failing even though connection to etcd is intact. This can cause health checks to fail in both the agent and the operator. This happens due to a deadlock in pkg/kvstore/etcd after a prolonged downtime of etcd. Status check errors are being sent into a channel for the purpose of recreating kvstore connections in clustermesh. However when clustermesh is not used, messages from this channel are never read. The channel uses a buffer of size 128. After etcd has been down long enough to generate 128 errors, we enter a deadlock state. Agent / operator will continue to report etcd quorum failures and inturn health check failures until they're restarted. statusChecker() -> isConnectedAndHasQuorum() -> waitForInitLock() -> goroutine -> for -> ( initLockSucceeded <- err ) -> chan initLockSucceeded returned -> Block on receiving messages from initLockSucceeded channel -> e.statusCheckErrors <- e.latestErrorStatus [Blocked after 128 entries] Blocked goroutines captured from cilium 1.10 operator: goroutine 3309 [chan send, 13456 minutes]: github.com/cilium/cilium/pkg/kvstore.(*etcdClient).statusChecker(0xc00017db30) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:1171 +0x75a created by github.com/cilium/cilium/pkg/kvstore.connectEtcdClient /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:801 +0x679 goroutine 7838665 [chan send, 13505 minutes]: g.com/c/cilium/pkg/kvstore.(*etcdClient).waitForInitLock.func1(-,-,-,-) /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:433 +0x449 created by github.com/cilium/cilium/pkg/kvstore.(*etcdClient).waitForInitLock /go/src/github.com/cilium/cilium/pkg/kvstore/etcd.go:425 +0x7f Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
- Loading branch information