Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible thread leak in "calico-node -felix" #5018

Closed
igcherkaev opened this issue Oct 25, 2021 · 16 comments
Closed

Possible thread leak in "calico-node -felix" #5018

igcherkaev opened this issue Oct 25, 2021 · 16 comments

Comments

@igcherkaev
Copy link

Expected Behavior

Thread count should stay about same over time.

Current Behavior

It appears to be steadily growing.

image

Overall, we don't see any issues with calico, but our monitoring system raised several alerts (our monitoring team just began adding various checks for our kubernetes nodes) about thread count. It differs node by node, and we have yet to see the pattern, as some nodes appear to be ok, but on some it's growing pretty fast. On one node it's already over 9000!

rnqkbs401 ~ # ps -eLf | grep calico-node | wc -l
9971

Nothing in the calico logs suggests the reason for having so many threads.

Context

We upgraded from 3.19 to 3.20.2 not a while ago, but we haven't been monitoring thread count on the nodes, so we can't say if it's related to the upgrade or not. At this point we'd like to make sure it's an expected behavior from calico-node.

Your Environment

  • Calico version
$ calicoctl version
Client Version:    v3.20.2
Git commit:        dcb4b76a
Cluster Version:   v3.20.2
Cluster Type:      typha,kdd,k8s,bgp,lvp
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
Kubernetes v1.21.5 and v1.20.9
  • Operating System and version:
Flatcar Container Linux by Kinvolk 2765.2.5 (Oklo)   5.10.38-flatcar   containerd://1.4.4
@tomastigera
Copy link
Contributor

Could you provide the output of ps -eLf | grep calico-node

@igcherkaev
Copy link
Author

Could you provide the output of ps -eLf | grep calico-node

Absolutely, it's here: https://gist.github.com/igcherkaev/e080618692e002fdfd6d6826042068ef

@igcherkaev
Copy link
Author

So, it seems like there's a hard limit of 10k threads allowed per process and once calico-node reached it, it dumped stack trace and crashed/restarted:

Excerpt from the very long listing of goroutines stack traces:

calico-node-l99l7 calico-node runtime: program exceeds 10000-thread limit
calico-node-l99l7 calico-node fatal error: thread exhaustion
calico-node-l99l7 calico-node 
calico-node-l99l7 calico-node runtime stack:
calico-node-l99l7 calico-node runtime.throw(0x299fb17, 0x11)
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/panic.go:1116 +0x72
calico-node-l99l7 calico-node runtime.checkmcount()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:622 +0xac
calico-node-l99l7 calico-node runtime.mReserveID(0x3fca718)
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:636 +0x3e
calico-node-l99l7 calico-node runtime.startm(0x0, 0x7eecaa938d01)
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:1957 +0x79
calico-node-l99l7 calico-node runtime.wakep()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:2053 +0x66
calico-node-l99l7 calico-node runtime.resetspinning()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:2525 +0x59
calico-node-l99l7 calico-node runtime.schedule()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:2676 +0x2b9
calico-node-l99l7 calico-node runtime.mstart1()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:1179 +0x93
calico-node-l99l7 calico-node runtime.mstart()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:1137 +0x6e
calico-node-l99l7 calico-node 
calico-node-l99l7 calico-node goroutine 1 [select, 4992 minutes]:
calico-node-l99l7 calico-node github.com/projectcalico/felix/daemon.monitorAndManageShutdown(0xc0000be7e0, 0x0, 0x0, 0x0, 0x0)
calico-node-l99l7 calico-node   /go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20210924151702-ab06c3940caa/daemon/daemon.go:671 +0x294
calico-node-l99l7 calico-node github.com/projectcalico/felix/daemon.Run(0x29a7270, 0x15, 0x2d07178, 0x7, 0x2d6fe60, 0x18, 0x2db6580, 0x28)
calico-node-l99l7 calico-node   /go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20210924151702-ab06c3940caa/daemon/daemon.go:642 +0x2f30
calico-node-l99l7 calico-node main.main()
calico-node-l99l7 calico-node   /go/src/github.com/projectcalico/node/cmd/calico-node/main.go:116 +0x714
calico-node-l99l7 calico-node 
calico-node-l99l7 calico-node goroutine 34 [chan receive]:
calico-node-l99l7 calico-node k8s.io/klog/v2.(*loggingT).flushDaemon(0x3fca340)
calico-node-l99l7 calico-node   /go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:1164 +0x8b
calico-node-l99l7 calico-node created by k8s.io/klog/v2.init.0
calico-node-l99l7 calico-node   /go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:418 +0xdf
calico-node-l99l7 calico-node 
...

Full listing is available here: https://gist.github.com/igcherkaev/fd2d416e0326b288104bbd5b77013b5d (5.7Mb of text)

@igcherkaev
Copy link
Author

Actually github gist truncates it to the first 10k lines... Here's the full version: https://www.dropbox.com/s/07s39cxa9ivkbfe/calico-stack-traces.txt?dl=0

@tomastigera
Copy link
Contributor

A ton of goroutines stuck in this forever:

calico-node-l99l7 calico-node goroutine 224275 [select, 4561 minutes, locked to thread]:
calico-node-l99l7 calico-node github.com/mdlayher/netlink.newLockedNetNSGoroutine.func1(0xc006ac9401, 0xc002eee0c0, 0x2aa3688, 0xc006ac9440, 0xc0037a8b70, 0xc0047f03c0)
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:738 +0x211
calico-node-l99l7 calico-node created by github.com/mdlayher/netlink.newLockedNetNSGoroutine
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:669 +0x205

@igcherkaev
Copy link
Author

Yes, I noticed that, though it's unclear to me what it means, haven't yet looked into the respective go package and source code lines...

@tomastigera
Copy link
Contributor

tomastigera commented Oct 25, 2021

Upgrading the lib to at least v1.2.0 might help a little https://github.com/mdlayher/netlink/blob/main/CHANGELOG.md#v120

[Performance] #171: netlink.Conn no longer requires a locked OS thread for the vast majority of operations, which should result in a significant speedup for highly concurrent callers. Thanks @ti-mo.
[Bug Fix] #169: calls to netlink.Conn.Close are now able to unblock concurrent calls to netlink.Conn.Receive and other blocking operations.

but that is likely imported by a dependency.

@tomastigera
Copy link
Contributor

tomastigera commented Oct 25, 2021

Imported by :

golang.zx2c4.com/wireguard/wgctrl@v0.0.0-20200324154536-ceff61240acf github.com/mdlayher/netlink@v1.1.0
github.com/jsimonetti/rtnetlink@v0.0.0-20190606172950-9527aa82566a github.com/mdlayher/netlink@v0.0.0-20190409211403-11939a169225
github.com/jsimonetti/rtnetlink@v0.0.0-20200117123717-f846d4f6c1f4 github.com/mdlayher/netlink@v1.0.0

cc @petercork

@tomastigera
Copy link
Contributor

tomastigera commented Oct 25, 2021

Afaict there is no finalizer so if a user (probably a wireguard library) does not call Close, just forgets about a socket (as you can with "net" package etc.) then the thread is never stopped and cleaned up 🤷

@ti-mo
Copy link
Contributor

ti-mo commented Oct 25, 2021

A ton of goroutines stuck in this forever:

calico-node-l99l7 calico-node goroutine 224275 [select, 4561 minutes, locked to thread]:
calico-node-l99l7 calico-node github.com/mdlayher/netlink.newLockedNetNSGoroutine.func1(0xc006ac9401, 0xc002eee0c0, 0x2aa3688, 0xc006ac9440, 0xc0037a8b70, 0xc0047f03c0)
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:738 +0x211
calico-node-l99l7 calico-node created by github.com/mdlayher/netlink.newLockedNetNSGoroutine
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:669 +0x205

This is an (idle) netlink socket sitting in a netns. 1.2.0 no longer does this, but upgrading won't solve the fact something's maintaining a bunch of open netlink.Conns.

@tomastigera
Copy link
Contributor

tomastigera commented Oct 25, 2021

upgrading won't solve the fact something's maintaining a bunch of open netlink.Conns.

True, that would only solve the run-away system threads, but would not solve the goroutine leak, likely in the WG lib.

@mikestephen
Copy link
Contributor

It sounds like the issue is that we need to upgrade to the latest version of https://github.com/WireGuard/wgctrl-go.
We're currently working on reproducing the issue and then we'll see if upgrading fixes it.

@igcherkaev
Copy link
Author

Per @mikestephen suggestion in Slack I disabled prometheus metrics in calico-node and after an hour of uptime since restart there's been no thread count increase at all on all nodes in the cluster.

@igcherkaev
Copy link
Author

Since prometheus metrics got disabled the thread count have gotten stabilized:

image

@caseydavenport
Copy link
Member

Looks like projectcalico/felix#3052 is a candidate fix - @mikestephen do we have confirmation that this fixes this issue?

@caseydavenport
Copy link
Member

This should be fixed in v3.21, which was released yesterday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants