Possible thread leak in "calico-node -felix" #5018

igcherkaev · 2021-10-25T17:01:13Z

Expected Behavior

Thread count should stay about same over time.

Current Behavior

It appears to be steadily growing.

Overall, we don't see any issues with calico, but our monitoring system raised several alerts (our monitoring team just began adding various checks for our kubernetes nodes) about thread count. It differs node by node, and we have yet to see the pattern, as some nodes appear to be ok, but on some it's growing pretty fast. On one node it's already over 9000!

rnqkbs401 ~ # ps -eLf | grep calico-node | wc -l
9971

Nothing in the calico logs suggests the reason for having so many threads.

Context

We upgraded from 3.19 to 3.20.2 not a while ago, but we haven't been monitoring thread count on the nodes, so we can't say if it's related to the upgrade or not. At this point we'd like to make sure it's an expected behavior from calico-node.

Your Environment

Calico version

$ calicoctl version
Client Version:    v3.20.2
Git commit:        dcb4b76a
Cluster Version:   v3.20.2
Cluster Type:      typha,kdd,k8s,bgp,lvp

Orchestrator version (e.g. kubernetes, mesos, rkt):

Kubernetes v1.21.5 and v1.20.9

Operating System and version:

Flatcar Container Linux by Kinvolk 2765.2.5 (Oklo)   5.10.38-flatcar   containerd://1.4.4

The text was updated successfully, but these errors were encountered:

tomastigera · 2021-10-25T17:24:20Z

Could you provide the output of ps -eLf | grep calico-node

igcherkaev · 2021-10-25T17:34:00Z

Could you provide the output of ps -eLf | grep calico-node

Absolutely, it's here: https://gist.github.com/igcherkaev/e080618692e002fdfd6d6826042068ef

igcherkaev · 2021-10-25T20:25:22Z

So, it seems like there's a hard limit of 10k threads allowed per process and once calico-node reached it, it dumped stack trace and crashed/restarted:

Excerpt from the very long listing of goroutines stack traces:

calico-node-l99l7 calico-node runtime: program exceeds 10000-thread limit
calico-node-l99l7 calico-node fatal error: thread exhaustion
calico-node-l99l7 calico-node 
calico-node-l99l7 calico-node runtime stack:
calico-node-l99l7 calico-node runtime.throw(0x299fb17, 0x11)
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/panic.go:1116 +0x72
calico-node-l99l7 calico-node runtime.checkmcount()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:622 +0xac
calico-node-l99l7 calico-node runtime.mReserveID(0x3fca718)
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:636 +0x3e
calico-node-l99l7 calico-node runtime.startm(0x0, 0x7eecaa938d01)
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:1957 +0x79
calico-node-l99l7 calico-node runtime.wakep()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:2053 +0x66
calico-node-l99l7 calico-node runtime.resetspinning()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:2525 +0x59
calico-node-l99l7 calico-node runtime.schedule()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:2676 +0x2b9
calico-node-l99l7 calico-node runtime.mstart1()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:1179 +0x93
calico-node-l99l7 calico-node runtime.mstart()
calico-node-l99l7 calico-node   /usr/local/go-cgo/src/runtime/proc.go:1137 +0x6e
calico-node-l99l7 calico-node 
calico-node-l99l7 calico-node goroutine 1 [select, 4992 minutes]:
calico-node-l99l7 calico-node github.com/projectcalico/felix/daemon.monitorAndManageShutdown(0xc0000be7e0, 0x0, 0x0, 0x0, 0x0)
calico-node-l99l7 calico-node   /go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20210924151702-ab06c3940caa/daemon/daemon.go:671 +0x294
calico-node-l99l7 calico-node github.com/projectcalico/felix/daemon.Run(0x29a7270, 0x15, 0x2d07178, 0x7, 0x2d6fe60, 0x18, 0x2db6580, 0x28)
calico-node-l99l7 calico-node   /go/pkg/mod/github.com/projectcalico/felix@v0.0.0-20210924151702-ab06c3940caa/daemon/daemon.go:642 +0x2f30
calico-node-l99l7 calico-node main.main()
calico-node-l99l7 calico-node   /go/src/github.com/projectcalico/node/cmd/calico-node/main.go:116 +0x714
calico-node-l99l7 calico-node 
calico-node-l99l7 calico-node goroutine 34 [chan receive]:
calico-node-l99l7 calico-node k8s.io/klog/v2.(*loggingT).flushDaemon(0x3fca340)
calico-node-l99l7 calico-node   /go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:1164 +0x8b
calico-node-l99l7 calico-node created by k8s.io/klog/v2.init.0
calico-node-l99l7 calico-node   /go/pkg/mod/k8s.io/klog/v2@v2.8.0/klog.go:418 +0xdf
calico-node-l99l7 calico-node 
...

Full listing is available here: https://gist.github.com/igcherkaev/fd2d416e0326b288104bbd5b77013b5d (5.7Mb of text)

igcherkaev · 2021-10-25T20:35:22Z

Actually github gist truncates it to the first 10k lines... Here's the full version: https://www.dropbox.com/s/07s39cxa9ivkbfe/calico-stack-traces.txt?dl=0

tomastigera · 2021-10-25T20:37:35Z

A ton of goroutines stuck in this forever:

calico-node-l99l7 calico-node goroutine 224275 [select, 4561 minutes, locked to thread]:
calico-node-l99l7 calico-node github.com/mdlayher/netlink.newLockedNetNSGoroutine.func1(0xc006ac9401, 0xc002eee0c0, 0x2aa3688, 0xc006ac9440, 0xc0037a8b70, 0xc0047f03c0)
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:738 +0x211
calico-node-l99l7 calico-node created by github.com/mdlayher/netlink.newLockedNetNSGoroutine
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:669 +0x205

igcherkaev · 2021-10-25T20:38:22Z

Yes, I noticed that, though it's unclear to me what it means, haven't yet looked into the respective go package and source code lines...

tomastigera · 2021-10-25T20:45:05Z

Upgrading the lib to at least v1.2.0 might help a little https://github.com/mdlayher/netlink/blob/main/CHANGELOG.md#v120

[Performance] #171: netlink.Conn no longer requires a locked OS thread for the vast majority of operations, which should result in a significant speedup for highly concurrent callers. Thanks @ti-mo.
[Bug Fix] #169: calls to netlink.Conn.Close are now able to unblock concurrent calls to netlink.Conn.Receive and other blocking operations.

but that is likely imported by a dependency.

tomastigera · 2021-10-25T20:51:36Z

Imported by :

golang.zx2c4.com/wireguard/wgctrl@v0.0.0-20200324154536-ceff61240acf github.com/mdlayher/netlink@v1.1.0
github.com/jsimonetti/rtnetlink@v0.0.0-20190606172950-9527aa82566a github.com/mdlayher/netlink@v0.0.0-20190409211403-11939a169225
github.com/jsimonetti/rtnetlink@v0.0.0-20200117123717-f846d4f6c1f4 github.com/mdlayher/netlink@v1.0.0

cc @petercork

tomastigera · 2021-10-25T21:17:45Z

Afaict there is no finalizer so if a user (probably a wireguard library) does not call Close, just forgets about a socket (as you can with "net" package etc.) then the thread is never stopped and cleaned up 🤷

ti-mo · 2021-10-25T21:54:41Z

A ton of goroutines stuck in this forever:

calico-node-l99l7 calico-node goroutine 224275 [select, 4561 minutes, locked to thread]:
calico-node-l99l7 calico-node github.com/mdlayher/netlink.newLockedNetNSGoroutine.func1(0xc006ac9401, 0xc002eee0c0, 0x2aa3688, 0xc006ac9440, 0xc0037a8b70, 0xc0047f03c0)
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:738 +0x211
calico-node-l99l7 calico-node created by github.com/mdlayher/netlink.newLockedNetNSGoroutine
calico-node-l99l7 calico-node 	/go/pkg/mod/github.com/mdlayher/netlink@v1.1.0/conn_linux.go:669 +0x205

This is an (idle) netlink socket sitting in a netns. 1.2.0 no longer does this, but upgrading won't solve the fact something's maintaining a bunch of open netlink.Conns.

tomastigera · 2021-10-25T22:02:39Z

upgrading won't solve the fact something's maintaining a bunch of open netlink.Conns.

True, that would only solve the run-away system threads, but would not solve the goroutine leak, likely in the WG lib.

mikestephen · 2021-10-27T10:17:07Z

It sounds like the issue is that we need to upgrade to the latest version of https://github.com/WireGuard/wgctrl-go.
We're currently working on reproducing the issue and then we'll see if upgrading fixes it.

igcherkaev · 2021-10-27T18:13:08Z

Per @mikestephen suggestion in Slack I disabled prometheus metrics in calico-node and after an hour of uptime since restart there's been no thread count increase at all on all nodes in the cluster.

igcherkaev · 2021-10-28T14:30:42Z

Since prometheus metrics got disabled the thread count have gotten stabilized:

caseydavenport · 2021-10-29T20:58:09Z

Looks like projectcalico/felix#3052 is a candidate fix - @mikestephen do we have confirmation that this fixes this issue?

caseydavenport · 2021-11-09T17:33:56Z

This should be fixed in v3.21, which was released yesterday.

mikestephen mentioned this issue Oct 28, 2021

Only collect WireGuard prom stats when enabled projectcalico/felix#3052

Merged

1 task

This was referenced Nov 1, 2021

Only collect WireGuard prom stats when enabled (#3052) projectcalico/felix#3057

Merged

Only collect WireGuard prom stats when enabled projectcalico/felix#3058

Merged

mgleung added kind/bug impact/high likelihood/low labels Nov 2, 2021

fasaxc mentioned this issue Nov 9, 2021

Calico node spikes with memory usage after upgrading to 3.20 #4810

Closed

caseydavenport closed this as completed Nov 9, 2021

caseydavenport mentioned this issue Nov 9, 2021

kube-controllers: Increased memory usage after update #5037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible thread leak in "calico-node -felix" #5018

Possible thread leak in "calico-node -felix" #5018

igcherkaev commented Oct 25, 2021

tomastigera commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

tomastigera commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

tomastigera commented Oct 25, 2021 •

edited

Loading

tomastigera commented Oct 25, 2021 •

edited

Loading

tomastigera commented Oct 25, 2021 •

edited

Loading

ti-mo commented Oct 25, 2021

tomastigera commented Oct 25, 2021 •

edited

Loading

mikestephen commented Oct 27, 2021

igcherkaev commented Oct 27, 2021

igcherkaev commented Oct 28, 2021

caseydavenport commented Oct 29, 2021

caseydavenport commented Nov 9, 2021

Possible thread leak in "calico-node -felix" #5018

Possible thread leak in "calico-node -felix" #5018

Comments

igcherkaev commented Oct 25, 2021

Expected Behavior

Current Behavior

Context

Your Environment

tomastigera commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

tomastigera commented Oct 25, 2021

igcherkaev commented Oct 25, 2021

tomastigera commented Oct 25, 2021 • edited Loading

tomastigera commented Oct 25, 2021 • edited Loading

tomastigera commented Oct 25, 2021 • edited Loading

ti-mo commented Oct 25, 2021

tomastigera commented Oct 25, 2021 • edited Loading

mikestephen commented Oct 27, 2021

igcherkaev commented Oct 27, 2021

igcherkaev commented Oct 28, 2021

caseydavenport commented Oct 29, 2021

caseydavenport commented Nov 9, 2021

tomastigera commented Oct 25, 2021 •

edited

Loading

tomastigera commented Oct 25, 2021 •

edited

Loading

tomastigera commented Oct 25, 2021 •

edited

Loading

tomastigera commented Oct 25, 2021 •

edited

Loading