-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible thread leak in "calico-node -felix" #5018
Comments
Could you provide the output of |
Absolutely, it's here: https://gist.github.com/igcherkaev/e080618692e002fdfd6d6826042068ef |
So, it seems like there's a hard limit of 10k threads allowed per process and once calico-node reached it, it dumped stack trace and crashed/restarted: Excerpt from the very long listing of goroutines stack traces:
Full listing is available here: https://gist.github.com/igcherkaev/fd2d416e0326b288104bbd5b77013b5d (5.7Mb of text) |
Actually github gist truncates it to the first 10k lines... Here's the full version: https://www.dropbox.com/s/07s39cxa9ivkbfe/calico-stack-traces.txt?dl=0 |
A ton of goroutines stuck in this forever:
|
Yes, I noticed that, though it's unclear to me what it means, haven't yet looked into the respective go package and source code lines... |
Upgrading the lib to at least v1.2.0 might help a little https://github.com/mdlayher/netlink/blob/main/CHANGELOG.md#v120
but that is likely imported by a dependency. |
Imported by :
cc @petercork |
Afaict there is no finalizer so if a user (probably a wireguard library) does not call Close, just forgets about a socket (as you can with "net" package etc.) then the thread is never stopped and cleaned up 🤷 |
This is an (idle) netlink socket sitting in a netns. 1.2.0 no longer does this, but upgrading won't solve the fact something's maintaining a bunch of open |
True, that would only solve the run-away system threads, but would not solve the goroutine leak, likely in the WG lib. |
It sounds like the issue is that we need to upgrade to the latest version of https://github.com/WireGuard/wgctrl-go. |
Per @mikestephen suggestion in Slack I disabled prometheus metrics in calico-node and after an hour of uptime since restart there's been no thread count increase at all on all nodes in the cluster. |
Looks like projectcalico/felix#3052 is a candidate fix - @mikestephen do we have confirmation that this fixes this issue? |
This should be fixed in v3.21, which was released yesterday. |
Expected Behavior
Thread count should stay about same over time.
Current Behavior
It appears to be steadily growing.
Overall, we don't see any issues with calico, but our monitoring system raised several alerts (our monitoring team just began adding various checks for our kubernetes nodes) about thread count. It differs node by node, and we have yet to see the pattern, as some nodes appear to be ok, but on some it's growing pretty fast. On one node it's already over 9000!
Nothing in the calico logs suggests the reason for having so many threads.
Context
We upgraded from 3.19 to 3.20.2 not a while ago, but we haven't been monitoring thread count on the nodes, so we can't say if it's related to the upgrade or not. At this point we'd like to make sure it's an expected behavior from calico-node.
Your Environment
The text was updated successfully, but these errors were encountered: