fix(core): ignore duplicate ADDED pod watch events #1553

dqsully · 2023-12-15T19:27:26Z

Purpose of PR?:

Fixes a memory leak I identified on our infrastructure while trying to deploy KubeArmor in our environment:

The WatchK8sPods() function uses raw Kubernetes watches instead of the Kubernetes SDK informer to track pods in a cluster. This isn't inherently an issue, and in KubeArmor's case, this actually results in significantly lower memory usage since KubeArmor doesn't care about most pod information. However, raw Kubernetes watches will eventually get stopped by the Kubernetes API server and you have to start a new watch, at which point every existing pod will be sent back as an "ADDED" event in the watch.

KubeArmor already restarts these watches, and even deduplicates stored pod information in an array, but it forwards every "ADDED" event verbatim to the dm.UpdateEndPointWithPod() function which has no inherent event deduplication. This means that every time a watch restarts, duplicate endpoint configurations are added in memory for every pod in the cluster. And in a large cluster, this memory leak can grow pretty quickly.

My PR uses the duplicate status computed when storing pod information to determine if an "ADDED" event should be forwarded to dm.UpdateEndPointWithPod() as a "MODIFIED" event instead, which greatly reduces KubeArmor memory leakage in our environment while still fully processing every pod change in the cluster.

Does this PR introduce a breaking change?

No

If the changes in this PR are manually verified, list down the scenarios covered::

Here's the average memory usage of the KubeArmor DaemonSet containers on our non-production Kubernetes cluster over the past 4 days, where I initially observed this memory leak and have been doing some testing on how to solve it:

tl;dr: the flat parts are my own changes, mostly just the changes in this PR. The positive slopes are KubeArmor v1.0.3 and v1.1.1, both with memory leaks after each redeployment.

Exactly what I did and when in this graph:

12/12 00:00 UTC - Deployed KubeArmor v1.0.3 with default configuration
12/13 01:30 UTC - Updated KubeArmor configuration to disable network visibility
12/14 17:45 UTC - Update KubeArmor w/ custom Docker image based on main but using k8s SDK informers instead of raw watches
12/14 20:30 UTC - Updated KubeArmor w/ custom Docker image based on main but ignoring duplicate "ADDED" events for dm.UpdateEndPointWithPod()
12/14 22:30 UTC - Updated KubeArmor w/ custom Docker image based on main but with this PR's changes
12/15 14:45 UTC - Updated KubeArmor to newly-released v1.1.1

This shows that with the unmodified v1.0.3 or v1.1.1 KubeArmor images, there's a clear memory leak, but with this PR's changes, while there appears to be a small memory leak still, the overall memory usage is much better.

We also ran KubeArmor v1.0.3 in production for only 8 hours until we had to take it down due to this memory leak and a knock-on CPU leak effect as well. This PR's changes have been running in our production environment for at least 24 hours now with very reasonable CPU and memory usage.

dqsully · 2024-02-22T15:48:26Z

Hey, are there any updates on this PR? It's been 2 months and this seems like a major issue. Without this fix we couldn't run KubeArmor for >8 hours in production before this resource leak started causing significant issues.

daemon1024

Hey @dqsully,

This completely makes sense. Apologies for missing it out earlier, the holiday season got to us and this PR got overlooked.

We had been considering moving to informers, even have a commit for that 6dd22ef But that hasn't gone through yet.

I have a small nitpick, it might be a bit opinionated but maybe is easier to understand as well?

KubeArmor/core/kubeUpdate.go

dqsully · 2024-02-22T18:10:25Z

I tried using a Kubernetes informer instead of the raw pod watch, and while it fixed the memory leak, it did use significantly more memory in our Kubernetes cluster compared to the raw watch. You can see that in the cliff around 12/14 17:45 UTC on my original screenshot.

(Although this probably wouldn't matter nearly as much if the KubeArmor DaemonSet pods weren't watching all pods and added a filter to the informer to only watch pods on the current node. I've actually already done this filtering in my fork and I'll be making a PR for it soon, along with lots of other bugfixes I've made in my environment over the past couple months.)

daemon1024 · 2024-02-22T18:19:32Z

KubeArmor DaemonSet pods weren't watching all pods and added a filter to the informer to only watch pods on the current node

Agreed, this was discussed in one of the community calls recently, but seems like we forgot to open an issue.

I'll be making a PR for it soon, along with lots of other bugfixes I've made in my environment over the past couple months

I am really looking forward to them

daemon1024

I will wait for the CI to pass before we merge this. Appreciate your help here. I will wait for CI to pass before we merge this.

I just went through a main...dqsully:KubeArmor:other-improvements and am wanting to pick up a lot of changes. I would love to chat with you and get feedback on KubeArmor in general.

You can join us on our Slack if you want real time interaction on anything.

dqsully · 2024-02-22T18:37:05Z

I think I need to rebase my changes for CI to pass, lemme do that real quick.

Signed-off-by: Dakota Sullivan <djqballer@outlook.com>

dqsully · 2024-02-22T18:40:37Z

Sorry, had to do one last force push to fix the commit verification (I need to fix my GitKraken config...).

daemon1024 · 2024-02-22T18:46:17Z

Looks good now, will merge after the CI passes 🙌🏽

Great to know that you are trying to run KubeArmor in production. How are you using KubeArmor? We maintain a ADOPTERS.md and would love to hear the use-cases if it's fine with you.

…armor#1553) Signed-off-by: Dakota Sullivan <djqballer@outlook.com>

dqsully force-pushed the main branch from 258e45e to dcdd33c Compare December 21, 2023 21:11

daemon1024 requested review from achrefbensaad, daemon1024 and DelusionalOptimist January 9, 2024 17:49

daemon1024 requested changes Feb 22, 2024

View reviewed changes

KubeArmor/core/kubeUpdate.go Outdated Show resolved Hide resolved

dqsully closed this Feb 22, 2024

dqsully force-pushed the main branch from dcdd33c to e64ccb7 Compare February 22, 2024 18:16

dqsully reopened this Feb 22, 2024

daemon1024 approved these changes Feb 22, 2024

View reviewed changes

dqsully force-pushed the main branch from 75aaf22 to 62340bc Compare February 22, 2024 18:38

daemon1024 changed the title ~~fix(enforcer): ignore duplicate ADDED pod watch events~~ fix(core): ignore duplicate ADDED pod watch events Feb 22, 2024

Shreyas220 approved these changes Feb 22, 2024

View reviewed changes

Treat duplicate pod added events as modified in endpoint policy

926d026

Signed-off-by: Dakota Sullivan <djqballer@outlook.com>

dqsully force-pushed the main branch from 62340bc to 926d026 Compare February 22, 2024 18:40

Merge branch 'main' into main

a2b7c9d

daemon1024 merged commit 19acf50 into kubearmor:main Feb 22, 2024
18 checks passed

PrimalPimmy pushed a commit to PrimalPimmy/KubeArmor that referenced this pull request Mar 8, 2024

Treat duplicate pod added events as modified in endpoint policy (kube…

1ef35c7

…armor#1553) Signed-off-by: Dakota Sullivan <djqballer@outlook.com>

PrimalPimmy pushed a commit to PrimalPimmy/KubeArmor that referenced this pull request Mar 14, 2024

Treat duplicate pod added events as modified in endpoint policy (kube…

0fd36e4

…armor#1553) Signed-off-by: Dakota Sullivan <djqballer@outlook.com>

Shreyas220 pushed a commit to Shreyas220/KubeArmor that referenced this pull request May 31, 2024

Treat duplicate pod added events as modified in endpoint policy (kube…

2255635

…armor#1553) Signed-off-by: Dakota Sullivan <djqballer@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): ignore duplicate ADDED pod watch events #1553

fix(core): ignore duplicate ADDED pod watch events #1553

dqsully commented Dec 15, 2023

dqsully commented Feb 22, 2024

daemon1024 left a comment

dqsully commented Feb 22, 2024

daemon1024 commented Feb 22, 2024

daemon1024 left a comment

dqsully commented Feb 22, 2024

dqsully commented Feb 22, 2024

daemon1024 commented Feb 22, 2024

fix(core): ignore duplicate ADDED pod watch events #1553

fix(core): ignore duplicate ADDED pod watch events #1553

Conversation

dqsully commented Dec 15, 2023

dqsully commented Feb 22, 2024

daemon1024 left a comment

Choose a reason for hiding this comment

dqsully commented Feb 22, 2024

daemon1024 commented Feb 22, 2024

daemon1024 left a comment

Choose a reason for hiding this comment

dqsully commented Feb 22, 2024

dqsully commented Feb 22, 2024

daemon1024 commented Feb 22, 2024