Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): ignore duplicate ADDED pod watch events #1553

Merged
merged 2 commits into from
Feb 22, 2024

Conversation

dqsully
Copy link
Contributor

@dqsully dqsully commented Dec 15, 2023

Purpose of PR?:

Fixes a memory leak I identified on our infrastructure while trying to deploy KubeArmor in our environment:

The WatchK8sPods() function uses raw Kubernetes watches instead of the Kubernetes SDK informer to track pods in a cluster. This isn't inherently an issue, and in KubeArmor's case, this actually results in significantly lower memory usage since KubeArmor doesn't care about most pod information. However, raw Kubernetes watches will eventually get stopped by the Kubernetes API server and you have to start a new watch, at which point every existing pod will be sent back as an "ADDED" event in the watch.

KubeArmor already restarts these watches, and even deduplicates stored pod information in an array, but it forwards every "ADDED" event verbatim to the dm.UpdateEndPointWithPod() function which has no inherent event deduplication. This means that every time a watch restarts, duplicate endpoint configurations are added in memory for every pod in the cluster. And in a large cluster, this memory leak can grow pretty quickly.

My PR uses the duplicate status computed when storing pod information to determine if an "ADDED" event should be forwarded to dm.UpdateEndPointWithPod() as a "MODIFIED" event instead, which greatly reduces KubeArmor memory leakage in our environment while still fully processing every pod change in the cluster.

Does this PR introduce a breaking change?

No

If the changes in this PR are manually verified, list down the scenarios covered::

Here's the average memory usage of the KubeArmor DaemonSet containers on our non-production Kubernetes cluster over the past 4 days, where I initially observed this memory leak and have been doing some testing on how to solve it:

Screenshot_20231215_120505

tl;dr: the flat parts are my own changes, mostly just the changes in this PR. The positive slopes are KubeArmor v1.0.3 and v1.1.1, both with memory leaks after each redeployment.

Exactly what I did and when in this graph:

  1. 12/12 00:00 UTC - Deployed KubeArmor v1.0.3 with default configuration
  2. 12/13 01:30 UTC - Updated KubeArmor configuration to disable network visibility
  3. 12/14 17:45 UTC - Update KubeArmor w/ custom Docker image based on main but using k8s SDK informers instead of raw watches
  4. 12/14 20:30 UTC - Updated KubeArmor w/ custom Docker image based on main but ignoring duplicate "ADDED" events for dm.UpdateEndPointWithPod()
  5. 12/14 22:30 UTC - Updated KubeArmor w/ custom Docker image based on main but with this PR's changes
  6. 12/15 14:45 UTC - Updated KubeArmor to newly-released v1.1.1

This shows that with the unmodified v1.0.3 or v1.1.1 KubeArmor images, there's a clear memory leak, but with this PR's changes, while there appears to be a small memory leak still, the overall memory usage is much better.

We also ran KubeArmor v1.0.3 in production for only 8 hours until we had to take it down due to this memory leak and a knock-on CPU leak effect as well. This PR's changes have been running in our production environment for at least 24 hours now with very reasonable CPU and memory usage.

@dqsully
Copy link
Contributor Author

dqsully commented Feb 22, 2024

Hey, are there any updates on this PR? It's been 2 months and this seems like a major issue. Without this fix we couldn't run KubeArmor for >8 hours in production before this resource leak started causing significant issues.

Copy link
Member

@daemon1024 daemon1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dqsully,

This completely makes sense. Apologies for missing it out earlier, the holiday season got to us and this PR got overlooked.

We had been considering moving to informers, even have a commit for that 6dd22ef But that hasn't gone through yet.

I have a small nitpick, it might be a bit opinionated but maybe is easier to understand as well?

KubeArmor/core/kubeUpdate.go Outdated Show resolved Hide resolved
@dqsully
Copy link
Contributor Author

dqsully commented Feb 22, 2024

I tried using a Kubernetes informer instead of the raw pod watch, and while it fixed the memory leak, it did use significantly more memory in our Kubernetes cluster compared to the raw watch. You can see that in the cliff around 12/14 17:45 UTC on my original screenshot.

(Although this probably wouldn't matter nearly as much if the KubeArmor DaemonSet pods weren't watching all pods and added a filter to the informer to only watch pods on the current node. I've actually already done this filtering in my fork and I'll be making a PR for it soon, along with lots of other bugfixes I've made in my environment over the past couple months.)

@daemon1024
Copy link
Member

KubeArmor DaemonSet pods weren't watching all pods and added a filter to the informer to only watch pods on the current node

Agreed, this was discussed in one of the community calls recently, but seems like we forgot to open an issue.

I'll be making a PR for it soon, along with lots of other bugfixes I've made in my environment over the past couple months

I am really looking forward to them

Copy link
Member

@daemon1024 daemon1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait for the CI to pass before we merge this. Appreciate your help here. I will wait for CI to pass before we merge this.

I just went through a main...dqsully:KubeArmor:other-improvements and am wanting to pick up a lot of changes. I would love to chat with you and get feedback on KubeArmor in general.

You can join us on our Slack if you want real time interaction on anything.

@dqsully
Copy link
Contributor Author

dqsully commented Feb 22, 2024

I think I need to rebase my changes for CI to pass, lemme do that real quick.

@daemon1024 daemon1024 changed the title fix(enforcer): ignore duplicate ADDED pod watch events fix(core): ignore duplicate ADDED pod watch events Feb 22, 2024
Signed-off-by: Dakota Sullivan <djqballer@outlook.com>
@dqsully
Copy link
Contributor Author

dqsully commented Feb 22, 2024

Sorry, had to do one last force push to fix the commit verification (I need to fix my GitKraken config...).

@daemon1024
Copy link
Member

Looks good now, will merge after the CI passes 🙌🏽

Great to know that you are trying to run KubeArmor in production. How are you using KubeArmor? We maintain a ADOPTERS.md and would love to hear the use-cases if it's fine with you.

@daemon1024 daemon1024 merged commit 19acf50 into kubearmor:main Feb 22, 2024
18 checks passed
PrimalPimmy pushed a commit to PrimalPimmy/KubeArmor that referenced this pull request Mar 8, 2024
PrimalPimmy pushed a commit to PrimalPimmy/KubeArmor that referenced this pull request Mar 14, 2024
Shreyas220 pushed a commit to Shreyas220/KubeArmor that referenced this pull request May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants