Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeshark deployment DOSes kube-apiserver if k8s audit events enabled #1500

Closed
MMquant opened this issue Feb 21, 2024 · 10 comments
Closed

kubeshark deployment DOSes kube-apiserver if k8s audit events enabled #1500

MMquant opened this issue Feb 21, 2024 · 10 comments
Assignees

Comments

@MMquant
Copy link

MMquant commented Feb 21, 2024

We just successfully killed our k8s control plane nodes by deploying the kubeshark. The kubeshark deployment created thousands of k8s audit events thus DOSed the kube api servers which leaded to control plane nodes memory exhaustion. We use audit policy from the k8s documentation https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/.

How can we protect against such events?

I could remove the kubeshark namespace from the audit policy file but is there any more general solution to protect the kube api server and node against audit events DOS?

I analyzed k8s audit events in our ELK before and during the crash and it seems I'm not able to identify any common DOS events which could be filtered-out in the audit-policy.yaml file.

Currently we're gonna to set the

  • memory limits on kube-apiserver pods so that the kubeshark doesnt't kill the node;
  • limit hub, sniffer and tracer pods;
  • target specific namespace.
@alongir
Copy link
Member

alongir commented Feb 21, 2024

@MMquant thanks for reporting this. We are actively looking into this and will report back our findings.

@corest corest self-assigned this Feb 21, 2024
@corest
Copy link
Contributor

corest commented Feb 22, 2024

Hi @MMquant
Thx for reporting this issue.

I've tried to verify this on our test environment in EKS cluster. (5 t3.large nodes, ~100 pods)
image

This graph shows a number of audit log events in the cluster.
I installed Kubeshark at 7:30.
There was a little spike in a number of events at this point which stabilized after Kubeshark made its initial discovery. After that, no anomalies in the number of events were detected.
Also, there is no visible additional load on the Kubernetes API server.
We did similar tests before on a cluster with 100 nodes and ~1000 pods and didn't find any issues.
This doesn't prove that there are no such issues though.
Maybe EKS does not have that verbose audit policy, dunno.

Please provide more details on your setup.

  1. Is it cloud, on-prem, some test env with kind/minikube/etc
  2. Number of nodes and approximate number of pods
  3. Nodes resources (CPU, RAM, storage)
  4. Exact audit policy used

Also maybe ELK can provide some details on anomalies? You wrote that no common "DDOS" events were found, but maybe you can provide at least the difference in the count of events before Kubeshark and after.
E.g. average count before Kubeshark was 1k events/s and after Kubeshark was installed - 10k events/s.
That would help us to identify the magnitude of the issue at least.

@MMquant
Copy link
Author

MMquant commented Feb 22, 2024

  1. Is it cloud, on-prem, some test env with kind/minikube/etc

The k8s cluster is deployed on-prem on proxmox. 3 master nodes, 3 worker nodes.

  1. Number of nodes and approximate number of pods

±150

  1. Nodes resources (CPU, RAM, storage)
maple@ubuntu:tk8s-mon$ k top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
tk8sm01   423m         10%    4086Mi          53%       
tk8sm02   316m         7%     3477Mi          45%       
tk8sm03   252m         6%     3216Mi          42%       
tk8sw01   381m         4%     11859Mi         37%       
tk8sw02   671m         8%     8623Mi          27%       
tk8sw03   1051m        13%    18054Mi         56%
  1. Exact audit policy used

https://falco.org/docs/install-operate/third-party/learning/#falco-with-multiple-sources

ELK event count screenshot from opensearch

image

You can see that normal event rate[5m] is around 6k-7k. When the kubeshark was deployed the rate jumped to 13k.
The empty bars are kube-apiserver crashes. We needed to temporarily disable auditing in the kube-apiserver.yaml so that we would be able to uninstall the kubeshark helm release.

Additionally see the screenshots from Grafana

kubeshark daemonset memory load

Screenshot 2024-02-20 at 13 42 08

kube-apiserver memory load

Screenshot 2024-02-20 at 11 14 33

At the moment we are gonna to test following

  • upgrade nodes to use cgroup v2
  • limit the kube-apiserver memory resources so that next time at least the node itself survive
  • limit the kubeshark memory resources according to the guide https://docs.kubeshark.co/en/performance
  • scope the kubeshark to specific namespace

You are right that the audit policy definition has huge impact on the amount of the events generated. The one we are using is pretty verbose as it's needed for analysis by Falco.

@corest
Copy link
Contributor

corest commented Feb 22, 2024

Thx for the info @MMquant

To confirm if Kubeshark itself generates those events, can you please exclude Kubeshark service account from auditing?
If you installed Kubeshark in the default namespace, this can be done with the rule:

  - level: None
    userGroups: ["system:serviceaccounts"]
    users: ["system:serviceaccount:default:kubeshark-service-account"]

So, when you have time, remove Kubeshark, add rule, restart API servers, install Kubeshark and check if number of events is that high again

@MMquant
Copy link
Author

MMquant commented Feb 27, 2024

Hi @corest ,
we have just tested the rule and it seems that the rule indeed filtered-out the "DOS events".

@alongir
Copy link
Member

alongir commented Feb 27, 2024 via email

@MMquant
Copy link
Author

MMquant commented Feb 28, 2024

@corest I think we can close this issue can't we?
@alongir The logs I posted have been sorted out by our devops team. Moreover I wouldn't discuss that log error in this issue as I think these things are not related.

@corest
Copy link
Contributor

corest commented Feb 28, 2024

@MMquant we will keep this opened for now as I have few things to work on:

  1. Create environment with extensive audit policy + falco to replicate your issue.
  2. Find why Kubeshark generates so many events
  3. Update docs on our side regarding excluding Kubeshar from audit events.

@corest
Copy link
Contributor

corest commented Mar 1, 2024

  1. Recreated cluster with 3 nodes and audit policy provided by Falco
  2. Installed Falco, some workloads. Left cluster for 1h. Average rate of audit events - 345 events/minute
  3. Installed Kubeshark and enabled scripts to have some activity. Left for 1h. Average rage of audit events increased to 352 events/minute

Overall in 1h Kubeshark service account generated ~300 events and that is expected and normal.
Also there was no additional visible load on Kubernetes API generated.

So for the case of this issue I think the reason behind high volume of events is very specific to cluster setup and can't be fixed on Kubeshark side as for now.

Last thing for this issue - I'll add section here https://docs.kubeshark.co/en/troubleshooting on how to exclude kubeshark audit events from monitoring

FYI @alongir

@corest
Copy link
Contributor

corest commented Mar 2, 2024

Done

@corest corest closed this as completed Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants