kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled #1500

MMquant · 2024-02-21T09:23:51Z

We just successfully killed our k8s control plane nodes by deploying the kubeshark. The kubeshark deployment created thousands of k8s audit events thus DOSed the kube api servers which leaded to control plane nodes memory exhaustion. We use audit policy from the k8s documentation https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/.

How can we protect against such events?

~~I could remove the kubeshark namespace from the audit policy file but is there any more general solution to protect the kube api server and node against audit events DOS?~~

I analyzed k8s audit events in our ELK before and during the crash and it seems I'm not able to identify any common DOS events which could be filtered-out in the audit-policy.yaml file.

Currently we're gonna to set the

memory limits on kube-apiserver pods so that the kubeshark doesnt't kill the node;
limit hub, sniffer and tracer pods;
target specific namespace.

The text was updated successfully, but these errors were encountered:

alongir · 2024-02-21T16:37:11Z

@MMquant thanks for reporting this. We are actively looking into this and will report back our findings.

corest · 2024-02-22T08:02:21Z

Hi @MMquant
Thx for reporting this issue.

I've tried to verify this on our test environment in EKS cluster. (5 t3.large nodes, ~100 pods)

This graph shows a number of audit log events in the cluster.
I installed Kubeshark at 7:30.
There was a little spike in a number of events at this point which stabilized after Kubeshark made its initial discovery. After that, no anomalies in the number of events were detected.
Also, there is no visible additional load on the Kubernetes API server.
We did similar tests before on a cluster with 100 nodes and ~1000 pods and didn't find any issues.
This doesn't prove that there are no such issues though.
Maybe EKS does not have that verbose audit policy, dunno.

Please provide more details on your setup.

Is it cloud, on-prem, some test env with kind/minikube/etc
Number of nodes and approximate number of pods
Nodes resources (CPU, RAM, storage)
Exact audit policy used

Also maybe ELK can provide some details on anomalies? You wrote that no common "DDOS" events were found, but maybe you can provide at least the difference in the count of events before Kubeshark and after.
E.g. average count before Kubeshark was 1k events/s and after Kubeshark was installed - 10k events/s.
That would help us to identify the magnitude of the issue at least.

MMquant · 2024-02-22T10:43:00Z

Is it cloud, on-prem, some test env with kind/minikube/etc

The k8s cluster is deployed on-prem on proxmox. 3 master nodes, 3 worker nodes.

Number of nodes and approximate number of pods

±150

Nodes resources (CPU, RAM, storage)

maple@ubuntu:tk8s-mon$ k top nodes
NAME                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
tk8sm01   423m         10%    4086Mi          53%       
tk8sm02   316m         7%     3477Mi          45%       
tk8sm03   252m         6%     3216Mi          42%       
tk8sw01   381m         4%     11859Mi         37%       
tk8sw02   671m         8%     8623Mi          27%       
tk8sw03   1051m        13%    18054Mi         56%

Exact audit policy used

https://falco.org/docs/install-operate/third-party/learning/#falco-with-multiple-sources

ELK event count screenshot from opensearch

You can see that normal event rate[5m] is around 6k-7k. When the kubeshark was deployed the rate jumped to 13k.
The empty bars are kube-apiserver crashes. We needed to temporarily disable auditing in the kube-apiserver.yaml so that we would be able to uninstall the kubeshark helm release.

Additionally see the screenshots from Grafana

kubeshark daemonset memory load

kube-apiserver memory load

At the moment we are gonna to test following

upgrade nodes to use cgroup v2
limit the kube-apiserver memory resources so that next time at least the node itself survive
limit the kubeshark memory resources according to the guide https://docs.kubeshark.co/en/performance
scope the kubeshark to specific namespace

You are right that the audit policy definition has huge impact on the amount of the events generated. The one we are using is pretty verbose as it's needed for analysis by Falco.

corest · 2024-02-22T12:11:37Z

Thx for the info @MMquant

To confirm if Kubeshark itself generates those events, can you please exclude Kubeshark service account from auditing?
If you installed Kubeshark in the default namespace, this can be done with the rule:

  - level: None
    userGroups: ["system:serviceaccounts"]
    users: ["system:serviceaccount:default:kubeshark-service-account"]

So, when you have time, remove Kubeshark, add rule, restart API servers, install Kubeshark and check if number of events is that high again

MMquant · 2024-02-27T09:36:36Z

Hi @corest ,
we have just tested the rule and it seems that the rule indeed filtered-out the "DOS events".

alongir · 2024-02-27T16:52:05Z

@MMquant <https://github.com/MMquant> The logs don't show a problem. Do you still experience the containers failing? Also, one log line implies you're using an older version. It will be good to test out one of the recent versions (e.g. the latest one).

…

On Tue, Feb 27, 2024 at 1:36 AM Petr Javorik ***@***.***> wrote: Hi @corest <https://github.com/corest> , we have just tested the rule and it seems that the rule indeed filtered-out the "DOS events". However we are concurrently facing another issue with kubeshark which we don't know if it's related to this events issue. Kubeshark containers are in crashLoopBackoff state Defaulted container "sniffer" out of: sniffer, tracer {"level":"debug","time":"2024-02-27T09:24:01Z","message":"packet-capture flag is deprecated!"} 2024-02-27T09:24:01Z INF main.go:75 > Starting worker... 2024-02-27T09:24:01Z INF misc/data.go:25 > Set the data directory to: data-dir=/app/data 2024-02-27T09:24:01Z INF kubernetes/memory/limit.go:47 > Memory limit is set to limit=8301034833169294539 2024-02-27T09:24:01Z INF main.go:106 > Starting worker... 2024-02-27T09:24:01Z WRN kubernetes/resolver/resolver.go:126 > Failed reading the name resolution history dump: error="open /app/data/name_resolution_history.json: no such file or directory" path=/app/data/name_resolution_history.json 2024-02-27T09:24:01Z INF main.go:126 > Linux kernel: version=4.18.0-513.11.1.el8_9.x86_64 2024-02-27T09:24:01Z INF utils/kernel/loader.go:80 > Downloading kernel module: dst=/app/kernel_modules/pf_ring.ko url=https://api.kubeshark.co/kernel-modules/4.18.0-513.11.1.el8_9.x86_64/pf_ring.ko 2024-02-27T09:24:01Z <https://api.kubeshark.co/kernel-modules/4.18.0-513.11.1.el8_9.x86_64/pf_ring.ko2024-02-27T09:24:01Z> INF kubernetes/resolver/target.go:115 > Targeted pod: ......-01-0 2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: .......-02-0 2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: ....dsr7k 2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: linux-tools 2024-02-27T09:24:02Z WRN main.go:131 > error="bad response code: 404" 2024-02-27T09:24:02Z INF assemblers/tcp_streams_map.go:75 > Using 1000 ms as the close timed out TCP stream channels interval 2024-02-27T09:24:02Z WRN source/tcp_packet_source.go:88 > Can't use PF_RING socket error="pfring NewRing error: address family not supported by protocol" 2024-02-27T09:24:02Z INF source/tcp_packet_source.go:103 > Using AF_PACKET socket as the capture source 2024-02-27T09:24:02Z INF server/server.go:62 > Starting the server... port=30001 — Reply to this email directly, view it on GitHub <#1500 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPGA2JN2QUG4IJ2GIMI7ULYVWSK7AVCNFSM6AAAAABDSVNNBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRWGE2DIMZYGE> . You are receiving this because you commented.Message ID: ***@***.***>

MMquant · 2024-02-28T08:13:22Z

@corest I think we can close this issue can't we?
@alongir The logs I posted have been sorted out by our devops team. Moreover I wouldn't discuss that log error in this issue as I think these things are not related.

corest · 2024-02-28T09:36:04Z

@MMquant we will keep this opened for now as I have few things to work on:

Create environment with extensive audit policy + falco to replicate your issue.
Find why Kubeshark generates so many events
Update docs on our side regarding excluding Kubeshar from audit events.

corest · 2024-03-01T07:21:44Z

Recreated cluster with 3 nodes and audit policy provided by Falco
Installed Falco, some workloads. Left cluster for 1h. Average rate of audit events - 345 events/minute
Installed Kubeshark and enabled scripts to have some activity. Left for 1h. Average rage of audit events increased to 352 events/minute

Overall in 1h Kubeshark service account generated ~300 events and that is expected and normal.
Also there was no additional visible load on Kubernetes API generated.

So for the case of this issue I think the reason behind high volume of events is very specific to cluster setup and can't be fixed on Kubeshark side as for now.

Last thing for this issue - I'll add section here https://docs.kubeshark.co/en/troubleshooting on how to exclude kubeshark audit events from monitoring

FYI @alongir

corest · 2024-03-02T21:26:19Z

Done

corest self-assigned this Feb 21, 2024

corest mentioned this issue Mar 1, 2024

Add note about high volume of audit logs kubeshark/docs#27

Merged

corest closed this as completed Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled #1500

kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled #1500

MMquant commented Feb 21, 2024

alongir commented Feb 21, 2024

corest commented Feb 22, 2024 •

edited

MMquant commented Feb 22, 2024

corest commented Feb 22, 2024 •

edited

MMquant commented Feb 27, 2024 •

edited

alongir commented Feb 27, 2024 via email

MMquant commented Feb 28, 2024

corest commented Feb 28, 2024

corest commented Mar 1, 2024

corest commented Mar 2, 2024

kubeshark deployment DOSes kube-apiserver if k8s audit events enabled #1500

kubeshark deployment DOSes kube-apiserver if k8s audit events enabled #1500

Comments

MMquant commented Feb 21, 2024

alongir commented Feb 21, 2024

corest commented Feb 22, 2024 • edited

MMquant commented Feb 22, 2024

corest commented Feb 22, 2024 • edited

MMquant commented Feb 27, 2024 • edited

alongir commented Feb 27, 2024 via email

MMquant commented Feb 28, 2024

corest commented Feb 28, 2024

corest commented Mar 1, 2024

corest commented Mar 2, 2024

kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled #1500

kubeshark deployment DOSes `kube-apiserver` if k8s audit events enabled #1500

corest commented Feb 22, 2024 •

edited

corest commented Feb 22, 2024 •

edited

MMquant commented Feb 27, 2024 •

edited