New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeshark deployment DOSes kube-apiserver
if k8s audit events enabled
#1500
Comments
@MMquant thanks for reporting this. We are actively looking into this and will report back our findings. |
Hi @MMquant I've tried to verify this on our test environment in EKS cluster. (5 t3.large nodes, ~100 pods) This graph shows a number of audit log events in the cluster. Please provide more details on your setup.
Also maybe ELK can provide some details on anomalies? You wrote that no common "DDOS" events were found, but maybe you can provide at least the difference in the count of events before Kubeshark and after. |
The k8s cluster is deployed on-prem on proxmox. 3 master nodes, 3 worker nodes.
±150
https://falco.org/docs/install-operate/third-party/learning/#falco-with-multiple-sources ELK event count screenshot from opensearch You can see that normal event rate[5m] is around 6k-7k. When the kubeshark was deployed the rate jumped to 13k. Additionally see the screenshots from Grafana kubeshark daemonset memory load kube-apiserver memory load At the moment we are gonna to test following
You are right that the audit policy definition has huge impact on the amount of the events generated. The one we are using is pretty verbose as it's needed for analysis by Falco. |
Thx for the info @MMquant To confirm if Kubeshark itself generates those events, can you please exclude Kubeshark service account from auditing?
So, when you have time, remove Kubeshark, add rule, restart API servers, install Kubeshark and check if number of events is that high again |
Hi @corest , |
@MMquant <https://github.com/MMquant> The logs don't show a problem. Do you
still experience the containers failing?
Also, one log line implies you're using an older version.
It will be good to test out one of the recent versions (e.g. the latest
one).
…On Tue, Feb 27, 2024 at 1:36 AM Petr Javorik ***@***.***> wrote:
Hi @corest <https://github.com/corest> ,
we have just tested the rule and it seems that the rule indeed
filtered-out the "DOS events".
However we are concurrently facing another issue with kubeshark which we
don't know if it's related to this events issue.
Kubeshark containers are in crashLoopBackoff state
Defaulted container "sniffer" out of: sniffer, tracer
{"level":"debug","time":"2024-02-27T09:24:01Z","message":"packet-capture flag is deprecated!"}
2024-02-27T09:24:01Z INF main.go:75 > Starting worker...
2024-02-27T09:24:01Z INF misc/data.go:25 > Set the data directory to: data-dir=/app/data
2024-02-27T09:24:01Z INF kubernetes/memory/limit.go:47 > Memory limit is set to limit=8301034833169294539
2024-02-27T09:24:01Z INF main.go:106 > Starting worker...
2024-02-27T09:24:01Z WRN kubernetes/resolver/resolver.go:126 > Failed reading the name resolution history dump: error="open /app/data/name_resolution_history.json: no such file or directory" path=/app/data/name_resolution_history.json
2024-02-27T09:24:01Z INF main.go:126 > Linux kernel: version=4.18.0-513.11.1.el8_9.x86_64
2024-02-27T09:24:01Z INF utils/kernel/loader.go:80 > Downloading kernel module: dst=/app/kernel_modules/pf_ring.ko url=https://api.kubeshark.co/kernel-modules/4.18.0-513.11.1.el8_9.x86_64/pf_ring.ko
2024-02-27T09:24:01Z <https://api.kubeshark.co/kernel-modules/4.18.0-513.11.1.el8_9.x86_64/pf_ring.ko2024-02-27T09:24:01Z> INF kubernetes/resolver/target.go:115 > Targeted pod: ......-01-0
2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: .......-02-0
2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: ....dsr7k
2024-02-27T09:24:01Z INF kubernetes/resolver/target.go:115 > Targeted pod: linux-tools
2024-02-27T09:24:02Z WRN main.go:131 > error="bad response code: 404"
2024-02-27T09:24:02Z INF assemblers/tcp_streams_map.go:75 > Using 1000 ms as the close timed out TCP stream channels interval
2024-02-27T09:24:02Z WRN source/tcp_packet_source.go:88 > Can't use PF_RING socket error="pfring NewRing error: address family not supported by protocol"
2024-02-27T09:24:02Z INF source/tcp_packet_source.go:103 > Using AF_PACKET socket as the capture source
2024-02-27T09:24:02Z INF server/server.go:62 > Starting the server... port=30001
—
Reply to this email directly, view it on GitHub
<#1500 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPGA2JN2QUG4IJ2GIMI7ULYVWSK7AVCNFSM6AAAAABDSVNNBKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRWGE2DIMZYGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@MMquant we will keep this opened for now as I have few things to work on:
|
Overall in 1h Kubeshark service account generated ~300 events and that is expected and normal. So for the case of this issue I think the reason behind high volume of events is very specific to cluster setup and can't be fixed on Kubeshark side as for now. Last thing for this issue - I'll add section here https://docs.kubeshark.co/en/troubleshooting on how to exclude kubeshark audit events from monitoring FYI @alongir |
Done |
We just successfully killed our k8s control plane nodes by deploying the
kubeshark
. Thekubeshark
deployment created thousands of k8s audit events thus DOSed the kube api servers which leaded to control plane nodes memory exhaustion. We use audit policy from the k8s documentation https://kubernetes.io/docs/tasks/debug/debug-cluster/audit/.How can we protect against such events?
I could remove the kubeshark namespace from the audit policy file but is there any more general solution to protect the kube api server and node against audit events DOS?I analyzed k8s audit events in our ELK before and during the crash and it seems I'm not able to identify any common DOS events which could be filtered-out in the
audit-policy.yaml
file.Currently we're gonna to set the
kube-apiserver
pods so that thekubeshark
doesnt't kill the node;hub
,sniffer
andtracer
pods;The text was updated successfully, but these errors were encountered: