Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add podresources DOS prevention using rate limit #116459

Merged

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Mar 10, 2023

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Implement server-side Denial Of Service prevention using rate limit for podresources API endpoint (which is served locally on the node through unix domain socket). This is a GA graduation blocker: kubernetes/enhancements#3863

Which issue(s) this PR fixes:

related (but not sufficient to close) to k/e 3743

Special notes for your reviewer:

Alternative take to #115852
Implements suggestion made during the review: #115852 (comment)
Acknowledges some drawbacks emerged during the API review (most notably: #115852 (comment) - please see the review of #115852 for all the details)

Added basic Denial Of Service prevention for the the node-local kubelet `podresource` API

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- KEP: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/606-compute-device-assignment/README.md

To enable rate limiting, needed for GA graduation,
we need to pass more parameters to the already crowded
`ListenAndServePodresources` function.

To tidy up a bit, pack the parameters in a helper struct,
with no intended changes in behavior.

Signed-off-by: Francesco Romani <fromani@redhat.com>
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 10, 2023
@ffromani ffromani changed the title Podresources ratelimit minimal WIP: add podresources DOS prevention using rate limit Mar 10, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2023
@ffromani
Copy link
Contributor Author

/sig node
WIP because I need to verify the e2e test

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 10, 2023
@ffromani
Copy link
Contributor Author

ffromani commented Mar 10, 2023

/priority important-soon

copied from #115852 (comment) - same reasoning applies here

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 10, 2023
@rjsadow
Copy link
Contributor

rjsadow commented Mar 10, 2023

/release-note-edit

release-note Added basic Denial Of Service prevention for the the node-local kubelet `podresource` API

@ffromani ffromani changed the title WIP: add podresources DOS prevention using rate limit add podresources DOS prevention using rate limit Mar 10, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2023
@ffromani
Copy link
Contributor Author

e2e test re-validated locally:

[sig-node] POD Resources [Serial] [Feature:PodResources][NodeFeature:PodResources] with the builtin rate limit values should hit throttling when calling podresources List in a tight loop
test/e2e_node/podresources_test.go:869
  STEP: Creating a kubernetes client @ 03/10/23 09:42:36.925
  STEP: Building a namespace api object, basename podresources-test @ 03/10/23 09:42:36.925
  Mar 10 09:42:36.927: INFO: Skipping waiting for service account
  STEP: Connecting to the kubelet endpoint @ 03/10/23 09:42:36.927
  STEP: Issuing 200 List() calls in a tight loop @ 03/10/23 09:42:36.927
  STEP: Checking return codes for 200 List() calls in 10.014719ms @ 03/10/23 09:42:36.937
  Mar 10 09:42:36.937: INFO: got 190/200 rate limit errors, at least one needed, the more the better
  Mar 10 09:42:36.937: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
  STEP: Destroying namespace "podresources-test-2758" for this suite. @ 03/10/23 09:42:36.938
• [0.015 seconds]

@ffromani ffromani force-pushed the podresources-ratelimit-minimal branch from 9da9acb to e6a2149 Compare March 10, 2023 14:46
@bart0sh bart0sh added this to WIP in SIG Node PR Triage Mar 10, 2023
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small note about the field documentation. Otherwise lgtm

@SergeyKanzhelev SergeyKanzhelev moved this from WIP to Needs Approver in SIG Node PR Triage Mar 10, 2023
@SergeyKanzhelev
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 10, 2023
Implement DOS prevention wiring a global rate limit for podresources
API. The goal here is not to introduce a general ratelimiting solution
for the kubelet (we need more research and discussion to get there),
but rather to prevent misuse of the API.

Known limitations:
- the rate limits value (QPS, BurstTokens) are hardcoded to
  "high enough" values.
  Enabling user-configuration would require more discussion
  and sweeping changes to the other kubelet endpoints, so it
  is postponed for now.
- the rate limiting is global. Malicious clients can starve other
  clients consuming the QPS quota.

Add e2e test to exercise the flow, because the wiring itself
is mostly boilerplate and API adaptation.
@ffromani ffromani force-pushed the podresources-ratelimit-minimal branch from e6a2149 to b837a0c Compare March 11, 2023 07:06
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4b10493450b97c1930f3648a6e54d2a0b13c1ad1

@derekwaynecarr
Copy link
Member

rate limiting looks good.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, ffromani, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 14, 2023
@k8s-ci-robot k8s-ci-robot merged commit 204a9a1 into kubernetes:master Mar 14, 2023
SIG Node CI/Test Board automation moved this from Triage to Done Mar 14, 2023
SIG Node PR Triage automation moved this from Needs Approver to Done Mar 14, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Mar 14, 2023
@ffromani ffromani deleted the podresources-ratelimit-minimal branch July 17, 2023 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

None yet

6 participants