Skip to content

Conversation

@SaranBalaji90
Copy link

@SaranBalaji90 SaranBalaji90 commented Jul 6, 2021

This KEP adds supports for "out of tree" plugin support for pod admission handler. It is not currently possible to add additional validations (without changing kubelet code) before admitting the Pod. This PR provides flexibility for adding such validations without updating kubelet source code.

Co-authored-by: @xujyan xujyan@apple.com
Co-authored-by: @jaypipes jaypipes@gmail.com

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 6, 2021
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 6, 2021
@SaranBalaji90 SaranBalaji90 force-pushed the podadmithandler branch 3 times, most recently from b3af251 to b8a8838 Compare July 6, 2021 22:38
@SaranBalaji90
Copy link
Author

SaranBalaji90 commented Jul 7, 2021

Slides for KEP presentation - NodeLocal Pluggable admit handler-v2.pptx

@SaranBalaji90 SaranBalaji90 changed the title Add plugin support for pod admission handler Add node-level plugin support for pod admission handler Jul 8, 2021
@@ -0,0 +1,314 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert your KEP to the latest template. I see that the version skew strategy and production readiness questionnaires are missing, among other key sections of the design.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ehashman for checking. I will update it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehashman updated this kep as per latest template.

Comment on lines 106 to 111
The team initially investigated using a Pod Security Policy (PSP) that would
prevent Pods with a Fargate scheduler type from having an elevated security
context or using host networking. However, because the EKS user has
administrative rights on the Kubernetes cluster, API-layer constructs such as a
Pod Security Policy may be deleted, which would effectively disable the effect
of that PSP. Likewise, the second solution the team landed on — using Node
Copy link
Member

@neolit123 neolit123 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for presenting in today's SIG Node meeting.
something that i found quite puzzling is the fact that you'd like the design to gate out on the cluster admin.
this was mentioned by @Random-Liu during the call.

not trusting the cluster admin is a peculiar case, since the cluster admin can already render a cluster inoperable, or change key RBAC roles.
can you elaborate more on why would you not trust them about potential disruptions, such as tempering with PSP or webhooks? why isn't documenting a best practice sufficient here?

Copy link
Author

@SaranBalaji90 SaranBalaji90 Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @neolit123 for giving quick read :) There are two risks as mentioned below that needs to be addressed,

  1. Restrict pods which cannot execute on these managed worker nodes and
  2. Prevent pods from accessing some of the file system on the node or tampering other process on the host.

while #1 can be solved with webhook or PSP and if cluster admins deletes them then it will impact their operations on this cluster (for eg in some scenario pods will be stuck in weird state instead of appropriate error message emitted by pod admit handler), #2 requires some validation in Kubelet before admitting the pods since cluster admins can delete webhook/PSP and schedule their pods onto these worker nodes directly by adding NodeName in their podSpec and perform elevated actions. Also, this bypasses the scheduler since cluster users populate the pod spec with nodename. I will elaborate in the KEP as well.

My thought here is, extending pod admit handler to support node-local plugins will help other providers who wanted to perform similar last min checks in Kubelet.

Copy link
Member

@neolit123 neolit123 Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can see the resulting complications if the cluster admins perform these actions, but it's still not clear to me why the platform would want to prevent them from doing that. is this about restricted scheduling such that can benefit the provider and not the cluster admin (in one form or the other)?

doing some high-level thinking here:
instead of adding more extension mechanisms, perhaps this should be considered as a problem in the privileges ladder.
if there is a cluster-admin with full scope, there could be a persona on top of them that can lock certain objects.

this would overlap with the ongoing discussion about locking certain objects and blocking an accidental kubectl delete <everything>.
https://groups.google.com/d/msgid/kubernetes-sig-cli/CAN9Ncmx-a6qLr_%3D74Mv%2B%2Bp5rJJkPA%3Dk8vtFNTKs5LY1xB4x_Xw%40mail.gmail.com?utm_medium=email&utm_source=footer

Copy link
Author

@SaranBalaji90 SaranBalaji90 Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this about restricted scheduling such that can benefit the provider and not the cluster admin (in one form or the other)?

Yes this is also one of our requirement that I called out earlier, Prevent pods from accessing some of the file system on the node or tampering other process on the host.

if there is a cluster-admin with full scope, there could be a persona on top of them that can lock certain objects.

Yes I like your thought process, something like blocking cluster-admin from deleting some objects for eg, webhook of certain types for example. This is where manifest based webhook would have really helped us but there were few cavets as Daniel pointed out - #2074 (comment) and thank you for taking time on this :)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SaranBalaji90
To complete the pull request process, please assign dchen1107, ehashman after the PR has been reviewed.
You can assign the PR to them by writing /assign @dchen1107 @ehashman in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@SaranBalaji90 SaranBalaji90 force-pushed the podadmithandler branch 3 times, most recently from 95952b5 to 5223ef6 Compare July 14, 2021 17:04
mount, [pulling secrets](https://github.com/kubernetes/kubernetes/blob/v1.14.6/pkg/kubelet/kubelet.go#L1644)
for pods etc.

2. OCI hook is invoked just before running the container, therefore Kubelet
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Kubelet retries the pod indefinitely in this case.

Incorporating updates from xujyan@apple.com

Co-authored-by: xujyan <xujyan@apple.com>
Co-authored-by: jaypipes <jaypipes@gmail.com>
@SaranBalaji90
Copy link
Author

Hello @derekwaynecarr and @dchen1107 will you be able to take a quick look at this? Based on your review, I will add this to our agenda to discuss further during our sig-node weekly meeting.

@SaranBalaji90
Copy link
Author

SaranBalaji90 commented Aug 12, 2021

Similar proposal for rejecting non-windows pod from windows worker node - #2803

Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @derekwaynecarr

Comment on lines +21 to +25
latest-milestone:
milestone:
alpha:
beta:
stable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to fill these milestones out or the validation on the metadata won't work properly.

@swatisehgal
Copy link
Contributor

/cc

Copy link
Contributor

@swatisehgal swatisehgal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this KEP, this is super interesting! Left some comments, hope it helps.

resource allocation. The in-tree PodAdmitHandlers mentioned below are good
examples for this.

* Eviction manager rejects pods if the node is under high pressure (e.g., PIDs)
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add here the recent addition of CPU Manager option full-pcpus-only which allows to reject non SMT-aligned workloads.

also to prevent pods from accessing some portion of file system on the node or
tampering other process on the host.

The team initially investigated using a webhook or Pod Security Policy (PSP)
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more suitable to move this information to the Alternative Solutions section along with pros and cons of each approach.

architecture. With CNI, kubelet invokes one or more CNI plugin binaries on the
host to set up a Pod’s networking. kubelet discovers available CNI plugins
by [examining](https://github.com/kubernetes/kubernetes/blob/dd5272b76f07bea60628af0bb793f3cca385bf5e/pkg/kubelet/dockershim/docker_service.go#L242)
a well-known directory (`/etc/cni/net.d`) for configuration files
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define the well defined directory path for the admission plugins as part of this proposal.

Also, I see later in the proposal you have plugin type as grpc. Does that mean that plugins would be deployed as a daemon set and interact with kubelet over gRPC? Where is the gRPC API defined? Deploying it as a daemon set would give us the ability to deploy/remove custom admit handlers during the lifecycle of kubelet. Am I understanding this correctly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ack. We will definitely add and define one. The next iteration may not have it though.

  2. A daemonset won't be applicable in this case for these reasons:
    A) A cluster administrator can change the properties of the daemonset.
    B) By the time a daemonset will run this plugin, it will have run its (daemonset's) container. As a result of this, the daemonset maybe able to get privileged access for Fargate which is what the Plugin intends to prevent.

Note: The intention right now is to use a unix connection and will work in the same way as node-register plugins.


### Risks and Mitigations

These out-of-tree plugins should be provisioned and managed along with kubelet.
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's capture as part of this KEP, from a sys-admin's perspective where would these out-of-tree plugins be consumed from? Do we handle this just with documentation or plan to have a common repository located in e.g. in https://github.com/kubernetes-sigs to house or point to third party out-of-tree admission plugins. Here are a few examples:

operator (e.g., the descheduler) that watches over pod rejections and delete the
pods based on the Reason field.

### Test Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these sections (Test Plan, Graduation Criteria, PRR review questionnaire) need to be filled.

Pod object and can be invoked in parallel since there is no dependency between
them.

These validations will be applied to both static pods which are managed by
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have a mechanism to only run admission handler on a subset of pods. Maybe we can have a way to exclude certain kind of pods or pods with a particular label. WDYT?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a nice idea. We would potentially like to visit it in the beta phase. If there is a use-case that you have, please let us know.

Kubelet will encode the pod spec and invoke each admission plugin. After
decoding the pod spec, plugins can perform additional validations and return the
encoded form of the struct mentioned below to kubelet to decide whether to admit
the pod or not. For shell plugin type, request will be sent over stdin and
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention here shell plugin type, can you please explain what you mean by a shell plugin type here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubelet will be able to call an executable using os.Exec and communicate over stdout and stderr with the plugin. A diagram is needed. I will add a diagram

We will also update to reflect the unix socket type as the first point of discussion as that is what we would want to do in Alpha.

response will be received over stdout and stderr.

```
AdmitResult {
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should capture how the rejection manifests itself from user perspective. e.g. a Pod status shows the Reason ( e.g. CustomAdmissionFailure) along with description which essentially is the Message. Lets explain this with concrete examples. It is implicit but it might be worth mentioning that this allows each plugin to have a mechanism to have a more plugin specific pod rejection error message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, we will add diagrams and examples in user stories.

@@ -0,0 +1,3 @@
kep-number: 2823
alpha:
approver: "@dchen1107"
Copy link
Contributor

@swatisehgal swatisehgal Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node approver(s) should be provided in the kep.yaml (like you already have). Here one of the approvers from the list of Production readiness approvers should be provided:

enhancements/OWNERS_ALIASES

Lines 191 to 195 in 0ec64c8

prod-readiness-approvers:
- johnbelamaric
- deads2k
- wojtek-t
- ehashman

@SaranBalaji90
Copy link
Author

Thanks for working on this KEP, this is super interesting! Left some comments, hope it helps.

Thank you @swatisehgal for the review. @manugupt1 is updating the PR with your feedback, will keep you posted when PR is updated.

@manugupt1
Copy link

manugupt1 commented Sep 16, 2021

/assign @manugupt1 so that I can be notified.

/cc

Also, Kubernetes scheduler is extensible but isn’t omniscient about everything
on the node hence there can be cases where a scheduling decision cannot be
fulfilled by the node based on the current node conditions such as node-level
resource allocation. The in-tree PodAdmitHandlers mentioned below are good

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a new scheduler plugin that checks the POD against the host's access control attributes/policy? You can register the Fargate nodes' attributes/policy to the K8S control plane through the node feature discovery path. Doing this validation at the scheduler point is more efficient since it prevents the scheduler from selecting the Fargate nodes and eliminates all the unnecessary interaction between the K8S control plane and the kubelet of the Fargate node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/website#21128 (comment) shows that scheduler plugins can be disabled.

In case of Fargate, the fleet is owned by Fargate but the customer has admin rights to the cluster. A customer can chose to disable the scheduler plugin.


## Proposal

The approach taken is similar to the container networking interface (CNI) plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a design perspective, CNI seems to have the most downsides of all node level plugin mechanims. It's an RPC framework modulated via a process interface. It's hard to extend, it has a narrow channel of error responses (limiting the amount of info that can be returned to users), and it complicates how security is handled (we have to defend against certain types of escalations that come due to process permission inheritance).

I would hold up CRI, CSI, and auth which leverage GRPC as more coherent, efficient, and evolvable interfaces. I would expect a discussion in this kep about why binary shell execution was chosen over gRPC registration (in general we pick gRPC for most RPC callouts today, with the exception of webhook admission, which has a long history). I would probably expect gRPC to be chosen as the "default" mechanism unless a very strong argument was made against it.

In long term, I consider it an open topic of discussion to move CNI to a gRPC style interface, and I would expect to raise the bar on components that don't either extend via the controller pattern or a gRPC pattern. So it's good to start that discussion here.

type PluginType string

const (
PluginTypeShell PluginType = "shell" // binary to execute.
Copy link
Contributor

@smarterclayton smarterclayton Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why we need 3 types of plugins. Why is gRPC not sufficient?

I would probably expect the configuration for this to look more like kubelet configuration than "on disk configuration". It's possible that the different types of kubelet configuration that exist have complicated the perception of this, but if this config is part of the kubelet, the real questions are:

  1. Is this a static config tied to the kubelet process lifecycle? Why or why not?
  2. What flexibility does disk configuration add? Why is it necessary?
  3. What flexibility does having 3 types of plugin add? Why is it necessary?
  4. What are the requirements of this plugin's invocation pattern? Latency, throughput, discoverability, extensibility, security? Knowing those, we can identify what a recommended plugin configuration should be.

Also, another question I have is:

  1. Was it easy for you to understand what the current guidance for plugins to Kube components or the Kubelet is, or were you having to look at a bunch of different examples and guess at what the recommendation was?

If it's the latter, that seems like something sig-node needs to clarify in a KEP or other doc "recommended design and tradeoffs for kubelet extension" so that we can save people time inferring it.

architecture. With CNI, kubelet invokes one or more CNI plugin binaries on the
host to set up a Pod’s networking. kubelet discovers available CNI plugins
by [examining](https://github.com/kubernetes/kubernetes/blob/dd5272b76f07bea60628af0bb793f3cca385bf5e/pkg/kubelet/dockershim/docker_service.go#L242)
a well-known directory (`/etc/cni/net.d`) for configuration files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the configuration of these plugins is not dynamic, I don't see why we would use a node specific directory vs simply requiring kubelet config. Is there a use case for dynamic reconfiguration while a kubelet is running? If not, a mechanism that supports dynamic registration probably wouldn't be encouraged.

* Alternative resource allocator implemented by the CRI which needs to reject
pods in conditions similar to the built-in resource managers.

These can also benefit from a pluggable admission handler to reject pods before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a key question this KEP doesn't really clarify is whether this set of hooks applies ONLY to things that are representable in the API model (things that show up in Pod) or whether it applies to other parts of the Kubelet logic that are internal to how Kubelet models pods (the runtime pod, volumes, csi driver interactions). If it's the former, you're effectively duplicating apiserver admission (literally pod admission hooks take a Pod and say yes or no), and you should probably simply leverage that mechanism - "A user should be able to reject a Pod for admission at the Kubelet and the mechanism should look identical to rejecting a pod for admission at the APIserver". If it's the latter, we would need to define formal APIs for those, and if we do that those APIs should probably still look like an apiserver admission (if we have an API, it should be kubelike, if it's kubelike, the existing webhook pattern solves it, so let's not invent another pattern).

A second concern as well is whether mutation is allowed, or whether this is only validation. If it's mutation, some of the implications of the apiserver apply and should be addressed (and it's also an even stronger argument that if you want kubelet admission hooks, it should look like apiserver hooks).

I think before we get into the guts of discussing "how you register these" we need to address the questions above.

@249043822
Copy link
Member

/cc

@derekwaynecarr
Copy link
Member

@smarterclayton can we find some time on sig-node slack to try and reconcile your comments with @SaranBalaji90 ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Development

Successfully merging this pull request may close these issues.