-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add node-level plugin support for pod admission handler #2811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b3af251 to
b8a8838
Compare
|
Slides for KEP presentation - NodeLocal Pluggable admit handler-v2.pptx |
| @@ -0,0 +1,314 @@ | |||
| --- | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please convert your KEP to the latest template. I see that the version skew strategy and production readiness questionnaires are missing, among other key sections of the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ehashman for checking. I will update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehashman updated this kep as per latest template.
| The team initially investigated using a Pod Security Policy (PSP) that would | ||
| prevent Pods with a Fargate scheduler type from having an elevated security | ||
| context or using host networking. However, because the EKS user has | ||
| administrative rights on the Kubernetes cluster, API-layer constructs such as a | ||
| Pod Security Policy may be deleted, which would effectively disable the effect | ||
| of that PSP. Likewise, the second solution the team landed on — using Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for presenting in today's SIG Node meeting.
something that i found quite puzzling is the fact that you'd like the design to gate out on the cluster admin.
this was mentioned by @Random-Liu during the call.
not trusting the cluster admin is a peculiar case, since the cluster admin can already render a cluster inoperable, or change key RBAC roles.
can you elaborate more on why would you not trust them about potential disruptions, such as tempering with PSP or webhooks? why isn't documenting a best practice sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @neolit123 for giving quick read :) There are two risks as mentioned below that needs to be addressed,
- Restrict pods which cannot execute on these managed worker nodes and
- Prevent pods from accessing some of the file system on the node or tampering other process on the host.
while #1 can be solved with webhook or PSP and if cluster admins deletes them then it will impact their operations on this cluster (for eg in some scenario pods will be stuck in weird state instead of appropriate error message emitted by pod admit handler), #2 requires some validation in Kubelet before admitting the pods since cluster admins can delete webhook/PSP and schedule their pods onto these worker nodes directly by adding NodeName in their podSpec and perform elevated actions. Also, this bypasses the scheduler since cluster users populate the pod spec with nodename. I will elaborate in the KEP as well.
My thought here is, extending pod admit handler to support node-local plugins will help other providers who wanted to perform similar last min checks in Kubelet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can see the resulting complications if the cluster admins perform these actions, but it's still not clear to me why the platform would want to prevent them from doing that. is this about restricted scheduling such that can benefit the provider and not the cluster admin (in one form or the other)?
doing some high-level thinking here:
instead of adding more extension mechanisms, perhaps this should be considered as a problem in the privileges ladder.
if there is a cluster-admin with full scope, there could be a persona on top of them that can lock certain objects.
this would overlap with the ongoing discussion about locking certain objects and blocking an accidental kubectl delete <everything>.
https://groups.google.com/d/msgid/kubernetes-sig-cli/CAN9Ncmx-a6qLr_%3D74Mv%2B%2Bp5rJJkPA%3Dk8vtFNTKs5LY1xB4x_Xw%40mail.gmail.com?utm_medium=email&utm_source=footer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this about restricted scheduling such that can benefit the provider and not the cluster admin (in one form or the other)?
Yes this is also one of our requirement that I called out earlier, Prevent pods from accessing some of the file system on the node or tampering other process on the host.
if there is a cluster-admin with full scope, there could be a persona on top of them that can lock certain objects.
Yes I like your thought process, something like blocking cluster-admin from deleting some objects for eg, webhook of certain types for example. This is where manifest based webhook would have really helped us but there were few cavets as Daniel pointed out - #2074 (comment) and thank you for taking time on this :)
b8a8838 to
deba9da
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: SaranBalaji90 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
95952b5 to
5223ef6
Compare
| mount, [pulling secrets](https://github.com/kubernetes/kubernetes/blob/v1.14.6/pkg/kubelet/kubelet.go#L1644) | ||
| for pods etc. | ||
|
|
||
| 2. OCI hook is invoked just before running the container, therefore Kubelet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also Kubelet retries the pod indefinitely in this case.
8596b29 to
58d8dbd
Compare
Incorporating updates from xujyan@apple.com Co-authored-by: xujyan <xujyan@apple.com> Co-authored-by: jaypipes <jaypipes@gmail.com>
58d8dbd to
7b472a2
Compare
|
Hello @derekwaynecarr and @dchen1107 will you be able to take a quick look at this? Based on your review, I will add this to our agenda to discuss further during our sig-node weekly meeting. |
|
Similar proposal for rejecting non-windows pod from windows worker node - #2803 |
ehashman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign @derekwaynecarr
| latest-milestone: | ||
| milestone: | ||
| alpha: | ||
| beta: | ||
| stable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to fill these milestones out or the validation on the metadata won't work properly.
|
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this KEP, this is super interesting! Left some comments, hope it helps.
| resource allocation. The in-tree PodAdmitHandlers mentioned below are good | ||
| examples for this. | ||
|
|
||
| * Eviction manager rejects pods if the node is under high pressure (e.g., PIDs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to add here the recent addition of CPU Manager option full-pcpus-only which allows to reject non SMT-aligned workloads.
| also to prevent pods from accessing some portion of file system on the node or | ||
| tampering other process on the host. | ||
|
|
||
| The team initially investigated using a webhook or Pod Security Policy (PSP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more suitable to move this information to the Alternative Solutions section along with pros and cons of each approach.
| architecture. With CNI, kubelet invokes one or more CNI plugin binaries on the | ||
| host to set up a Pod’s networking. kubelet discovers available CNI plugins | ||
| by [examining](https://github.com/kubernetes/kubernetes/blob/dd5272b76f07bea60628af0bb793f3cca385bf5e/pkg/kubelet/dockershim/docker_service.go#L242) | ||
| a well-known directory (`/etc/cni/net.d`) for configuration files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define the well defined directory path for the admission plugins as part of this proposal.
Also, I see later in the proposal you have plugin type as grpc. Does that mean that plugins would be deployed as a daemon set and interact with kubelet over gRPC? Where is the gRPC API defined? Deploying it as a daemon set would give us the ability to deploy/remove custom admit handlers during the lifecycle of kubelet. Am I understanding this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Ack. We will definitely add and define one. The next iteration may not have it though.
-
A daemonset won't be applicable in this case for these reasons:
A) A cluster administrator can change the properties of the daemonset.
B) By the time a daemonset will run this plugin, it will have run its (daemonset's) container. As a result of this, the daemonset maybe able to get privileged access for Fargate which is what the Plugin intends to prevent.
Note: The intention right now is to use a unix connection and will work in the same way as node-register plugins.
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| These out-of-tree plugins should be provisioned and managed along with kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's capture as part of this KEP, from a sys-admin's perspective where would these out-of-tree plugins be consumed from? Do we handle this just with documentation or plan to have a common repository located in e.g. in https://github.com/kubernetes-sigs to house or point to third party out-of-tree admission plugins. Here are a few examples:
- scheduler plugins: https://github.com/kubernetes-sigs/scheduler-plugins (this repo houses scheduler plugins)
- CNI plugins: https://github.com/containernetworking/plugins (this repo has CNI network plugins maintained by the containernetworking team) and https://github.com/containernetworking/cni#3rd-party-plugins points to the location where third party plugins can be obtained from.
| operator (e.g., the descheduler) that watches over pod rejections and delete the | ||
| pods based on the Reason field. | ||
|
|
||
| ### Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these sections (Test Plan, Graduation Criteria, PRR review questionnaire) need to be filled.
| Pod object and can be invoked in parallel since there is no dependency between | ||
| them. | ||
|
|
||
| These validations will be applied to both static pods which are managed by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to have a mechanism to only run admission handler on a subset of pods. Maybe we can have a way to exclude certain kind of pods or pods with a particular label. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a nice idea. We would potentially like to visit it in the beta phase. If there is a use-case that you have, please let us know.
| Kubelet will encode the pod spec and invoke each admission plugin. After | ||
| decoding the pod spec, plugins can perform additional validations and return the | ||
| encoded form of the struct mentioned below to kubelet to decide whether to admit | ||
| the pod or not. For shell plugin type, request will be sent over stdin and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mention here shell plugin type, can you please explain what you mean by a shell plugin type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kubelet will be able to call an executable using os.Exec and communicate over stdout and stderr with the plugin. A diagram is needed. I will add a diagram
We will also update to reflect the unix socket type as the first point of discussion as that is what we would want to do in Alpha.
| response will be received over stdout and stderr. | ||
|
|
||
| ``` | ||
| AdmitResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should capture how the rejection manifests itself from user perspective. e.g. a Pod status shows the Reason ( e.g. CustomAdmissionFailure) along with description which essentially is the Message. Lets explain this with concrete examples. It is implicit but it might be worth mentioning that this allows each plugin to have a mechanism to have a more plugin specific pod rejection error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, we will add diagrams and examples in user stories.
| @@ -0,0 +1,3 @@ | |||
| kep-number: 2823 | |||
| alpha: | |||
| approver: "@dchen1107" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node approver(s) should be provided in the kep.yaml (like you already have). Here one of the approvers from the list of Production readiness approvers should be provided:
Lines 191 to 195 in 0ec64c8
| prod-readiness-approvers: | |
| - johnbelamaric | |
| - deads2k | |
| - wojtek-t | |
| - ehashman |
Thank you @swatisehgal for the review. @manugupt1 is updating the PR with your feedback, will keep you posted when PR is updated. |
|
/assign @manugupt1 so that I can be notified. /cc |
| Also, Kubernetes scheduler is extensible but isn’t omniscient about everything | ||
| on the node hence there can be cases where a scheduling decision cannot be | ||
| fulfilled by the node based on the current node conditions such as node-level | ||
| resource allocation. The in-tree PodAdmitHandlers mentioned below are good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a new scheduler plugin that checks the POD against the host's access control attributes/policy? You can register the Fargate nodes' attributes/policy to the K8S control plane through the node feature discovery path. Doing this validation at the scheduler point is more efficient since it prevents the scheduler from selecting the Fargate nodes and eliminates all the unnecessary interaction between the K8S control plane and the kubelet of the Fargate node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/website#21128 (comment) shows that scheduler plugins can be disabled.
In case of Fargate, the fleet is owned by Fargate but the customer has admin rights to the cluster. A customer can chose to disable the scheduler plugin.
|
|
||
| ## Proposal | ||
|
|
||
| The approach taken is similar to the container networking interface (CNI) plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a design perspective, CNI seems to have the most downsides of all node level plugin mechanims. It's an RPC framework modulated via a process interface. It's hard to extend, it has a narrow channel of error responses (limiting the amount of info that can be returned to users), and it complicates how security is handled (we have to defend against certain types of escalations that come due to process permission inheritance).
I would hold up CRI, CSI, and auth which leverage GRPC as more coherent, efficient, and evolvable interfaces. I would expect a discussion in this kep about why binary shell execution was chosen over gRPC registration (in general we pick gRPC for most RPC callouts today, with the exception of webhook admission, which has a long history). I would probably expect gRPC to be chosen as the "default" mechanism unless a very strong argument was made against it.
In long term, I consider it an open topic of discussion to move CNI to a gRPC style interface, and I would expect to raise the bar on components that don't either extend via the controller pattern or a gRPC pattern. So it's good to start that discussion here.
| type PluginType string | ||
|
|
||
| const ( | ||
| PluginTypeShell PluginType = "shell" // binary to execute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why we need 3 types of plugins. Why is gRPC not sufficient?
I would probably expect the configuration for this to look more like kubelet configuration than "on disk configuration". It's possible that the different types of kubelet configuration that exist have complicated the perception of this, but if this config is part of the kubelet, the real questions are:
- Is this a static config tied to the kubelet process lifecycle? Why or why not?
- What flexibility does disk configuration add? Why is it necessary?
- What flexibility does having 3 types of plugin add? Why is it necessary?
- What are the requirements of this plugin's invocation pattern? Latency, throughput, discoverability, extensibility, security? Knowing those, we can identify what a recommended plugin configuration should be.
Also, another question I have is:
- Was it easy for you to understand what the current guidance for plugins to Kube components or the Kubelet is, or were you having to look at a bunch of different examples and guess at what the recommendation was?
If it's the latter, that seems like something sig-node needs to clarify in a KEP or other doc "recommended design and tradeoffs for kubelet extension" so that we can save people time inferring it.
| architecture. With CNI, kubelet invokes one or more CNI plugin binaries on the | ||
| host to set up a Pod’s networking. kubelet discovers available CNI plugins | ||
| by [examining](https://github.com/kubernetes/kubernetes/blob/dd5272b76f07bea60628af0bb793f3cca385bf5e/pkg/kubelet/dockershim/docker_service.go#L242) | ||
| a well-known directory (`/etc/cni/net.d`) for configuration files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the configuration of these plugins is not dynamic, I don't see why we would use a node specific directory vs simply requiring kubelet config. Is there a use case for dynamic reconfiguration while a kubelet is running? If not, a mechanism that supports dynamic registration probably wouldn't be encouraged.
| * Alternative resource allocator implemented by the CRI which needs to reject | ||
| pods in conditions similar to the built-in resource managers. | ||
|
|
||
| These can also benefit from a pluggable admission handler to reject pods before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a key question this KEP doesn't really clarify is whether this set of hooks applies ONLY to things that are representable in the API model (things that show up in Pod) or whether it applies to other parts of the Kubelet logic that are internal to how Kubelet models pods (the runtime pod, volumes, csi driver interactions). If it's the former, you're effectively duplicating apiserver admission (literally pod admission hooks take a Pod and say yes or no), and you should probably simply leverage that mechanism - "A user should be able to reject a Pod for admission at the Kubelet and the mechanism should look identical to rejecting a pod for admission at the APIserver". If it's the latter, we would need to define formal APIs for those, and if we do that those APIs should probably still look like an apiserver admission (if we have an API, it should be kubelike, if it's kubelike, the existing webhook pattern solves it, so let's not invent another pattern).
A second concern as well is whether mutation is allowed, or whether this is only validation. If it's mutation, some of the implications of the apiserver apply and should be addressed (and it's also an even stronger argument that if you want kubelet admission hooks, it should look like apiserver hooks).
I think before we get into the guts of discussing "how you register these" we need to address the questions above.
|
/cc |
|
@smarterclayton can we find some time on sig-node slack to try and reconcile your comments with @SaranBalaji90 ? |
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
|
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This KEP adds supports for "out of tree" plugin support for pod admission handler. It is not currently possible to add additional validations (without changing kubelet code) before admitting the Pod. This PR provides flexibility for adding such validations without updating kubelet source code.
Co-authored-by: @xujyan xujyan@apple.com
Co-authored-by: @jaypipes jaypipes@gmail.com