Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
244 changes: 244 additions & 0 deletions keps/sig-node/20200115-pod-addmission-plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
title: Node-local pluggable pod admission framework
authors:
- "@SaranBalaji90"
- "@jaypipes"
owning-sig: sig-node
reviewers: TBD
approvers: TBD
editor: TBD
creation-date: 2020-01-15
last-updated: 2020-01-15
status: provisional
---

# Plugin support for pod admission handlers

## Table of Contents

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Alpha -&gt; Beta Graduation](#alpha---beta-graduation)
- [Beta -&gt; GA Graduation](#beta---ga-graduation)
- [Implementation History](#implementation-history)
- [References](#references)
<!-- /toc -->

## Release Signoff Checklist

- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- [ ] KEP approvers have set the KEP status to `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

## Summary

Today, kubelet is responsible for determining if a Pod can execute on the node. Kubelet compares the required capabilities of the Pod against the discovered capabilities of both the worker node and the container runtime.

Kubelet will reject the Pod if any required capabilities in the Pod.Spec are not supported by the container engine running on the node. Such capabilities might include the ability to set sysctl parameters, use of elevated system privileges or use of a non-default process mount. Likewise, kubelet checks the Pod against node capabilities; for example, the presence of a specific apparmor profile or host kernel.

These validations represent final, last-minute checks immediately before the Pod is started by the container runtime. These node-local checks differ from API-layer validations like Pod Security Policies or Validating Admission webhooks. Whereas the latter may be deactivated or removed by Kubernetes cluster administrators, the former node-local checks cannot be disabled. As such, they represent a final defense against malicious actors and misconfigured Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whereas the latter may be deactivated or removed by Kubernetes cluster administrators, the former node-local checks cannot be disabled.

This doesn't make much sense to me. Why are node-local checks different?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair node-local checks can be same, its just that these validations cant be removed by cluster administrator, whereas validation webhooks or psp can be removed by cluster admins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1712, which proposes static admission webhooks. Does that address this concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they represent a final defense against malicious actors and misconfigured Pods

I think it's worth keeping this 2 use cases separate. It makes sense to me to have some protection against misconfigured pods, since not all configuration details are available at the cluster level. However, I'm more skeptical of node-level admission offering a increase in security over cluster-level admission.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will update this. Also, we do perform validations in the control plane and block such misconfigured pods. But if cluster admins removes webhook/psp or have their own scheduler that bypasses our checks then we need something on the node to block such pods from running.


It is not currently possible to add additional validations before admitting the Pod. This document proposes a framework for enabling additional node-local Pod admission checks.

## Motivation

Amazon Elastic Kubernetes Service (EKS) provides users a managed Kubernetes control plane. EKS users are provisioned a Kubernetes cluster running on AWS cloud infrastructure. While the EKS user does not have host-level administrative access to the master nodes, it is important to point out that they do have administrative rights on that Kubernetes cluster.

The EKS user’s worker node administrative access depends on the type of worker node the EKS user chooses. EKS users have three options. The first option is to bring their own EC2 instances as worker nodes. The second option is for EKS users to launch a managed worker node group. These first two options both result in the EKS user maintaining full host-level administrative rights on the worker nodes. The final option — the option that motivated this proposal — is for the EKS user to forego worker node management entirely using AWS Fargate, a serverless computing environment. With AWS Fargate, the EKS user does not have host-level administrative access to their worker node; in fact, the worker node runs on a serverless computing platform that abstracts away the entire notion of a host.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please linewrap the whole document (I like 80 chars) so that it's easier to leave comments on parts of the paragraph.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. Will update the doc.


In building the AWS EKS support for AWS Fargate, the AWS Kubernetes engineering team faced a dilemma: how could they prevent Pods destined to run on Fargate nodes from using host networking or assuming elevated host user privileges?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are daemonset pods targeting this node type? is there a dns plugin or sdn configured via a daemonset? do things like a node exporter or similar monitoring components exclude this node type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently daemonset pods doesn't run on these node types. But that's something we are looking into.


The team initially investigated using a Pod Security Policy (PSP) that would prevent Pods with a Fargate scheduler type from having an elevated security context or using host networking. However, because the EKS user has administrative rights on the Kubernetes cluster, API-layer constructs such as a Pod Security Policy may be deleted, which would effectively disable the effect of that PSP. Likewise, the second solution the team landed on — using Node taints and tolerations — was similarly bound to the Kubernetes API layer, which meant EKS users could modify those Node taints and tolerations, effectively disabling the effects. A third potential solution involving OCI hooks was then investigated. OCI hooks are separate executables that an OCI-compatible container runtime invokes that can modify the behaviour of the containers in a sandbox. While this solution would have solved the API-layer problem, it introduced other issues, such as the inefficiency of downloading the container image to the Node before the OCI hook was run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads like you considered some built-in policy controls that didn't work, and then jumped to building a new node-level custom policy enforcement mechanism. What about custom cluster-level policy? We already have AdmissionWebhooks for exactly that reason. If your concern is a clusteradmin being able to mess with the admission webhook, then I would rather consider a statically configured admission webhook (I think this has already been proposed elsewhere?) before proposing a completely new mechanism in the kubelet.

Copy link
Author

@SaranBalaji90 SaranBalaji90 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with static admission webhook, the problem is we will put heavy load on the controllers right, because webhook is going to reject and controller will keep creating the pods. But adding it as soft admit handler in kubelet, will not put pressure on controllers.

Copy link
Author

@SaranBalaji90 SaranBalaji90 May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if I'm not wrong users in system:masters group can delete the validation webhook right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, #1712 is the proposal I was referring to.

the problem is we will put heavy load on the controllers right, because webhook is going to reject and controller will keep creating the pods.

Controllers will back off. I'm not sure if they treat a rejection on pod creation differently from a failed pod.


The final solution the EKS team settled on involved changing kubelet itself to support additional node-local Pod admission checks. This KEP outlines the EKS team’s approach and proposes upstreaming these changes to kubelet in order to allow extensible node-local last-minute validation checks. This functionality will enable cloud providers who support nodeless/serverless worker nodes to restrict Pods based on fields other than those already being validated by kubelet.

### Goals

- Allow deployers of fully managed worker nodes to have control over Pods running on those nodes.

- Enable node-local Pod admission checks without requiring changes to kubelet.

### Non-Goals

- Move existing validations to “out of tree” plugins.

- Change existing API-layer validation solutions such as Pod Security Policies and validating admission webhooks.

## Proposal

The approach taken is similar to the container networking interface (CNI) plugin architecture. With CNI, kubelet invokes one or more CNI plugin binaries on the host to set up a Pod’s networking. kubelet discovers available CNI plugins by [examining](https://github.com/kubernetes/kubernetes/blob/dd5272b76f07bea60628af0bb793f3cca385bf5e/pkg/kubelet/dockershim/docker_service.go#L242) a well-known directory (`/etc/cni/net.d`) for configuration files and [loading](https://github.com/kubernetes/kubernetes/blob/dd5272b76f07bea60628af0bb793f3cca385bf5e/pkg/kubelet/dockershim/docker_service.go#L248) plugin [descriptors](https://github.com/kubernetes/kubernetes/blob/f4db8212be53c69a27d893d6a4111422fbce8008/pkg/kubelet/dockershim/network/plugins.go#L52) upon startup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dealt with CNI while developed kuryr-kubernetes CNI plugin. And I found cmd line interface of CNI is little bit out dated, it doesnt' conform to actual state, since all known for me CNI nowdays use daemons in runtime/and persistently run in pod and cmd line client to connect to its daemon. So RPC is better, what kind of RPC it's another question. I liked the way podresources were implemeted or device plugins. If neceserity this feature will be proven, I vote for the solution based on RPC (gRPC) working through unix domain socket.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under design, I have included the structs for configuration file. There I specified three options for plugin types - binary file, unix socket and local grpc server. We can decide if we need to support just unix socket and remove the other two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlike device plugins or cni, there is an implication that no other pod is running on this host as part of bootstrapping the host. dns and sdn is configured outside of kube-api model itself.


To support pluggable validation for pod admission on the worker node, we propose to have kubelet similarly discover node-local Pod admission plugins listed in a new PodAdmissionPluginDir flag.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implication here is that dynamic kubelet configuration is disabled on managed nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I guess you meant, if its enabled then users can update the kubelet configuration and change the plugin dir?


Other option is to enable this functionality through feature flag “enablePodAdmissionPlugin” and have the directory path defined inside the kubelet itself.

### Design Details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem with the existing kubelet admission approach is that it can cause controllers to thrash. E.g. what if a DaemonSet controller is trying to schedule a pod on the Kubelet, and the kubelet keeps rejecting it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your right about this, but if a pod is rejected by soft admit handler then this doesn't apply right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Historically soft-reject was added because controllers treated a failed pod differently from an error on create. I think this has since been resolved, and controllers should properly backoff on failed pods. It would be good to clarify these interactions in the KEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would admission still apply to static pods?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I will look into this. Not sure if kubelet invokes admit handlers for static pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever you conclude, please update the KEP to include it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like admit handlers are invoked on static pods too. I will update the KEP to reflect this. Thanks.


#### Configuration file

Node-local Pod admission plugins will be listed in a configuration file. The Plugins field indicates the list of plugins to invoke before admitting the Pod.

```
{
"name": "admission-plugin",
"version": "0.1",
"plugins": [
{
"name": "sysctlcheck",
"type": "shell"
},
{
"type": "fargatecheck",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"type": "fargatecheck",
"name": "fargatecheck",

"type": "shell"
}
]
}
```

A node-local Pod admission plugin has the following structure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to kube-apiserver admission configuration, i could see wanting to enable more flexibility here outside of a file based configuration surface on the local kubelet. maybe kubelet can source its admission configuration from multiple sources so future scenarios could allow node local extension where desired.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I guess you mean we shouldn't stick with just file based configuration to read the plugin details but this could also be extended in future to read from different sources.


```
package podadmission

// PluginType indicates type of the admission plugin
type PluginType string

const (
PluginTypeShell PluginType = "shell" // binary to execute.
PluginTypeGRPC PluginType = "grpc" // Local port on the host.
PluginTypeSocket PluginType = "socket" // fd to connect to.
)

// AdmissionPluginManager is the podAdmitHandler shim for external plugins.
type AdmissionPluginManager struct {
confDir string
binDir string
pluginConfig *PluginConfig
}

// PluginConfig represents the plugin configuration file
type PluginConfig struct {
Name string `json:"name,omitempty"`
Version string `json:"version,omitempty"`
Plugins []*AdmissionPlugin `json:"plugins"`
}

// AdmissionPlugin represents individual plugins specified in the configuraiton.
type AdmissionPlugin struct {
Name string `json:"name"`
Type PluginType `json:"type"`
GrpcPort int `json:"grcPort,omitempty"` // not required for shell/socket type
Socket string `json:"socket,omitempty"` // not required for shell/grpc
}

//
func NewManager(confDir, binDir string) *AdmissionPluginManager {
admissionPluginMgr := &AdmissionPluginManager{
confDir: confDir,
binDir: binDir,
}
admissionPluginMgr.initializePlugins() // sort and read the conf file and updates list of plugins.

return admissionPluginMgr
}

func (apm *AdmissionPluginManager) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
var admit bool
var message string

for _, plugin := range apm.pluginConfig.Plugins {
switch plugin.Type {
case PluginTypeShell:
// exec
case PluginTypeGRPC:
// Fetch a connection to gRPC service
case PluginTypeSocket:
// Fetch a connection through unix socket.
}
}

response := lifecycle.PodAdmitResult{
Admit: admit,
Message: message,
}
if !admit {
response.Reason = unsupportedPodSpecMessage
}
return response
}
```
#### Feature gate

This functionality adds a new feature gate named “PodAdmissionPlugin” which decides whether to invoke admission plugin or not.

#### Kubelet to pod admission plugin communication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the interface for the shell type? Send over stdin, and get a response over stdout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is similar to how CNI operates today (passing required parameters using env variable and read the response over stdout).


Kubelet will encode the pod spec and invoke each admission plugin's Admit() method. After decoding the pod spec, plugins can perform additional validations and return the encoded form of the struct mentioned below to kubelet to decide whether to admit the pod or not.

```
AdmitResult {
Admit bool
Reason string
Message string
}
```

#### Implementation detail
Copy link
Member

@tallclair tallclair May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does bootstrapping work? For the non-shell types, it looks like the assumption is that the server is running prior to the Kubelet?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes your right server should be running before kubelet can accept pods. I was initially thinking about managing these by the same component that manages kubelet. For eg in linux env, managing it through systemd process. But more I think about this, shell type might be simpler for this approach. Users don't have to monitor one more component on the host. Happy to hear your feedback here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the failure modes? Fail open or fail closed? How would a failure be debugged? Would kubelet start if it couldn't connect to an admission hook?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fail closed, reason being if we can't validate the spec and admit the pod then whatever we are trying to protect might be violated. Kubelet can start even if it couldn't connect to an admission hook, but will not accept pods without validating the pod spec with the plugin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add these details to the KEP.


As part of this implementation, new sub package will be added to pkg/kubelet/lifecycle. In-tree admission handler shim will be included in this package, which will be responsible for discovering and invoking the pod admission plugins available on the host to decide on whether to admit the pod or not.

If the plugin does not respond or if it's crashing, then pod will not be accepted by the kubelet. Because if kubelet ignores the response and schedules the pod then intention of not executing pod with specific needs on these worker nodes will be violated. These plugins doesn't mutate the Pod object and can be invoked in parallel since there is no dependency between them.

### Test Plan

- Apart from unit tests, add integration test to invoke process running on the node to decide on pod admission
* Process should return true and pod should be executed on the worker node.
* Process should return false and pod should not be admitted on the worker node.
* Process doesn’t respond and pod should not be admitted on the worker node.
* Tests with multiple node-local pod admission handlers
* Tests of multiple node-local pod admission handlers with one intentionally non-functioning or crashing plugin

### Graduation Criteria

#### Alpha -> Beta Graduation

- Ensure there is minimal or acceptable performance degradation when external plugins are enabled on the node. This includes monitoring kubelet CPU and memory utilization. This is because of additional call which kubelet will perform to invoke plugins.

- Have the feature tested by other community members and address the feedback.

#### Beta -> GA Graduation

- TODO

## Implementation History

- 2020-01-15: Initial KEP sent out for initial review, including Summary, Motivation and Proposal

## Other solutions discussed

Why we didn’t go over CRI shim or using OCI hook approach?

1. Even before Kubelet invokes container [runtime](https://github.com/kubernetes/kubernetes/blob/v1.14.6/pkg/kubelet/kubelet.go#L1665) it sets up few things for pod, including [cgroup](https://github.com/kubernetes/kubernetes/blob/v1.14.6/pkg/kubelet/kubelet.go#L1605), volume mount, [pulling secrets](https://github.com/kubernetes/kubernetes/blob/v1.14.6/pkg/kubelet/kubelet.go#L1644) for pods etc.

2. OCI hook is invoked just before running the container, therefore Kubelet would have already downloaded the image as well. Even if hook rejects the Pod object, there is no good way to emit events on why hook failed Pod creation.

## References

- https://kubernetes.io/docs/tutorials/clusters/apparmor/
- https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0035-20190130-topology-manager.md
- https://github.com/kubernetes/community/pull/1934/files
- https://kubernetes.io/docs/concepts/policy/pod-security-policy/
- https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/