Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
398 lines (328 sloc) 17.3 KB
kep-number title authors owning-sig participating-sigs reviewers approvers editor creation-date status
14
Runtime Class
@tallclair
sig-node
sig-architecture
TBD
TBD
TBD
2018-06-19
provisional

Runtime Class

Table of Contents

Summary

RuntimeClass is a new cluster-scoped resource that surfaces container runtime properties to the control plane. RuntimeClasses are assigned to pods through a runtimeClass field on the PodSpec. This provides a new mechanism for supporting multiple runtimes in a cluster and/or node.

Motivation

There is growing interest in using different runtimes within a cluster. Sandboxes are the primary motivator for this right now, with both Kata containers and gVisor looking to integrate with Kubernetes. Other runtime models such as Windows containers or even remote runtimes will also require support in the future. RuntimeClass provides a way to select between different runtimes configured in the cluster and surface their properties (both to the cluster & the user).

In addition to selecting the runtime to use, supporting multiple runtimes raises other problems to the control plane level, including: accounting for runtime overhead, scheduling to nodes that support the runtime, and surfacing which optional features are supported by different runtimes. Although these problems are not tackled by this initial proposal, RuntimeClass provides a cluster-scoped resource tied to the runtime that can help solve these problems in a future update.

Goals

  • Provide a mechanism for surfacing container runtime properties to the control plane
  • Support multiple runtimes per-cluster, and provide a mechanism for users to select the desired runtime

Non-Goals

  • RuntimeClass is NOT RuntimeComponentConfig.
  • RuntimeClass is NOT a general policy mechanism.
  • RuntimeClass is NOT "NodeClass". Although different nodes may run different runtimes, in general RuntimeClass should not be a cross product of runtime properties and node properties.

The following goals are out-of-scope for the initial implementation, but may be explored in a future iteration:

  • Surfacing support for optional features by runtimes, and surfacing errors caused by incompatible features & runtimes earlier.
  • Automatic runtime or feature discovery - initially RuntimeClasses are manually defined (by the cluster admin or provider), and are asserted to be an accurate representation of the runtime.
  • Scheduling in heterogeneous clusters - it is possible to operate a heterogeneous cluster (different runtime configurations on different nodes) through scheduling primitives like NodeAffinity and Taints+Tolerations, but the user is responsible for setting these up and automatic runtime-aware scheduling is out-of-scope.
  • Define standardized or conformant runtime classes - although I would like to declare some predefined RuntimeClasses with specific properties, doing so is out-of-scope for this initial KEP.
  • Pod Overhead - Although RuntimeClass is likely to be the configuration mechanism of choice, the details of how pod resource overhead will be implemented is out of scope for this KEP.
  • Provide a mechanism to dynamically register or provision additional runtimes.
  • Requiring specific RuntimeClasses according to policy. This should be addressed by other cluster-level policy mechanisms, such as PodSecurityPolicy.
  • "Fitting" a RuntimeClass to pod requirements - In other words, specifying runtime properties and letting the system match an appropriate RuntimeClass, rather than explicitly assigning a RuntimeClass by name. This approach can increase portability, but can be added seamlessly in a future iteration.

User Stories

  • As a cluster operator, I want to provide multiple runtime options to support a wide variety of workloads. Examples include native linux containers, "sandboxed" containers, and windows containers.
  • As a cluster operator, I want to provide stable rolling upgrades of runtimes. For example, rolling out an update with backwards incompatible changes or previously unsupported features.
  • As an application developer, I want to select the runtime that best fits my workload.
  • As an application developer, I don't want to study the nitty-gritty details of different runtime implementations, but rather choose from pre-configured classes.
  • As an application developer, I want my application to be portable across clusters that use similar but different variants of a "class" of runtimes.

Proposal

The initial design includes:

  • RuntimeClass API resource definition
  • RuntimeClass pod field for specifying the RuntimeClass the pod should be run with
  • Kubelet implementation for fetching & interpreting the RuntimeClass
  • CRI API & implementation for passing along the RuntimeHandler.

API

RuntimeClass is a new cluster-scoped resource in the node.k8s.io API group.

The node.k8s.io API group would eventually hold the Node resource when core is retired. Alternatives considered: runtime.k8s.io, cluster.k8s.io

(This is a simplified declaration, syntactic details will be covered in the API PR review)

type RuntimeClass struct {
    metav1.TypeMeta
    // ObjectMeta minimally includes the RuntimeClass name, which is used to reference the class.
    // Namespace should be left blank.
    metav1.ObjectMeta

    Spec RuntimeClassSpec
}

type RuntimeClassSpec struct {
    // RuntimeHandler specifies the underlying runtime the CRI calls to handle pod and/or container
    // creation. The possible values are specific to a given configuration & CRI implementation.
    // The empty string is equivalent to the default behavior.
    // +optional
    RuntimeHandler string
}

The runtime is selected by the pod by specifying the RuntimeClass in the PodSpec. Once the pod is scheduled, the RuntimeClass cannot be changed.

type PodSpec struct {
    ...
    // RuntimeClassName refers to a RuntimeClass object with the same name,
    // which should be used to run this pod.
    // +optional
    RuntimeClassName string
    ...
}

The legacy RuntimeClass name is reserved. The legacy RuntimeClass is defined to be fully backwards compatible with current Kubernetes. This means that the legacy runtime does not specify any RuntimeHandler or perform any feature validation (all features are "supported").

const (
    // RuntimeClassNameLegacy is a reserved RuntimeClass name. The legacy
    // RuntimeClass does not specify a runtime handler or perform any
    // feature validation.
    RuntimeClassNameLegacy = "legacy"
)

An unspecified RuntimeClassName "" is equivalent to the legacy RuntimeClass, though the field is not defaulted to legacy (to leave room for configurable defaults in a future update).

Examples

Suppose we operate a cluster that lets users choose between native runc containers, and gvisor and kata-container sandboxes. We might create the following runtime classes:

kind: RuntimeClass
apiVersion: node.k8s.io/v1alpha1
metadata:
    name: native  # equivalent to 'legacy' for now
spec:
    runtimeHandler: runc
---
kind: RuntimeClass
apiVersion: node.k8s.io/v1alpha1
metadata:
    name: gvisor
spec:
    runtimeHandler: gvisor
----
kind: RuntimeClass
apiVersion: node.k8s.io/v1alpha1
metadata:
    name: kata-containers
spec:
    runtimeHandler: kata-containers
----
# provides the default sandbox runtime when users don't care about which they're getting.
kind: RuntimeClass
apiVersion: node.k8s.io/v1alpha1
metadata:
  name: sandboxed
spec:
  runtimeHandler: gvisor

Then when a user creates a workload, they can choose the desired runtime class to use (or not, if they want the default).

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: sandboxed-nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sandboxed-nginx
  template:
    metadata:
      labels:
        app: sandboxed-nginx
    spec:
      runtimeClassName: sandboxed   #   <----  Reference the desired RuntimeClass
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
          protocol: TCP

Runtime Handler

The RuntimeHandler is passed to the CRI as part of the RunPodSandboxRequest:

message RunPodSandboxRequest {
    // Configuration for creating a PodSandbox.
    PodSandboxConfig config = 1;
    // Named runtime configuration to use for this PodSandbox.
    string RuntimeHandler = 2;
}

The RuntimeHandler is provided as a mechanism for CRI implementations to select between different predetermined configurations. The initial use case is replacing the experimental pod annotations currently used for selecting a sandboxed runtime by various CRI implementations:

CRI Runtime Pod Annotation
CRIO io.kubernetes.cri-o.TrustedSandbox: "false"
containerd io.kubernetes.cri.untrusted-workload: "true"
frakti runtime.frakti.alpha.kubernetes.io/OSContainer: "true"
runtime.frakti.alpha.kubernetes.io/Unikernel: "true"
windows experimental.windows.kubernetes.io/isolation-type: "hyperv"

These implementations could stick with scheme ("trusted" and "untrusted"), but the preferred approach is a non-binary one wherein arbitrary handlers can be configured with a name that can be matched against the specified RuntimeHandler. For example, containerd might have a configuration corresponding to a "kata-runtime" handler:

[plugins.cri.containerd.kata-runtime]
    runtime_type = "io.containerd.runtime.v1.linux"
    runtime_engine = "/opt/kata/bin/kata-runtime"
    runtime_root = ""

This non-binary approach is more flexible: it can still map to a binary RuntimeClass selection (e.g. sandboxed or untrusted RuntimeClasses), but can also support multiple parallel sandbox types (e.g. kata-containers or gvisor RuntimeClasses).

Versioning, Updates, and Rollouts

Getting upgrades and rollouts right is a very nuanced and complicated problem. For the initial alpha implementation, we will kick the can down the road by making the RuntimeClassSpec immutable, thereby requiring changes to be pushed as a newly named RuntimeClass instance. This means that pods must be updated to reference the new RuntimeClass, and comes with the advantage of native support for rolling updates through the same mechanisms as any other application update. The RuntimeClassName pod field is also immutable post scheduling.

This conservative approach is preferred since it's much easier to relax constraints in a backwards compatible way than tighten them. We should revisit this decision prior to graduating RuntimeClass to beta.

Implementation Details

The Kubelet uses an Informer to keep a local cache of all RuntimeClass objects. When a new pod is added, the Kubelet resolves the Pod's RuntimeClass against the local RuntimeClass cache. Once resolved, the RuntimeHandler field is passed to the CRI as part of the [RunPodSandboxRequest][]. At that point, the interpretation of the RuntimeHandler is left to the CRI implementation, but it should be cached if needed for subsequent calls.

If the RuntimeClass cannot be resolved (e.g. doesn't exist) at Pod creation, then the request will be rejected in admission (controller to be detailed in a following update). If the RuntimeClass cannot be resolved by the Kubelet when RunPodSandbox should be called, then the Kubelet will fail the Pod. The admission check on a replica recreation will prevent the scheduler from thrashing. If the RuntimeHandler is not recognized by the CRI implementation, then RunPodSandbox will return an error.

Risks and Mitigations

Scope creep. RuntimeClass has a fairly broad charter, but it should not become a default dumping ground for every new feature exposed by the node. For each feature, careful consideration should be made about whether it belongs on the Pod, Node, RuntimeClass, or some other resource. The non-goals should be kept in mind when considering RuntimeClass features.

Becoming a general policy mechanism. RuntimeClass should not be used a replacement for PodSecurityPolicy. The use cases for defining multiple RuntimeClasses for the same underlying runtime implementation should be extremely limited (generally only around updates & rollouts). To enforce this, no authorization or restrictions are placed directly on RuntimeClass use; in order to restrict a user to a specific RuntimeClass, you must use another policy mechanism such as PodSecurityPolicy.

Pushing complexity to the user. RuntimeClass is a new resource in order to hide the complexity of runtime configuration from most users (aside from the cluster admin or provisioner). However, we are still side-stepping the issue of precisely defining specific types of runtimes like "Sandboxed". However, it is still up for debate whether precisely defining such runtime categories is even possible. RuntimeClass allows us to decouple this specification from the implementation, but it is still something I hope we can address in a future iteration through the concept of pre-defined or "conformant" RuntimeClasses.

Non-portability. We are already in a world of non-portability for many features (see examples of runtime variation. Future improvements to RuntimeClass can help address this issue by formally declaring supported features, or matching the runtime that supports a given workload automitaclly. Another issue is that pods need to refer to a RuntimeClass by name, which may not be defined in every cluster. This is something that can be addressed through pre-defined runtime classes (see previous risk), and/or by "fitting" pod requirements to compatible RuntimeClasses.

Graduation Criteria

Alpha:

  • Everything described in the current proposal:
    • Introduce the RuntimeClass API resource
    • Add a RuntimeClassName field to the PodSpec
    • Add a RuntimeHandler field to the CRI RunPodSandboxRequest
    • Lookup the RuntimeClass for pods & plumb through the RuntimeHandler in the Kubelet (feature gated)
  • RuntimeClass support in at least one CRI runtime & dockershim
    • Runtime Handlers can be statically configured by the runtime, and referenced via RuntimeClass
    • An error is reported when the handler or is unknown or unsupported
  • Testing

Beta:

  • Most runtimes support RuntimeClass, and the current untrusted annotations are deprecated.
  • RuntimeClasses are configured in the E2E environment with test coverage of a non-legacy RuntimeClass
  • The update & upgrade story is revisited, and a longer-term approach is implemented as necessary.
  • The cluster admin can choose which RuntimeClass is the default in a cluster.
  • Additional requirements TBD

Implementation History

  • 2018-06-11: SIG-Node decision to move forward with proposal
  • 2018-06-19: Initial KEP published.

Appendix

Examples of runtime variation

  • Linux Security Module (LSM) choice - Kubernetes supports both AppArmor & SELinux options on pods, but those are mutually exclusive, and support of either is not required by the runtime. The default configuration is also not well defined.
  • Seccomp-bpf - Kubernetes has alpha support for specifying a seccomp profile, but the default is defined by the runtime, and support is not guaranteed.
  • Windows containers - isolation features are very OS-specific, and most of the current features are limited to linux. As we build out Windows container support, we'll need to add windows-specific features as well.
  • Host namespaces (Network,PID,IPC) may not be supported by virtualization-based runtimes (e.g. Kata-containers & gVisor).
  • Per-pod and Per-container resource overhead varies by runtime.
  • Device support (e.g. GPUs) varies wildly by runtime & nodes.
  • Supported volume types varies by node - it remains TBD whether this information belongs in RuntimeClass.
  • The list of default capabilities is defined in Docker, but not Kubernetes. Future runtimes may have differing defaults, or support a subset of capabilities.
  • Privileged mode is not well defined, and thus may have differing implementations.
  • Support for resource over-commit and dynamic resource sizing (e.g. Burstable vs Guaranteed workloads)