Permalink
e0cf343 Feb 23, 2017
@michelleN @wanghaoran1988
371 lines (295 sloc) 16.4 KB

Note: this is a design doc, which describes features that have not been completely implemented. User documentation of the current state is here. The tracking issue for implementation of this model is #168. Currently, both limits and requests of memory and cpu on containers (not pods) are supported. "memory" is in bytes and "cpu" is in milli-cores.

The Kubernetes resource model

To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model — the subject of this document.

The resource model aims to be:

  • simple, for common cases;
  • extensible, to accommodate future growth;
  • regular, with few special cases; and
  • precise, to avoid misunderstandings and promote pod portability.

The resource model

A Kubernetes resource is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth.

Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called feasibility checking.

Note that the resource model currently prohibits over-committing resources; we will want to relax that restriction later.

Resource types

All resources have a type that is identified by their typename (a string, e.g., "memory"). Several resource types are predefined by Kubernetes (a full list is below), although only two will be supported at first: CPU and memory. Users and system administrators can define their own resource types if they wish (e.g., Hadoop slots).

A fully-qualified resource typename is constructed from a DNS-style subdomain, followed by a slash /, followed by a name.

  • The subdomain must conform to RFC 1123 (e.g., kubernetes.io, example.com).
  • The name must be not more than 63 characters, consisting of upper- or lower-case alphanumeric characters, with the -, _, and . characters allowed anywhere except the first or last character.
  • As a shorthand, any resource typename that does not start with a subdomain and a slash will automatically be prefixed with the built-in Kubernetes namespace, kubernetes.io/ in order to fully-qualify it. This namespace is reserved for code in the open source Kubernetes repository; as a result, all user typenames MUST be fully qualified, and cannot be created in this namespace.

Some example typenames include memory (which will be fully-qualified as kubernetes.io/memory), and example.com/Shiny_New-Resource.Type.

For future reference, note that some resources, such as CPU and network bandwidth, are compressible, which means that their usage can potentially be throttled in a relatively benign manner. All other resources are incompressible, which means that any attempt to throttle them is likely to cause grief. This distinction will be important if a Kubernetes implementation supports over-committing of resources.

Resource quantities

Initially, all Kubernetes resource types are quantitative, and have an associated unit for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?).

Resource quantities can be added and subtracted: for example, a node has a fixed quantity of each resource type that can be allocated to pods/containers; once such an allocation has been made, the allocated resources cannot be made available to other pods/containers without over-committing the resources.

To make life easier for people, quantities can be represented externally as unadorned integers, or as fixed-point integers with one of these SI suffices (E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, Ki). For example, the following represent roughly the same value: 128974848, "129e6", "129M" , "123Mi". Small quantities can be represented directly as decimals (e.g., 0.3), or using milli-units (e.g., "300m").

  • "Externally" means in user interfaces, reports, graphs, and in JSON or YAML resource specifications that might be generated or read by people.
  • Case is significant: "m" and "M" are not the same, so "k" is not a valid SI suffix. There are no power-of-two equivalents for SI suffixes that represent multipliers less than 1.
  • These conventions only apply to resource quantities, not arbitrary values.

Internally (i.e., everywhere else), Kubernetes will represent resource quantities as integers so it can avoid problems with rounding errors, and will not use strings to represent numeric values. To achieve this, quantities that naturally have fractional parts (e.g., CPU seconds/second) will be scaled to integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. Internal APIs, data structures, and protobufs will use these scaled integer units. Raw measurement data such as usage may still need to be tracked and calculated using floating point values, but internally they should be rescaled to avoid some values being in milli-units and some not.

  • Note that reading in a resource quantity and writing it out again may change the way its values are represented, and truncate precision (e.g., 1.0001 may become 1.000), so comparison and difference operations (e.g., by an updater) must be done on the internal representations.
  • Avoiding milli-units in external representations has advantages for people who will use Kubernetes, but runs the risk of developers forgetting to rescale or accidentally using floating-point representations. That seems like the right choice. We will try to reduce the risk by providing libraries that automatically do the quantization for JSON/YAML inputs.

Resource specifications

Both users and a number of system components, such as schedulers, (horizontal) auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers need to reason about resource requirements of workloads, resource capacities of nodes, and resource usage. Kubernetes divides specifications of desired state, aka the Spec, and representations of current state, aka the Status. Resource requirements and total node capacity fall into the specification category, while resource usage, characterizations derived from usage (e.g., maximum usage, histograms), and other resource demand signals (e.g., CPU load) clearly fall into the status category and are discussed in the Appendix for now.

Resource requirements for a container or pod should have the following form:

resourceRequirementSpec: [
  request:   [ cpu: 2.5, memory: "40Mi" ],
  limit:     [ cpu: 4.0, memory: "99Mi" ],
]

Where:

  • request [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its request, any associated SLOs are voided — e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If request is omitted for a container, it defaults to limit if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If request is omitted for a pod, it defaults to the sum of the (explicit or implicit) request values for the containers it encloses.

  • limit [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its limit, it may be terminated. The limit defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory.

Total capacity for a node should have a similar structure:

resourceCapacitySpec: [
  total:     [ cpu: 12,  memory: "128Gi" ]
]

Where:

  • total: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes.

Notes

  • It is an error to specify the same resource type more than once in each list.

  • It is an error for the request or limit values for a pod to be less than the sum of the (explicit or defaulted) values for the containers it encloses. (We may relax this later.)

  • If multiple pods are running on the same node and attempting to use more resources than they have requested, the result is implementation-defined. For example: unallocated or unused resources might be spread equally across claimants, or the assignment might be weighted by the size of the original request, or as a function of limits, or priority, or the phase of the moon, perhaps modulated by the direction of the tide. Thus, although it's not mandatory to provide a request, it's probably a good idea. (Note that the request could be filled in by an automated system that is observing actual usage and/or historical data.)

  • Internally, the Kubernetes master can decide the defaulting behavior and the kubelet implementation may expected an absolute specification. For example, if the master decided that "the default is unbounded" it would pass 2^64 to the kubelet.

Kubernetes-defined resource types

The following resource types are predefined ("reserved") by Kubernetes in the kubernetes.io namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet.

Processor cycles

  • Name: cpu (or kubernetes.io/cpu)
  • Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU")
  • Internal representation: milli-KCUs
  • Compressible? yes
  • Qualities: this is a placeholder for the kind of thing that may be supported in the future — see #147
    • [future] schedulingLatency: as per lmctfy
    • [future] cpuConversionFactor: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0).

To reduce performance portability problems for pods, and to avoid worse-case provisioning behavior, the units of CPU will be normalized to a canonical "Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be equivalent to a single CPU hyperthreaded core for some recent x86 processor. The normalization may be implementation-defined, although some reasonable defaults will be provided in the open-source Kubernetes code.

Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated — control of aspects like this will be handled by resource qualities (a future feature).

Memory

  • Name: memory (or kubernetes.io/memory)
  • Units: bytes
  • Compressible? no (at least initially)

The precise meaning of what "memory" means is implementation dependent, but the basic idea is to rely on the underlying memcg mechanisms, support, and definitions.

Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory quantities rather than decimal ones: "64MiB" rather than "64MB".

Resource metadata

A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example:

resourceTypes: [
  "kubernetes.io/memory": [
    isCompressible: false, ... 
  ]
  "kubernetes.io/cpu": [
    isCompressible: true,
    internalScaleExponent: 3, ...
  ]
  "kubernetes.io/disk-space": [ ... ]
]

Kubernetes will provide ResourceType metadata for its predefined types. If no resource metadata can be found for a resource type, Kubernetes will assume that it is a quantified, incompressible resource that is not specified in milli-units, and has no default value.

The defined properties are as follows:

field name type contents
name string, required the typename, as a fully-qualified string (e.g., kubernetes.io/cpu)
internalScaleExponent int, default=0 external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units)
units string, required format: unit* [per unit+] (e.g., second, byte per second). An empty unit field means "dimensionless".
isCompressible bool, default=false true if the resource type is compressible
defaultRequest string, default=none in the same format as a user-supplied value
[future] quantization number, default=1 smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM).

Appendix: future extensions

The following are planned future extensions to the resource model, included here to encourage comments.

Usage data

Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as Pods and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD.

Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information:

resourceStatus: [
  usage:     [ cpu: <CPU-info>, memory: <memory-info> ],
  maxusage:  [ cpu: <CPU-info>, memory: <memory-info> ],
  predicted: [ cpu: <CPU-info>, memory: <memory-info> ],
]

where a <CPU-info> or <memory-info> structure looks like this:

{
    mean: <value>    # arithmetic mean
    max: <value>     # maximum value
    min: <value>     # minimum value
    count: <value>   # number of data points
    percentiles: [   # map from %iles to values
      "10": <10th-percentile-value>,
      "50": <median-value>,
      "99": <99th-percentile-value>,
      "99.9": <99.9th-percentile-value>,
      ...
    ]
}

All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. [In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.] and predicted

Future resource types

[future] Network bandwidth

  • Name: "network-bandwidth" (or kubernetes.io/network-bandwidth)
  • Units: bytes per second
  • Compressible? yes

[future] Network operations

  • Name: "network-iops" (or kubernetes.io/network-iops)
  • Units: operations (messages) per second
  • Compressible? yes

[future] Storage space

  • Name: "storage-space" (or kubernetes.io/storage-space)
  • Units: bytes
  • Compressible? no

The amount of secondary storage space available to a container. The main target is local disk drives and SSDs, although this could also be used to qualify remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a disk array, or a file system fronting any of these, is left for future work.

[future] Storage time

  • Name: storage-time (or kubernetes.io/storage-time)
  • Units: seconds per second of disk time
  • Internal representation: milli-units
  • Compressible? yes

This is the amount of time a container spends accessing disk, including actuator and transfer time. A standard disk drive provides 1.0 diskTime seconds per second.

[future] Storage operations

  • Name: "storage-iops" (or kubernetes.io/storage-iops)
  • Units: operations per second
  • Compressible? yes

Analytics