Skip to content

Commit

Permalink
HPA Algorithm Information Improvements (#9780)
Browse files Browse the repository at this point in the history
* Update HPA docs with more algorithm details

The HPA docs pointed to an out-of-date document for information on the
algorithm details, which users were finding confusing.  This sticks a
section on the algorithm in the HPA docs instead, documenting both
general behavior and corner cases.

* Add glossary info, HPA docs on quantities

People often ask about the quantity notation when working with the
metrics APIs, so this adds a glossary entry on quantities (since they're
used elsewhere in the system), and a short explantation in the HPA walkthough.

* Information about HPA readiness and stabilization

This adds information about the new changes to HPA readiness and
stabilization from kubernetes/enhancements#591, and other minor changes that
landed in Kubernetes 1.12.

* Update horizontal-pod-autoscale.md
  • Loading branch information
DirectXMan12 authored and k8s-ci-robot committed Sep 12, 2018
1 parent e1e6555 commit 93c0bb9
Show file tree
Hide file tree
Showing 3 changed files with 126 additions and 8 deletions.
30 changes: 30 additions & 0 deletions content/en/docs/reference/glossary/quantity.md
@@ -0,0 +1,30 @@
---
title: Quantity
id: quantity
date: 2018-08-07
full_link:
short_description: >
A whole-number representation of small or large numbers using SI suffixes.
aka:
tags:
---
A whole-number representation of small or large numbers using SI suffixes.

<!--more-->

Quantities are representations of small or large numbers using a compact,
whole-number notation with SI suffixes. Fractional numbers are represented
using milli-units, while large numbers can be represented using kilo-units,
mega-units, giga-units, etc.

For instance, the number `1.5` is represented `1500m`, while the number `1000`
can be represented as `1k`, and `1000000` as `1M`. You can also specify
binary-notation suffixes; the number 2048 can be written as `2Ki`.

The accepted decimal (power-of-10) units are `m` (milli), `k` (kilo,
intentionally lowercase), `M` (mega), `G` (giga), `T` (terra), `P` (peta),
`E` (exa).

The accepted binary (power-of-2) units are `Ki` (kibi), `Mi` (mebi), `Gi` (gibi),
`Ti` (tebi), `Pi` (pebi), `Ei` (exbi).
Expand Up @@ -432,6 +432,16 @@ was capped by the maximum or minimum of the HorizontalPodAutoscaler. This is an
you may wish to raise or lower the minimum or maximum replica count constraints on your
HorizontalPodAutoscaler.

## Appendix: Quantities

All metrics in the HorizontalPodAutoscaler and metrics APIs are specified using
a special whole-number notation known in Kubernetes as a *quantity*. For example,
the quantity `10500m` would be written as `10.5` in decimal notation. The metrics APIs
will return whole numbers without a suffix when possible, and will generally return
quantities in milli-units otherwise. This means you might see your metric value fluctuate
between `1` and `1500m`, or `1` and `1.5` when written in decimal notation. See the
[glossary entry on quantities](/docs/reference/glossary/quantity.md) for more information.

## Appendix: Other possible scenarios

### Creating the autoscaler declaratively
Expand Down
94 changes: 86 additions & 8 deletions content/en/docs/tasks/run-application/horizontal-pod-autoscale.md
Expand Up @@ -50,9 +50,10 @@ or the custom metrics API (for all other metrics).
the number of desired replicas.

Please note that if some of the pod's containers do not have the relevant resource request set,
CPU utilization for the pod will not be defined and the autoscaler will not take any action
for that metric. See the [autoscaling algorithm design document](https://git.k8s.io/community/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md#autoscaling-algorithm) for further
details about how the autoscaling algorithm works.
CPU utilization for the pod will not be defined and the autoscaler will
not take any action for that metric. See the [algorithm
details](#algorithm-details) section below for more information about
how the autoscaling algorithm works.

* For per-pod custom metrics, the controller functions similarly to per-pod resource metrics,
except that it works with raw values, not utilization values.
Expand Down Expand Up @@ -81,6 +82,85 @@ by using the scale sub-resource. Scale is an interface that allows you to dynami
each of their current states. More details on scale sub-resource can be found
[here](https://git.k8s.io/community/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md#scale-subresource).

### Algorithm Details

From the most basic perspective, the Horizontal Pod Autoscaler controller
operates on the ratio between desired metric value and current metric
value:

```
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
```

For example, if the current metric value is `200m`, and the desired value
is `100m`, the number of replicas will be doubled, since `200.0 / 100.0 ==
2.0` If the the current value is instead `50m`, we'll halve the number of
replicas, since `50.0 / 100.0 == 0.5`. We'll skip scaling if the ratio is
sufficiently close to 1.0 (within a globally-configurable tolerance, from
the `--horizontal-pod-autoscaler-tolerance` flag, which defaults to 0.1).

When a `targetAverageValue` or `targetAverageUtilization` is specified,
the `currentMetricValue` is computed by taking the average of the given
metric across all Pods in the HorizontalPodAutoscaler's scale target.
Before checking the tolerance and deciding on the final values, we take
pod readiness and missing metrics into consideration, however.

All Pods with a deletion timestamp set (i.e. Pods in the process of being
shut down) and all failed Pods are discarded.

If a particular Pod is missing metrics, it is set aside for later; Pods
with missing metrics will be used to adjust the final scaling amount.

When scaling on CPU, if any pod has yet to become ready (i.e. it's still
initializing) *or* the most recent metric point for the pod was before it
became ready, that pod is set aside as well.

Due to technical constraints, the HorizontalPodAutoscaler controller
cannot exactly determine the first time a pod becomes ready when
determinining whether to set aside certain CPU metrics. Instead, it
considers a Pod "not yet ready" if it's unready and transitioned to
unready within a short, configurable window of time since it started.
This value is configured with the `--horizontal-pod-autoscaler-initial-readiness-delay` flag, and its default is 30
seconds. Once a pod has become ready, it considers any transition to
ready to be the first if it occurred within a longer, configurable time
since it started. This value is configured with the `--horizontal-pod-autoscaler-cpu-initialization-period` flag, and its
default is 5 minutes.

The `currentMetricValue / desiredMetricValue` base scale ratio is then
calculated using the remaining pods not set aside or discarded from above.

If there were any missing metrics, we recompute the average more
conservatively, assuming those pods were consuming 100% of the desired
value in case of a scale down, and 0% in case of a scale up. This dampens
the magnitude of any potential scale.

Futhermore, if any not-yet-ready pods were present, and we would have
scaled up without factoring in missing metrics or not-yet-ready pods, we
conservatively assume the non-yet-ready pods are consuming 0% of the
desired metric, further dampening the magnitude of a scale up.

After factoring in the not-yet-ready pods and missing metrics, we
recalculate the usage ratio. If the new ratio reverses the scale
direction, or is within the tolerance, we skip scaling. Otherwise, we use
the new ratio to scale.

Note that the *original* value for the average utilization is reported
back via the HorizontalPodAutoscaler status, without factoring in the
not-yet-ready pods or missing metrics, even when the new usage ratio is
used.

If multiple metrics are specified in a HorizontalPodAutoscaler, this
calculation is done for each metric, and then the largest of the desired
replica counts is chosen. If any of those metrics cannot be converted
into a desired replica count (e.g. due to an error fetching the metrics
from the metrics APIs), scaling is skipped.

Finally, just before HPA scales the target, the scale reccomendation is recorded. The
controller considers all reccomendations within a configurable window choosing the
highest recommendation from within that window. This value can be configured using the `--horizontal-pod-autoscaler-downscale-stabilization-window` flag, which defaults to 5 minutes.
This means that scaledowns will occur gradually, smothing out the impact of rapidly
fluctuating metric values.

## API Object

The Horizontal Pod Autoscaler is an API resource in the Kubernetes `autoscaling` API group.
Expand Down Expand Up @@ -129,16 +209,14 @@ dynamic nature of the metrics evaluated. This is sometimes referred to as *thras
Starting from v1.6, a cluster operator can mitigate this problem by tuning
the global HPA settings exposed as flags for the `kube-controller-manager` component:

Starting from v1.12, a new algorithmic update removes the need for the
upscale delay.

- `--horizontal-pod-autoscaler-downscale-delay`: The value for this option is a
duration that specifies how long the autoscaler has to wait before another
downscale operation can be performed after the current one has completed.
The default value is 5 minutes (`5m0s`).

- `--horizontal-pod-autoscaler-upscale-delay`: The value for this option is a
duration that specifies how long the autoscaler has to wait before another
upscale operation can be performed after the current one has completed.
The default value is 3 minutes (`3m0s`).

{{< note >}}
**Note**: When tuning these parameter values, a cluster operator should be aware of
the possible consequences. If the delay (cooldown) value is set too long, there
Expand Down

0 comments on commit 93c0bb9

Please sign in to comment.