From 93c0bb93476b3e227fb4c9736a4d3c7a622b6738 Mon Sep 17 00:00:00 2001 From: Solly Ross Date: Wed, 12 Sep 2018 00:17:42 -0400 Subject: [PATCH] HPA Algorithm Information Improvements (#9780) * Update HPA docs with more algorithm details The HPA docs pointed to an out-of-date document for information on the algorithm details, which users were finding confusing. This sticks a section on the algorithm in the HPA docs instead, documenting both general behavior and corner cases. * Add glossary info, HPA docs on quantities People often ask about the quantity notation when working with the metrics APIs, so this adds a glossary entry on quantities (since they're used elsewhere in the system), and a short explantation in the HPA walkthough. * Information about HPA readiness and stabilization This adds information about the new changes to HPA readiness and stabilization from kubernetes/features#591, and other minor changes that landed in Kubernetes 1.12. * Update horizontal-pod-autoscale.md --- .../en/docs/reference/glossary/quantity.md | 30 ++++++ .../horizontal-pod-autoscale-walkthrough.md | 10 ++ .../horizontal-pod-autoscale.md | 94 +++++++++++++++++-- 3 files changed, 126 insertions(+), 8 deletions(-) create mode 100644 content/en/docs/reference/glossary/quantity.md diff --git a/content/en/docs/reference/glossary/quantity.md b/content/en/docs/reference/glossary/quantity.md new file mode 100644 index 0000000000000..c0c0edd52b537 --- /dev/null +++ b/content/en/docs/reference/glossary/quantity.md @@ -0,0 +1,30 @@ +--- +title: Quantity +id: quantity +date: 2018-08-07 +full_link: +short_description: > + A whole-number representation of small or large numbers using SI suffixes. + +aka: +tags: +--- + A whole-number representation of small or large numbers using SI suffixes. + + + +Quantities are representations of small or large numbers using a compact, +whole-number notation with SI suffixes. Fractional numbers are represented +using milli-units, while large numbers can be represented using kilo-units, +mega-units, giga-units, etc. + +For instance, the number `1.5` is represented `1500m`, while the number `1000` +can be represented as `1k`, and `1000000` as `1M`. You can also specify +binary-notation suffixes; the number 2048 can be written as `2Ki`. + +The accepted decimal (power-of-10) units are `m` (milli), `k` (kilo, +intentionally lowercase), `M` (mega), `G` (giga), `T` (terra), `P` (peta), +`E` (exa). + +The accepted binary (power-of-2) units are `Ki` (kibi), `Mi` (mebi), `Gi` (gibi), +`Ti` (tebi), `Pi` (pebi), `Ei` (exbi). diff --git a/content/en/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough.md b/content/en/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough.md index 1963fa4293669..ca1ec171bd932 100644 --- a/content/en/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough.md +++ b/content/en/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough.md @@ -432,6 +432,16 @@ was capped by the maximum or minimum of the HorizontalPodAutoscaler. This is an you may wish to raise or lower the minimum or maximum replica count constraints on your HorizontalPodAutoscaler. +## Appendix: Quantities + +All metrics in the HorizontalPodAutoscaler and metrics APIs are specified using +a special whole-number notation known in Kubernetes as a *quantity*. For example, +the quantity `10500m` would be written as `10.5` in decimal notation. The metrics APIs +will return whole numbers without a suffix when possible, and will generally return +quantities in milli-units otherwise. This means you might see your metric value fluctuate +between `1` and `1500m`, or `1` and `1.5` when written in decimal notation. See the +[glossary entry on quantities](/docs/reference/glossary/quantity.md) for more information. + ## Appendix: Other possible scenarios ### Creating the autoscaler declaratively diff --git a/content/en/docs/tasks/run-application/horizontal-pod-autoscale.md b/content/en/docs/tasks/run-application/horizontal-pod-autoscale.md index edb6f927a73f0..4a5b88edf0784 100644 --- a/content/en/docs/tasks/run-application/horizontal-pod-autoscale.md +++ b/content/en/docs/tasks/run-application/horizontal-pod-autoscale.md @@ -50,9 +50,10 @@ or the custom metrics API (for all other metrics). the number of desired replicas. Please note that if some of the pod's containers do not have the relevant resource request set, - CPU utilization for the pod will not be defined and the autoscaler will not take any action - for that metric. See the [autoscaling algorithm design document](https://git.k8s.io/community/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md#autoscaling-algorithm) for further - details about how the autoscaling algorithm works. + CPU utilization for the pod will not be defined and the autoscaler will + not take any action for that metric. See the [algorithm + details](#algorithm-details) section below for more information about + how the autoscaling algorithm works. * For per-pod custom metrics, the controller functions similarly to per-pod resource metrics, except that it works with raw values, not utilization values. @@ -81,6 +82,85 @@ by using the scale sub-resource. Scale is an interface that allows you to dynami each of their current states. More details on scale sub-resource can be found [here](https://git.k8s.io/community/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md#scale-subresource). +### Algorithm Details + +From the most basic perspective, the Horizontal Pod Autoscaler controller +operates on the ratio between desired metric value and current metric +value: + +``` +desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )] +``` + +For example, if the current metric value is `200m`, and the desired value +is `100m`, the number of replicas will be doubled, since `200.0 / 100.0 == +2.0` If the the current value is instead `50m`, we'll halve the number of +replicas, since `50.0 / 100.0 == 0.5`. We'll skip scaling if the ratio is +sufficiently close to 1.0 (within a globally-configurable tolerance, from +the `--horizontal-pod-autoscaler-tolerance` flag, which defaults to 0.1). + +When a `targetAverageValue` or `targetAverageUtilization` is specified, +the `currentMetricValue` is computed by taking the average of the given +metric across all Pods in the HorizontalPodAutoscaler's scale target. +Before checking the tolerance and deciding on the final values, we take +pod readiness and missing metrics into consideration, however. + +All Pods with a deletion timestamp set (i.e. Pods in the process of being +shut down) and all failed Pods are discarded. + +If a particular Pod is missing metrics, it is set aside for later; Pods +with missing metrics will be used to adjust the final scaling amount. + +When scaling on CPU, if any pod has yet to become ready (i.e. it's still +initializing) *or* the most recent metric point for the pod was before it +became ready, that pod is set aside as well. + +Due to technical constraints, the HorizontalPodAutoscaler controller +cannot exactly determine the first time a pod becomes ready when +determinining whether to set aside certain CPU metrics. Instead, it +considers a Pod "not yet ready" if it's unready and transitioned to +unready within a short, configurable window of time since it started. +This value is configured with the `--horizontal-pod-autoscaler-initial-readiness-delay` flag, and its default is 30 +seconds. Once a pod has become ready, it considers any transition to +ready to be the first if it occurred within a longer, configurable time +since it started. This value is configured with the `--horizontal-pod-autoscaler-cpu-initialization-period` flag, and its +default is 5 minutes. + +The `currentMetricValue / desiredMetricValue` base scale ratio is then +calculated using the remaining pods not set aside or discarded from above. + +If there were any missing metrics, we recompute the average more +conservatively, assuming those pods were consuming 100% of the desired +value in case of a scale down, and 0% in case of a scale up. This dampens +the magnitude of any potential scale. + +Futhermore, if any not-yet-ready pods were present, and we would have +scaled up without factoring in missing metrics or not-yet-ready pods, we +conservatively assume the non-yet-ready pods are consuming 0% of the +desired metric, further dampening the magnitude of a scale up. + +After factoring in the not-yet-ready pods and missing metrics, we +recalculate the usage ratio. If the new ratio reverses the scale +direction, or is within the tolerance, we skip scaling. Otherwise, we use +the new ratio to scale. + +Note that the *original* value for the average utilization is reported +back via the HorizontalPodAutoscaler status, without factoring in the +not-yet-ready pods or missing metrics, even when the new usage ratio is +used. + +If multiple metrics are specified in a HorizontalPodAutoscaler, this +calculation is done for each metric, and then the largest of the desired +replica counts is chosen. If any of those metrics cannot be converted +into a desired replica count (e.g. due to an error fetching the metrics +from the metrics APIs), scaling is skipped. + +Finally, just before HPA scales the target, the scale reccomendation is recorded. The +controller considers all reccomendations within a configurable window choosing the +highest recommendation from within that window. This value can be configured using the `--horizontal-pod-autoscaler-downscale-stabilization-window` flag, which defaults to 5 minutes. +This means that scaledowns will occur gradually, smothing out the impact of rapidly +fluctuating metric values. + ## API Object The Horizontal Pod Autoscaler is an API resource in the Kubernetes `autoscaling` API group. @@ -129,16 +209,14 @@ dynamic nature of the metrics evaluated. This is sometimes referred to as *thras Starting from v1.6, a cluster operator can mitigate this problem by tuning the global HPA settings exposed as flags for the `kube-controller-manager` component: +Starting from v1.12, a new algorithmic update removes the need for the +upscale delay. + - `--horizontal-pod-autoscaler-downscale-delay`: The value for this option is a duration that specifies how long the autoscaler has to wait before another downscale operation can be performed after the current one has completed. The default value is 5 minutes (`5m0s`). -- `--horizontal-pod-autoscaler-upscale-delay`: The value for this option is a - duration that specifies how long the autoscaler has to wait before another - upscale operation can be performed after the current one has completed. - The default value is 3 minutes (`3m0s`). - {{< note >}} **Note**: When tuning these parameter values, a cluster operator should be aware of the possible consequences. If the delay (cooldown) value is set too long, there