HPA Algorithm Information Improvements (#9780)

* Update HPA docs with more algorithm details The HPA docs pointed to an out-of-date document for information on the algorithm details, which users were finding confusing. This sticks a section on the algorithm in the HPA docs instead, documenting both general behavior and corner cases. * Add glossary info, HPA docs on quantities People often ask about the quantity notation when working with the metrics APIs, so this adds a glossary entry on quantities (since they're used elsewhere in the system), and a short explantation in the HPA walkthough. * Information about HPA readiness and stabilization This adds information about the new changes to HPA readiness and stabilization from kubernetes/enhancements#591, and other minor changes that landed in Kubernetes 1.12. * Update horizontal-pod-autoscale.md
kubernetes · Sep 12, 2018 · 93c0bb9 · 93c0bb9
1 parent e1e6555
commit 93c0bb9
Show file tree

Hide file tree

Showing 3 changed files with 126 additions and 8 deletions.
diff --git a/content/en/docs/reference/glossary/quantity.md b/content/en/docs/reference/glossary/quantity.md
@@ -0,0 +1,30 @@
+---
+title: Quantity
+id: quantity
+date: 2018-08-07
+full_link:
+short_description: >
+  A whole-number representation of small or large numbers using SI suffixes.
+
+aka: 
+tags:
+---
+ A whole-number representation of small or large numbers using SI suffixes.
+
+<!--more--> 
+
+Quantities are representations of small or large numbers using a compact,
+whole-number notation with SI suffixes.  Fractional numbers are represented
+using milli-units, while large numbers can be represented using kilo-units,
+mega-units, giga-units, etc.
+
+For instance, the number `1.5` is represented `1500m`, while the number `1000`
+can be represented as `1k`, and `1000000` as `1M`. You can also specify
+binary-notation suffixes; the number 2048 can be written as `2Ki`.
+
+The accepted decimal (power-of-10) units are `m` (milli), `k` (kilo,
+intentionally lowercase), `M` (mega), `G` (giga), `T` (terra), `P` (peta),
+`E` (exa).
+
+The accepted binary (power-of-2) units are `Ki` (kibi), `Mi` (mebi), `Gi` (gibi),
+`Ti` (tebi), `Pi` (pebi), `Ei` (exbi).
diff --git a/content/en/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough.md b/content/en/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough.md
@@ -432,6 +432,16 @@ was capped by the maximum or minimum of the HorizontalPodAutoscaler.  This is an
 you may wish to raise or lower the minimum or maximum replica count constraints on your
 HorizontalPodAutoscaler.
 
+## Appendix: Quantities
+
+All metrics in the HorizontalPodAutoscaler and metrics APIs are specified using
+a special whole-number notation known in Kubernetes as a *quantity*.  For example,
+the quantity `10500m` would be written as `10.5` in decimal notation.  The metrics APIs
+will return whole numbers without a suffix when possible, and will generally return
+quantities in milli-units otherwise.  This means you might see your metric value fluctuate
+between `1` and `1500m`, or `1` and `1.5` when written in decimal notation.  See the
+[glossary entry on quantities](/docs/reference/glossary/quantity.md) for more information.
+
 ## Appendix: Other possible scenarios
 
 ### Creating the autoscaler declaratively

diff --git a/content/en/docs/tasks/run-application/horizontal-pod-autoscale.md b/content/en/docs/tasks/run-application/horizontal-pod-autoscale.md
@@ -50,9 +50,10 @@ or the custom metrics API (for all other metrics).
   the number of desired replicas.
 
   Please note that if some of the pod's containers do not have the relevant resource request set,
-  CPU utilization for the pod will not be defined and the autoscaler will not take any action
-  for that metric. See the [autoscaling algorithm design document](https://git.k8s.io/community/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md#autoscaling-algorithm) for further
-  details about how the autoscaling algorithm works.
+  CPU utilization for the pod will not be defined and the autoscaler will
+  not take any action for that metric. See the [algorithm
+  details](#algorithm-details) section below for more information about
+  how the autoscaling algorithm works.
 
 * For per-pod custom metrics, the controller functions similarly to per-pod resource metrics,
   except that it works with raw values, not utilization values.
@@ -81,6 +82,85 @@ by using the scale sub-resource. Scale is an interface that allows you to dynami
 each of their current states. More details on scale sub-resource can be found
 [here](https://git.k8s.io/community/contributors/design-proposals/autoscaling/horizontal-pod-autoscaler.md#scale-subresource).
 
+### Algorithm Details
+
+From the most basic perspective, the Horizontal Pod Autoscaler controller
+operates on the ratio between desired metric value and current metric
+value:
+
+```
+desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
+```
+
+For example, if the current metric value is `200m`, and the desired value
+is `100m`, the number of replicas will be doubled, since `200.0 / 100.0 ==
+2.0` If the the current value is instead `50m`, we'll halve the number of
+replicas, since `50.0 / 100.0 == 0.5`.  We'll skip scaling if the ratio is
+sufficiently close to 1.0 (within a globally-configurable tolerance, from
+the `--horizontal-pod-autoscaler-tolerance` flag, which defaults to 0.1).
+
+When a `targetAverageValue` or `targetAverageUtilization` is specified,
+the `currentMetricValue` is computed by taking the average of the given
+metric across all Pods in the HorizontalPodAutoscaler's scale target.
+Before checking the tolerance and deciding on the final values, we take
+pod readiness and missing metrics into consideration, however.
+
+All Pods with a deletion timestamp set (i.e. Pods in the process of being
+shut down) and all failed Pods are discarded.
+
+If a particular Pod is missing metrics, it is set aside for later; Pods
+with missing metrics will be used to adjust the final scaling amount.
+
+When scaling on CPU, if any pod has yet to become ready (i.e. it's still
+initializing) *or* the most recent metric point for the pod was before it
+became ready, that pod is set aside as well.
+
+Due to technical constraints, the HorizontalPodAutoscaler controller
+cannot exactly determine the first time a pod becomes ready when
+determinining whether to set aside certain CPU metrics. Instead, it
+considers a Pod "not yet ready" if it's unready and transitioned to
+unready within a short, configurable window of time since it started.
+This value is configured with the `--horizontal-pod-autoscaler-initial-readiness-delay` flag, and its default is 30
+seconds.  Once a pod has become ready, it considers any transition to
+ready to be the first if it occurred within a longer, configurable time
+since it started. This value is configured with the `--horizontal-pod-autoscaler-cpu-initialization-period` flag, and its
+default is 5 minutes.
+
+The `currentMetricValue / desiredMetricValue` base scale ratio is then
+calculated using the remaining pods not set aside or discarded from above.
+
+If there were any missing metrics, we recompute the average more
+conservatively, assuming those pods were consuming 100% of the desired
+value in case of a scale down, and 0% in case of a scale up.  This dampens
+the magnitude of any potential scale.
+
+Futhermore, if any not-yet-ready pods were present, and we would have
+scaled up without factoring in missing metrics or not-yet-ready pods, we
+conservatively assume the non-yet-ready pods are consuming 0% of the
+desired metric, further dampening the magnitude of a scale up.
+
+After factoring in the not-yet-ready pods and missing metrics, we
+recalculate the usage ratio.  If the new ratio reverses the scale
+direction, or is within the tolerance, we skip scaling.  Otherwise, we use
+the new ratio to scale.
+
+Note that the *original* value for the average utilization is reported
+back via the HorizontalPodAutoscaler status, without factoring in the
+not-yet-ready pods or missing metrics, even when the new usage ratio is
+used.
+
+If multiple metrics are specified in a HorizontalPodAutoscaler, this
+calculation is done for each metric, and then the largest of the desired
+replica counts is chosen.  If any of those metrics cannot be converted
+into a desired replica count (e.g. due to an error fetching the metrics
+from the metrics APIs), scaling is skipped.
+
+Finally, just before HPA scales the target, the scale reccomendation is recorded.  The
+controller considers all reccomendations within a configurable window choosing the 
+highest recommendation from within that window. This value can be configured using the `--horizontal-pod-autoscaler-downscale-stabilization-window` flag, which defaults to 5 minutes.  
+This means that scaledowns will occur gradually, smothing out the impact of rapidly
+fluctuating metric values.
+
 ## API Object
 
 The Horizontal Pod Autoscaler is an API resource in the Kubernetes `autoscaling` API group.
@@ -129,16 +209,14 @@ dynamic nature of the metrics evaluated. This is sometimes referred to as *thras
 Starting from v1.6, a cluster operator can mitigate this problem by tuning
 the global HPA settings exposed as flags for the `kube-controller-manager` component:
 
+Starting from v1.12, a new algorithmic update removes the need for the
+upscale delay.
+
 - `--horizontal-pod-autoscaler-downscale-delay`: The value for this option is a
   duration that specifies how long the autoscaler has to wait before another
   downscale operation can be performed after the current one has completed.
   The default value is 5 minutes (`5m0s`).
 
-- `--horizontal-pod-autoscaler-upscale-delay`: The value for this option is a
-  duration that specifies how long the autoscaler has to wait before another
-  upscale operation can be performed after the current one has completed.
-  The default value is 3 minutes (`3m0s`).
-
 {{< note >}}
 **Note**: When tuning these parameter values, a cluster operator should be aware of
 the possible consequences. If the delay (cooldown) value is set too long, there