Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support scaling HPA to/from zero pods for object/external metrics #74526

Merged
merged 2 commits into from Jul 16, 2019

Conversation

@DXist
Copy link
Contributor

commented Feb 25, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:
This PR targets worker deployments that use queues and scaling is based on object or external metric that depends on queue size. When workers are idle it is possible to scale corresponding deployment to zero replicas and save resources.

This technique is especially useful when workers request GPU resources and the amount of different idling worker types exceeds number of available GPUs.

Which issue(s) this PR fixes:

Fixes #69687

Special notes for your reviewer:
The PR is based on changes made in #61423

  1. Scale to/from zero changes are made
  2. Applied changes from #61423
  3. HPA continues to scale as long as there is at least one metric value available.
    There is no conservative scale down behaviour introduced in #61423
    Scaling down works even if we have just one metric value.
  4. HPA tolerance set through --horizontal-pod-autoscaler-tolerance flag is ignored when scaling up from zero pods.

Does this PR introduce a user-facing change?:

When HPAScaleToZero feature gate is enabled HPA supports scaling to zero pods based on object or external metrics. HPA remains active as long as at least one metric value available.

To downgrade the cluster to version that does not support scale-to-zero feature:
1. make sure there are no hpa objects with minReplicas=0. Here is a oneliner to update it to 1:
    $ kubectl get hpa --all-namespaces  --no-headers=true | awk  '{if($6==0) printf "kubectl patch hpa/%s --namespace=%s -p \"{\\\"spec\\\":{\\\"minReplicas\\\":1}}\"\n", $2, $1 }' | sh
2. disable HPAScaleToZero feature gate

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Feb 25, 2019

Hi @DXist. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

/ok-to-test

@Rajat-0

This comment has been minimized.

Copy link

commented Feb 25, 2019

@DXist I am also working on this issue and I have discussed the below proposel in Sig weekly meeting, and as per discussion in meeting the @mwielgus will discuss this with networking team, and based on that we can modify the PR and I'll be happy to coordinate with you on this.

https://docs.google.com/document/d/1p_Xlk8e5V32WOBVeRbJqbxhEKcJy_mDu9JgVp5ZNDxs/

@DXist

This comment has been minimized.

Copy link
Contributor Author

commented Feb 26, 2019

@Rajat-0

@DXist I am also working on this issue and I have discussed the below proposel in Sig weekly meeting, and as per discussion in meeting the Sig owner will discuss this with networking team, and based on that we can modify the PR and I'll be happy to coordinate with you on this.

https://docs.google.com/document/d/1p_Xlk8e5V32WOBVeRbJqbxhEKcJy_mDu9JgVp5ZNDxs/

This PR handles more general case and is not specific to HTTP workload.

If an HTTP load balancer/Ingress could buffer requests and export a number of queued requests for given service as a custom metric then HPA can use this metric for scaling. I see that there could be some signaling about metric change or push based model of metric delivery to HPA to speed up the scaling process.

@mwielgus
Copy link
Contributor

left a comment

Couple of high-level items:

  1. How was it tested?
  2. Why didn't you handle: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal.go#L567
  3. Did you consider explicitly validating that there is Object/External metric configured?
  4. Maybe we should have a comment update in the API.

@mwielgus mwielgus self-assigned this Feb 28, 2019

@DXist

This comment has been minimized.

Copy link
Contributor Author

commented Feb 28, 2019

Couple of high-level items:

  1. How was it tested?

There are unittests that cover scaling up from zero pods and scaling down to zero for custom object and external metrics.

For end-to-end testing I

  • built K8S binaries
  • run them using https://github.com/kubernetes-sigs/kubeadm-dind-cluster
  • set up RabbitMQ, Prometheus Operator, Prometheus adapter for custom metrics
  • set up the service I want to scale and configured Prometheus, Prometheus adapter and HPA.
    HPA used two Object metrics - worker utilization and queue ingress/egress ratio
  • run scripts to generate test load - a single message or a series of messages in the rate of ~5X processing throughput of a single pod

Only one metric value (worker utilization) was available when there were no pods.
Metrics used 1m window to calculate averaged value so I delayed start of pod readiness probes via initialDelaySeconds: 60. Due to this delay the next scaling decision was based on a full metric window since previous rescale. The delay worked in the same way as replica calculator code for resource metrics.

  1. Why didn't you handle: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal.go#L567

Github collapsed the diff for this module. I've added extra condition to enable scaling when minReplicas==0

  1. Did you consider explicitly validating that there is Object/External metric configured?

No, I didn't. It seems to be a good protection from HPA misconfiguration.

  1. Maybe we should have a comment update in the API.

A comment update should be near MinReplicas?

@DXist

This comment has been minimized.

Copy link
Contributor Author

commented Mar 1, 2019

Added metrics validation for minReplicas=0 case and a comment to the API.

@fejta-bot

This comment has been minimized.

Copy link

commented Mar 1, 2019

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@thockin

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

API LGTM

@mwielgus ping me when LGTM and I will approve

@DXist DXist referenced this pull request Mar 5, 2019

Closed

Allow HPA to scale to 0 #69687

@liggitt liggitt added this to Assigned in API Reviews Jul 2, 2019

@thockin
Copy link
Member

left a comment

I think this is sane now. @liggitt want another look?

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DXist, mwielgus, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

replicaCount = int32(math.Ceil(usageRatio * float64(readyPodCount)))
} else {
// Scale to zero or n pods depending on usageRatio
replicaCount = int32(math.Ceil(usageRatio))

This comment has been minimized.

Copy link
@alok87

alok87 Jul 12, 2019

Contributor

We have a use case in which we need to scale worker deployments from min=0 to min=100 (for example) based on the SQS queue length. The worker deployments works on the SQS queue.

Do you see this logic working with AWS SQS? SQS does not have a single metric to give the usageRatio because of the long polling of the queue by the consumers, the ApproximateMessage may be showing 0 all the time if the consumers are consuming the messages very fast. So in such a case we need to scale down on a different metric - NumberOfEmptyReceives

This comment has been minimized.

Copy link
@DXist

DXist Jul 12, 2019

Author Contributor

@alok87 , personally I use this code with RabbitMQ and Prometheus. I have an heuristic of worker utilization which depends on the number of messages in the queue and the number of workers.

This comment has been minimized.

Copy link
@felippe-mendonca

felippe-mendonca Jul 13, 2019

@DXist, sorry but wasn't clear to me which Prometheus exporter for RabbitMQ you did use on tests, was prometheus_rabbitmq_exporter or rabbitmq_exporter. Moreover, what was the metric used on that? In a rollback scenario which you doesn't have scaling HPA to/from zero feature available, it'll still work? Does it scale to one instead of zero?

This comment has been minimized.

Copy link
@DXist

DXist Jul 13, 2019

Author Contributor

@fellippe-mendonca, I use the second exporter which is configured in rabbitmq-ha helm chart

I have 2 metrics - worker utilisation and ingress/egress message rate ratio. Here is a Helm template of PrometheusRule object I've configured.

If you want to rollback the cluster to version without scale-to-zero support you have to update HPA objects explicitly before the rollback. See release notes to this PR.

This comment has been minimized.

Copy link
@felippe-mendonca

felippe-mendonca Jul 13, 2019

Thanks for the information @DXist, I'll try this approach with an application I'm running with RabbitMQ that doesn't scale well with CPU. I was wondering to use average response time, but I think that ingress/egress rate ratio might be better! Hope this new functionality come to k8s ASAP and further for GKE.

@DXist DXist force-pushed the DXist:feature/hpa-scale-to-zero branch from a0303fc to 5fa77da Jul 12, 2019

@DXist

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2019

I've rebased onto current master.

var MinReplicasLowerBound int32

if !utilfeature.DefaultFeatureGate.Enabled(features.HPAScaleToZero) {
MinReplicasLowerBound = 1

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 12, 2019

Member

validateHorizontalPodAutoscalerSpec is also called from ValidateHorizontalPodAutoscalerUpdate.

Once we enable this feature by default (e.g. 1.17), an API server will allow minReplicas of 0. In a skewed HA apiserver scenario, the prior version of the apiserver with the feature disabled (e.g. 1.16) would reject updates to the object the 1.17 allowed to be persisted. If a 1.16 server encounters a persisted object that already has minReplicas set to 0, it must allow updates of that object with minReplicas=0.

One way to do this:

  • pass in minReplicasLowerBound as a parameter to validateHorizontalPodAutoscalerSpec.
  • when called from ValidateHorizontalPodAutoscaler (on creation), compute minReplicasLowerBound the way you do here (based only on the feature gate)
  • when called from ValidateHorizontalPodAutoscalerUpdate (on update), allow a lower bound of 0 if the feature gate is enabled OR if the existing HPA has minReplicas=0.

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 12, 2019

Member

This also needs unit tests exercising all of these scenarios:

  • feature enabled, ValidateHorizontalPodAutoscaler() allows minReplicas=0
  • feature disabled, ValidateHorizontalPodAutoscaler() forbids minReplicas=0
  • feature enabled, old HPA has minReplicas=1, ValidateHorizontalPodAutoscalerUpdate() allows minReplicas=0
  • feature enabled, old HPA has minReplicas=0, ValidateHorizontalPodAutoscalerUpdate() allows minReplicas=0
  • feature disabled, old HPA has minReplicas=1, ValidateHorizontalPodAutoscalerUpdate() forbids minReplicas=0
  • feature disabled, old HPA has minReplicas=0, ValidateHorizontalPodAutoscalerUpdate() allows minReplicas=0

This comment has been minimized.

Copy link
@DXist

DXist Jul 13, 2019

Author Contributor

@liggitt I've updated validation logic and moved minReplicas tests into separate test functions to isolate feature gate toggling.

@@ -96,6 +99,10 @@ func TestValidateHorizontalPodAutoscaler(t *testing.T) {
if err != nil {
t.Errorf("unable to parse label selector: %v", err)
}

// Enable HPAScaleToZero feature gate.
defer featuregatetesting.SetFeatureGateDuringTest(t, utilfeature.DefaultFeatureGate, features.HPAScaleToZero, true)()

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 12, 2019

Member

rather than enabling unconditionally for all tests, we need tests exercising behavior with the feature enabled and disabled

This comment has been minimized.

Copy link
@DXist

DXist Jul 13, 2019

Author Contributor

I moved the gate switching to separate test functions.

@@ -38,7 +38,9 @@ type HorizontalPodAutoscalerSpec struct {
// reference to scaled resource; horizontal pod autoscaler will learn the current resource consumption
// and will set the desired number of pods by using its Scale subresource.
ScaleTargetRef CrossVersionObjectReference `json:"scaleTargetRef" protobuf:"bytes,1,opt,name=scaleTargetRef"`
// lower limit for the number of pods that can be set by the autoscaler, default 1.
// MinReplicas is the lower limit for the number of replicas to which the autoscaler can scale down.
// It defaults to 1 pod. MinReplicas is allowed to be 0 if at least one Object or External metric is configured.

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 12, 2019

Member

nit: use minReplicas to match the serialized name API consumers will see.

include a mention that setting minReplicas=0 requires the alpha feature gate HPAScaleToZero to be enabled.

(comment applies to all types.go files)

This comment has been minimized.

Copy link
@DXist

DXist Jul 13, 2019

Author Contributor

Updated comments in all API versions.

@liggitt

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

I think this is sane now. @liggitt want another look?

this is close, but needs a couple tweaks to handle skew cases properly. left a description at https://github.com/kubernetes/kubernetes/pull/74526/files#r302972197

@DXist DXist force-pushed the DXist:feature/hpa-scale-to-zero branch from 5fa77da to ff0a753 Jul 13, 2019

@DXist

This comment has been minimized.

Copy link
Contributor Author

commented Jul 13, 2019

/test pull-kubernetes-bazel-test

@DXist

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2019

@liggitt I've updated the PR.

if errs := ValidateHorizontalPodAutoscalerUpdate(&nonZeroCase, &zeroCase); len(errs) != 0 {
t.Errorf("expected success: %v", errs)
}
}

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 16, 2019

Member

needs one final case where both old and new are zero, and update validation should pass even with the feature disabled

This comment has been minimized.

Copy link
@DXist

DXist Jul 16, 2019

Author Contributor

@liggitt I've added extra check to cover this case.

@liggitt

This comment has been minimized.

Copy link
Member

commented Jul 16, 2019

validation logic looks correct. one test case missing: feature disabled, old HPA has minReplicas=0, ValidateHorizontalPodAutoscalerUpdate() allows minReplicas=0

@liggitt

This comment has been minimized.

Copy link
Member

commented Jul 16, 2019

/hold cancel
validation looks good, will lgtm once test case is added

DXist added some commits Feb 18, 2019

HPA scale-to-zero for custom object/external metrics
Add support for scaling to zero pods

minReplicas is allowed to be zero

condition is set once

Based on #61423

set original valid condition

add scale to/from zero and invalid metric tests

Scaling up from zero pods ignores tolerance

validate metrics when minReplicas is 0

Document HPA behaviour when minReplicas is 0

Documented minReplicas field in autoscaling APIs
Update generated files
update generated protobufs

swagger docs are regenerated

update openapi-spec

update generated openapi

@DXist DXist force-pushed the DXist:feature/hpa-scale-to-zero branch from ff0a753 to 19d93ee Jul 16, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented Jul 16, 2019

/lgtm
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 16, 2019

@liggitt liggitt moved this from Assigned to API review completed, 1.16 in API Reviews Jul 16, 2019

@DXist

This comment has been minimized.

Copy link
Contributor Author

commented Jul 16, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 5ece88c into kubernetes:master Jul 16, 2019

23 checks passed

cla/linuxfoundation DXist authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.