Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA doesn't scale down to minReplicas even though metric is under target #78761

Closed
max-rocket-internet opened this issue Jun 6, 2019 · 113 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.

Comments

@max-rocket-internet
Copy link

What happened:

HPA scales to Spec.MaxReplicas even though metric is always under target.

Here's the HPA in YAML:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2019-06-06T10:46:13Z","reason":"ReadyForNewScale","message":"recommended
      size matches current size"},{"type":"ScalingActive","status":"True","lastTransitionTime":"2019-06-06T10:46:13Z","reason":"ValidMetricFound","message":"the
      HPA was able to successfully calculate a replica count from cpu resource utilization
      (percentage of request)"},{"type":"ScalingLimited","status":"True","lastTransitionTime":"2019-06-06T10:46:13Z","reason":"TooManyReplicas","message":"the
      desired replica count is more than the maximum replica count"}]'
    autoscaling.alpha.kubernetes.io/current-metrics: '[{"type":"Resource","resource":{"name":"cpu","currentAverageUtilization":0,"currentAverageValue":"9m"}}]'
  creationTimestamp: "2019-06-06T10:45:58Z"
  name: my-app-1
  namespace: default
  resourceVersion: "55041251"
  selfLink: /apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers/my-app-1
  uid: 44fedc1a-8848-11e9-8465-025acf90d81e
spec:
  maxReplicas: 4
  minReplicas: 2
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: my-app-1
  targetCPUUtilizationPercentage: 40
status:
  currentCPUUtilizationPercentage: 0
  currentReplicas: 4
  desiredReplicas: 4

And here's a description output:

$ kubectl describe hpa my-app-1
  Name:                                                  my-app-1
  Namespace:                                             default
  Labels:                                                <none>
  Annotations:                                           <none>
  CreationTimestamp:                                     Thu, 06 Jun 2019 12:45:58 +0200
  Reference:                                             Deployment/my-app-1
  Metrics:                                               ( current / target )
    resource cpu on pods  (as a percentage of request):  0% (9m) / 40%
  Min replicas:                                          2
  Max replicas:                                          4
  Deployment pods:                                       4 current / 4 desired
  Conditions:
    Type            Status  Reason            Message
    ----            ------  ------            -------
    AbleToScale     True    ReadyForNewScale  recommended size matches current size
    ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
    ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count
  Events:           <none>

What you expected to happen:

HPA only scales up when metric is above target and scales down when under until Spec.MinReplicas is reached.

How to reproduce it (as minimally and precisely as possible):

I'm not sure. We have 9 HPAs and only one has this problem. I can't see anything unique about this HPA when comparing to the others. If I delete and recreate the HPA using Helm, same problem. Also if I recreate the HPA using kubectl autoscale Deployment/my-app-1 --min=2 --max=4 --cpu-percent=40 same problem.

Environment:

  • Kubernetes version (use kubectl version): v1.12.6-eks-d69f1b
  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g: cat /etc/os-release): EKS AMI release v20190327
  • Kernel (e.g. uname -a): 4.14.104-95.84.amzn2.x86_64
  • Network plugin and version (if this is a network-related bug): AWS CNI
  • Metrics-server version: 0.3.2
@max-rocket-internet max-rocket-internet added the kind/bug Categorizes issue or PR as related to a bug. label Jun 6, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 6, 2019
@max-rocket-internet
Copy link
Author

@kubernetes/sig-autoscaling-bugs

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 6, 2019
@k8s-ci-robot
Copy link
Contributor

@max-rocket-internet: Reiterating the mentions to trigger a notification:
@kubernetes/sig-autoscaling-bugs

In response to this:

@kubernetes/sig-autoscaling-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@max-rocket-internet
Copy link
Author

If I scale down manually:

I0606 12:06:57.269751 1 horizontal.go:592] Successful rescale of my-app-1, old size: 2, new size: 4, reason: cpu resource utilization (percentage of request) above target

And these are the resources specified in the deployment:

        resources:
          limits:
            cpu: 2048m
            memory: 4Gi
          requests:
            cpu: 2048m
            memory: 4Gi

@tedyu
Copy link
Contributor

tedyu commented Jun 7, 2019

Can you attach log to this issue ?

thanks

@max-rocket-internet
Copy link
Author

Hi @tedyu

I am using AWS EKS so the only HPA related log entries I can see are like this and nothing more:

I0612 09:51:36.511060 1 horizontal.go:777] Successfully updated status for xxxx
I0612 09:52:04.080816 1 horizontal.go:777] Successfully updated status for yyyy
I0612 09:52:34.415303 1 horizontal.go:777] Successfully updated status for zzzz

@max-rocket-internet
Copy link
Author

@tedyu Is there some other way I can get more debug information?

@tedyu
Copy link
Contributor

tedyu commented Jun 12, 2019

There are logs at higher verbosity. e.g. (not that this would be logged in your cluster)

		klog.V(4).Infof("proposing %v desired replicas (based on %s from %s) for %s", metricDesiredReplicas, metricName, metricTimestamp, reference)

See if you can tune up verbosity

@max-rocket-internet
Copy link
Author

@tedyu thanks for the suggestion but I don't think we have that option on EKS as it's K8S service: https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html

I guess I have to chase it up with AWS support?

@max-rocket-internet max-rocket-internet changed the title HPA scales to maximum even though metric is under target HPA doesn't scale down to minReplicas even though metric is under target Jun 17, 2019
@max-rocket-internet
Copy link
Author

Our deployment strategy could be also relevant:

spec:
  strategy:
    rollingUpdate:
      maxSurge: 100%
      maxUnavailable: 0
    type: RollingUpdate

@max-rocket-internet
Copy link
Author

Just look at these events and missing reason:

Events:
  Type     Reason                        Age                    From                       Message
  ----     ------                        ----                   ----                       -------
  Normal   SuccessfulRescale             4m59s                  horizontal-pod-autoscaler  New size: 8; reason: All metrics below target
  Normal   SuccessfulRescale             4m44s                  horizontal-pod-autoscaler  New size: 16; reason:
  Normal   SuccessfulRescale             4m29s                  horizontal-pod-autoscaler  New size: 32; reason:
  Normal   SuccessfulRescale             4m14s                  horizontal-pod-autoscaler  New size: 64; reason:
  Normal   SuccessfulRescale             3m59s                  horizontal-pod-autoscaler  New size: 128; reason:
  Normal   SuccessfulRescale             3m44s                  horizontal-pod-autoscaler  New size: 200; reason:
  Normal   SuccessfulRescale             0s (x2 over 5m14s)     horizontal-pod-autoscaler  New size: 4; reason: All metrics below target

@vdemonchy
Copy link

I'm having the exact same issue as you @max-rocket-internet, also running on EKS with their latest version available to date. This is frustrating :(

@SocietyCao
Copy link

@vdemonchy There may be sudden bursts of traffic,some times cause CPUUtilization to 100% , in this case,it won't scale down

@max-rocket-internet
Copy link
Author

some times cause CPUUtilization to 100% , in this case,it won't scale down

This is not the case.

@SocietyCao
Copy link

The pod is all ready?
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details

If there were any missing metrics, we recompute the average more conservatively, assuming those pods were consuming 100% of the desired value in case of a scale down, and 0% in case of a scale up.
I can only speculate this is the case

@cmanzi
Copy link

cmanzi commented Aug 31, 2019

@max-rocket-internet Try increasing your metrics resolution from the default. I was experienceing similar behavior, I added the flag --metric-resolution=5s (the default is 60s), and it seems to be behaving in a much more expected manner now.

As @SocietyCao said, in my case it appears that the HPA was rapidly scaling up my service, creating a bunch of pods that didn't have any metrics yet, which in turn caused the HPA to assume the pods were under load. Seems like it can create a feedback loop of sorts.

@wxwang33
Copy link

We are seeing the same behavior. Has this issue been resolved?

@cmanzi
Copy link

cmanzi commented Sep 27, 2019

@wxwang33 What is your metrics-server resolution set to? That fixed it for me (on 1.14.6).

@wxwang33
Copy link

I will check later as I don't have direct access to it. Will update and thanks for the quick response!

@itninja-hue
Copy link

I am having the same, issue , at first i thought its not scaling down, but as time went by , exactly 6 minutes , hpa scaled down the pods.
After looking at the events log , it showed up that an event was triggered to scale down the pods the moment the cpi load went off , but got executes 6 minutes later. I guess this is a feature or how it is supposed to work , it would be nice if we can get a gracePeriod option to define if we want to narrow down or reduce the time.
I am running k8s 1.16 on vagrant (test cluster) 1 master 2 workers

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2022
@rosoft2001
Copy link

still an issue

@coding-bunny
Copy link

This is also an issue for us with the HPA when using 2 metrics.
With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.

Screenshot 2022-06-22 at 17 19 19

We haven't tried setting the threshold to double the expected size to see if this makes a difference.
The HPA does downscale, but only up to a certain point.

@vitobotta
Copy link

This is also an issue for us with the HPA when using 2 metrics. With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.

Screenshot 2022-06-22 at 17 19 19

We haven't tried setting the threshold to double the expected size to see if this makes a difference. The HPA does downscale, but only up to a certain point.

OT - which dashboard is that?

@coding-bunny
Copy link

This is also an issue for us with the HPA when using 2 metrics. With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.
Screenshot 2022-06-22 at 17 19 19
We haven't tried setting the threshold to double the expected size to see if this makes a difference. The HPA does downscale, but only up to a certain point.

OT - which dashboard is that?

ArgoCD from our private clusters

@vitobotta
Copy link

This is also an issue for us with the HPA when using 2 metrics. With the memory added to our HPA, we do see that the downscaling works, but at some point the downscaling stops, even through all reported metrics are below the configured threshold for the averageUtilization.
Screenshot 2022-06-22 at 17 19 19
We haven't tried setting the threshold to double the expected size to see if this makes a difference. The HPA does downscale, but only up to a certain point.

OT - which dashboard is that?

ArgoCD from our private clusters

Thanks!

@otakuinside
Copy link

Based on my experience, the key factor is that the metric is based on the Requested, not the Limit. Then, the condition to match is the Usage vs Request and the scale criteria.
On the other side, if you have 2 replicas, the required condition to scaleDown to 1 is that both replicas are 49%, so if it scales, the single one replica remaining would allocate 98% (49% on its side + 49% on the Terminated scaledDown pod). If Usage is not under 50% then an scaleDown trigger would cause the only single remaining pod to handle more than 100% of usage, which would immediately cause and scaleUp condition (which would be a riduculous loop :p).
If you have 3 pods instead of 2, the % change from 50% to 66% but the underlying analysis remains.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 6, 2022
@jimethn
Copy link

jimethn commented Sep 1, 2022

Seeing this in kubernetes 1.21 when using custom metrics. The metric drops below target and the HPA responds by scaling up.

  Type    Reason             Age                   From                       Message
  ----    ------             ----                  ----                       -------
  Normal  SuccessfulRescale  5m52s (x49 over 15h)  horizontal-pod-autoscaler  New size: 4; reason: Service metric cortex_query_scheduler_queue_length above target
  Normal  SuccessfulRescale  2m20s (x34 over 9h)   horizontal-pod-autoscaler  New size: 8; reason: All metrics below target

@ismashal
Copy link

ismashal commented Sep 7, 2022

I facing the same issue Scale down not working for me

NAME READY STATUS RESTARTS AGE
pod/device-service-78668474b5-fwd75 1/1 Running 0 156m
pod/device-service-78668474b5-g9cvg 1/1 Running 0 89m
pod/device-service-78668474b5-hz79v 1/1 Running 0 37m

NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
horizontalpodautoscaler.autoscaling/device-service Deployment/device-service 33%/40%, 0%/70% 1 3 3 3h46m

@markandersontrocme
Copy link

Having the same issues with some HPA in 1.21 using API autoscaling/v1. Can anyone confirm if this is working better with autoscaling/v2?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2022
@h0jeZvgoxFepBQ2C
Copy link

/remove-lifecycle rotten

this is still a valid issue

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 16, 2022
@h0jeZvgoxFepBQ2C
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@h0jeZvgoxFepBQ2C: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@h0jeZvgoxFepBQ2C
Copy link

h0jeZvgoxFepBQ2C commented Oct 16, 2022

Could someone reopen this issue? I'm not allowed to do it

@coding-bunny
Copy link

Yeah, this issue needs to be reopened as this is a blocker for using memory HPA in Kubernetes.

@dani-newman
Copy link

Any news?

@h0jeZvgoxFepBQ2C
Copy link

@liggitt @wojtek-t @pohly @smarterclayton Could anyone of you reopen this issue maybe?

@nicon89
Copy link

nicon89 commented Jan 16, 2023

Is there any update on this?

@WFA-hhsieh
Copy link

@markandersontrocme Still got same issue on Kubernetes 1.24 with autoscaling/v2.

@max-rocket-internet
Copy link
Author

So much has changed in Kubernetes since I opened this issue but I guess some things never change: I have this issue again 😅

@max-rocket-internet
Copy link
Author

It seems some others also have the same problem. Rather than reopen an old issue with tonnes of comments I've created a new one to start fresh with autoscaling/v2: #120875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling.
Projects
None yet
Development

No branches or pull requests