Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect to cooldownPeriod for the first deployment and let service is up and and running based on replica number for the first time. #5008

Closed
nuved opened this issue Sep 27, 2023 · 22 comments · Fixed by #5478 or kedacore/keda-docs#1337
Labels
feature-request All issues for new features that have not been committed to needs-discussion

Comments

@nuved
Copy link

nuved commented Sep 27, 2023

Proposal

Hey
From my understanding based on the current documentation, the cooldownPeriod in KEDA only takes effect after a scaling trigger has occurred. When initially deploying a Deployment, StatefulSet, KEDA immediately scales it to minReplicaCount, regardless of the cooldownPeriod.

It would be incredibly beneficial if the cooldownPeriod could also apply when scaling resources for the first time. Specifically, this would mean that upon deployment, the resource scales based on the defined replicas in the Deployment or StatefulSet and respects the cooldownPeriod before any subsequent scaling operations.

Use-Case

This enhancement would provide teams with a more predictable deployment behavior, especially during CI/CD processes. Ensuring that a new version of a service is stable upon deployment is critical, and this change would give teams more confidence during releases.

Is this a feature you are interested in implementing yourself?

No

Anything else?

No response

@nuved nuved added feature-request All issues for new features that have not been committed to needs-discussion labels Sep 27, 2023
@JorTurFer
Copy link
Member

Hello,
During CD process KEDA doesn't modify the workload. I mean, IIRC you are right about the 1st time deployment and KEDA doesn't take into account it (for scaling to 0, never for scaling to minReplicaCount)

Do you see this behaviour on every CD? I mean, does this happen every time when you deploy your workload? Are your workload scaled 0 or to minReplicaCount? Could you share an example of your ScaledObject and also an example of your workload?

@nuved
Copy link
Author

nuved commented Sep 28, 2023

Hello, During CD process KEDA doesn't modify the workload. I mean, IIRC you are right about the 1st time deployment and KEDA doesn't take into account it (for scaling to 0, never for scaling to minReplicaCount)

Do you see this behaviour on every CD? I mean, does this happen every time when you deploy your workload? Are your workload scaled 0 or to minReplicaCount? Could you share an example of your ScaledObject and also an example of your workload?

Hey ,

This is my configuration for keda , minimum replica is set on 0.

spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          policies:
          - periodSeconds: 300
            type: Pods
            value: 1
          stabilizationWindowSeconds: 1800
        scaleUp:
          policies:
          - periodSeconds: 300
            type: Percent
            value: 100
          stabilizationWindowSeconds: 0
    restoreToOriginalReplicaCount: true
  cooldownPeriod: 1800
  fallback:
    failureThreshold: 3
    replicas: 1
  maxReplicaCount: 10
  minReplicaCount: 0
  pollingInterval: 20
  scaleTargetRef:
    name: test
  triggers:
  - authenticationRef:
      name: test
    metadata:
      mode: QueueLength
      protocol: amqp
      queueName: test
      value: "150"
    type: rabbitmq

In other side, the replica count of deployment service is set on 1 . the deployment also has liveness and readiness probes and most of times service needs to have 3 minutes to be up and ready .

This is the command that our CD is running each time when deploying the service.

helm upgrade test ./ --install -f value.yaml -n test --set 'image.tag=test_6.0.0' --atomic --timeout 1200s

When using Helm with the --atomic flag, Helm expects the service to be up and their readiness/liveness probes to pass to mark the deployment as successful. However, with KEDA set to a minReplica of 0, our service is immediately scaled down to zero replicas, even before triggers are recognized.

This behavior leads Helm to assume the deployment was successful, while it's not necessery true. actually service was not up and running for even 20 seconds , it was killed by keda because the minimum replica is set on 0 .

I believe, respect to cooldownPeriod and use the replica count of deployment when deploying service can be beneficial in this cases .

For the moment I have to set the minimum replica to 1 to fix this issue.

@JorTurFer
Copy link
Member

In other side, the replica count of deployment service is set on 1 . the deployment also has liveness and readiness probes and most of times service needs to have 3 minutes to be up and ready .

Do you mean that your helm chart always set replicas: 1? Don't you have any condition to skip this setting? Deployment manifest is idempotent, I mean, whatever you set there will be applied at least during a few seconds, if you set 1, your workload will scale to 1 until the next HPA Controller cycle.

As I said, the first time when you deploy an ScaledObject this could happen, but not in the next times, and the reason behind this behavior on next times can be that you are explicitly setting replicas in deployment manifest.

@nuved
Copy link
Author

nuved commented Sep 28, 2023

In other side, the replica count of deployment service is set on 1 . the deployment also has liveness and readiness probes and most of times service needs to have 3 minutes to be up and ready .

Do you mean that your helm chart always set replicas: 1? Don't you have any condition to skip this setting? Deployment manifest is idempotent, I mean, whatever you set there will be applied at least during a few seconds, if you set 1, your workload will scale to 1 until the next HPA Controller cycle.

As I said, the first time when you deploy an ScaledObject this could happen, but not in the next times, and the reason behind this behavior on next times can be that you are explicitly setting replicas in deployment manifest.

Yes , replica is set on 1 in the deployment of service.
I even increase initialDelaySeconds to 300 for liveness and readiness, so normally when I set minimum replica of keda to 1 , helm will wait for 300 seconds to get confirmation that service is up and running .

when I set the minimum replica to 0 , after 5 seconds, the service is shutdown by keda and helm said that service is deployed successfully while it's not right!

kubectl describe ScaledObject -n test
Normal  KEDAScalersStarted          5s   keda-operator  Started scalers watch
Normal  ScaledObjectReady           5s   keda-operator  ScaledObject is ready for scaling
Normal  KEDAScaleTargetDeactivated  5s   keda-operator  Deactivated apps/v1.Deployment test/test from 1 to 0

And please consider the ScaledObject is applied by helm alongside other things like deployment ingress and service.

moreover, we do use keda in our stages envs that is not under load most of times .
so most of times there is no queue in the message and it's 0 .
so replica is set on 0 and it's fine!
the issue is raised when we deploy a new versions , how can we make sure the service is working well and is not crashing when it will be shutdown by keda.

As a result , it would be great keda use the replica of deployment as a base each time.
keda needs to use set replica to 1
replicas: 1
maxReplicaCount: 10
minReplicaCount: 0

keda still should set replica 1 when deploy service!
replicas: 1
maxReplicaCount: 10
minReplicaCount: 5

in this case, keda can set replica 5 when re-deploy services.
replicas: 5
maxReplicaCount: 10
minReplicaCount: 5

I can even set an annotation based on time for ScaledObject ( by help of helm ) so ScaledObject will be updated after each deploy.

@JorTurFer
Copy link
Member

I guess that we could implement some initialDelay or something so, but I'm still not sure why this happens after the 1st deployment. The 1st time it can happen, but after that I thought that it shouldn't.
Am I missing any important point @zroubalik ?

@zroubalik
Copy link
Member

Yeah, this is something we can add. Right now KEDA imidiatelly scales to minReplicas, if there's no load.

@pintonunes
Copy link

+1.

We have exactly the same requirement. KEDA should have an initialDelay before starting to make scaling decisions. This is very helpful when you deploy something and need it immediately available. Then KEDA should scale things to idle/minimum if not used.

Imagine a deployment with prometheus as trigger (or any other pull trigger). The deployment is immediately scaled to zero and only after pull interval it will be available again..

@helloxjade
Copy link

Proposal

Hey From my understanding based on the current documentation, the cooldownPeriod in KEDA only takes effect after a scaling trigger has occurred. When initially deploying a Deployment, StatefulSet, KEDA immediately scales it to minReplicaCount, regardless of the cooldownPeriod.

It would be incredibly beneficial if the cooldownPeriod could also apply when scaling resources for the first time. Specifically, this would mean that upon deployment, the resource scales based on the defined replicas in the Deployment or StatefulSet and respects the cooldownPeriod before any subsequent scaling operations.

Use-Case

This enhancement would provide teams with a more predictable deployment behavior, especially during CI/CD processes. Ensuring that a new version of a service is stable upon deployment is critical, and this change would give teams more confidence during releases.

Is this a feature you are interested in implementing yourself?

No

Anything else?

No response

I agree that implementing a cooldown period for initial scaling in KEDA is extremely beneficial, especially when using KEDA for serverless architectures. It's crucial to have a cooldown period after the first deployment, before allowing the system to scale down to zero. This cooldown would provide a stabilization phase for the system, ensuring that the service runs smoothly post-deployment before scaling down. Such a design not only enhances post-deployment stability but also aids in assessing the deployment's effectiveness before the service scales down to zero. This cooldown period is particularly important for ensuring smooth and predictable scaling behavior in serverless environments.

@JorTurFer
Copy link
Member

Maybe we can easily fix this just honoring cooldownPeriod also in this case. I think that we check if lastActive has value, but we could just assign a default value. WDYT @kedacore/keda-core-contributors ?

@zroubalik
Copy link
Member

This is implementable, but probably as a new setting, to not break existing behavior?

@JorTurFer
Copy link
Member

The bug has become into a feature? xD
Yep, we can use a new field for it

@zroubalik
Copy link
Member

The bug has become into a feature? xD Yep, we can use a new field for it

Well, it is there since the beginning 🤷‍♂️ 😄 I am open to discussion.

@pintonunes
Copy link

pintonunes commented Dec 20, 2023

The workaround we have in place right now since we deploy scaled objects with an operator, is to not add the idleReplicas while the scaled object creationTimestamp is less than then cooldownPeriod.
After we set the idleReplicas to zero.

@528548004
Copy link

I have a problem. When I use scaledobject to operate deployment, the cooling value configured when creating is normal. When updated, the modified cooling value will no longer take effect. The minimum value of the copy will not take effect when it is set to one, but it will work when it is zero. #5321

@528548004
Copy link

@JorTurFer Does cooldownPeriod only take effect when minReplicaCount is equal to 0?

@JorTurFer
Copy link
Member

Yes, it only works when minReplicaCount or idleReplicaCount are zero

@thincal
Copy link

thincal commented Jan 22, 2024

@JorTurFer Hello, any plan for this fix? thanks.

@JorTurFer
Copy link
Member

I'm not sure if there is consensus about how to fix it. @zroubalik , is a new field like initialDelay the way to go?

Once there is a solution agreed, anyone who is willing to contribute can help with the fix

@zroubalik
Copy link
Member

Yeah a new field, maybe initialCooldownPeriod, to be consistent in naming?

And maybe put it into advanced section? Not sure.

@lgy1027
Copy link

lgy1027 commented Feb 2, 2024

support the proposal

@thincal
Copy link

thincal commented Mar 20, 2024

Is there any progress here? really need this feature :) thanks.

@JorTurFer
Copy link
Member

The feature is almost ready, small changes are pending (but KubeCon has been in the middle)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All issues for new features that have not been committed to needs-discussion
Projects
Status: Ready To Ship
8 participants