Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEDA treats un-paused ScaledObject as Paused when restart from down #5490

Closed
tr1et opened this issue Feb 7, 2024 · 4 comments
Closed

KEDA treats un-paused ScaledObject as Paused when restart from down #5490

tr1et opened this issue Feb 7, 2024 · 4 comments
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@tr1et
Copy link

tr1et commented Feb 7, 2024

Report

We are using KEDA 2.12.1 on an Azure Kubernetes Service version 1.27.3. When keda-operator got an issue, down and restarted, it flag the previously un-paused ScaledObjects as paused and scaled it to 0 pods instead of keeping it running with 1 pod (set in minReplicasCount)

Expected Behavior

It should keep the pods running when restarted.

Actual Behavior

It scaled the Deployments to 0 pods and kill all running pods of those deployments.

Steps to Reproduce the Problem

  1. Pause a ScaledObject: kubectl annotate scaledobject "$deployment" autoscaling.keda.sh/paused-replicas="0" -n "$NAMESPACE" --overwrite
  2. Un-pause a ScaledObject: kubectl annotate scaledobject "$deployment" autoscaling.keda.sh/paused-replicas- -n "$NAMESPACE" --overwrite
  3. "Down" the keda-operator

Logs from KEDA operator

Service names are masked.
Operator down logs:

2024-02-06T20:06:25ZERRORReconciler error{"controller": "cert-rotator", "object": {"name":"kedaorg-certs","namespace":"keda"}, "namespace": "keda", "name": "kedaorg-certs", "r
econcileID": "cd95ed0d-c848-4a17-8b1a-d4b2c21724e4", "error": "Operation cannot be fulfilled on apiservices.apiregistration.k8s.io \"v1beta1.external.metrics.k8s.io\": the object has been mo
dified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-02-06T20:06:25ZINFOcert-rotationno cert refresh needed
2024-02-06T20:06:25ZINFOcert-rotationEnsuring CA cert{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "ked
a-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-02-06T20:06:33ZINFOcert-rotationEnsuring CA cert{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.exte
rnal.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
E0206 20:06:35.635590       1 leaderelection.go:332] error retrieving resource lock keda/operator.keda.sh: Get "https://10.8.0.1:443/apis/coordination.k8s.io/v1/namespaces/keda/leases/operat
or.keda.sh": context deadline exceeded
I0206 20:06:35.635622       1 leaderelection.go:285] failed to renew lease keda/operator.keda.sh: timed out waiting for the condition
2024-02-06T20:06:35ZERRORsetupproblem running manager{"error": "leader election lost"}
main.main
/workspace/cmd/operator/main.go:300
runtime.main
/usr/local/go/src/runtime/proc.go:250

Operator restarted logs:

2024-02-06T20:06:37ZINFOsetupStarting manager
2024-02-06T20:06:37ZINFOsetupKEDA Version: 2.12.1
2024-02-06T20:06:37ZINFOsetupGit Commit: dc76ca70f19c22e8f0c806f84d95256d771f3dc9
2024-02-06T20:06:37ZINFOsetupGo Version: go1.20.8
2024-02-06T20:06:37ZINFOsetupGo OS/Arch: linux/amd64
2024-02-06T20:06:37ZINFOsetupRunning on Kubernetes 1.27{"version": "v1.27.3"}
2024-02-06T20:06:37ZINFOstarting server{"kind": "health probe", "addr": "[::]:8081"}
I0206 20:06:37.889130       1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
2024-02-06T20:06:37ZINFOcontroller-runtime.metricsStarting metrics server
2024-02-06T20:06:37ZINFOcontroller-runtime.metricsServing metrics server{"bindAddress": ":8080", "secure": false}
I0206 20:06:56.178956       1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
...
2024-02-06T20:06:58ZINFOscaleexecutorSuccessfully scaled target to paused replicas count{"scaledobject.Name": "a-service-green", "scaledObject.Namespace": "prd", "scaleTarget.Name": "a-service-green", "paused replicas": 0}
2024-02-06T20:06:58ZINFOscaleexecutorSuccessfully scaled target to paused replicas count{"scaledobject.Name": "b-service-green", "scaledObject.Namespace": "prd", "scaleTarget.Name": "b-service-green", "paused replicas": 0}
2024-02-06T20:06:58ZINFOscaleexecutorSuccessfully scaled target to paused replicas count{"scaledobject.Name": "c-service-green", "scaledObject.Namespace": "prd", "scaleTarget.Name": "c-service-green", "paused replicas": 0}
2024-02-06T20:06:58ZINFOscaleexecutorSuccessfully scaled target to paused replicas count{"scaledobject.Name": "d-service-green", "scaledObject.Namespace": "prd", "scaleTarget.Name": "d-service-green", "paused replicas": 0}

KEDA Version

2.12.1

Kubernetes Version

1.27

Platform

Microsoft Azure

Scaler Details

Azure Service Bus, CPU, Memory

Anything else?

No response

@tr1et tr1et added the bug Something isn't working label Feb 7, 2024
@JorTurFer
Copy link
Member

JorTurFer commented Feb 7, 2024

Hi,
I've tested and I can't reproduce the issue :(

I don't think that this is related with KEDA because KEDA is stateless. It means that after a restart, KEDA pulls the ScaledObject from the API Server and works based on them.
Based on these lines:

I0206 20:06:35.635622       1 leaderelection.go:285] failed to renew lease keda/operator.keda.sh: timed out waiting for the condition
2024-02-06T20:06:35ZERRORsetupproblem running manager{"error": "leader election lost"}

I think that something has happened in the control plane side and maybe there have been a rollback in ETCD, producing that the annotation has been placed again (permanently of just for a while). Have you checked if the annotation is still there?

@tomkerkhove , Do you know if this is something the @tr1et can ask via support ticket?

@tr1et
Copy link
Author

tr1et commented Feb 19, 2024

Thanks @JorTurFer , let me try to reproduce the issue on our dev environment.

I don't have the exact annotations info after the restart right now (we deleted and re-created the ScaledObject after finding the issue to make sure) but I remember that the "paused" annotations are not there.

Copy link

stale bot commented Apr 19, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Apr 19, 2024
Copy link

stale bot commented Apr 27, 2024

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
Status: Ready To Ship
Development

No branches or pull requests

2 participants