Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-leading replica fails due to not started cert-controller #1445

Closed
woehrl01 opened this issue Dec 12, 2023 · 22 comments · Fixed by #1509
Closed

Non-leading replica fails due to not started cert-controller #1445

woehrl01 opened this issue Dec 12, 2023 · 22 comments · Fixed by #1509
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@woehrl01
Copy link
Contributor

What happened:

If I run two replicas the manager crashes after a while, it looks like that the healthiness probe fails and restarts the pod

What you expected to happen:

Both pods as running fine

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

I1212 18:02:44.725166 1 leaderelection.go:250] attempting to acquire leader lease kueue-system/c1f6bfd2.kueue.x-k8s.io...
{"level":"info","ts":"2023-12-12T18:02:44.725142061Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *v1.Secret"}
{"level":"info","ts":"2023-12-12T18:02:44.726132561Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2023-12-12T18:02:44.726421287Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2023-12-12T18:02:44.726453798Z","caller":"controller/controller.go:186","msg":"Starting Controller","controller":"cert-rotator"}
{"level":"error","ts":"2023-12-12T18:04:44.726776367Z","caller":"controller/controller.go:203","msg":"Could not wait for Cache to sync","controller":"cert-rotator","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/runnable_group.go:223"}
{"level":"info","ts":"2023-12-12T18:04:44.72690754Z","caller":"manager/internal.go:516","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2023-12-12T18:04:44.72693391Z","caller":"manager/internal.go:520","msg":"Stopping and waiting for leader election runnables"}
{"level":"error","ts":"2023-12-12T18:04:44.726934751Z","caller":"manager/internal.go:490","msg":"error received after stop sequence was engaged","error":"failed waiting for reader to sync","errorVerbose":"failed waiting for reader to sync\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:258\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/runnable_group.go:223\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/internal.go:490"}
{"level":"info","ts":"2023-12-12T18:04:44.72773714Z","caller":"manager/internal.go:526","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2023-12-12T18:04:44.727923434Z","caller":"manager/internal.go:530","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2023-12-12T18:04:44.727976656Z","caller":"manager/internal.go:533","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2023-12-12T18:04:44.727990216Z","logger":"controller-runtime.metrics","caller":"server/server.go:231","msg":"Shutting down metrics server with timeout of 1 minute"}
{"level":"info","ts":"2023-12-12T18:04:44.727998496Z","caller":"manager/server.go:43","msg":"shutting down server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2023-12-12T18:04:44.728063958Z","caller":"manager/internal.go:537","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2023-12-12T18:04:44.728084368Z","logger":"setup","caller":"kueue/main.go:182","msg":"Could not run manager","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret","stacktrace":"main.main\n\t/workspace/cmd/kueue/main.go:182\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"}

Environment:

  • Kubernetes version (use kubectl version): Server Version: v1.28.3-eks-4f4795d
  • Kueue version (use git describe --tags --dirty --always): v0.5.1
  • Cloud provider or hardware configuration: EKS
  • OS (e.g: cat /etc/os-release): bottlerocket os
  • Kernel (e.g. uname -a):
  • Install tools: helm
  • Others:
@woehrl01 woehrl01 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2023
@alculquicondor
Copy link
Contributor

/assign @trasc

@kbakk
Copy link

kbakk commented Dec 13, 2023

I'm able to reproduce this in Kind:

$ kind create cluster --image kindest/node:v1.28.0
# on v0.5.1 tag in kueue repo
$ helm install  kueue charts/kueue/ --create-namespace --namespace kueue-system --set=controllerManager.replicas=2 --set=controllerManager.manager.image.tag=v0.5.1

The non-leader pod restarts after ~1.5 min and keeps restarting.

Logs:
{"level":"info","ts":"2023-12-13T08:43:32.418662328Z","logger":"setup","caller":"kueue/main.go:120","msg":"Initializing","gitVersion":"","gitCommit":""}
{"level":"info","ts":"2023-12-13T08:43:32.41998334Z","logger":"setup","caller":"kueue/main.go:385","msg":"Successfully loaded configuration","config":"apiVersion: config.kueue.x-k8s.io/v1beta1\nclientConnection:\n  burst: 100\n  qps: 50\ncontroller:\n  groupKindConcurrency:\n    ClusterQueue.kueue.x-k8s.io: 1\n    Job.batch: 5\n    LocalQueue.kueue.x-k8s.io: 1\n    ResourceFlavor.kueue.x-k8s.io: 1\n    Workload.kueue.x-k8s.io: 1\nhealth:\n  healthProbeBindAddress: :8081\nintegrations:\n  frameworks:\n  - batch/job\n  - kubeflow.org/mpijob\n  - ray.io/rayjob\n  - jobset.x-k8s.io/jobset\n  - kubeflow.org/mxjob\n  - kubeflow.org/paddlejob\n  - kubeflow.org/pytorchjob\n  - kubeflow.org/tfjob\n  - kubeflow.org/xgboostjob\n  podOptions:\n    namespaceSelector:\n      matchExpressions:\n      - key: kubernetes.io/metadata.name\n        operator: NotIn\n        values:\n        - kube-system\n        - kueue-system\n    podSelector: {}\ninternalCertManagement:\n  enable: true\n  webhookSecretName: kueue-webhook-server-cert\n  webhookServiceName: kueue-webhook-service\nkind: Configuration\nleaderElection:\n  leaderElect: true\n  leaseDuration: 0s\n  renewDeadline: 0s\n  resourceLock: \"\"\n  resourceName: c1f6bfd2.kueue.x-k8s.io\n  resourceNamespace: \"\"\n  retryPeriod: 0s\nmanageJobsWithoutQueueName: false\nmetrics:\n  bindAddress: :8080\nnamespace: kueue-system\nqueueVisibility:\n  clusterQueues:\n    maxCount: 10\n  updateIntervalSeconds: 5\nwebhook:\n  port: 9443\n"}
{"level":"Level(-2)","ts":"2023-12-13T08:43:32.420171397Z","logger":"setup","caller":"kueue/main.go:136","msg":"K8S Client","qps":50,"burst":100}
{"level":"info","ts":"2023-12-13T08:43:32.43573053Z","logger":"setup","caller":"kueue/main.go:315","msg":"Probe endpoints are configured on healthz and readyz"}
{"level":"info","ts":"2023-12-13T08:43:32.435855285Z","logger":"setup","caller":"kueue/main.go:180","msg":"Starting manager"}
{"level":"info","ts":"2023-12-13T08:43:32.435892166Z","logger":"setup","caller":"kueue/main.go:217","msg":"Waiting for certificate generation to complete"}
{"level":"info","ts":"2023-12-13T08:43:32.435957174Z","caller":"manager/server.go:50","msg":"starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2023-12-13T08:43:32.435990491Z","logger":"controller-runtime.metrics","caller":"server/server.go:185","msg":"Starting metrics server"}
{"level":"info","ts":"2023-12-13T08:43:32.436136724Z","logger":"controller-runtime.metrics","caller":"server/server.go:224","msg":"Serving metrics server","bindAddress":":8080","secure":false}
{"level":"info","ts":"2023-12-13T08:43:32.537488746Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *v1.Secret"}
{"level":"info","ts":"2023-12-13T08:43:32.537563402Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2023-12-13T08:43:32.537578014Z","caller":"controller/controller.go:178","msg":"Starting EventSource","controller":"cert-rotator","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2023-12-13T08:43:32.537587442Z","caller":"controller/controller.go:186","msg":"Starting Controller","controller":"cert-rotator"}
I1213 08:43:32.537593       1 leaderelection.go:250] attempting to acquire leader lease kueue-system/c1f6bfd2.kueue.x-k8s.io...
{"level":"error","ts":"2023-12-13T08:45:32.538752584Z","caller":"controller/controller.go:203","msg":"Could not wait for Cache to sync","controller":"cert-rotator","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:203\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/runnable_group.go:223"}
{"level":"info","ts":"2023-12-13T08:45:32.538943092Z","caller":"manager/internal.go:516","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2023-12-13T08:45:32.539056196Z","caller":"manager/internal.go:520","msg":"Stopping and waiting for leader election runnables"}
{"level":"error","ts":"2023-12-13T08:45:32.539191455Z","caller":"manager/internal.go:490","msg":"error received after stop sequence was engaged","error":"failed waiting for reader to sync","errorVerbose":"failed waiting for reader to sync\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*CertRotator).Start\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.10.0/pkg/rotator/rotator.go:258\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/runnable_group.go:223\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/manager/internal.go:490"}
{"level":"info","ts":"2023-12-13T08:45:32.541570151Z","caller":"manager/internal.go:526","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2023-12-13T08:45:32.541829866Z","caller":"manager/internal.go:530","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2023-12-13T08:45:32.541923813Z","caller":"manager/internal.go:533","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2023-12-13T08:45:32.541944674Z","logger":"controller-runtime.metrics","caller":"server/server.go:231","msg":"Shutting down metrics server with timeout of 1 minute"}
{"level":"info","ts":"2023-12-13T08:45:32.542043913Z","caller":"manager/server.go:43","msg":"shutting down server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":"2023-12-13T08:45:32.542234342Z","caller":"manager/internal.go:537","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2023-12-13T08:45:32.542345851Z","logger":"setup","caller":"kueue/main.go:182","msg":"Could not run manager","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret","stacktrace":"main.main\n\t/workspace/cmd/kueue/main.go:182\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"}

@kbakk
Copy link

kbakk commented Dec 13, 2023

See the restarts on v0.5.0 as well. Not able to reproduce this on v0.4.2.

@alculquicondor
Copy link
Contributor

Does it only reproduce with helm?
I wonder if this is the culprit #986

@yyzxw what did you base the probe configuration on?

@alculquicondor
Copy link
Contributor

Maybe it is the probes implementations themselves that are wrong.
The behavior should be:

  • The liveness probe should always be healthy if the pod is up.
  • The readiness probe should be healthy if it's the leader

@woehrl01
Copy link
Contributor Author

woehrl01 commented Dec 13, 2023

I've recently started using Kueue and I'm intrigued by its implementation. Given that it operates as an admission webhook, I'm wondering if there should be a redundancy system for the pods that manage the requests. Wouldn't it make sense for the non-leader pods to at least handle the admission webhooks, while the leader pod focuses also on background processing? Resulting in also returning readiness probe healthy if it's a non-leader.

@alculquicondor
Copy link
Contributor

alculquicondor commented Dec 13, 2023

Yeah, ideally all replicas should reply to webhooks.

However, we recently introduced this feature https://github.com/kubernetes-sigs/kueue/tree/main/keps/168-2-pending-workloads-visibility. In this case, only the leader can respond. Another alternative would be for non-leaders to also maintain the queues (we do this in kube-scheduler), so that they can also respond to api-extensions requests.

I'm not actually sure about what is the behavior that controller-runtime applies.

@astefanutti
Copy link
Member

controller-runtime exposes the LeaderElectionRunnable interface, so controllers can implement its NeedLeaderElection method to control whether the manager should start them in non-leader instances. Also managed webhooks are always started, irrespective of leader election.

In the case of the OPA cert-controller, there is a RequireLeaderElection option, that's correctly set by Kueue, but I suspect there is an issue in cert-controller that makes it not taken into account, which is the root cause of that issue. I'll fix it upstream.

For the visibility extension API server, we would need to make sure it's safe to run multiple instances of ClusterQueueReconciler concurrently, or find a way to only run the read-only part?

@astefanutti
Copy link
Member

/assign

@trasc
Copy link
Contributor

trasc commented Dec 14, 2023

@astefanutti do you want to on this?

@astefanutti
Copy link
Member

astefanutti commented Dec 14, 2023

@trasc yes I can work on it. There is only the part about the visibility API server HA, I may need your input on this. But that may be tackled separately.

@alculquicondor
Copy link
Contributor

/unassign @trasc

@alculquicondor
Copy link
Contributor

For the visibility extension API server, we would need to make sure it's safe to run multiple instances of ClusterQueueReconciler concurrently, or find a way to only run the read-only part?

That might be tricky. The CQReconciler updates status based on usage. If each replica maintains the cache and queues, in theory the usage should be in sync, but we could face race conditions. Ideally we would only like to run the event handlers, but not the reconciler itself.

Is there a boolean given by the interface that can tell us whether it's the leader? Then we can just return false from the event handlers.

@alculquicondor
Copy link
Contributor

In the case of the OPA cert-controller, there is a RequireLeaderElection option, that's correctly set by Kueue, but I suspect there is an issue in cert-controller that makes it not taken into account, which is the root cause of that issue. I'll fix it upstream.

Do you mean you'll fix ca-cert itself? I wonder if this somehow related to the e2e failures that we are seeing: #1372 (comment)

@alculquicondor
Copy link
Contributor

For the sake of truthfulness

/retitle Non-leading replica is restarted due to inaccurate probe implementation

@k8s-ci-robot k8s-ci-robot changed the title Non leading replica is crashing Non-leading replica is restarted due to inaccurate probe implementation Dec 14, 2023
@astefanutti
Copy link
Member

Is there a boolean given by the interface that can tell us whether it's the leader? Then we can just return false from the event handlers.

controller-runtime Manager interface exposes a Elected() <-chan struct{} method, that returns a channel that's closed only when that manager has been leader elected (or leader election is disabled). So in theory, it could be possible to pass it to the ClusterQueueReconciler, and adapt the handlers logic depending on it.

@astefanutti
Copy link
Member

In the case of the OPA cert-controller, there is a RequireLeaderElection option, that's correctly set by Kueue, but I suspect there is an issue in cert-controller that makes it not taken into account, which is the root cause of that issue. I'll fix it upstream.

Do you mean you'll fix ca-cert itself?

I mean I'll fix the following issue in cert-controller itself (assuming my analysis is correct), and upgrade the cert-controller dependency in Kueue:

Could not wait for Cache to sync","controller":"cert-rotator","error":"failed to wait for cert-rotator caches to sync: timed out waiting for cache to be synced for Kind *v1.Secret

I wonder if this somehow related to the e2e failures that we are seeing: #1372 (comment)

I need to look more at how the e2e tests are setup. I'll check it once I have the fix.

@astefanutti
Copy link
Member

I've open open-policy-agent/cert-controller#166 that fixes the cache timeout issue in cert-controller.

@alculquicondor
Copy link
Contributor

@astefanutti can you explain why the binary is terminating given the bug in cert-controller?

I thought this had something to do with the probes, but our probes just use Ping, which would return Ready/Live if the binary is able to respond to the Http request.

@astefanutti
Copy link
Member

astefanutti commented Dec 19, 2023

@astefanutti can you explain why the binary is terminating given the bug in cert-controller?

I thought this had something to do with the probes, but our probes just use Ping, which would return Ready/Live if the binary is able to respond to the Http request.

@alculquicondor right, I don't think it has something to do with the probes.

So what happens in the non-leader elected mode is the following:

  • cert-controller is added to the manager (as a runnable)
  • the manager starts
  • the cert-controller runnable is started
  • that waits for its cache to sync, but that cache is never started (because controller-runtime defaults to starting runnables only in the leader-elected instance)
  • the above calls times out and returns an error
  • the manager start call return an error, which is confirmed by the Could not run manager message
  • the binary exits

@alculquicondor alculquicondor changed the title Non-leading replica is restarted due to inaccurate probe implementation Non-leading replica is restarted due to unstarted cert-controller Dec 19, 2023
@alculquicondor alculquicondor changed the title Non-leading replica is restarted due to unstarted cert-controller Non-leading replica fails due to not started cert-controller Dec 19, 2023
@astefanutti
Copy link
Member

A new version of cert-controller has been released with the fix, and I've open #1509 that upgrades our dependency. That fixes the non-leading replicas starting issue.

I've also open #1510 to track the work for making the visibility extension API server highly available.

@alculquicondor
Copy link
Contributor

Thanks a lot for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants