kubernetes-management job stuck in executor queue #3350

NotMyFault · 2023-01-23T08:06:28Z

Service(s)

infra.ci.jenkins.io

Summary

I was checking if something has been deployed yet, and noticed that the kubernetes-management job is stuck in queue since Friday the 20th, failing at the “Apply” state.

Reproduction steps

No response

dduportal · 2023-01-23T08:22:53Z

Current state:

It's the build #11436
Logs indicate the following:

# ...
kubernetes-jobs-kubernetes-management-main-11436-6qp4f-61-cq06p has been removed for 5 min 0 sec, assuming it is not coming back
# ...
Body did not finish within grace period; terminating with extreme prejudice

=> either our pipeline setup is not correct in regard of pipeline durability (retry, agent, etc.) or there has been an issue.

Cancelling the build to unblock the queue.

dduportal · 2023-01-23T08:24:42Z

Same for the build #6551 of the Update'scli job for kubernetes management.

Stopping it.

dduportal · 2023-01-23T08:36:45Z

The next kubernetes-management job deployed a new controller image: infra.ci is restaring but seems stuck in starting step (HTTP/503 errors).

Currently diagnosing

dduportal · 2023-01-23T10:33:34Z

infra.ci.jenkins.io is back online since 1h
kubernetes-management is disabled because it's stuck in a "restart controller/break builds" loop => WiP

dduportal · 2023-01-23T16:10:34Z

We had to "operate" (as a team) on infra.ci.

To avoid the "loop" where jenkins-infra pod is re-created (and it takes ~6-7 min), we had to increase the timeout of 5 min (300s) to 10 min (600s) in jenkins-infra/kubernetes-management@87ca94e
The restart of the controller was slowed down by slow IOPS while scanning the persistent volume. We realized that the PV was using the default storage class (with a StandardSSD_LR disk type). So we had to update the associated disk (inspired by https://tothecloud.dev/convert-aks-pv-to-premium/):
- Scale down the statefulset to zero
- Patch the PV from Delete to Reclaim policy with kubectl patch pv pvc-23fa0e93-9dec-4ee6-b7b2-408150abcae9 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
- Update the PV by removing the spec.claimRef.uid and spec.claimRef.resourceVersion (⚠️ do it EACH time you delete the PVC otherwise it will create a new PV ⚠️)
- Update the PV's disk type to Premium managed SSD (50 Gb means P6 for QoS by default)
- Delete PVC
- Update the Helm chart values in jenkins-infra/kubernetes-management@87ca94e with:
  - A nitpick about naming of the svc account for the controller
  - A cleanup: we do not create svc account for kube agents (default will be used unless specified so) to avoid confusion
  - Fix the specified storageClass to managed-csi-premium
- Re-apply manually the deployment with helmfile -f <...> apply (it will create a new PVC that will be automatically re-assigned to the existing PV IF you removed the spec.claimRef.uid and spec.claimRef.resourceVersion in the PV, right before)
- Side note : we had to "die and retry" for 2 at least hours before perfecting this procedure 😅

infra.ci.jenkins.io is back and kubernetes-management works well now

dduportal · 2023-01-23T16:57:10Z

We are trying the new setup by merging jenkins-infra/kubernetes-management#3478 (core version + plugin): let's see how the controller infra.ci behaves.

dduportal · 2023-01-23T17:13:28Z

Nice, it work as expected !

Side note before closing:

We've met a java.lang.StackOverflow error while Jenkins was trying to continue the builds. We'lle have to check the memory usage.
One of the agents failed (but not its counterparts) while reconnecting to the controller: the stackoverflow error timestamp is the same as the agent-side handshake error.

NotMyFault added the triage Incoming issues that need review label Jan 23, 2023

jenkins-infra-helpdesk-app bot added the infra.ci.jenkins.io label Jan 23, 2023

dduportal removed the triage Incoming issues that need review label Jan 23, 2023

dduportal self-assigned this Jan 23, 2023

dduportal added this to the infra-team-sync-2023-01-24 milestone Jan 23, 2023

dduportal closed this as completed Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes-management job stuck in executor queue #3350

kubernetes-management job stuck in executor queue #3350

NotMyFault commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

kubernetes-management job stuck in executor queue #3350

kubernetes-management job stuck in executor queue #3350

Comments

NotMyFault commented Jan 23, 2023

Service(s)

Summary

Reproduction steps

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023

dduportal commented Jan 23, 2023