Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes-management job stuck in executor queue #3350

Closed
NotMyFault opened this issue Jan 23, 2023 · 7 comments
Closed

kubernetes-management job stuck in executor queue #3350

NotMyFault opened this issue Jan 23, 2023 · 7 comments

Comments

@NotMyFault
Copy link
Member

Service(s)

infra.ci.jenkins.io

Summary

I was checking if something has been deployed yet, and noticed that the kubernetes-management job is stuck in queue since Friday the 20th, failing at the “Apply” state.

Reproduction steps

No response

@NotMyFault NotMyFault added the triage Incoming issues that need review label Jan 23, 2023
@dduportal dduportal removed the triage Incoming issues that need review label Jan 23, 2023
@dduportal dduportal self-assigned this Jan 23, 2023
@dduportal dduportal added this to the infra-team-sync-2023-01-24 milestone Jan 23, 2023
@dduportal
Copy link
Contributor

Current state:

  • It's the build #11436
  • Logs indicate the following:
# ...
kubernetes-jobs-kubernetes-management-main-11436-6qp4f-61-cq06p has been removed for 5 min 0 sec, assuming it is not coming back
# ...
Body did not finish within grace period; terminating with extreme prejudice

=> either our pipeline setup is not correct in regard of pipeline durability (retry, agent, etc.) or there has been an issue.

Cancelling the build to unblock the queue.

@dduportal
Copy link
Contributor

Same for the build #6551 of the Update'scli job for kubernetes management.

Stopping it.

@dduportal
Copy link
Contributor

The next kubernetes-management job deployed a new controller image: infra.ci is restaring but seems stuck in starting step (HTTP/503 errors).

Currently diagnosing

@dduportal
Copy link
Contributor

  • infra.ci.jenkins.io is back online since 1h
  • kubernetes-management is disabled because it's stuck in a "restart controller/break builds" loop => WiP

@dduportal
Copy link
Contributor

We had to "operate" (as a team) on infra.ci.

  • To avoid the "loop" where jenkins-infra pod is re-created (and it takes ~6-7 min), we had to increase the timeout of 5 min (300s) to 10 min (600s) in jenkins-infra/kubernetes-management@87ca94e
  • The restart of the controller was slowed down by slow IOPS while scanning the persistent volume. We realized that the PV was using the default storage class (with a StandardSSD_LR disk type). So we had to update the associated disk (inspired by https://tothecloud.dev/convert-aks-pv-to-premium/):
    • Scale down the statefulset to zero
    • Patch the PV from Delete to Reclaim policy with kubectl patch pv pvc-23fa0e93-9dec-4ee6-b7b2-408150abcae9 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
    • Update the PV by removing the spec.claimRef.uid and spec.claimRef.resourceVersion (⚠️ do it EACH time you delete the PVC otherwise it will create a new PV ⚠️)
    • Update the PV's disk type to Premium managed SSD (50 Gb means P6 for QoS by default)
    • Delete PVC
    • Update the Helm chart values in jenkins-infra/kubernetes-management@87ca94e with:
      • A nitpick about naming of the svc account for the controller
      • A cleanup: we do not create svc account for kube agents (default will be used unless specified so) to avoid confusion
      • Fix the specified storageClass to managed-csi-premium
    • Re-apply manually the deployment with helmfile -f <...> apply (it will create a new PVC that will be automatically re-assigned to the existing PV IF you removed the spec.claimRef.uid and spec.claimRef.resourceVersion in the PV, right before)
    • Side note : we had to "die and retry" for 2 at least hours before perfecting this procedure 😅

infra.ci.jenkins.io is back and kubernetes-management works well now

@dduportal
Copy link
Contributor

We are trying the new setup by merging jenkins-infra/kubernetes-management#3478 (core version + plugin): let's see how the controller infra.ci behaves.

@dduportal
Copy link
Contributor

Nice, it work as expected !

Side note before closing:

  • We've met a java.lang.StackOverflow error while Jenkins was trying to continue the builds. We'lle have to check the memory usage.
  • One of the agents failed (but not its counterparts) while reconnecting to the controller: the stackoverflow error timestamp is the same as the agent-side handshake error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants