New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766
Comments
@krzyzacy is this one getting traction in the SIG? cc: @luxas @jessicaochen |
Seems like there is an issue with scheduling a DNS pod.
Trying to reach out on slack to see if there is someone familiar with self-hosting that can take a closer look. |
Does this have anything to do with the issue: #59762?
Though this might not be the root cause. |
@xiangpengzhao - Does not seem so at the moment. The error saying it cannot fetch the kubeconfig means the master did not succeed in setting up but does not tell why. Looking at the last three failures, I triaged out two classes of issue: [2] The etcd pod on the master is failing. This makes the apiserver unavailable. |
@jessicaochen , @xiangpengzhao : From what I can tell, the "run_after_success:" mechanism in prow/config.yaml isn't working as expected. It seems the "run_after_success" job is running while the prerequisite build job is still running. Here is the test result I looked at: ci-kubernetes-e2e-kubeadm-gce # 9642
So it's looking for the kubeadm binary at Feb 28 15:24:22. I believe that the corresponding build job is here: ci-kubernetes-bazel-build # 228346.
But that copy isn't happening until 15:27:50 So the "run_after_success" isn't guaranteeing serialization of build vs. test jobs. |
Adding critical-urgent because we are now in Code Freeze /priority critical-urgent |
@stealthybox - After the change to kubeadm etcd behavior ( #57415 ) , this test developed the failure [2] where the etcd pod on the master fails. Do you have any idea what might be going on? |
@jessicaochen I was able to reproduce the etcd pod failure locally. We missed that selfhosting can depend on the etcd static pod. The liveness probe kills the etcd pod for all kubeadm installs after 95 seconds and then starts the crash loop. |
/status in-progress |
You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels. |
xref #60608 |
@jessicaochen the above PR works to address this 👍 |
…tcd_tls Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add mTLS to kubeadm etcd liveness probe. **What this PR does / why we need it**: We switched etcd over to using mTLS, but the liveness probe is still using http. Disabling the liveness probe allows etcd to continue operating. The real fix isn't simple, because we need to generate a client certificate for healthchecking and update the probe to exec `etcdctl` like so: https://sourcegraph.com/github.com/coreos/etcd-operator/-/blob/pkg/util/k8sutil/pod_util.go#L71-89 ~Working on patching this now.~ This PR now generates the healthcheck identity and updates the liveness probe to use it. **Which issue(s) this PR fixes** Fixes #59766 Fixes kubernetes/kubeadm#720 **Special notes for your reviewer**: We should generate a client cert specifically for etcd health checks so that the apiserver certs can be revoked independently. This will be stored in `/etc/kubernetes/pki/etcd/` so that we don't have to change the pod's hostMount. **Release note**: ```release-note NONE ```
Not sure why this closed given that we have not verified the change actually makes the test green. Could someone re-open this? Perhaps @krzyzacy ? |
I linked the PR as a fix -- thanks for reopening |
@jessicaochen: Github automatically closes issues if there's "Fixes: #ISSUE" in the merge. |
[MILESTONENOTIFIER] Milestone Issue Needs Attention @krzyzacy @kubernetes/sig-cluster-lifecycle-misc Action required: During code freeze, issues in the milestone should be in progress. Note: This issue is marked as Example update:
Issue Labels
|
this is also fixed |
/priority failing-test
/kind bug
/status approved-for-milestone
/sig cluster-lifecycle
https://k8s-testgrid.appspot.com/sig-release-master-blocking#kubeadm-gce-selfhosting
the job is on the master-blocking dashboard and been failing to bring up the cluster.
cc @jdumars @luxas @jessicaochen
The text was updated successfully, but these errors were encountered: