[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766

krzyzacy · 2018-02-12T20:30:48Z

/priority failing-test
/kind bug
/status approved-for-milestone
/sig cluster-lifecycle

https://k8s-testgrid.appspot.com/sig-release-master-blocking#kubeadm-gce-selfhosting
the job is on the master-blocking dashboard and been failing to bring up the cluster.

cc @jdumars @luxas @jessicaochen

jdumars · 2018-02-23T16:02:36Z

@krzyzacy is this one getting traction in the SIG? cc: @luxas @jessicaochen

jessicaochen · 2018-02-23T18:04:23Z

Seems like there is an issue with scheduling a DNS pod.

Feb 23 15:20:31.315: Error waiting for all pods to be running and ready: 1 / 15 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s POD NODE PHASE GRACE CONDITIONS kube-dns-865b8d59f4-zcndv Pending [{Type:PodScheduled Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-02-23 15:02:14 +0000 UTC Reason:Unschedulable Message:0/1 nodes are available: 1 node(s) were not ready.}]

Trying to reach out on slack to see if there is someone familiar with self-hosting that can take a closer look.

xiangpengzhao · 2018-02-27T17:32:28Z

Does this have anything to do with the issue: #59762?
I saw some similar failure:

W0227 15:22:05.130] + echo Trying to fetch kubeconfig from master... 60/60
W0227 15:22:05.130] + gcloud compute ssh --project gce-cvm-upg-1-3-lat-ctl-skew --zone us-central1-f prow@e2e-269-master --command 'echo STARTFILE; sudo cat /etc/kubernetes/admin.conf'
I0227 15:22:05.230] Trying to fetch kubeconfig from master... 60/60
W0227 15:22:06.285] cat: /etc/kubernetes/admin.conf: No such file or directory
W0227 15:22:06.345] + sleep 5
W0227 15:22:11.346] + echo Exhausted attempts to fetch kubeconfig.
W0227 15:22:11.346] Exhausted attempts to fetch kubeconfig.
W0227 15:22:11.347] + exit 1
W0227 15:22:11.347] make[1]: *** [do] Error 1
W0227 15:22:11.347] make: *** [deploy-cluster] Error 2
W0227 15:22:11.347] 2018/02/27 15:22:11 process.go:152: Step 'make -C /workspace/kubernetes-anywhere WAIT_FOR_KUBECONFIG=y deploy' finished in 7m13.741693578s

Though this might not be the root cause.

jessicaochen · 2018-02-27T18:11:06Z

@xiangpengzhao - Does not seem so at the moment. The error saying it cannot fetch the kubeconfig means the master did not succeed in setting up but does not tell why.

Looking at the last three failures, I triaged out two classes of issue:
[1] The GS bucket with kubeadm was not ready so the master could not set up. This seems like some sort of timing issue.
Feb 27 15:17:07 e2e-269-master startup-script: INFO startup-script: CommandException: No URLs matched: gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.762+44c166cd73d21e/bin/linux/amd64/kubeadm
From https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/269/artifacts/e2e-269-master/serial-1.log for test https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/269

[2] The etcd pod on the master is failing. This makes the apiserver unavailable.
Feb 27 03:45:47 e2e-267-master kubelet[5562]: E0227 03:45:47.201521 5562 pod_workers.go:186] Error syncing pod fb435261e8a6806cc7bf2238686ffdf7 ("etcd-e2e-267-master_kube-system(fb435261e8a6806cc7bf2238686ffdf7)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=etcd pod=etcd-e2e-267-master_kube-system(fb435261e8a6806cc7bf2238686ffdf7)"
for tests https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/267/artifacts/e2e-267-master/serial-1.log
https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/268/artifacts/e2e-268-master/serial-1.log

leblancd · 2018-02-28T16:26:50Z

@jessicaochen , @xiangpengzhao :
Excellent triage! Failure mode [1] in @jessicaochen's analysis matches what is happening in at least one test case for ci-kubernetes-e2e-kubeadm-gce (Issue #59762). The test is looking to download kubeadm from a GS bucket, but kubeadm binary has not yet been copied to that GS bucket.

From what I can tell, the "run_after_success:" mechanism in prow/config.yaml isn't working as expected. It seems the "run_after_success" job is running while the prerequisite build job is still running.

Here is the test result I looked at: ci-kubernetes-e2e-kubeadm-gce # 9642
The kube master serial log shows that the kubeadm binary is not yet populated in the GS bucket:

Feb 28 15:24:22 e2e-9642-master startup-script: INFO startup-script: CommandException: No URLs matched: gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.914+24adcd59f2ea47/bin/linux/amd64/kubeadm

So it's looking for the kubeadm binary at Feb 28 15:24:22.

I believe that the corresponding build job is here: ci-kubernetes-bazel-build # 228346.
The build log shows kubeadm binary being copied:

W0228 15:27:50.323] / [21/108 files][ 22.2 MiB/  2.1 GiB]   1% Done                                 
Copying file:///tmp/bazel-gcs.HClg_9/bin/linux/amd64/kubeadm [Content-Type=application/octet-stream]...
W0228 15:27:50.326] Copying file:///tmp/bazel-gcs.HClg_9/bin/linux/amd64/kube-scheduler.tar.sha1 [Content-Type=application/octet-stream]...

But that copy isn't happening until 15:27:50

So the "run_after_success" isn't guaranteeing serialization of build vs. test jobs.

jberkus · 2018-02-28T23:24:38Z

Adding critical-urgent because we are now in Code Freeze

/priority critical-urgent

jessicaochen · 2018-03-01T18:14:24Z

@stealthybox - After the change to kubeadm etcd behavior ( #57415 ) , this test developed the failure [2] where the etcd pod on the master fails. Do you have any idea what might be going on?

stealthybox · 2018-03-02T17:46:35Z

@jessicaochen I was able to reproduce the etcd pod failure locally.
The pod is failing due to the liveness probe still using HTTP.

We missed that selfhosting can depend on the etcd static pod.
~~I'm now patching it to support mTLS.~~
^ mTLS is already working with self-hosted

The liveness probe kills the etcd pod for all kubeadm installs after 95 seconds and then starts the crash loop.

stealthybox · 2018-03-02T19:31:32Z

/status in-progress

k8s-ci-robot · 2018-03-02T19:31:33Z

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

stealthybox · 2018-03-02T22:29:05Z

xref #60608

stealthybox · 2018-03-03T20:42:02Z

@jessicaochen the above PR works to address this 👍

…tcd_tls Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add mTLS to kubeadm etcd liveness probe. **What this PR does / why we need it**: We switched etcd over to using mTLS, but the liveness probe is still using http. Disabling the liveness probe allows etcd to continue operating. The real fix isn't simple, because we need to generate a client certificate for healthchecking and update the probe to exec `etcdctl` like so: https://sourcegraph.com/github.com/coreos/etcd-operator/-/blob/pkg/util/k8sutil/pod_util.go#L71-89 ~Working on patching this now.~ This PR now generates the healthcheck identity and updates the liveness probe to use it. **Which issue(s) this PR fixes** Fixes #59766 Fixes kubernetes/kubeadm#720 **Special notes for your reviewer**: We should generate a client cert specifically for etcd health checks so that the apiserver certs can be revoked independently. This will be stored in `/etc/kubernetes/pki/etcd/` so that we don't have to change the pod's hostMount. **Release note**: ```release-note NONE ```

jessicaochen · 2018-03-05T18:17:06Z

Not sure why this closed given that we have not verified the change actually makes the test green. Could someone re-open this? Perhaps @krzyzacy ?

stealthybox · 2018-03-05T20:57:50Z

I linked the PR as a fix -- thanks for reopening

jberkus · 2018-03-06T20:07:53Z

@jessicaochen: Github automatically closes issues if there's "Fixes: #ISSUE" in the merge.

k8s-github-robot · 2018-03-06T20:08:03Z

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@krzyzacy @kubernetes/sig-cluster-lifecycle-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

krzyzacy · 2018-03-07T22:10:58Z

this is also fixed

cblecker added this to the v1.10 milestone Feb 12, 2018

jberkus mentioned this issue Feb 21, 2018

1.10 Issue Burndown kubernetes/sig-release#86

Closed

leblancd mentioned this issue Feb 28, 2018

[job failure] ci-kubernetes-e2e-kubeadm-gce #59762

Closed

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 28, 2018

leblancd mentioned this issue Mar 2, 2018

ci-kubernetes-e2e-kubeadm-gce is pulling ci/latest not prior job build results kubernetes/test-infra#6978

Closed

k8s-github-robot added the milestone/needs-attention label Mar 2, 2018

stealthybox mentioned this issue Mar 2, 2018

Add mTLS to kubeadm etcd liveness probe. #60728

Merged

k8s-github-robot closed this as completed in #60728 Mar 5, 2018

krzyzacy reopened this Mar 5, 2018

krzyzacy closed this as completed Mar 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766

[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766

krzyzacy commented Feb 12, 2018

jdumars commented Feb 23, 2018

jessicaochen commented Feb 23, 2018

xiangpengzhao commented Feb 27, 2018

jessicaochen commented Feb 27, 2018

leblancd commented Feb 28, 2018

jberkus commented Feb 28, 2018

jessicaochen commented Mar 1, 2018

stealthybox commented Mar 2, 2018 •

edited

stealthybox commented Mar 2, 2018

k8s-ci-robot commented Mar 2, 2018

stealthybox commented Mar 2, 2018

stealthybox commented Mar 3, 2018

jessicaochen commented Mar 5, 2018

stealthybox commented Mar 5, 2018

jberkus commented Mar 6, 2018

k8s-github-robot commented Mar 6, 2018

krzyzacy commented Mar 7, 2018

[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766

[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766

Comments

krzyzacy commented Feb 12, 2018

jdumars commented Feb 23, 2018

jessicaochen commented Feb 23, 2018

xiangpengzhao commented Feb 27, 2018

jessicaochen commented Feb 27, 2018

leblancd commented Feb 28, 2018

jberkus commented Feb 28, 2018

jessicaochen commented Mar 1, 2018

stealthybox commented Mar 2, 2018 • edited

stealthybox commented Mar 2, 2018

k8s-ci-robot commented Mar 2, 2018

stealthybox commented Mar 2, 2018

stealthybox commented Mar 3, 2018

jessicaochen commented Mar 5, 2018

stealthybox commented Mar 5, 2018

jberkus commented Mar 6, 2018

k8s-github-robot commented Mar 6, 2018

krzyzacy commented Mar 7, 2018

stealthybox commented Mar 2, 2018 •

edited