Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766

Closed
krzyzacy opened this issue Feb 12, 2018 · 17 comments · Fixed by #60728
Closed

[job failure] periodic-kubernetes-e2e-kubeadm-gce-selfhosting #59766

krzyzacy opened this issue Feb 12, 2018 · 17 comments · Fixed by #60728
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@krzyzacy
Copy link
Member

/priority failing-test
/kind bug
/status approved-for-milestone
/sig cluster-lifecycle

https://k8s-testgrid.appspot.com/sig-release-master-blocking#kubeadm-gce-selfhosting
the job is on the master-blocking dashboard and been failing to bring up the cluster.

cc @jdumars @luxas @jessicaochen

@k8s-ci-robot k8s-ci-robot added status/approved-for-milestone kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 12, 2018
@cblecker cblecker added this to the v1.10 milestone Feb 12, 2018
@jdumars
Copy link
Member

jdumars commented Feb 23, 2018

@krzyzacy is this one getting traction in the SIG? cc: @luxas @jessicaochen

@jessicaochen
Copy link
Member

Seems like there is an issue with scheduling a DNS pod.

Feb 23 15:20:31.315: Error waiting for all pods to be running and ready: 1 / 15 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s POD NODE PHASE GRACE CONDITIONS kube-dns-865b8d59f4-zcndv Pending [{Type:PodScheduled Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-02-23 15:02:14 +0000 UTC Reason:Unschedulable Message:0/1 nodes are available: 1 node(s) were not ready.}]

Trying to reach out on slack to see if there is someone familiar with self-hosting that can take a closer look.

@xiangpengzhao
Copy link
Contributor

Does this have anything to do with the issue: #59762?
I saw some similar failure:

W0227 15:22:05.130] + echo Trying to fetch kubeconfig from master... 60/60
W0227 15:22:05.130] + gcloud compute ssh --project gce-cvm-upg-1-3-lat-ctl-skew --zone us-central1-f prow@e2e-269-master --command 'echo STARTFILE; sudo cat /etc/kubernetes/admin.conf'
I0227 15:22:05.230] Trying to fetch kubeconfig from master... 60/60
W0227 15:22:06.285] cat: /etc/kubernetes/admin.conf: No such file or directory
W0227 15:22:06.345] + sleep 5
W0227 15:22:11.346] + echo Exhausted attempts to fetch kubeconfig.
W0227 15:22:11.346] Exhausted attempts to fetch kubeconfig.
W0227 15:22:11.347] + exit 1
W0227 15:22:11.347] make[1]: *** [do] Error 1
W0227 15:22:11.347] make: *** [deploy-cluster] Error 2
W0227 15:22:11.347] 2018/02/27 15:22:11 process.go:152: Step 'make -C /workspace/kubernetes-anywhere WAIT_FOR_KUBECONFIG=y deploy' finished in 7m13.741693578s

Though this might not be the root cause.

@jessicaochen
Copy link
Member

@xiangpengzhao - Does not seem so at the moment. The error saying it cannot fetch the kubeconfig means the master did not succeed in setting up but does not tell why.

Looking at the last three failures, I triaged out two classes of issue:
[1] The GS bucket with kubeadm was not ready so the master could not set up. This seems like some sort of timing issue.
Feb 27 15:17:07 e2e-269-master startup-script: INFO startup-script: CommandException: No URLs matched: gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.762+44c166cd73d21e/bin/linux/amd64/kubeadm
From https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/269/artifacts/e2e-269-master/serial-1.log for test https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/269

[2] The etcd pod on the master is failing. This makes the apiserver unavailable.
Feb 27 03:45:47 e2e-267-master kubelet[5562]: E0227 03:45:47.201521 5562 pod_workers.go:186] Error syncing pod fb435261e8a6806cc7bf2238686ffdf7 ("etcd-e2e-267-master_kube-system(fb435261e8a6806cc7bf2238686ffdf7)"), skipping: failed to "StartContainer" for "etcd" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=etcd pod=etcd-e2e-267-master_kube-system(fb435261e8a6806cc7bf2238686ffdf7)"
for tests https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/267/artifacts/e2e-267-master/serial-1.log
https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-kubernetes-e2e-kubeadm-gce-selfhosting/268/artifacts/e2e-268-master/serial-1.log

@leblancd
Copy link

@jessicaochen , @xiangpengzhao :
Excellent triage! Failure mode [1] in @jessicaochen's analysis matches what is happening in at least one test case for ci-kubernetes-e2e-kubeadm-gce (Issue #59762). The test is looking to download kubeadm from a GS bucket, but kubeadm binary has not yet been copied to that GS bucket.

From what I can tell, the "run_after_success:" mechanism in prow/config.yaml isn't working as expected. It seems the "run_after_success" job is running while the prerequisite build job is still running.

Here is the test result I looked at: ci-kubernetes-e2e-kubeadm-gce # 9642
The kube master serial log shows that the kubeadm binary is not yet populated in the GS bucket:

Feb 28 15:24:22 e2e-9642-master startup-script: INFO startup-script: CommandException: No URLs matched: gs://kubernetes-release-dev/bazel/v1.11.0-alpha.0.914+24adcd59f2ea47/bin/linux/amd64/kubeadm

So it's looking for the kubeadm binary at Feb 28 15:24:22.

I believe that the corresponding build job is here: ci-kubernetes-bazel-build # 228346.
The build log shows kubeadm binary being copied:

W0228 15:27:50.323] / [21/108 files][ 22.2 MiB/  2.1 GiB]   1% Done                                 
Copying file:///tmp/bazel-gcs.HClg_9/bin/linux/amd64/kubeadm [Content-Type=application/octet-stream]...
W0228 15:27:50.326] Copying file:///tmp/bazel-gcs.HClg_9/bin/linux/amd64/kube-scheduler.tar.sha1 [Content-Type=application/octet-stream]...

But that copy isn't happening until 15:27:50

So the "run_after_success" isn't guaranteeing serialization of build vs. test jobs.

@jberkus
Copy link

jberkus commented Feb 28, 2018

Adding critical-urgent because we are now in Code Freeze

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 28, 2018
@jessicaochen
Copy link
Member

@stealthybox - After the change to kubeadm etcd behavior ( #57415 ) , this test developed the failure [2] where the etcd pod on the master fails. Do you have any idea what might be going on?

@stealthybox
Copy link
Member

stealthybox commented Mar 2, 2018

@jessicaochen I was able to reproduce the etcd pod failure locally.
The pod is failing due to the liveness probe still using HTTP.

We missed that selfhosting can depend on the etcd static pod.
I'm now patching it to support mTLS.
^ mTLS is already working with self-hosted

The liveness probe kills the etcd pod for all kubeadm installs after 95 seconds and then starts the crash loop.

@stealthybox
Copy link
Member

/status in-progress

@k8s-ci-robot
Copy link
Contributor

You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to add status labels.

@stealthybox
Copy link
Member

xref #60608

@stealthybox
Copy link
Member

@jessicaochen the above PR works to address this 👍

k8s-github-robot pushed a commit that referenced this issue Mar 5, 2018
…tcd_tls

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Add mTLS to kubeadm etcd liveness probe.

**What this PR does / why we need it**:
We switched etcd over to using mTLS, but the liveness probe is still using http.
Disabling the liveness probe allows etcd to continue operating.

The real fix isn't simple, because we need to generate a client certificate for healthchecking and update the probe to exec `etcdctl` like so: 
https://sourcegraph.com/github.com/coreos/etcd-operator/-/blob/pkg/util/k8sutil/pod_util.go#L71-89

~Working on patching this now.~
This PR now generates the healthcheck identity and updates the liveness probe to use it.

**Which issue(s) this PR fixes**
Fixes #59766
Fixes kubernetes/kubeadm#720

**Special notes for your reviewer**:
We should generate a client cert specifically for etcd health checks so that the apiserver certs can be revoked independently.
This will be stored in `/etc/kubernetes/pki/etcd/` so that we don't have to change the pod's hostMount.

**Release note**:
```release-note
NONE
```
@jessicaochen
Copy link
Member

Not sure why this closed given that we have not verified the change actually makes the test green. Could someone re-open this? Perhaps @krzyzacy ?

@krzyzacy krzyzacy reopened this Mar 5, 2018
@stealthybox
Copy link
Member

I linked the PR as a fix -- thanks for reopening

@jberkus
Copy link

jberkus commented Mar 6, 2018

@jessicaochen: Github automatically closes issues if there's "Fixes: #ISSUE" in the merge.

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@krzyzacy @kubernetes/sig-cluster-lifecycle-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@krzyzacy
Copy link
Member Author

krzyzacy commented Mar 7, 2018

this is also fixed

@krzyzacy krzyzacy closed this as completed Mar 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants