-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916
Comments
Happened on kubernetes-e2e-gce-slow as well. |
@mml do you think you could take a look at this? |
@davidopp okie doke. |
I don't see much to debug with, but I'll try to reproduce. I think when any test fails at this stage, we could at least output whether or not we can ping the master and then include the apiserver logfile if we can find it. Maybe a stacktrace from the apiserver, too. |
Most generic error in the world. Happened twice this morning in gce-serial http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-serial/462/ |
OK. Mine just failed on the first try.
and then mml@e2e-test-mml-master:~$ curl https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz
<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message></Error>mml@e2e-test-mml-master:~$ Two things seem weird:
|
No.2 is obvious: because that's what the code says to do: validate and then keep going no matter what was returned. https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/configure-vm.sh#L571 This is incidental to the flaky test, but I'd like to fix it unless it's on purpose. I fear if we change this now a bunch of clusters might fail to boot due to hidden bugs in this part of the process, so maybe it's a change to save until after the 1.2 branch is cut. |
OK, so I'm planning to print the contents of /var/log/startupscript.log and /var/log/kube-apiserver.log on failure, and I plan to toggle this on an env var setting: |
Nice detective work. We wouldn't get to the sha validation if we were handling status code >= 400 in download-or-bust. We should retry on these as well. I'm hoping that GCS is just taking a second to converge after pushing the artifact and there aren't other errors we are failing to handle when pushing the artifact.
|
Actually that's not what it is... Looks like it is returning a bad error code on 404. I'm wondering why we got past download-or-bust. What is in the file when you cat it after seeing this error? |
Could this have possibly just happened in a GKE cluster? http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-slow/528/console I'm wondering if these are all the same issue or different |
I don't think the missing tar file is happening elsewhere. I tracked it down to having switched projects but not removed the *.uploaded.sha1 files. I doubt that happens in jenkins. |
With a bug inserted and the timeout reduced...
|
Any thoughts on next steps for this issue? (It's marked P0 v1.2). It doesn't sound like we've identified any specific problem here that doesn't already have a separate issue filed for it (and all three of the things Dawn mentioned have been merged). The fix for #10423 has been merged, so we can get better logs in the future. I'd propose we close this issue for now. Any objections? |
we just merged: #22192, which is the fallout from the debugging in #20916 (comment). I want to see why/if/for how long the metadata server is hanging. Previously we weren't timing out requests, so there's a chance we won't hit that failure mode if it was a GET hang of some sort that came after the string of 404s. |
I'm reassigning to myself so I can take a look with the next failure. |
OK, but can you re-title this issue so it's just about trying to read from the metadata server? |
Hmm, but I don't even know if that's the problem |
This happened again in http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/32023/console and it looks like we just didn't create the master.
I'm spinning that off into another bug. |
@bprashanth what remains to close this bug? Should it be closed and we only track #22655? |
The last issue that Wojtek posted logs too seems to be that etcd wasn't up, filed: I haven't observed the metadata server issue that I'm keeping this open for since I put in the poll+timeout, so I'm closing this for now as suggested by David in #20916 (comment). Please re-open if it reoccurs, and we can dig through logs and triage. |
This just happened again: It seems that master components, didn't event started, although we have some logs from master kubelet: |
That's really strange. The test ran 9 hours ago: #22682 (comment) And the kubelet logs have:
Which I removed in https://github.com/kubernetes/kubernetes/pull/22192/files#diff-d983c6311a6046434cdbf5d6edfc84f9R66 and went in a week ago. So I'm assuming the run didn't have my timeout fix as well, from the same pr. @kubernetes/goog-testing am I missing something or is the builder confused? |
Maybe it was an older PR that had not been rebased recently. Do we attempt to merge with master before building the test artifact like travis does? |
Actually, I agree with @bprashanth - there is something extremely strange here. |
@mikedanese Yes, PR Jenkins checks out the PR merge tag when possible. If the PR has a merge conflict then we will get whatever the user has in their branch, which may be very out of date. We should probably disallow building/testing PRs with merge conflicts. |
The build log shows no merge conflicts in that build, and indeed shows
I really don't know what happened. |
Im putting the reason this was re-opened down to: infra weirdness. #22933 failed because of:
which has an associated internal bug, and is not this issue. This last failure is the only one concerning, for which I've forked #23967 |
https://cloud.google.com/console/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/4496/
The text was updated successfully, but these errors were encountered: