Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

Closed
krousey opened this issue Feb 9, 2016 · 55 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@krousey
Copy link
Contributor

krousey commented Feb 9, 2016

10:12:14 Using master: e2e-scalability-master (external IP: 104.197.128.205)
10:12:14 Waiting up to 300 seconds for cluster initialization.
10:12:14 
10:12:14   This will continually check to see if the API for kubernetes is reachable.
10:12:14   This may time out if there was some uncaught error during start up.
10:12:14 
10:12:14 ......................................................................................................................................................Cluster failed to initialize within 300 seconds.
10:17:15 2016/02/09 10:17:15 e2e.go:309: Error running up: exit status 2
10:17:15 2016/02/09 10:17:15 e2e.go:305: Step 'up' finished in 8m0.913745413s

https://cloud.google.com/console/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/4496/

@krousey krousey added the kind/flake Categorizes issue or PR as related to a flaky test. label Feb 9, 2016
@ikehz
Copy link
Contributor

ikehz commented Feb 9, 2016

@ikehz ikehz changed the title e2e Flake: kubernetes-e2e-gce-scalability Cluster failed to initialize within 300 seconds e2e Flake: Cluster failed to initialize within 300 seconds Feb 9, 2016
@krousey
Copy link
Contributor Author

krousey commented Feb 9, 2016

@krousey krousey added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 9, 2016
@krousey
Copy link
Contributor Author

krousey commented Feb 10, 2016

@davidopp
Copy link
Member

@mml do you think you could take a look at this?

@mml mml self-assigned this Feb 11, 2016
@mml
Copy link
Contributor

mml commented Feb 11, 2016

@davidopp okie doke.

@mml
Copy link
Contributor

mml commented Feb 11, 2016

I don't see much to debug with, but I'll try to reproduce.

I think when any test fails at this stage, we could at least output whether or not we can ping the master and then include the apiserver logfile if we can find it. Maybe a stacktrace from the apiserver, too.

@mikedanese
Copy link
Member

@mml
Copy link
Contributor

mml commented Feb 11, 2016

OK. Mine just failed on the first try.

Feb 11 19:54:26 e2e-test-mml-master startupscript: == kubernetes-salt.tar.gz corrupted, sha1 6cf22a7f474695a7b02c4f8e6bbe35b2441c8eb2 doesn't match expected f18d39b707eb8101b18be679d687507cde1d0a60 ==
Feb 11 19:54:26 e2e-test-mml-master startupscript: Validated https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz SHA1 = f18d39b707eb8101b18be679d687507cde1d0a60
Feb 11 19:54:26 e2e-test-mml-master startupscript: Unpacking Salt tree and checking integrity of binary release tar
Feb 11 19:54:26 e2e-test-mml-master startupscript: 
Feb 11 19:54:26 e2e-test-mml-master startupscript: gzip: stdin: not in gzip format
Feb 11 19:54:26 e2e-test-mml-master startupscript: tar: Child returned status 1
Feb 11 19:54:26 e2e-test-mml-master startupscript: tar: Error is not recoverable: exiting now
Feb 11 19:54:41 e2e-test-mml-master startupscript: Couldn't download release. Retrying...
Feb 11 19:54:41 e2e-test-mml-master startupscript: Downloading binary release tar (https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-server-linux-amd64.tar.gz)
Feb 11 19:54:41 e2e-test-mml-master startupscript:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Feb 11 19:54:41 e2e-test-mml-master startupscript:                                  Dload  Upload   Total   Spent    Left  Speed
Feb 11 19:54:41 e2e-test-mml-master startupscript: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015100   127  100   127    0     0   5797      0 --:--:-- --:--:-- --:--:--  6684
Feb 11 19:54:41 e2e-test-mml-master startupscript: == kubernetes-server-linux-amd64.tar.gz corrupted, sha1 6cf22a7f474695a7b02c4f8e6bbe35b2441c8eb2 doesn't match expected 3a0bfc3dd36a357c3c79a203260577937ee9af76 ==
Feb 11 19:54:41 e2e-test-mml-master startupscript: Validated https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-server-linux-amd64.tar.gz SHA1 = 3a0bfc3dd36a357c3c79a203260577937ee9af76
Feb 11 19:54:41 e2e-test-mml-master startupscript: Downloading Salt tar (https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz)
Feb 11 19:54:41 e2e-test-mml-master startupscript:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Feb 11 19:54:41 e2e-test-mml-master startupscript:                                  Dload  Upload   Total   Spent    Left  Speed
Feb 11 19:54:41 e2e-test-mml-master startupscript: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015100   127  100   127    0     0   5075      0 --:--:-- --:--:-- --:--:--  5772
Feb 11 19:54:41 e2e-test-mml-master startupscript: == kubernetes-salt.tar.gz corrupted, sha1 6cf22a7f474695a7b02c4f8e6bbe35b2441c8eb2 doesn't match expected f18d39b707eb8101b18be679d687507cde1d0a60 ==
Feb 11 19:54:41 e2e-test-mml-master startupscript: Validated https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz SHA1 = f18d39b707eb8101b18be679d687507cde1d0a60
Feb 11 19:54:41 e2e-test-mml-master startupscript: Unpacking Salt tree and checking integrity of binary release tar
Feb 11 19:54:41 e2e-test-mml-master startupscript: 
Feb 11 19:54:41 e2e-test-mml-master startupscript: gzip: stdin: not in gzip format
Feb 11 19:54:41 e2e-test-mml-master startupscript: tar: Child returned status 1
Feb 11 19:54:41 e2e-test-mml-master startupscript: tar: Error is not recoverable: exiting now

and then

mml@e2e-test-mml-master:~$ curl https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz
<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message></Error>mml@e2e-test-mml-master:~$ 

Two things seem weird:

  1. Where's the salt tarball?
  2. After the sha1 doesn't match, why does it say "Validated" and then appear to keep going?

@mml
Copy link
Contributor

mml commented Feb 11, 2016

No.2 is obvious: because that's what the code says to do: validate and then keep going no matter what was returned. https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/configure-vm.sh#L571

This is incidental to the flaky test, but I'd like to fix it unless it's on purpose. I fear if we change this now a bunch of clusters might fail to boot due to hidden bugs in this part of the process, so maybe it's a change to save until after the 1.2 branch is cut.

@mml
Copy link
Contributor

mml commented Feb 11, 2016

OK, so I'm planning to print the contents of /var/log/startupscript.log and /var/log/kube-apiserver.log on failure, and I plan to toggle this on an env var setting: KUBE_TEST_DEBUG. That seems to follow the existing patterns, and I couldn't find an existing env var to reuse.

@mikedanese
Copy link
Member

Nice detective work. We wouldn't get to the sha validation if we were handling status code >= 400 in download-or-bust. We should retry on these as well. I'm hoping that GCS is just taking a second to converge after pushing the artifact and there aren't other errors we are failing to handle when pushing the artifact.

➜  kubernetes git:(controller-manager-componentconfig) curl https://storage.googleapis.com/asdlkfjaskl/lsajdflkdsajfl && echo -e "\n$?"  
<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchBucket</Code><Message>The specified bucket does not exist.</Message></Error>
0

@mikedanese
Copy link
Member

Actually that's not what it is... Looks like it is returning a bad error code on 404. I'm wondering why we got past download-or-bust. What is in the file when you cat it after seeing this error?

@mikedanese
Copy link
Member

Could this have possibly just happened in a GKE cluster?

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-slow/528/console

I'm wondering if these are all the same issue or different

@mml
Copy link
Contributor

mml commented Feb 11, 2016

I don't think the missing tar file is happening elsewhere. I tracked it down to having switched projects but not removed the *.uploaded.sha1 files. I doubt that happens in jenkins.

@mml
Copy link
Contributor

mml commented Feb 11, 2016

With a bug inserted and the timeout reduced...


  This will continually check to see if the API for kubernetes is reachable.
  This may time out if there was some uncaught error during start up.

..Cluster failed to initialize within 3 seconds.
/var/log/startupscript.log contents:
Warning: Permanently added '104.197.85.192' (ECDSA) to the list of known hosts.
startupscript.log                                                                                                                                                    100%   36KB  35.8KB/s   00:00
Feb 11 22:10:01 e2e-test-mml-master startupscript: Running startup script /var/run/google.startup.script
Feb 11 22:10:01 e2e-test-mml-master startupscript: == kube-up node config starting ==
Feb 11 22:10:01 e2e-test-mml-master startupscript: Networking functional on e2e-test-mml-master (10.240.0.2)
[...]
Feb 11 22:10:32 e2e-test-mml-master startupscript: `/srv/salt-new/reactor' -> `/srv/reactor'
Feb 11 22:10:33 e2e-test-mml-master startupscript: !!!!!!!! DOING SOMETHING BUGGY
Feb 11 22:10:33 e2e-test-mml-master startupscript: Finished running startup script /var/run/google.startup.script
/var/log/kube-apiserver.log contents:
Warning: Permanently added '104.197.85.192' (ECDSA) to the list of known hosts.
scp: /var/log/kube-apiserver.log: No such file or directory
ERROR: (gcloud.compute.copy-files) [/usr/bin/scp] exited with return code [1]. See https://cloud.google.com/compute/docs/troubleshooting#ssherrors for troubleshooting hints.
2016/02/11 14:11:18 e2e.go:309: Error running up: exit status 2
2016/02/11 14:11:18 e2e.go:305: Step 'up' finished in 2m41.007675547s
2016/02/11 14:11:18 e2e.go:141: Error starting e2e cluster. Aborting.
exit status 1

@mikedanese
Copy link
Member

@davidopp
Copy link
Member

davidopp commented Mar 4, 2016

Any thoughts on next steps for this issue? (It's marked P0 v1.2).

It doesn't sound like we've identified any specific problem here that doesn't already have a separate issue filed for it (and all three of the things Dawn mentioned have been merged). The fix for #10423 has been merged, so we can get better logs in the future. I'd propose we close this issue for now. Any objections?

@bprashanth
Copy link
Contributor

we just merged: #22192, which is the fallout from the debugging in #20916 (comment). I want to see why/if/for how long the metadata server is hanging. Previously we weren't timing out requests, so there's a chance we won't hit that failure mode if it was a GET hang of some sort that came after the string of 404s.

@bprashanth bprashanth assigned bprashanth and unassigned mml Mar 4, 2016
@bprashanth
Copy link
Contributor

I'm reassigning to myself so I can take a look with the next failure.

@davidopp
Copy link
Member

davidopp commented Mar 4, 2016

OK, but can you re-title this issue so it's just about trying to read from the metadata server?

@bprashanth
Copy link
Contributor

Hmm, but I don't even know if that's the problem

@bprashanth bprashanth changed the title e2e Flake: Cluster failed to initialize within 300 seconds e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) Mar 4, 2016
@bprashanth
Copy link
Contributor

This happened again in http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/32023/console and it looks like we just didn't create the master.

INSTANCE_GROUPS=e2e-gce-builder-2-0-minion-group
NODE_NAMES=e2e-gce-builder-2-0-minion-0501 e2e-gce-builder-2-0-minion-4pdi e2e-gce-builder-2-0-minion-ekcs e2e-gce-builder-2-0-minion-gj5m e2e-gce-builder-2-0-minion-h9co e2e-gce-builder-2-0-minion-pv6j
ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
 - The resource 'projects/kubernetes-jenkins-pull/zones/us-central1-f/instances/e2e-gce-builder-2-0-master' was not found
2016/03/07 10:20:02 e2e.go:200: Error running up: exit status 1
2016/03/07 10:20:02 e2e.go:196: Step 'up' finished in 7m42.604863565s
2016/03/07 10:20:02 e2e.go:110: Error starting e2e cluster. Aborting.

I'm spinning that off into another bug.

@goltermann
Copy link
Contributor

@bprashanth what remains to close this bug? Should it be closed and we only track #22655?

@bprashanth
Copy link
Contributor

The last issue that Wojtek posted logs too seems to be that etcd wasn't up, filed:
#22819

I haven't observed the metadata server issue that I'm keeping this open for since I put in the poll+timeout, so I'm closing this for now as suggested by David in #20916 (comment). Please re-open if it reoccurs, and we can dig through logs and triage.

@wojtek-t
Copy link
Member

This just happened again:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/22682/kubernetes-pull-build-test-e2e-gce/32360/build-log.txt

It seems that master components, didn't event started, although we have some logs from master kubelet:
it seems like some metadata-serve issue, but I'm not sure.

@bprashanth
Copy link
Contributor

That's really strange. The test ran 9 hours ago: #22682 (comment)

And the kubelet logs have:

W0311 09:18:34.803067    2661 http.go:62] Failed to read pods from URL. Won't log this message anymore: http://metadata.google.internal/computeMetadata/v1/instance/attributes/google-container-manifest: 404 Not Found

Which I removed in https://github.com/kubernetes/kubernetes/pull/22192/files#diff-d983c6311a6046434cdbf5d6edfc84f9R66 and went in a week ago. So I'm assuming the run didn't have my timeout fix as well, from the same pr.

@kubernetes/goog-testing am I missing something or is the builder confused?

@mikedanese
Copy link
Member

Maybe it was an older PR that had not been rebased recently. Do we attempt to merge with master before building the test artifact like travis does?

@wojtek-t
Copy link
Member

Actually, I agree with @bprashanth - there is something extremely strange here.
@kubernetes/goog-testing - any thoughs on it?

@spxtr
Copy link
Contributor

spxtr commented Mar 11, 2016

@mikedanese Yes, PR Jenkins checks out the PR merge tag when possible. If the PR has a merge conflict then we will get whatever the user has in their branch, which may be very out of date. We should probably disallow building/testing PRs with merge conflicts.

@ixdy
Copy link
Member

ixdy commented Mar 12, 2016

The build log shows no merge conflicts in that build, and indeed shows

Checking out Revision cbb2d1a868465d30c8cb61f4bfcabb7cf8a61eb4 (refs/remotes/origin/pr/22682/merge)

I really don't know what happened.

@bprashanth
Copy link
Contributor

Im putting the reason this was re-opened down to: infra weirdness.
Out of the last 2 prs that were linked against this:

#22933 failed because of:

e2e-gce-master-1-master-https e2e-gce-master-1 0.0.0.0/0  tcp:443          e2e-gce-master-1-master
ERROR: (gcloud.compute.addresses.create) Some requests did not succeed:
 - Backend Error
Attempt 1 failed to create static ip e2e-gce-master-1-master-ip. Retrying.
ERROR: (gcloud.compute.addresses.create) Some requests did not succeed:
 - The resource 'projects/kubernetes-jenkins-pull/regions/us-central1/addresses/e2e-gce-master-1-master-ip' already exists
Attempt 2 failed to create static ip e2e-gce-master-1-master-ip. Retrying.
ERROR: (gcloud.compute.addresses.create) Some requests did not succeed:

which has an associated internal bug, and is not this issue.

This last failure is the only one concerning, for which I've forked #23967

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests