e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

krousey · 2016-02-09T18:37:57Z

10:12:14 Using master: e2e-scalability-master (external IP: 104.197.128.205)
10:12:14 Waiting up to 300 seconds for cluster initialization.
10:12:14 
10:12:14   This will continually check to see if the API for kubernetes is reachable.
10:12:14   This may time out if there was some uncaught error during start up.
10:12:14 
10:12:14 ......................................................................................................................................................Cluster failed to initialize within 300 seconds.
10:17:15 2016/02/09 10:17:15 e2e.go:309: Error running up: exit status 2
10:17:15 2016/02/09 10:17:15 e2e.go:305: Step 'up' finished in 8m0.913745413s

https://cloud.google.com/console/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/4496/

ikehz · 2016-02-09T20:41:58Z

And also in the PR builder:

https://console.developers.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/20183/kubernetes-pull-build-test-e2e-gce/27684/artifacts/

krousey · 2016-02-09T22:35:08Z

Happened on kubernetes-e2e-gce-slow as well.

https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/2030/

krousey · 2016-02-10T22:05:53Z

Again:

https://console.cloud.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/2065/

davidopp · 2016-02-11T00:14:11Z

@mml do you think you could take a look at this?

mml · 2016-02-11T18:32:48Z

@davidopp okie doke.

mml · 2016-02-11T18:40:27Z

I don't see much to debug with, but I'll try to reproduce.

I think when any test fails at this stage, we could at least output whether or not we can ping the master and then include the apiserver logfile if we can find it. Maybe a stacktrace from the apiserver, too.

mikedanese · 2016-02-11T18:54:49Z

Most generic error in the world. Happened twice this morning in gce-serial

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-serial/462/
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-serial/464/

mml · 2016-02-11T20:00:41Z

OK. Mine just failed on the first try.

Feb 11 19:54:26 e2e-test-mml-master startupscript: == kubernetes-salt.tar.gz corrupted, sha1 6cf22a7f474695a7b02c4f8e6bbe35b2441c8eb2 doesn't match expected f18d39b707eb8101b18be679d687507cde1d0a60 ==
Feb 11 19:54:26 e2e-test-mml-master startupscript: Validated https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz SHA1 = f18d39b707eb8101b18be679d687507cde1d0a60
Feb 11 19:54:26 e2e-test-mml-master startupscript: Unpacking Salt tree and checking integrity of binary release tar
Feb 11 19:54:26 e2e-test-mml-master startupscript: 
Feb 11 19:54:26 e2e-test-mml-master startupscript: gzip: stdin: not in gzip format
Feb 11 19:54:26 e2e-test-mml-master startupscript: tar: Child returned status 1
Feb 11 19:54:26 e2e-test-mml-master startupscript: tar: Error is not recoverable: exiting now
Feb 11 19:54:41 e2e-test-mml-master startupscript: Couldn't download release. Retrying...
Feb 11 19:54:41 e2e-test-mml-master startupscript: Downloading binary release tar (https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-server-linux-amd64.tar.gz)
Feb 11 19:54:41 e2e-test-mml-master startupscript:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Feb 11 19:54:41 e2e-test-mml-master startupscript:                                  Dload  Upload   Total   Spent    Left  Speed
Feb 11 19:54:41 e2e-test-mml-master startupscript: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015100   127  100   127    0     0   5797      0 --:--:-- --:--:-- --:--:--  6684
Feb 11 19:54:41 e2e-test-mml-master startupscript: == kubernetes-server-linux-amd64.tar.gz corrupted, sha1 6cf22a7f474695a7b02c4f8e6bbe35b2441c8eb2 doesn't match expected 3a0bfc3dd36a357c3c79a203260577937ee9af76 ==
Feb 11 19:54:41 e2e-test-mml-master startupscript: Validated https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-server-linux-amd64.tar.gz SHA1 = 3a0bfc3dd36a357c3c79a203260577937ee9af76
Feb 11 19:54:41 e2e-test-mml-master startupscript: Downloading Salt tar (https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz)
Feb 11 19:54:41 e2e-test-mml-master startupscript:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Feb 11 19:54:41 e2e-test-mml-master startupscript:                                  Dload  Upload   Total   Spent    Left  Speed
Feb 11 19:54:41 e2e-test-mml-master startupscript: #015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015100   127  100   127    0     0   5075      0 --:--:-- --:--:-- --:--:--  5772
Feb 11 19:54:41 e2e-test-mml-master startupscript: == kubernetes-salt.tar.gz corrupted, sha1 6cf22a7f474695a7b02c4f8e6bbe35b2441c8eb2 doesn't match expected f18d39b707eb8101b18be679d687507cde1d0a60 ==
Feb 11 19:54:41 e2e-test-mml-master startupscript: Validated https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz SHA1 = f18d39b707eb8101b18be679d687507cde1d0a60
Feb 11 19:54:41 e2e-test-mml-master startupscript: Unpacking Salt tree and checking integrity of binary release tar
Feb 11 19:54:41 e2e-test-mml-master startupscript: 
Feb 11 19:54:41 e2e-test-mml-master startupscript: gzip: stdin: not in gzip format
Feb 11 19:54:41 e2e-test-mml-master startupscript: tar: Child returned status 1
Feb 11 19:54:41 e2e-test-mml-master startupscript: tar: Error is not recoverable: exiting now

and then

mml@e2e-test-mml-master:~$ curl https://storage.googleapis.com/kubernetes-staging-e71034d0da/e2e-test-mml-devel/kubernetes-salt.tar.gz
<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message></Error>mml@e2e-test-mml-master:~$

Two things seem weird:

Where's the salt tarball?
After the sha1 doesn't match, why does it say "Validated" and then appear to keep going?

mml · 2016-02-11T20:13:11Z

No.2 is obvious: because that's what the code says to do: validate and then keep going no matter what was returned. https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/configure-vm.sh#L571

This is incidental to the flaky test, but I'd like to fix it unless it's on purpose. I fear if we change this now a bunch of clusters might fail to boot due to hidden bugs in this part of the process, so maybe it's a change to save until after the 1.2 branch is cut.

mml · 2016-02-11T20:34:00Z

OK, so I'm planning to print the contents of /var/log/startupscript.log and /var/log/kube-apiserver.log on failure, and I plan to toggle this on an env var setting: KUBE_TEST_DEBUG. That seems to follow the existing patterns, and I couldn't find an existing env var to reuse.

mikedanese · 2016-02-11T20:37:05Z

Nice detective work. We wouldn't get to the sha validation if we were handling status code >= 400 in download-or-bust. We should retry on these as well. I'm hoping that GCS is just taking a second to converge after pushing the artifact and there aren't other errors we are failing to handle when pushing the artifact.

➜  kubernetes git:(controller-manager-componentconfig) curl https://storage.googleapis.com/asdlkfjaskl/lsajdflkdsajfl && echo -e "\n$?"  
<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchBucket</Code><Message>The specified bucket does not exist.</Message></Error>
0

mikedanese · 2016-02-11T20:40:14Z

Actually that's not what it is... Looks like it is returning a bad error code on 404. I'm wondering why we got past download-or-bust. What is in the file when you cat it after seeing this error?

mikedanese · 2016-02-11T20:41:49Z

Could this have possibly just happened in a GKE cluster?

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gke-slow/528/console

I'm wondering if these are all the same issue or different

mml · 2016-02-11T21:02:56Z

I don't think the missing tar file is happening elsewhere. I tracked it down to having switched projects but not removed the *.uploaded.sha1 files. I doubt that happens in jenkins.

mml · 2016-02-11T22:12:58Z

With a bug inserted and the timeout reduced...


  This will continually check to see if the API for kubernetes is reachable.
  This may time out if there was some uncaught error during start up.

..Cluster failed to initialize within 3 seconds.
/var/log/startupscript.log contents:
Warning: Permanently added '104.197.85.192' (ECDSA) to the list of known hosts.
startupscript.log                                                                                                                                                    100%   36KB  35.8KB/s   00:00
Feb 11 22:10:01 e2e-test-mml-master startupscript: Running startup script /var/run/google.startup.script
Feb 11 22:10:01 e2e-test-mml-master startupscript: == kube-up node config starting ==
Feb 11 22:10:01 e2e-test-mml-master startupscript: Networking functional on e2e-test-mml-master (10.240.0.2)
[...]
Feb 11 22:10:32 e2e-test-mml-master startupscript: `/srv/salt-new/reactor' -> `/srv/reactor'
Feb 11 22:10:33 e2e-test-mml-master startupscript: !!!!!!!! DOING SOMETHING BUGGY
Feb 11 22:10:33 e2e-test-mml-master startupscript: Finished running startup script /var/run/google.startup.script
/var/log/kube-apiserver.log contents:
Warning: Permanently added '104.197.85.192' (ECDSA) to the list of known hosts.
scp: /var/log/kube-apiserver.log: No such file or directory
ERROR: (gcloud.compute.copy-files) [/usr/bin/scp] exited with return code [1]. See https://cloud.google.com/compute/docs/troubleshooting#ssherrors for troubleshooting hints.
2016/02/11 14:11:18 e2e.go:309: Error running up: exit status 2
2016/02/11 14:11:18 e2e.go:305: Step 'up' finished in 2m41.007675547s
2016/02/11 14:11:18 e2e.go:141: Error starting e2e cluster. Aborting.
exit status 1

Helps with kubernetes#20916

wojtek-t · 2016-02-12T12:06:18Z

Three more failures:

gce-slow:
https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/2122/build-log.txt
https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/2123/build-log.txt

gce-scalability:
https://storage.googleapis.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-scalability/4553/build-log.txt

mikedanese · 2016-02-12T17:47:22Z

http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce/11456/console

davidopp · 2016-03-04T00:19:35Z

Any thoughts on next steps for this issue? (It's marked P0 v1.2).

It doesn't sound like we've identified any specific problem here that doesn't already have a separate issue filed for it (and all three of the things Dawn mentioned have been merged). The fix for #10423 has been merged, so we can get better logs in the future. I'd propose we close this issue for now. Any objections?

bprashanth · 2016-03-04T00:24:05Z

we just merged: #22192, which is the fallout from the debugging in #20916 (comment). I want to see why/if/for how long the metadata server is hanging. Previously we weren't timing out requests, so there's a chance we won't hit that failure mode if it was a GET hang of some sort that came after the string of 404s.

bprashanth · 2016-03-04T00:24:56Z

I'm reassigning to myself so I can take a look with the next failure.

davidopp · 2016-03-04T00:33:56Z

OK, but can you re-title this issue so it's just about trying to read from the metadata server?

bprashanth · 2016-03-04T01:47:35Z

Hmm, but I don't even know if that's the problem

bprashanth · 2016-03-07T18:29:00Z

This happened again in http://kubekins.dls.corp.google.com:8081/job/kubernetes-pull-build-test-e2e-gce/32023/console and it looks like we just didn't create the master.

INSTANCE_GROUPS=e2e-gce-builder-2-0-minion-group
NODE_NAMES=e2e-gce-builder-2-0-minion-0501 e2e-gce-builder-2-0-minion-4pdi e2e-gce-builder-2-0-minion-ekcs e2e-gce-builder-2-0-minion-gj5m e2e-gce-builder-2-0-minion-h9co e2e-gce-builder-2-0-minion-pv6j
ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
 - The resource 'projects/kubernetes-jenkins-pull/zones/us-central1-f/instances/e2e-gce-builder-2-0-master' was not found
2016/03/07 10:20:02 e2e.go:200: Error running up: exit status 1
2016/03/07 10:20:02 e2e.go:196: Step 'up' finished in 7m42.604863565s
2016/03/07 10:20:02 e2e.go:110: Error starting e2e cluster. Aborting.

I'm spinning that off into another bug.

goltermann · 2016-03-10T19:58:09Z

@bprashanth what remains to close this bug? Should it be closed and we only track #22655?

bprashanth · 2016-03-10T20:22:34Z

The last issue that Wojtek posted logs too seems to be that etcd wasn't up, filed:
#22819

I haven't observed the metadata server issue that I'm keeping this open for since I put in the poll+timeout, so I'm closing this for now as suggested by David in #20916 (comment). Please re-open if it reoccurs, and we can dig through logs and triage.

wojtek-t · 2016-03-11T09:47:41Z

This just happened again:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/22682/kubernetes-pull-build-test-e2e-gce/32360/build-log.txt

It seems that master components, didn't event started, although we have some logs from master kubelet:
it seems like some metadata-serve issue, but I'm not sure.

bprashanth · 2016-03-11T18:18:41Z

That's really strange. The test ran 9 hours ago: #22682 (comment)

And the kubelet logs have:

W0311 09:18:34.803067    2661 http.go:62] Failed to read pods from URL. Won't log this message anymore: http://metadata.google.internal/computeMetadata/v1/instance/attributes/google-container-manifest: 404 Not Found

Which I removed in https://github.com/kubernetes/kubernetes/pull/22192/files#diff-d983c6311a6046434cdbf5d6edfc84f9R66 and went in a week ago. So I'm assuming the run didn't have my timeout fix as well, from the same pr.

@kubernetes/goog-testing am I missing something or is the builder confused?

mikedanese · 2016-03-11T18:32:02Z

Maybe it was an older PR that had not been rebased recently. Do we attempt to merge with master before building the test artifact like travis does?

wojtek-t · 2016-03-11T20:34:06Z

Actually, I agree with @bprashanth - there is something extremely strange here.
@kubernetes/goog-testing - any thoughs on it?

spxtr · 2016-03-11T21:00:18Z

@mikedanese Yes, PR Jenkins checks out the PR merge tag when possible. If the PR has a merge conflict then we will get whatever the user has in their branch, which may be very out of date. We should probably disallow building/testing PRs with merge conflicts.

ixdy · 2016-03-12T00:28:26Z

The build log shows no merge conflicts in that build, and indeed shows

Checking out Revision cbb2d1a868465d30c8cb61f4bfcabb7cf8a61eb4 (refs/remotes/origin/pr/22682/merge)

I really don't know what happened.

bprashanth · 2016-04-07T01:42:00Z

Im putting the reason this was re-opened down to: infra weirdness.
Out of the last 2 prs that were linked against this:

#22933 failed because of:

e2e-gce-master-1-master-https e2e-gce-master-1 0.0.0.0/0  tcp:443          e2e-gce-master-1-master
ERROR: (gcloud.compute.addresses.create) Some requests did not succeed:
 - Backend Error
Attempt 1 failed to create static ip e2e-gce-master-1-master-ip. Retrying.
ERROR: (gcloud.compute.addresses.create) Some requests did not succeed:
 - The resource 'projects/kubernetes-jenkins-pull/regions/us-central1/addresses/e2e-gce-master-1-master-ip' already exists
Attempt 2 failed to create static ip e2e-gce-master-1-master-ip. Retrying.
ERROR: (gcloud.compute.addresses.create) Some requests did not succeed:

which has an associated internal bug, and is not this issue.

This last failure is the only one concerning, for which I've forked #23967

krousey added the kind/flake Categorizes issue or PR as related to a flaky test. label Feb 9, 2016

ikehz added the team/control-plane label Feb 9, 2016

ikehz changed the title ~~e2e Flake: kubernetes-e2e-gce-scalability Cluster failed to initialize within 300 seconds~~ e2e Flake: Cluster failed to initialize within 300 seconds Feb 9, 2016

ikehz mentioned this issue Feb 9, 2016

Make FSGroup e2es [Feature:FSGroup], not [Skipped] #20183

Merged

krousey added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 9, 2016

pmorie mentioned this issue Feb 9, 2016

Make namespace controller delete configMaps in namespace #20645

Merged

mml self-assigned this Feb 11, 2016

mikedanese mentioned this issue Feb 11, 2016

A bunch of tests started failing around noon #21103

Closed

mml mentioned this issue Feb 11, 2016

Include some debug output when the API server never appears. #21108

Merged

mml added a commit to mml/kubernetes that referenced this issue Feb 12, 2016

Include some debug output when the API server never appears.

7a2e649

Helps with kubernetes#20916

ghost mentioned this issue Mar 1, 2016

Add SELinuxOptions to emptyDir e2e #17577

Merged

bprashanth assigned bprashanth and unassigned mml Mar 4, 2016

bprashanth changed the title ~~e2e Flake: Cluster failed to initialize within 300 seconds~~ e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) Mar 4, 2016

wojtek-t mentioned this issue Mar 7, 2016

Use local_ipv4 when public_ipv4 is not present. #20931

Merged

bprashanth mentioned this issue Mar 7, 2016

e2e clusters sometimes fail to create master #22655

Closed

bprashanth closed this as completed Mar 10, 2016

wojtek-t reopened this Mar 11, 2016

wojtek-t mentioned this issue Mar 11, 2016

Credential providers are not newsworthy #22682

Merged

goltermann modified the milestones: next-candidate, v1.2 Mar 14, 2016

liangchenye mentioned this issue Mar 15, 2016

image operation conformance test #22933

Merged

justinsb mentioned this issue Mar 15, 2016

Fix non-masquerade CIDR #22989

Merged

bprashanth closed this as completed Apr 7, 2016

bprashanth mentioned this issue Sep 7, 2016

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

krousey commented Feb 9, 2016

ikehz commented Feb 9, 2016

krousey commented Feb 9, 2016

krousey commented Feb 10, 2016

davidopp commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

wojtek-t commented Feb 12, 2016

mikedanese commented Feb 12, 2016

davidopp commented Mar 4, 2016

bprashanth commented Mar 4, 2016

bprashanth commented Mar 4, 2016

davidopp commented Mar 4, 2016

bprashanth commented Mar 4, 2016

bprashanth commented Mar 7, 2016

goltermann commented Mar 10, 2016

bprashanth commented Mar 10, 2016

wojtek-t commented Mar 11, 2016

bprashanth commented Mar 11, 2016

mikedanese commented Mar 11, 2016

wojtek-t commented Mar 11, 2016

spxtr commented Mar 11, 2016

ixdy commented Mar 12, 2016

bprashanth commented Apr 7, 2016

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

Comments

krousey commented Feb 9, 2016

ikehz commented Feb 9, 2016

krousey commented Feb 9, 2016

krousey commented Feb 10, 2016

davidopp commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mikedanese commented Feb 11, 2016

mml commented Feb 11, 2016

mml commented Feb 11, 2016

wojtek-t commented Feb 12, 2016

mikedanese commented Feb 12, 2016

davidopp commented Mar 4, 2016

bprashanth commented Mar 4, 2016

bprashanth commented Mar 4, 2016

davidopp commented Mar 4, 2016

bprashanth commented Mar 4, 2016

bprashanth commented Mar 7, 2016

goltermann commented Mar 10, 2016

bprashanth commented Mar 10, 2016

wojtek-t commented Mar 11, 2016

bprashanth commented Mar 11, 2016

mikedanese commented Mar 11, 2016

wojtek-t commented Mar 11, 2016

spxtr commented Mar 11, 2016

ixdy commented Mar 12, 2016

bprashanth commented Apr 7, 2016