e2e clusters sometimes fail to create master #22655

bprashanth · 2016-03-07T18:33:46Z

Observed in #18672 (comment)
Probably contributing to: #20916 (comment)

INSTANCE_GROUPS=e2e-gce-builder-2-0-minion-group
NODE_NAMES=e2e-gce-builder-2-0-minion-0501 e2e-gce-builder-2-0-minion-4pdi e2e-gce-builder-2-0-minion-ekcs e2e-gce-builder-2-0-minion-gj5m e2e-gce-builder-2-0-minion-h9co e2e-gce-builder-2-0-minion-pv6j
ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
 - The resource 'projects/kubernetes-jenkins-pull/zones/us-central1-f/instances/e2e-gce-builder-2-0-master' was not found
2016/03/07 10:20:02 e2e.go:200: Error running up: exit status 1
2016/03/07 10:20:02 e2e.go:196: Step 'up' finished in 7m42.604863565s
2016/03/07 10:20:02 e2e.go:110: Error starting e2e cluster. Aborting.

Of course the kubelets are complaining:

I0307 18:22:10.110980    3364 kubelet.go:1129] Unable to register e2e-gce-builder-2-0-minion-ekcs with the apiserver: Post https://e2e-gce-builder-2-0-master/api/v1/nodes: dial tcp: lookup e2e-gce-builder-2-0-master: no such host
I0307 18:22:10.170923    3364 kubelet.go:2355] skipping pod synchronization - [ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now container runtime is down]
E0307 18:22:10.175011    3364 kubelet.go:2696] Container runtime sanity check failed: docker: failed to get docker version: cannot connect to Docker endpoint

Looks like we didn't even create the master vm, but the error is lost.

The text was updated successfully, but these errors were encountered:

dchen1107 · 2016-03-07T19:11:31Z

Shouldn't #10423 address the log related issue for us? cc/ @spxtr

bprashanth · 2016-03-07T19:13:03Z

In this case I don't think the vm even existed for us to collect logs

dchen1107 · 2016-03-07T19:38:50Z

Ahh, I saw above error message. Looks like this failure can occur with a real production cluster, not limited to test. cc @kubernetes/goog-gke

fejta · 2016-03-08T07:21:04Z

Yeah this is a https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/debian/helper.sh#L42 flake (or gcloud flake). Should this be in a loop? Or is there some way we can get the gcloud debug logs?

bprashanth · 2016-03-08T17:19:40Z

This happened again: http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-slow/3019/console
https://pantheon.corp.google.com/storage/browser/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/3019/

Maybe just put all these:

kubernetes/cluster/gce/util.sh

Line 544 in bc96422

create-master

In a loop like:

kubernetes/cluster/gce/util.sh

Line 505 in bc96422

if (( attempt > 5 )); then

roberthbailey · 2016-03-09T06:07:37Z

Ideally all of the gcloud calls would be wrapped in a loop. This one is particularly insidious because the create-master-instance call is run in the background and therefore doesn't abort the cluster creation until much later in the process (and without a useful error).

lavalamp · 2016-03-11T01:00:22Z

Yup, need retry loop. @thockin to delegate? Not sure who owns our setup scripts.

bprashanth · 2016-03-11T01:54:06Z

As Dawn noted in #22655 (comment) this appears to be a problem with kube-up in general and not just our e2e setup script, which makes me think we should fix it for 1.2.

roberthbailey · 2016-03-11T05:18:22Z

On the other hand, it's been this way since it was written and shouldn't be any flakier now that it's been for the last year. I wouldn't block a release on it (but I would cherry pick it so that it gets fixed on the release branch in the 1.2.1 release if it misses 1.2.0).

lavalamp · 2016-04-13T20:09:40Z

Another occurrence. https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/4164/

12:52:38 NAME                        MACHINE_TYPE  PREEMPTIBLE CREATION_TIMESTAMP
12:52:38 jenkins-e2e-minion-template n1-standard-2             2016-04-13T12:52:34.856-07:00
12:52:46 Created [https://www.googleapis.com/compute/v1/projects/k8s-jkns-e2e-gce-slow/zones/us-central1-f/instanceGroupManagers/jenkins-e2e-minion-group].
12:52:47 NAME                     ZONE          BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE           AUTOSCALED
12:52:47 jenkins-e2e-minion-group us-central1-f jenkins-e2e-minion      3           jenkins-e2e-minion-template
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 2
12:53:59 Waiting for group to become stable, current operations: creating: 2
12:53:59 Waiting for group to become stable, current operations: creating: 2
12:53:59 Group is stable
12:54:02 INSTANCE_GROUPS=jenkins-e2e-minion-group
12:54:02 NODE_NAMES=jenkins-e2e-minion-0m7m jenkins-e2e-minion-jsld jenkins-e2e-minion-ur7c
12:54:04 Using master: jenkins-e2e-master (external IP: 104.154.93.220)
12:54:04 Waiting up to 300 seconds for cluster initialization.
12:54:04 
12:54:04   This will continually check to see if the API for kubernetes is reachable.
12:54:04   This may time out if there was some uncaught error during start up.
12:54:04 
12:54:04 ....................................................................................................................................................Cluster failed to initialize within 300 seconds.
12:59:05 2016/04/13 12:59:05 e2e.go:200: Error running up: exit status 2
12:59:05 2016/04/13 12:59:05 e2e.go:196: Step 'up' finished in 7m57.662694392s
12:59:05 
2016/04/13 12:59:05 e2e.go:110: Error starting e2e cluster. Aborting.

thockin · 2016-04-18T17:24:26Z

I'm probably not the right assignee - I have almost no context on this area. It looks like it is flaking once a month?

Who has most context on kube-up? Names from git blame

@zmerlynn @gmarek

This is a pretty nefarious failure mode, can either of you shake loose a little time to estimate it and see what would have to push to fix this?

bweston92 · 2016-04-30T09:56:38Z

I don't know if this is related but brand new Google account and lastest pull of Kuberenetes, I get the following issue.

./kube-up.sh 
... Starting cluster in us-central1-b using provider gce
... calling verify-prereqs

All components are up to date.

All components are up to date.

All components are up to date.
... calling kube-up
Your active configuration is: [default]

Project: fundbay-1297
Zone: us-central1-b
+++ Staging server tars to Google Storage: gs://kubernetes-staging-a6deffac44/kubernetes-devel
+++ kubernetes-server-linux-amd64.tar.gz uploaded (sha1 = ddeae4ce0540fa8fd2bae4c3c88b553a561e708b)
+++ kubernetes-salt.tar.gz uploaded (sha1 = 7e30e38bfafcb1abd23c0a2519fc7e77e656df2a)
INSTANCE_GROUPS=
NODE_NAMES=
Looking for already existing resources
Listed 0 items.
Listed 0 items.
Starting master and configuring firewalls
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/zones/us-central1-b/disks/kubernetes-master-pd].
NAME                  ZONE           SIZE_GB  TYPE    STATUS
kubernetes-master-pd  us-central1-b  20       pd-ssd  READY
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/regions/us-central1/addresses/kubernetes-master-ip].
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/default-default-internal].
NAME                      NETWORK  SRC_RANGES  RULES                         SRC_TAGS  TARGET_TAGS
default-default-internal  default  10.0.0.0/8  tcp:1-65535,udp:1-65535,icmp
Generating certs for alternate-names: IP:104.154.66.17,IP:10.0.0.1,DNS:kubernetes,DNS:kubernetes.default,DNS:kubernetes.default.svc,DNS:kubernetes.default.svc.cluster.local,DNS:kubernetes-master
+++ Logging using Fluentd to gcp
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/kubernetes-master-https].
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/default-default-ssh].
NAME                 NETWORK  SRC_RANGES  RULES   SRC_TAGS  TARGET_TAGS
NAME                     NETWORK  SRC_RANGES  RULES    SRC_TAGS  TARGET_TAGS
default-default-ssh  default  0.0.0.0/0   tcp:22
kubernetes-master-https  default  0.0.0.0/0   tcp:443            kubernetes-master
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/kubernetes-minion-all].
NAME                   NETWORK  SRC_RANGES     RULES                     SRC_TAGS  TARGET_TAGS
kubernetes-minion-all  default  10.244.0.0/16  tcp,udp,icmp,esp,ah,sctp            kubernetes-minion
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/zones/us-central1-b/instances/kubernetes-master].
NAME               ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP    STATUS
kubernetes-master  us-central1-b  n1-standard-1               10.128.0.2   104.154.66.17  RUNNING
Creating minions.
Attempt 1 to create kubernetes-minion-template
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/instanceTemplates/kubernetes-minion-template].
NAME                        MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP
kubernetes-minion-template  n1-standard-2               2016-04-30T01:06:43.574-07:00
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/zones/us-central1-b/instanceGroupManagers/kubernetes-minion-group].
NAME                     LOCATION       SCOPE  BASE_INSTANCE_NAME  SIZE  TARGET_SIZE  INSTANCE_TEMPLATE           AUTOSCALED
kubernetes-minion-group  us-central1-b  zone   kubernetes-minion         3            kubernetes-minion-template
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 1,recreating: 2
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3 (this goes on forever)

roberthbailey · 2016-04-30T20:10:01Z

@bweston92 that looks like an issue creating the node VMs.

Random-Liu · 2016-05-17T00:44:52Z

https://console.cloud.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/25695/kubernetes-pull-build-test-e2e-gce/40332/

jbeda · 2016-06-23T17:05:43Z

I think this happened here: https://console.cloud.google.com/storage/browser/kubernetes-jenkins/pr-logs/pull/27864/kubernetes-pull-build-test-e2e-gce/46315/artifacts/

k8s-github-robot · 2016-08-29T21:33:30Z

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention...

k8s-github-robot · 2016-08-29T21:58:37Z

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention...

k8s-github-robot · 2016-09-05T22:02:10Z

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention.

k8s-github-robot · 2016-09-12T22:07:38Z

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention.

k8s-github-robot · 2016-09-19T22:10:55Z

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention.

liggitt · 2017-03-06T02:55:53Z

hasn't been referenced in five months... I vote to close this one

grodrigues3 · 2017-03-10T22:54:35Z

Seems reasonable.

bprashanth added area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 7, 2016

bprashanth mentioned this issue Mar 7, 2016

Wait till netexec is ready in kubeproxy e2e #18672

Merged

dchen1107 added the team/gke label Mar 7, 2016

fejta added team/cluster and removed area/test-infra labels Mar 8, 2016

goltermann mentioned this issue Mar 10, 2016

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

Closed

lavalamp assigned thockin Mar 11, 2016

lavalamp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 13, 2016

therc mentioned this issue May 10, 2016

AWS: Allow cross-region image pulling with ECR #24369

Merged

yujuhong mentioned this issue May 11, 2016

Clearly identify errors killing pods in events and logs #24743

Merged

soltysh mentioned this issue May 16, 2016

SplitHostPort is needed since Request.RemoteAddr has the host:port format #25501

Merged

jsafrane mentioned this issue May 16, 2016

Add specific error type for "operation already exists" error. #25400

Merged

ajwdev mentioned this issue May 16, 2016

AWS: ELB proxy protocol support via annotation service.beta.kubernetes.io/aws-load-balancer-proxy-protocol #24569

Merged

ArtfulCoder mentioned this issue May 16, 2016

Fix hyperkube flag parsing #25512

Merged

Random-Liu mentioned this issue May 17, 2016

Node E2E: Shorten consistently check timeout of runtime conformance test #25695

Merged

bprashanth mentioned this issue Jun 10, 2016

BeforeSuite {Kubernetes e2e suite} #26135

Closed

goltermann modified the milestones: v1.3, next-candidate Jun 13, 2016

bprashanth mentioned this issue Jun 16, 2016

kube-master-configuration not salted on master #27551

Closed

jbeda mentioned this issue Jun 23, 2016

Add support for Docker for MacOS #27864

Merged

yujuhong mentioned this issue Jun 23, 2016

Updating CoreOS image to CoreOS stable 1010.5.0 in Node e2e. #27913

Merged

wojtek-t mentioned this issue Jun 24, 2016

Workardound KubeProxy failures in test framework #28015

Merged

thockin mentioned this issue Jul 3, 2016

Cleanup a TODO from godeps -> vendor change #28421

Merged

jsafrane mentioned this issue Jul 12, 2016

Remove pod mutation for PVs with supplemental GIDs #28691

Merged

This was referenced Jul 20, 2016

Allow mounts to run in parallel for non-attachable volumes #28939

Merged

add enhanced volume and mount logging for block devices #24797

Merged

spxtr mentioned this issue Aug 2, 2016

GKE test-build-release: Actually do the build. #29877

Merged

derekwaynecarr mentioned this issue Aug 24, 2016

add quota test for creating update requests #31122

Merged

jayunit100 mentioned this issue Sep 20, 2016

Viper direct bindings to TestContext struct with hierarchichal suppor… #32902

Merged

saad-ali mentioned this issue Oct 19, 2016

Adding default StorageClass annotation printout for resource_printer and describer and some refactoring #34638

Merged

calebamiles modified the milestones: next-candidate, v1.6 Mar 3, 2017

grodrigues3 closed this as completed Mar 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e clusters sometimes fail to create master #22655

e2e clusters sometimes fail to create master #22655

bprashanth commented Mar 7, 2016

dchen1107 commented Mar 7, 2016

bprashanth commented Mar 7, 2016

dchen1107 commented Mar 7, 2016

fejta commented Mar 8, 2016

bprashanth commented Mar 8, 2016

roberthbailey commented Mar 9, 2016

lavalamp commented Mar 11, 2016

bprashanth commented Mar 11, 2016

roberthbailey commented Mar 11, 2016

lavalamp commented Apr 13, 2016

thockin commented Apr 18, 2016

bweston92 commented Apr 30, 2016

roberthbailey commented Apr 30, 2016

Random-Liu commented May 17, 2016

jbeda commented Jun 23, 2016

k8s-github-robot commented Aug 29, 2016

k8s-github-robot commented Aug 29, 2016

k8s-github-robot commented Sep 5, 2016

k8s-github-robot commented Sep 12, 2016

k8s-github-robot commented Sep 19, 2016

liggitt commented Mar 6, 2017

grodrigues3 commented Mar 10, 2017

e2e clusters sometimes fail to create master #22655

e2e clusters sometimes fail to create master #22655

Comments

bprashanth commented Mar 7, 2016

dchen1107 commented Mar 7, 2016

bprashanth commented Mar 7, 2016

dchen1107 commented Mar 7, 2016

fejta commented Mar 8, 2016

bprashanth commented Mar 8, 2016

roberthbailey commented Mar 9, 2016

lavalamp commented Mar 11, 2016

bprashanth commented Mar 11, 2016

roberthbailey commented Mar 11, 2016

lavalamp commented Apr 13, 2016

thockin commented Apr 18, 2016

bweston92 commented Apr 30, 2016

roberthbailey commented Apr 30, 2016

Random-Liu commented May 17, 2016

jbeda commented Jun 23, 2016

k8s-github-robot commented Aug 29, 2016

k8s-github-robot commented Aug 29, 2016

k8s-github-robot commented Sep 5, 2016

k8s-github-robot commented Sep 12, 2016

k8s-github-robot commented Sep 19, 2016

liggitt commented Mar 6, 2017

grodrigues3 commented Mar 10, 2017