Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e clusters sometimes fail to create master #22655

Closed
bprashanth opened this issue Mar 7, 2016 · 28 comments
Closed

e2e clusters sometimes fail to create master #22655

bprashanth opened this issue Mar 7, 2016 · 28 comments
Assignees
Labels
area/platform/gce kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@bprashanth
Copy link
Contributor

Observed in #18672 (comment)
Probably contributing to: #20916 (comment)

INSTANCE_GROUPS=e2e-gce-builder-2-0-minion-group
NODE_NAMES=e2e-gce-builder-2-0-minion-0501 e2e-gce-builder-2-0-minion-4pdi e2e-gce-builder-2-0-minion-ekcs e2e-gce-builder-2-0-minion-gj5m e2e-gce-builder-2-0-minion-h9co e2e-gce-builder-2-0-minion-pv6j
ERROR: (gcloud.compute.instances.describe) Could not fetch resource:
 - The resource 'projects/kubernetes-jenkins-pull/zones/us-central1-f/instances/e2e-gce-builder-2-0-master' was not found
2016/03/07 10:20:02 e2e.go:200: Error running up: exit status 1
2016/03/07 10:20:02 e2e.go:196: Step 'up' finished in 7m42.604863565s
2016/03/07 10:20:02 e2e.go:110: Error starting e2e cluster. Aborting.

Of course the kubelets are complaining:

I0307 18:22:10.110980    3364 kubelet.go:1129] Unable to register e2e-gce-builder-2-0-minion-ekcs with the apiserver: Post https://e2e-gce-builder-2-0-master/api/v1/nodes: dial tcp: lookup e2e-gce-builder-2-0-master: no such host
I0307 18:22:10.170923    3364 kubelet.go:2355] skipping pod synchronization - [ConfigureCBR0 requested, but PodCIDR not set. Will not configure CBR0 right now container runtime is down]
E0307 18:22:10.175011    3364 kubelet.go:2696] Container runtime sanity check failed: docker: failed to get docker version: cannot connect to Docker endpoint

Looks like we didn't even create the master vm, but the error is lost.

@bprashanth bprashanth added area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 7, 2016
@dchen1107
Copy link
Member

Shouldn't #10423 address the log related issue for us? cc/ @spxtr

@bprashanth
Copy link
Contributor Author

In this case I don't think the vm even existed for us to collect logs

@dchen1107
Copy link
Member

Ahh, I saw above error message. Looks like this failure can occur with a real production cluster, not limited to test. cc @kubernetes/goog-gke

@fejta
Copy link
Contributor

fejta commented Mar 8, 2016

Yeah this is a https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/debian/helper.sh#L42 flake (or gcloud flake). Should this be in a loop? Or is there some way we can get the gcloud debug logs?

@bprashanth
Copy link
Contributor Author

@roberthbailey
Copy link
Contributor

Ideally all of the gcloud calls would be wrapped in a loop. This one is particularly insidious because the create-master-instance call is run in the background and therefore doesn't abort the cluster creation until much later in the process (and without a useful error).

@lavalamp
Copy link
Member

Yup, need retry loop. @thockin to delegate? Not sure who owns our setup scripts.

@bprashanth
Copy link
Contributor Author

As Dawn noted in #22655 (comment) this appears to be a problem with kube-up in general and not just our e2e setup script, which makes me think we should fix it for 1.2.

@roberthbailey
Copy link
Contributor

On the other hand, it's been this way since it was written and shouldn't be any flakier now that it's been for the last year. I wouldn't block a release on it (but I would cherry pick it so that it gets fixed on the release branch in the 1.2.1 release if it misses 1.2.0).

@lavalamp lavalamp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 13, 2016
@lavalamp
Copy link
Member

Another occurrence. https://console.cloud.google.com/storage/kubernetes-jenkins/logs/kubernetes-e2e-gce-slow/4164/

12:52:38 NAME                        MACHINE_TYPE  PREEMPTIBLE CREATION_TIMESTAMP
12:52:38 jenkins-e2e-minion-template n1-standard-2             2016-04-13T12:52:34.856-07:00
12:52:46 Created [https://www.googleapis.com/compute/v1/projects/k8s-jkns-e2e-gce-slow/zones/us-central1-f/instanceGroupManagers/jenkins-e2e-minion-group].
12:52:47 NAME                     ZONE          BASE_INSTANCE_NAME SIZE TARGET_SIZE INSTANCE_TEMPLATE           AUTOSCALED
12:52:47 jenkins-e2e-minion-group us-central1-f jenkins-e2e-minion      3           jenkins-e2e-minion-template
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 3
12:53:59 Waiting for group to become stable, current operations: creating: 2
12:53:59 Waiting for group to become stable, current operations: creating: 2
12:53:59 Waiting for group to become stable, current operations: creating: 2
12:53:59 Group is stable
12:54:02 INSTANCE_GROUPS=jenkins-e2e-minion-group
12:54:02 NODE_NAMES=jenkins-e2e-minion-0m7m jenkins-e2e-minion-jsld jenkins-e2e-minion-ur7c
12:54:04 Using master: jenkins-e2e-master (external IP: 104.154.93.220)
12:54:04 Waiting up to 300 seconds for cluster initialization.
12:54:04 
12:54:04   This will continually check to see if the API for kubernetes is reachable.
12:54:04   This may time out if there was some uncaught error during start up.
12:54:04 
12:54:04 ....................................................................................................................................................Cluster failed to initialize within 300 seconds.
12:59:05 2016/04/13 12:59:05 e2e.go:200: Error running up: exit status 2
12:59:05 2016/04/13 12:59:05 e2e.go:196: Step 'up' finished in 7m57.662694392s
12:59:05 
2016/04/13 12:59:05 e2e.go:110: Error starting e2e cluster. Aborting.

@thockin
Copy link
Member

thockin commented Apr 18, 2016

I'm probably not the right assignee - I have almost no context on this area. It looks like it is flaking once a month?

Who has most context on kube-up? Names from git blame

@zmerlynn @gmarek

This is a pretty nefarious failure mode, can either of you shake loose a little time to estimate it and see what would have to push to fix this?

@bweston92
Copy link

I don't know if this is related but brand new Google account and lastest pull of Kuberenetes, I get the following issue.

./kube-up.sh 
... Starting cluster in us-central1-b using provider gce
... calling verify-prereqs

All components are up to date.

All components are up to date.

All components are up to date.
... calling kube-up
Your active configuration is: [default]

Project: fundbay-1297
Zone: us-central1-b
+++ Staging server tars to Google Storage: gs://kubernetes-staging-a6deffac44/kubernetes-devel
+++ kubernetes-server-linux-amd64.tar.gz uploaded (sha1 = ddeae4ce0540fa8fd2bae4c3c88b553a561e708b)
+++ kubernetes-salt.tar.gz uploaded (sha1 = 7e30e38bfafcb1abd23c0a2519fc7e77e656df2a)
INSTANCE_GROUPS=
NODE_NAMES=
Looking for already existing resources
Listed 0 items.
Listed 0 items.
Starting master and configuring firewalls
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/zones/us-central1-b/disks/kubernetes-master-pd].
NAME                  ZONE           SIZE_GB  TYPE    STATUS
kubernetes-master-pd  us-central1-b  20       pd-ssd  READY
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/regions/us-central1/addresses/kubernetes-master-ip].
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/default-default-internal].
NAME                      NETWORK  SRC_RANGES  RULES                         SRC_TAGS  TARGET_TAGS
default-default-internal  default  10.0.0.0/8  tcp:1-65535,udp:1-65535,icmp
Generating certs for alternate-names: IP:104.154.66.17,IP:10.0.0.1,DNS:kubernetes,DNS:kubernetes.default,DNS:kubernetes.default.svc,DNS:kubernetes.default.svc.cluster.local,DNS:kubernetes-master
+++ Logging using Fluentd to gcp
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/kubernetes-master-https].
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/default-default-ssh].
NAME                 NETWORK  SRC_RANGES  RULES   SRC_TAGS  TARGET_TAGS
NAME                     NETWORK  SRC_RANGES  RULES    SRC_TAGS  TARGET_TAGS
default-default-ssh  default  0.0.0.0/0   tcp:22
kubernetes-master-https  default  0.0.0.0/0   tcp:443            kubernetes-master
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/firewalls/kubernetes-minion-all].
NAME                   NETWORK  SRC_RANGES     RULES                     SRC_TAGS  TARGET_TAGS
kubernetes-minion-all  default  10.244.0.0/16  tcp,udp,icmp,esp,ah,sctp            kubernetes-minion
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/zones/us-central1-b/instances/kubernetes-master].
NAME               ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP    STATUS
kubernetes-master  us-central1-b  n1-standard-1               10.128.0.2   104.154.66.17  RUNNING
Creating minions.
Attempt 1 to create kubernetes-minion-template
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks#pdperformance.
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/global/instanceTemplates/kubernetes-minion-template].
NAME                        MACHINE_TYPE   PREEMPTIBLE  CREATION_TIMESTAMP
kubernetes-minion-template  n1-standard-2               2016-04-30T01:06:43.574-07:00
Created [https://www.googleapis.com/compute/v1/projects/fundbay-1297/zones/us-central1-b/instanceGroupManagers/kubernetes-minion-group].
NAME                     LOCATION       SCOPE  BASE_INSTANCE_NAME  SIZE  TARGET_SIZE  INSTANCE_TEMPLATE           AUTOSCALED
kubernetes-minion-group  us-central1-b  zone   kubernetes-minion         3            kubernetes-minion-template
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 3Waiting for group to become stable, current operations: creating: 3
Waiting for group to become stable, current operations: creating: 1,recreating: 2
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3
Waiting for group to become stable, current operations: recreating: 3 (this goes on forever)

@roberthbailey
Copy link
Contributor

@bweston92 that looks like an issue creating the node VMs.

@jbeda
Copy link
Contributor

jbeda commented Jun 23, 2016

@k8s-github-robot
Copy link

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention...

1 similar comment
@k8s-github-robot
Copy link

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention...

@k8s-github-robot
Copy link

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention.

2 similar comments
@k8s-github-robot
Copy link

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention.

@k8s-github-robot
Copy link

[FLAKE-PING] @mikedanese

This flaky-test issue would love to have more attention.

@liggitt
Copy link
Member

liggitt commented Mar 6, 2017

hasn't been referenced in five months... I vote to close this one

@grodrigues3
Copy link
Contributor

Seems reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform/gce kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests