Investigate etcd unavailability leading to 300s cluster up timeout #22819

bprashanth · 2016-03-10T20:20:31Z

#20931 (comment)
Apiserver logs:

E0306 01:08:48.052514       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.LimitRange: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.053808       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.ResourceQuota: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.054264       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.Secret: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.054349       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.Namespace: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.054435       6 cache

Didn't find anything immediately suspicous in master kubelet or docker logs.

The text was updated successfully, but these errors were encountered:

dchen1107 · 2016-03-11T01:18:03Z

@lavalamp could you please take a look at this or delegate? Thanks!

lavalamp · 2016-03-11T01:53:57Z

@dchen1107 something bad happened to the node, maybe to docker. I see lots of

I0306 01:04:03.535184    3507 worker.go:161] Probe target container not found: etcd-server-e2e-gce-master-0-master_kube-system(047397a03841de24aac3acbd0cab7cfa) - etcd-container

messages on the master node's kubelet.log. Looks like it wasn't able to run much. etcd log is empty. Nothing obvious in docker log.

But anyway, the problem is clearly that etcd never started. Sorry, but I have to give this back to you :)

dchen1107 · 2016-03-12T01:19:18Z

I looked at logs of master node.

The etcd-server container didn't run because docker image isn't loaded. We are doing side-loading all docker images for master component, including etcd image.
I checked docker log, it doesn't complain any failure of docker load operation.
Also docker is not hung at all since all other images are properly loaded and containers are created and started.
I checked salt related logs, and no errors.

I am wondering if etcd image tar file is copied to the master node properly through pr builder. Tried to reproduce it through kube-up but failed.

dchen1107 · 2016-03-12T01:59:07Z

Together with @bprashanth, we double checked our master component tarball, which doesn't include etcd:2.2.1 image tar file. Looks like etcd:2.2.1 is the only image of master component which is pulled from gcr.io.

I checked kubelet.log on master node again, there are only image pulling event, no corresponding pulled event for etcd:2.2.1. Based on my understanding, the reason we want to keep image side-loading for master component is that gcr.io repository is not very reliable. We used to include etcd image in master component tarball too. cc/ @roberthbailey and @zmerlynn

roberthbailey · 2016-03-12T03:31:51Z

I don't recall a time when we were side-loading the etcd image. We load it from gcr.io in the release-1.1 branch and the release-1.0 branch and we haven't seen it have a significant impact on cluster creation reliability.

/cc @jlowdermilk

dchen1107 · 2016-03-28T23:58:13Z

Do we have more instances with the same failure here? The problem is for this particular docker pull etcd:2.2.1 operation takes long time.

dchen1107 · 2016-04-08T16:55:00Z

Since there is no instance with the same failure. I am closing this one for now.

bprashanth added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 10, 2016

bprashanth mentioned this issue Mar 10, 2016

e2e Flake: Cluster failed to initialize within 300 seconds (possible metadata server issue?) #20916

Closed

bprashanth changed the title ~~Investigate etcd crashing leading to 300s cluster up timeout~~ Investigate etcd unavailability leading to 300s cluster up timeout Mar 10, 2016

dchen1107 added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed team/control-plane labels Mar 11, 2016

dchen1107 assigned lavalamp Mar 11, 2016

lavalamp assigned dchen1107 and unassigned lavalamp Mar 11, 2016

lavalamp added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 11, 2016

dchen1107 mentioned this issue Mar 14, 2016

Flake: Kubectl run rc [It] should create an rc from an image [Conformance] [pods not ready] #22603

Closed

dchen1107 closed this as completed Apr 8, 2016

bprashanth mentioned this issue Sep 7, 2016

Cluster setup in e2e tests not reliable enough nor transparent enough #31273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate etcd unavailability leading to 300s cluster up timeout #22819

Investigate etcd unavailability leading to 300s cluster up timeout #22819

bprashanth commented Mar 10, 2016

dchen1107 commented Mar 11, 2016

lavalamp commented Mar 11, 2016

dchen1107 commented Mar 12, 2016

dchen1107 commented Mar 12, 2016

roberthbailey commented Mar 12, 2016

dchen1107 commented Mar 28, 2016

dchen1107 commented Apr 8, 2016

Investigate etcd unavailability leading to 300s cluster up timeout #22819

Investigate etcd unavailability leading to 300s cluster up timeout #22819

Comments

bprashanth commented Mar 10, 2016

dchen1107 commented Mar 11, 2016

lavalamp commented Mar 11, 2016

dchen1107 commented Mar 12, 2016

dchen1107 commented Mar 12, 2016

roberthbailey commented Mar 12, 2016

dchen1107 commented Mar 28, 2016

dchen1107 commented Apr 8, 2016