Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate etcd unavailability leading to 300s cluster up timeout #22819

Closed
bprashanth opened this issue Mar 10, 2016 · 7 comments
Closed

Investigate etcd unavailability leading to 300s cluster up timeout #22819

bprashanth opened this issue Mar 10, 2016 · 7 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@bprashanth
Copy link
Contributor

#20931 (comment)
Apiserver logs:

E0306 01:08:48.052514       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.LimitRange: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.053808       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.ResourceQuota: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.054264       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.Secret: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.054349       6 cacher.go:217] unexpected ListAndWatch error: pkg/storage/cacher.go:160: Failed to list *api.Namespace: client: etcd cluster is unavailable or misconfigured
E0306 01:08:48.054435       6 cache

Didn't find anything immediately suspicous in master kubelet or docker logs.

@bprashanth bprashanth added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/control-plane kind/flake Categorizes issue or PR as related to a flaky test. labels Mar 10, 2016
@bprashanth bprashanth changed the title Investigate etcd crashing leading to 300s cluster up timeout Investigate etcd unavailability leading to 300s cluster up timeout Mar 10, 2016
@dchen1107 dchen1107 added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed team/control-plane labels Mar 11, 2016
@dchen1107
Copy link
Member

@lavalamp could you please take a look at this or delegate? Thanks!

@lavalamp lavalamp assigned dchen1107 and unassigned lavalamp Mar 11, 2016
@lavalamp
Copy link
Member

@dchen1107 something bad happened to the node, maybe to docker. I see lots of

I0306 01:04:03.535184    3507 worker.go:161] Probe target container not found: etcd-server-e2e-gce-master-0-master_kube-system(047397a03841de24aac3acbd0cab7cfa) - etcd-container

messages on the master node's kubelet.log. Looks like it wasn't able to run much. etcd log is empty. Nothing obvious in docker log.

But anyway, the problem is clearly that etcd never started. Sorry, but I have to give this back to you :)

@lavalamp lavalamp added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 11, 2016
@dchen1107
Copy link
Member

I looked at logs of master node.

  • The etcd-server container didn't run because docker image isn't loaded. We are doing side-loading all docker images for master component, including etcd image.
  • I checked docker log, it doesn't complain any failure of docker load operation.
  • Also docker is not hung at all since all other images are properly loaded and containers are created and started.
  • I checked salt related logs, and no errors.

I am wondering if etcd image tar file is copied to the master node properly through pr builder. Tried to reproduce it through kube-up but failed.

@dchen1107
Copy link
Member

Together with @bprashanth, we double checked our master component tarball, which doesn't include etcd:2.2.1 image tar file. Looks like etcd:2.2.1 is the only image of master component which is pulled from gcr.io.

I checked kubelet.log on master node again, there are only image pulling event, no corresponding pulled event for etcd:2.2.1. Based on my understanding, the reason we want to keep image side-loading for master component is that gcr.io repository is not very reliable. We used to include etcd image in master component tarball too. cc/ @roberthbailey and @zmerlynn

@roberthbailey
Copy link
Contributor

I don't recall a time when we were side-loading the etcd image. We load it from gcr.io in the release-1.1 branch and the release-1.0 branch and we haven't seen it have a significant impact on cluster creation reliability.

/cc @jlowdermilk

@dchen1107
Copy link
Member

Do we have more instances with the same failure here? The problem is for this particular docker pull etcd:2.2.1 operation takes long time.

@dchen1107
Copy link
Member

Since there is no instance with the same failure. I am closing this one for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

4 participants