cluster frequently failing to initialize on Jenkins #10868

ixdy · 2015-07-07T21:05:06Z

We're seeing clusters frequently fail to initialize on the pull request Jenkins; I'm seeing similar failures (though less frequently) on the post commit Jenkins runs, too.

The failures usually appear like the following:

... calling kube-up
Project: kubernetes-jenkins-pull
Zone: us-central1-f
+++ Staging server tars to Google Storage: gs://kubernetes-staging-b6e85ca3f3/devel-3
+++ kubernetes-server-linux-amd64.tar.gz uploaded (sha1 = 9431972b4f16efb46bf714f42c55552690476ad3)
+++ kubernetes-salt.tar.gz uploaded (sha1 = 7cf732ebcfa9aa7d8ab2fc579b45c514d619a4ca)
Starting master and configuring firewalls
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-jenkins-pull/zones/us-central1-f/disks/pull-e2e-3-master-pd].
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-jenkins-pull/global/firewalls/pull-e2e-3-master-https].
NAME                 ZONE          SIZE_GB TYPE   STATUS
pull-e2e-3-master-pd us-central1-f 20      pd-ssd READY
NAME                    NETWORK             SRC_RANGES RULES   SRC_TAGS TARGET_TAGS
pull-e2e-3-master-https pull-e2e-parallel-3 0.0.0.0/0  tcp:443          pull-e2e-3-master
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-jenkins-pull/regions/us-central1/addresses/pull-e2e-3-master-ip].
+++ Logging using Fluentd to elasticsearch
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-jenkins-pull/global/firewalls/pull-e2e-3-minion-all].
NAME                  NETWORK             SRC_RANGES    RULES                    SRC_TAGS TARGET_TAGS
pull-e2e-3-minion-all pull-e2e-parallel-3 10.245.0.0/16 tcp,udp,icmp,esp,ah,sctp          pull-e2e-3-minion
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-jenkins-pull/zones/us-central1-f/instances/pull-e2e-3-master].
NAME              ZONE          MACHINE_TYPE  PREEMPTIBLE INTERNAL_IP   EXTERNAL_IP    STATUS
pull-e2e-3-master us-central1-f n1-standard-1             10.240.32.179 104.197.25.187 RUNNING
Creating minions.
Attempt 1 to create pull-e2e-3-minion-template
WARNING: You have selected a disk size of under [200GB]. This may result in poor I/O performance. For more information, see: https://developers.google.com/compute/docs/disks/persistent-disks#pdperformance.
Created [https://www.googleapis.com/compute/v1/projects/kubernetes-jenkins-pull/global/instanceTemplates/pull-e2e-3-minion-template].
NAME                       MACHINE_TYPE  PREEMPTIBLE CREATION_TIMESTAMP
pull-e2e-3-minion-template n1-standard-1             2015-07-07T13:27:25.593-07:00
Managed instance group pull-e2e-3-minion-group is being created. Operation: operation-1436300847472-51a4ed9cddd80-4a6d2d17-1a9b834b
Waiting for minions to run. 0 out of 2 running. Retrying.
Waiting for minions to run. 0 out of 2 running. Retrying.
Waiting for minions to run. 0 out of 2 running. Retrying.
Waiting for minions to run. 0 out of 2 running. Retrying.
Waiting for minions to run. 0 out of 2 running. Retrying.
Waiting for minions to run. 0 out of 2 running. Retrying.
Waiting for minions to run. 0 out of 2 running. Retrying.
MINION_NAMES=pull-e2e-3-minion-a635 pull-e2e-3-minion-jpgu
Using master: pull-e2e-3-master (external IP: 104.197.25.187)
Waiting for cluster initialization.

  This will continually check to see if the API for kubernetes is reachable.
  This might loop forever if there was some uncaught error during start
  up.

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Build timed out (after 30 minutes). Marking the build as aborted.

I ssh'd in to the nodes and master on this particular build as it was failing to come up, and saved everything from /var/log from each.

I haven't examined everything in full yet, but a few things stick out to me as worrisome:

The kubelet logs from the minions show that they are unable to connect to the master:

I0707 20:49:54.348867    7083 kubelet.go:790] Attempting to register node pull-e2e-3-minion-a635
I0707 20:49:54.350951    7083 kubelet.go:793] Unable to register pull-e2e-3-minion-a635 with the apiserver: Post https://pull-e2e-3-master/api/v1/nodes: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:54.923284    7083 reflector.go:136] Failed to list *api.Pod: Get https://pull-e2e-3-master/api/v1/pods?fieldSelector=spec.nodeName%3Dpull-e2e-3-minion-a635: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:54.923469    7083 reflector.go:136] Failed to list *api.Service: Get https://pull-e2e-3-master/api/v1/services: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:54.923491    7083 reflector.go:136] Failed to list *api.Node: Get https://pull-e2e-3-master/api/v1/nodes?fieldSelector=metadata.name%3Dpull-e2e-3-minion-a635: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:55.926406    7083 reflector.go:136] Failed to list *api.Node: Get https://pull-e2e-3-master/api/v1/nodes?fieldSelector=metadata.name%3Dpull-e2e-3-minion-a635: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:55.926484    7083 reflector.go:136] Failed to list *api.Pod: Get https://pull-e2e-3-master/api/v1/pods?fieldSelector=spec.nodeName%3Dpull-e2e-3-minion-a635: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:55.926521    7083 reflector.go:136] Failed to list *api.Service: Get https://pull-e2e-3-master/api/v1/services: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:56.928553    7083 reflector.go:136] Failed to list *api.Service: Get https://pull-e2e-3-master/api/v1/services: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:56.928623    7083 reflector.go:136] Failed to list *api.Node: Get https://pull-e2e-3-master/api/v1/nodes?fieldSelector=metadata.name%3Dpull-e2e-3-minion-a635: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:56.928650    7083 reflector.go:136] Failed to list *api.Pod: Get https://pull-e2e-3-master/api/v1/pods?fieldSelector=spec.nodeName%3Dpull-e2e-3-minion-a635: dial tcp 10.240.32.179:443: connection refused
E0707 20:49:57.652805    7083 event.go:194] Unable to write event: 'Post https://pull-e2e-3-master/api/v1/namespaces/default/events: dial tcp 10.240.32.179:443: connection refused' (may retry after sleeping)

The master seems to be failing to start up, with the kubelet log there showing image failures:

W0707 20:49:55.979144    3012 manager.go:1569] Failed to pull image "gcr.io/google_containers/kube-scheduler:b7dc6fb5993845fe219b5fe5c25e5a1d" from pod "kube-scheduler-pull-e2e-3-master_default" and container "kube-scheduler": [Tag b7dc6fb5993845fe219b5fe5c25e5a1d not found in repository gcr.io/google_containers/kube-scheduler, Tag b7dc6fb5993845fe219b5fe5c25e5a1d not found in repository gcr.io/google_containers/kube-scheduler]
W0707 20:49:55.997716    3012 manager.go:1569] Failed to pull image "gcr.io/google_containers/kube-controller-manager:c9f3e175247966c35db90dbdd2647c83" from pod "kube-controller-manager-pull-e2e-3-master_default" and container "kube-controller-manager": [Tag c9f3e175247966c35db90dbdd2647c83 not found in repository gcr.io/google_containers/kube-controller-manager, Tag c9f3e175247966c35db90dbdd2647c83 not found in repository gcr.io/google_containers/kube-controller-manager]
W0707 20:49:56.002070    3012 manager.go:1569] Failed to pull image "gcr.io/google_containers/kube-apiserver:9c5b6dabad3eff8bd9e3af646cc8bd45" from pod "kube-apiserver-pull-e2e-3-master_default" and container "kube-apiserver": [Tag 9c5b6dabad3eff8bd9e3af646cc8bd45 not found in repository gcr.io/google_containers/kube-apiserver, Tag 9c5b6dabad3eff8bd9e3af646cc8bd45 not found in repository gcr.io/google_containers/kube-apiserver]

Anyone have any ideas as to what's going on?
I can upload all of the logs I saved to GCS in a bit.

@alex-mohr @fabioy @quinton-hoole can you assign folks who may be able to help investigate?

The text was updated successfully, but these errors were encountered:

alex-mohr · 2015-07-07T21:30:31Z

From my desktop:
% docker pull gcr.io/google_containers/kube-apiserver:9c5b6dabad3eff8bd9e3af646cc8bd45
FATA[0000] Tag 9c5b6dabad3eff8bd9e3af646cc8bd45 not found in repository gcr.io/google_containers/kube-apiserver

From https://pantheon.corp.google.com/project/google-containers/kubernetes/images/list:
kube-apiserver 9680e782e08a1a1c94c656190011bd02

I don't know how you built the pull request jenkins. But it looks like the kube-apiserver container only has 1 live tag at a time, so it looks like PR runner only supports single-threaded operation. Presumably something is deleting or changing the tag value before the above run finished?

ixdy · 2015-07-07T21:58:22Z

The pull request Jenkins is supposed to be starting up a cluster with the kubernetes components it just built. Is that supposed to be using gcr.io at all?

More data for comparison: a kubelet.log from a cluster which succeeded to start up on the pull request jenkins:

I0707 21:39:37.840579    2963 provider.go:91] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
I0707 21:39:37.840792    2963 provider.go:91] Refreshing cache for provider: *gcp_credentials.dockerConfigKeyProvider
I0707 21:39:37.841588    2963 config.go:119] body of failing http response: &{0xc208347500 {0 0} false <nil> 0x5e65b0 0x5e6540}
E0707 21:39:37.841615    2963 metadata.go:111] while reading 'google-dockercfg' metadata: http status code: 404 while fetching url http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg
W0707 21:39:38.360863    2963 manager.go:1569] Failed to pull image "gcr.io/google_containers/kube-controller-manager:4902c026c8a7558c519e08a3840e1fc4" from pod "kube-controller-manager-pull-e2e-1-master_default" and container "kube-controller-manager": [Tag 4902c026c8a7558c519e08a3840e1fc4 not found in repository gcr.io/google_containers/kube-controller-manager, Tag 4902c026c8a7558c519e08a3840e1fc4 not found in repository gcr.io/google_containers/kube-controller-manager]
W0707 21:39:38.376905    2963 manager.go:1569] Failed to pull image "gcr.io/google_containers/kube-apiserver:932996510684d433442e2392d2a23680" from pod "kube-apiserver-pull-e2e-1-master_default" and container "kube-apiserver": [Tag 932996510684d433442e2392d2a23680 not found in repository gcr.io/google_containers/kube-apiserver, Tag 932996510684d433442e2392d2a23680 not found in repository gcr.io/google_containers/kube-apiserver]
W0707 21:39:38.445658    2963 manager.go:1569] Failed to pull image "gcr.io/google_containers/kube-scheduler:ee6bf99b0954ff70f187f2e9bff8cb43" from pod "kube-scheduler-pull-e2e-1-master_default" and container "kube-scheduler": [Tag ee6bf99b0954ff70f187f2e9bff8cb43 not found in repository gcr.io/google_containers/kube-scheduler, Tag ee6bf99b0954ff70f187f2e9bff8cb43 not found in repository gcr.io/google_containers/kube-scheduler]
I0707 21:39:55.122898    2963 manager.go:1384] Need to restart pod infra container for "fluentd-elasticsearch-pull-e2e-1-master_kube-system" because it is not found
I0707 21:41:37.207424    2963 server.go:635] GET /healthz: (1.194241ms) 0 [[monit/5.4] 127.0.0.1:55777]

lavalamp · 2015-07-08T00:02:35Z

The containers like "gcr.io/google_containers/kube-apiserver:9c5b6dabad3eff8bd9e3af646cc8bd45" are never pushed to a repository, so pulling them is futile. They are shipped (scp or salt or something) directly to the VMs when we start a cluster.

lavalamp · 2015-07-08T00:33:39Z

When I start a regular cluster and use:

KUBE_CONFIG_FILE=config-default.sh hack/ginkgo-e2e.sh --ginkgo.focus=.*pgrade.* |& less -r

I repro this. When I start an e2e cluster and test like this, I don't:

go run hack/e2e.go -v -up
go run hack/e2e.go -v -test -test_args=-ginkgo.focus=.*pgrade.* |& less -r

justinsb · 2015-07-08T03:32:19Z

If it is useful, kube-master-addons is the service which runs on the master and docker loads the images. It looks like it should log to /var/log/kube-master-addons.log on an initd system.

justinsb · 2015-07-08T03:39:48Z

There's been a lot of churn in git log -p cluster/saltbase/salt/kube-master-addons/. I myself remember finding an issue with systemd having a circular dependency here, but I think/hope that is only systemd systems. I would bet on a race between docker & kube-master-addons as to which starts first; maybe kube-master-addons.sh should itself wait for docker to be ready? I think @dchen1107 is the expert here, if the logs confirm this to be the problem.

yujuhong · 2015-07-08T18:29:27Z

Not sure if this is the same issue, but my apiserver wouldn't start on the new master when running the cluster upgrade test (with a freshly created e2e cluster).

apiserver couldn't be started because the image is not available. ps showed that docker was trying to load kube-apiserver.tar for over 20 minutes.
root 3224 0.0 0.3 132688 11852 ? Sl 18:04 0:00 docker load -i /srv/salt/kube-bins/kube-apiserver.tar

dchen1107 · 2015-07-08T18:34:15Z

@yujuhong mind to add me to your GCE project? I can ssh into your master and take a look if the problem is occurring.

yujuhong · 2015-07-08T18:37:25Z

@yujuhong mind to add me to your GCE project? I can ssh into your master and take a look if the problem is occurring.

Done.

dchen1107 · 2015-07-08T20:50:43Z

Ok, I know what is going on here. docker load sometimes hang forever. Restarting docker daemon should fix the issue. I know how to fix this, but not sure if we want to make such change now for Friday cut. How about document this as a known issue, and include a proper fix for v1.1?

cc/ @alex-mohr @roberthbailey @brendandburns @erickhan Again, #5011 is for dev cluster upgrading, shouldn't be used as a solution for production. We should fix this by pushing release image to gcr.io, instead of salt copy tar file to the node, and docker load through scripts.

alex-mohr · 2015-07-08T20:54:54Z

@dchen1107 is a short-term workaround to change the salt and/or ssh scripts to do more docker restarts? If so, seems lowish risk to production k8s, and would help the developer experience in the near term?

dchen1107 · 2015-07-08T21:06:40Z

@alex-mohr Updating existing script to detect the issue and restart docker is easy. My concern is that I also need to introduce another dependency between master component manifest update and docker image load at such late time.

dchen1107 · 2015-07-08T21:07:36Z

Ok, let me quickly come up a pr for you guys to review.

dchen1107 · 2015-07-09T22:05:02Z

I assigned this p0 to myself and move it to 1.0 after talking to few folks here. Now I have a pending PR which timeout docker load when it hangs. But I couldn't reproduce the issue last night for many many runs. Are we really wanting to include fix for this for a rare failures? It is an ugly issue, but like what I mentioned yesterday, GKE shouldn't run into this in production. For GCE case, we could document how to fix it. If the problem later occurs a lot, we could patch v1.0.x, so that we can have more soak time for this fix? Or we could even install linux timeout utility to our image.

cc/ @alex-mohr @thockin @brendandburns @zmerlynn

thockin · 2015-07-09T22:08:42Z

I'm OK to hold this PR until we've got real evidence it changes something material.

brendandburns · 2015-07-09T22:09:05Z

agree, I'm happy to hold this in my pocket until we see it as a problem reported by users.

dchen1107 · 2015-07-09T22:42:11Z

Ok, I move this one to post 1.0 and will link this issue to: v1.0.0 known issues / FAQ accumulator #10760

zmerlynn · 2015-07-09T22:42:17Z

+1 to happy to wait. The PR builder flakes are definitely frustrating for on-calls and anyone on the hotseat waiting for a release to build, but it's a manageable / known pain.

ghost · 2015-07-09T23:36:11Z

I'm less concerned about the minor frustrations for oncalls, but more concerned about the percentage of GCE cluster builds that are failing, as presumably our new and excited v1.0 customers will experience the same cluster build failure rate until we improve it.

For the record:

4 of the past 30 "e2e-gce" builds have failed (13.3%)
24 of the past 472 "e2e-gce-parallel" builds have failed (5%)

Are we happy with these failure rates? Do they correlate with @dchen1107's attempts to reproduce?
@thockin @brendandburns FYI.

Note that I have not dived into the details of all of the above 48 cluster build failures to confirm that the failure is always the one in this bug.

dchen1107 · 2015-07-09T23:48:56Z

@quinton-hoole Good point! For those build failure, do we have any logs collected from master node?

ixdy · 2015-07-09T23:50:45Z

I collected /var/log/* from the master and minion nodes for one failure I observed. I can send you a link internally.

dchen1107 · 2015-07-10T00:03:06Z

@ixdy Thanks. I just navigated it, and the one you sent to me is this one: docker load somehow hang. But this is just one of those build failures. I talked to @quinton-hoole build failure is pretty high, but caused by various reasons.

davidopp · 2015-07-12T19:54:32Z

This happened again, multiple times within four hours:

http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/3323/
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/3328/
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/3329/

davidopp · 2015-07-14T06:23:39Z

And again:
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce/7583/console

dchen1107 · 2015-07-14T16:43:50Z

@davidopp Thanks for updating. Unfortunately when I saw the issue, the cluster is teared down already, thus I cannot verify if it is caused by the same docker load hang or other different cause. Among yesterday 20 builds, one is failed with cluster build failure.

ixdy · 2015-07-15T23:53:36Z

Found another cluster (kubernetes-e2e-gce-parallel-flaky) which appears to be stuck with the docker load issue.

root@parallel-flaky-master:~# ps -ef | grep "docker load"
root      3359  3024  0 23:02 ?        00:00:00 docker load -i /srv/salt/kube-bins/kube-apiserver.tar
root     23376 21283  0 23:51 pts/0    00:00:00 grep docker load
root@parallel-flaky-master:~# date
Wed Jul 15 23:51:32 UTC 2015

Are there any logs I should save which would help narrow this down to a simpler repo? (@thockin requested a simpler repro on #10998.)

yujuhong · 2015-07-16T00:15:55Z

Are there any logs I should save which would help narrow this down to a simpler repo? (@thockin requested a simpler repro on #10998.)

Some obvious ones I can think of:

/var/log/startupscript.log: records when the tarballs are copied, and when kube-master-addons.sh is run.
/var/log/docker.log: records the image load/pull requests.

@dchen1107 would know more.

ixdy · 2015-07-16T00:26:10Z

OK, saved those two logs:

roberthbailey · 2015-07-16T06:40:49Z

@fabioy recently found a test GKE cluster that failed to initialize with the docker load issue. I manually killed docker, ran the side loading commands, and then it worked as expected. AFAIK this was the first GKE cluster that failed to initialize due to this problem.

dchen1107 · 2015-07-16T07:31:52Z

@roberthbailey Thanks for reporting. PR #10998 was verified to fix the issue.

roberthbailey · 2015-07-16T16:27:24Z

Yep. I was just making a note for posterity since we hadn't ever seen this in GKE before.

davidopp · 2015-07-17T03:58:45Z

Happened again:
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/3581/

dchen1107 · 2015-07-17T05:38:38Z

Yes, we agreed to merge #10998 tomorrow right after v1.0.X is cut. So that we can have some soak time for the pr, and patch it v1.0.X+1

yujuhong · 2015-07-17T17:42:17Z

FWIW, I ran multiple docker load, docker inspect, and docker pull commands to simulate kubelet's actions for hours, but failed to reproduce this issue. On the side note, this could be related to (#10623, moby/moby#9718, moby/moby#13971), where multiple concurrent image pulls with shared based/intermediate images may cause layer corruption for docker <1.7.1. I have no proof though.

ixdy · 2015-07-17T18:14:33Z

@yujuhong that might be on the right track. A few things I noticed while looking at the last failure:

The tarball that's being loaded had 4 layer tarballs:

# tar tvf /srv/salt/kube-bins/kube-apiserver.tar
drwxr-xr-x 0/0               0 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/
-rw-r--r-- 0/0               3 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/VERSION
-rw-r--r-- 0/0            1502 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/json
-rw-r--r-- 0/0        39876096 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/layer.tar
drwxr-xr-x 0/0               0 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/
-rw-r--r-- 0/0               3 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/VERSION
-rw-r--r-- 0/0            1467 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/json
-rw-r--r-- 0/0         2643968 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/layer.tar
drwxr-xr-x 0/0               0 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/
-rw-r--r-- 0/0               3 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/VERSION
-rw-r--r-- 0/0            1408 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/json
-rw-r--r-- 0/0            1024 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/layer.tar
drwxr-xr-x 0/0               0 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/
-rw-r--r-- 0/0               3 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/VERSION
-rw-r--r-- 0/0            1243 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/json
-rw-r--r-- 0/0            1024 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/layer.tar

In docker's tmp directory, all but one of these (cf261697...) seem to have been extracted:

root@parallel-flaky-master:/var/lib/docker/tmp/docker-import-011961600/repo# ls -l
total 16
drwxr-xr-x 2 root root 4096 Jul 15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1
drwxr-xr-x 2 root root 4096 Jul 15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea
drwxr-xr-x 2 root root 4096 Jul 15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55

Though it did exist in docker's graph directory?

/var/lib/docker# ls graph/
2a141ce650849144be11de0187f274e5bcfb3f560f3fd5d765ff82388ecd1833  6d4946999d4fb403f40e151ecbd13cb866da125431eb1df0cdfd4dc72674e3c6
2c40b0526b6358710fd09e7b8c022429268cc61703b4777e528ac9d469a07ca1  73ef6762e29ac4c46c5639469edf31dedefa8f50a914e61e048e1418c4038ad9
428b411c28f0c33e561a95400a729552db578aee0553f87053b96fc0008cca6a  9fd3c8c9af32dddb1793ccb5f6535e12d735eacae16f8f8c4214f42f33fe3d29
435050075b3f881611b0f4c141bb723f38603caacd31a13a185c1a38acfb4ade  a2d2a4d3aff19020ebb41fec3d0a1023870f4b4f890c1fa466419039007b7dd9
4b344aede0e3494ccb7e28f5fa266ae6447f45b374d2be52426c3cd9629af648  bbb13d2df30555493b98405a3e2436c0f1645ebb258631c41d2295c8547c6e96
503d9be0c075ca24029af706f54970afc62bdaa89a7a9bde52234dd8a36d022e  bfaec288b825141fdc8f434e7578ac967a4beb07cffe963301c3436cf100948f
511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158  cc5d5e239fac2b7766a72d7fb6c74566887ec4e79183011ce1ba600eee1b2d53
56ba5533a2dbf18b017ed12d99c6c83485f7146ed0eb3a2e9966c27fc5a5dd7b  cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff
5b578b621bf596fd4067981cd295a7146918dafd31501a68e646392085027f50  dccc6a178e09614ac1b708fdba6af57923e90e2700f2ccdd1052835e3187ea9f
6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea  _tmp

I wasn't sure if I found anything anomalous or not.

dchen1107 · 2015-07-17T18:28:27Z

@yujuhong I think the problem here is slight difference from #10623 and other related docker pull issues, but the root cause in docker graph driver might be the same. The issues you listed above are layer corrupted, normally in this case, docker pull will return successfully, and docker run / create will failed with Image_Not_Exist error.

The issue we had here is docker load hang forever, looks like deadlock in docker graph driver somewhere. I personally think this is more related to moby/moby#12823 but there is no fix from docker yet.

ixdy · 2015-07-20T19:55:54Z

It appears that Jenkins has been healthier after #10998 merged, so it seems likely that the docker load hang was consistently the cause of failures.

ixdy · 2015-07-24T21:40:07Z

Haven't seen any cluster initialization failures recently, so closing.

ixdy added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. team/cluster area/test-infra labels Jul 7, 2015

ixdy mentioned this issue Jul 7, 2015

Increase zero-limit pod RAM for spreading to 200 MB to match cluster #10857

Merged

yujuhong mentioned this issue Jul 7, 2015

Nodes not being able to register after running cluster upgrade test #10858

Closed

This was referenced Jul 8, 2015

Better volumes docs #10800

Merged

Cluster health check sometimes fails due to failed kibana-logging pods (Ginkgo timed out waiting for all parallel nodes to report back! ) #10897

Closed

ghost mentioned this issue Jul 8, 2015

Networking test rework to support modular, layered service soak #9052

Merged

dchen1107 added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed team/cluster labels Jul 8, 2015

dchen1107 self-assigned this Jul 8, 2015

dchen1107 mentioned this issue Jul 9, 2015

Introduce a simple alarm utility to timeout docker load hang issue #10998

Merged

dchen1107 added this to the v1.0 milestone Jul 9, 2015

dchen1107 modified the milestones: v1.0-post, v1.0 Jul 9, 2015

ixdy mentioned this issue Jul 10, 2015

Allow upgrade target version to be specified #11005

Merged

dchen1107 mentioned this issue Jul 10, 2015

v1.0.0 known issues / FAQ accumulator #10760

Closed

bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015

ixdy closed this as completed Jul 24, 2015

cluster frequently failing to initialize on Jenkins #10868

cluster frequently failing to initialize on Jenkins #10868

Comments

ixdy commented Jul 7, 2015

alex-mohr commented Jul 7, 2015

ixdy commented Jul 7, 2015

lavalamp commented Jul 8, 2015

lavalamp commented Jul 8, 2015

justinsb commented Jul 8, 2015

justinsb commented Jul 8, 2015

yujuhong commented Jul 8, 2015

dchen1107 commented Jul 8, 2015

yujuhong commented Jul 8, 2015

dchen1107 commented Jul 8, 2015

alex-mohr commented Jul 8, 2015

dchen1107 commented Jul 8, 2015

dchen1107 commented Jul 8, 2015

dchen1107 commented Jul 9, 2015

thockin commented Jul 9, 2015

brendandburns commented Jul 9, 2015

dchen1107 commented Jul 9, 2015

zmerlynn commented Jul 9, 2015

ghost commented Jul 9, 2015

dchen1107 commented Jul 9, 2015

ixdy commented Jul 9, 2015

dchen1107 commented Jul 10, 2015

davidopp commented Jul 12, 2015

davidopp commented Jul 14, 2015

dchen1107 commented Jul 14, 2015

ixdy commented Jul 15, 2015

yujuhong commented Jul 16, 2015

ixdy commented Jul 16, 2015

roberthbailey commented Jul 16, 2015

dchen1107 commented Jul 16, 2015

roberthbailey commented Jul 16, 2015

davidopp commented Jul 17, 2015

dchen1107 commented Jul 17, 2015

yujuhong commented Jul 17, 2015

ixdy commented Jul 17, 2015

dchen1107 commented Jul 17, 2015

ixdy commented Jul 20, 2015

ixdy commented Jul 24, 2015