-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster frequently failing to initialize on Jenkins #10868
Comments
From my desktop: From https://pantheon.corp.google.com/project/google-containers/kubernetes/images/list: I don't know how you built the pull request jenkins. But it looks like the kube-apiserver container only has 1 live tag at a time, so it looks like PR runner only supports single-threaded operation. Presumably something is deleting or changing the tag value before the above run finished? |
The pull request Jenkins is supposed to be starting up a cluster with the kubernetes components it just built. Is that supposed to be using gcr.io at all? More data for comparison: a kubelet.log from a cluster which succeeded to start up on the pull request jenkins:
|
The containers like "gcr.io/google_containers/kube-apiserver:9c5b6dabad3eff8bd9e3af646cc8bd45" are never pushed to a repository, so pulling them is futile. They are shipped (scp or salt or something) directly to the VMs when we start a cluster. |
When I start a regular cluster and use:
I repro this. When I start an e2e cluster and test like this, I don't:
|
If it is useful, kube-master-addons is the service which runs on the master and |
There's been a lot of churn in |
Not sure if this is the same issue, but my apiserver wouldn't start on the new master when running the cluster upgrade test (with a freshly created e2e cluster). apiserver couldn't be started because the image is not available. |
@yujuhong mind to add me to your GCE project? I can ssh into your master and take a look if the problem is occurring. |
Done. |
Ok, I know what is going on here. docker load sometimes hang forever. Restarting docker daemon should fix the issue. I know how to fix this, but not sure if we want to make such change now for Friday cut. How about document this as a known issue, and include a proper fix for v1.1? cc/ @alex-mohr @roberthbailey @brendandburns @erickhan Again, #5011 is for dev cluster upgrading, shouldn't be used as a solution for production. We should fix this by pushing release image to gcr.io, instead of salt copy tar file to the node, and docker load through scripts. |
@dchen1107 is a short-term workaround to change the salt and/or ssh scripts to do more docker restarts? If so, seems lowish risk to production k8s, and would help the developer experience in the near term? |
@alex-mohr Updating existing script to detect the issue and restart docker is easy. My concern is that I also need to introduce another dependency between master component manifest update and docker image load at such late time. |
Ok, let me quickly come up a pr for you guys to review. |
I assigned this p0 to myself and move it to 1.0 after talking to few folks here. Now I have a pending PR which timeout docker load when it hangs. But I couldn't reproduce the issue last night for many many runs. Are we really wanting to include fix for this for a rare failures? It is an ugly issue, but like what I mentioned yesterday, GKE shouldn't run into this in production. For GCE case, we could document how to fix it. If the problem later occurs a lot, we could patch v1.0.x, so that we can have more soak time for this fix? Or we could even install linux timeout utility to our image. |
I'm OK to hold this PR until we've got real evidence it changes something material. |
agree, I'm happy to hold this in my pocket until we see it as a problem reported by users. |
Ok, I move this one to post 1.0 and will link this issue to: v1.0.0 known issues / FAQ accumulator #10760 |
+1 to happy to wait. The PR builder flakes are definitely frustrating for on-calls and anyone on the hotseat waiting for a release to build, but it's a manageable / known pain. |
I'm less concerned about the minor frustrations for oncalls, but more concerned about the percentage of GCE cluster builds that are failing, as presumably our new and excited v1.0 customers will experience the same cluster build failure rate until we improve it. For the record:
Are we happy with these failure rates? Do they correlate with @dchen1107's attempts to reproduce? Note that I have not dived into the details of all of the above 48 cluster build failures to confirm that the failure is always the one in this bug. |
@quinton-hoole Good point! For those build failure, do we have any logs collected from master node? |
I collected /var/log/* from the master and minion nodes for one failure I observed. I can send you a link internally. |
@ixdy Thanks. I just navigated it, and the one you sent to me is this one: docker load somehow hang. But this is just one of those build failures. I talked to @quinton-hoole build failure is pretty high, but caused by various reasons. |
This happened again, multiple times within four hours: http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-parallel/3323/ |
@davidopp Thanks for updating. Unfortunately when I saw the issue, the cluster is teared down already, thus I cannot verify if it is caused by the same docker load hang or other different cause. Among yesterday 20 builds, one is failed with cluster build failure. |
Found another cluster (kubernetes-e2e-gce-parallel-flaky) which appears to be stuck with the docker load issue. root@parallel-flaky-master:~# ps -ef | grep "docker load"
root 3359 3024 0 23:02 ? 00:00:00 docker load -i /srv/salt/kube-bins/kube-apiserver.tar
root 23376 21283 0 23:51 pts/0 00:00:00 grep docker load
root@parallel-flaky-master:~# date
Wed Jul 15 23:51:32 UTC 2015 Are there any logs I should save which would help narrow this down to a simpler repo? (@thockin requested a simpler repro on #10998.) |
Some obvious ones I can think of:
@dchen1107 would know more. |
@fabioy recently found a test GKE cluster that failed to initialize with the docker load issue. I manually killed docker, ran the side loading commands, and then it worked as expected. AFAIK this was the first GKE cluster that failed to initialize due to this problem. |
@roberthbailey Thanks for reporting. PR #10998 was verified to fix the issue. |
Yep. I was just making a note for posterity since we hadn't ever seen this in GKE before. |
Yes, we agreed to merge #10998 tomorrow right after v1.0.X is cut. So that we can have some soak time for the pr, and patch it v1.0.X+1 |
FWIW, I ran multiple |
@yujuhong that might be on the right track. A few things I noticed while looking at the last failure: The tarball that's being loaded had 4 layer tarballs: # tar tvf /srv/salt/kube-bins/kube-apiserver.tar
drwxr-xr-x 0/0 0 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/
-rw-r--r-- 0/0 3 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/VERSION
-rw-r--r-- 0/0 1502 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/json
-rw-r--r-- 0/0 39876096 2015-07-15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1/layer.tar
drwxr-xr-x 0/0 0 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/
-rw-r--r-- 0/0 3 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/VERSION
-rw-r--r-- 0/0 1467 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/json
-rw-r--r-- 0/0 2643968 2015-07-15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea/layer.tar
drwxr-xr-x 0/0 0 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/
-rw-r--r-- 0/0 3 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/VERSION
-rw-r--r-- 0/0 1408 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/json
-rw-r--r-- 0/0 1024 2015-07-15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55/layer.tar
drwxr-xr-x 0/0 0 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/
-rw-r--r-- 0/0 3 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/VERSION
-rw-r--r-- 0/0 1243 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/json
-rw-r--r-- 0/0 1024 2015-07-15 22:02 cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff/layer.tar In docker's tmp directory, all but one of these ( root@parallel-flaky-master:/var/lib/docker/tmp/docker-import-011961600/repo# ls -l
total 16
drwxr-xr-x 2 root root 4096 Jul 15 22:02 2fc1e49943cfabf66d04c3b6176ab18e426194d96ea20d1fa86af1836cb553f1
drwxr-xr-x 2 root root 4096 Jul 15 22:02 6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea
drwxr-xr-x 2 root root 4096 Jul 15 22:02 8c2e06607696bd4afb3d03b687e361cc43cf8ec1a4a725bc96e39f05ba97dd55 Though it did exist in docker's graph directory? /var/lib/docker# ls graph/
2a141ce650849144be11de0187f274e5bcfb3f560f3fd5d765ff82388ecd1833 6d4946999d4fb403f40e151ecbd13cb866da125431eb1df0cdfd4dc72674e3c6
2c40b0526b6358710fd09e7b8c022429268cc61703b4777e528ac9d469a07ca1 73ef6762e29ac4c46c5639469edf31dedefa8f50a914e61e048e1418c4038ad9
428b411c28f0c33e561a95400a729552db578aee0553f87053b96fc0008cca6a 9fd3c8c9af32dddb1793ccb5f6535e12d735eacae16f8f8c4214f42f33fe3d29
435050075b3f881611b0f4c141bb723f38603caacd31a13a185c1a38acfb4ade a2d2a4d3aff19020ebb41fec3d0a1023870f4b4f890c1fa466419039007b7dd9
4b344aede0e3494ccb7e28f5fa266ae6447f45b374d2be52426c3cd9629af648 bbb13d2df30555493b98405a3e2436c0f1645ebb258631c41d2295c8547c6e96
503d9be0c075ca24029af706f54970afc62bdaa89a7a9bde52234dd8a36d022e bfaec288b825141fdc8f434e7578ac967a4beb07cffe963301c3436cf100948f
511136ea3c5a64f264b78b5433614aec563103b4d4702f3ba7d4d2698e22c158 cc5d5e239fac2b7766a72d7fb6c74566887ec4e79183011ce1ba600eee1b2d53
56ba5533a2dbf18b017ed12d99c6c83485f7146ed0eb3a2e9966c27fc5a5dd7b cf2616975b4a3cba083ca99bc3f0bf25f5f528c3c52be1596b30f60b0b1c37ff
5b578b621bf596fd4067981cd295a7146918dafd31501a68e646392085027f50 dccc6a178e09614ac1b708fdba6af57923e90e2700f2ccdd1052835e3187ea9f
6ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea _tmp I wasn't sure if I found anything anomalous or not. |
@yujuhong I think the problem here is slight difference from #10623 and other related docker pull issues, but the root cause in docker graph driver might be the same. The issues you listed above are layer corrupted, normally in this case, docker pull will return successfully, and docker run / create will failed with Image_Not_Exist error. The issue we had here is docker load hang forever, looks like deadlock in docker graph driver somewhere. I personally think this is more related to moby/moby#12823 but there is no fix from docker yet. |
It appears that Jenkins has been healthier after #10998 merged, so it seems likely that the |
Haven't seen any cluster initialization failures recently, so closing. |
We're seeing clusters frequently fail to initialize on the pull request Jenkins; I'm seeing similar failures (though less frequently) on the post commit Jenkins runs, too.
The failures usually appear like the following:
I ssh'd in to the nodes and master on this particular build as it was failing to come up, and saved everything from /var/log from each.
I haven't examined everything in full yet, but a few things stick out to me as worrisome:
The master seems to be failing to start up, with the kubelet log there showing image failures:
Anyone have any ideas as to what's going on?
I can upload all of the logs I saved to GCS in a bit.
@alex-mohr @fabioy @quinton-hoole can you assign folks who may be able to help investigate?
The text was updated successfully, but these errors were encountered: