A lot of tests of prow tests are not finishing #2708

wojtek-t · 2017-05-10T11:04:52Z

Looking into:
https://k8s-testgrid.appspot.com/perf-tests
I'm seeing 10s of tests that are claimed to be "Build Still Running!"

There has to be something wrong in prow regarding this - I'm not even able to debug those runs due to no logs.

krzyzacy · 2017-05-10T13:28:17Z

/assign @spxtr

wojtek-t · 2017-05-10T14:06:16Z

BTW - now when I'm looking into:
https://k8s-testgrid.appspot.com/perf-tests

it seems that majority of runs are not even there, e.g. there isn't any run between 371 and 400 and anything after 400 (and now we seem to be at 470):
https://prow.k8s.io/log?pod=ci-perf-tests-e2e-gce-clusterloader-470

wojtek-t · 2017-05-10T14:06:31Z

Clicking from here: https://prow.k8s.io/?type=periodic

krzyzacy · 2017-05-10T14:12:12Z

the pod might be evicted..

wojtek-t · 2017-05-10T14:16:08Z

OK - so it seems there isn't anything clusterloader specific here:
https://k8s-testgrid.appspot.com/google-gke#ubuntu-gke-1.6-slow

wojtek-t · 2017-05-10T14:36:53Z

It seems that almost all the jobs are being Evicted due to disk-pressure:

  8m		8m		1	{kubelet gke-prow-build-pool-a89df2af-bg4x}			Warning		Evicted			The node was low on resource: [DiskPressure].

wojtek-t · 2017-05-10T14:39:10Z

@dchen1107 @yujuhong - I gues we are configuring things badly for the prow cluster, but you may be able to provide some insights what this is happening and how to solve it

spxtr · 2017-05-10T16:20:53Z

I opened kubernetes/kubernetes#45558 to track the evictions. Prow will restart the pods once they're evicted, so your job will eventually start.

spxtr · 2017-05-10T17:59:17Z

We hit a point last night when every single node had disk pressure so we couldn't make any progress. I recreated all nodes so we should be okay for another couple days. This is obviously not a good state to be in.

krzyzacy · 2017-05-11T01:55:54Z

worth add more nodes?

wojtek-t · 2017-05-11T14:50:54Z

I recreated all nodes so we should be okay for another couple days.

I'm not sure how it was supposed to help. IIUC, every test run by prow is heavily using disk, because it is downloading the whole repo etc. so it is expected that we are using disk heavily in my opinion.

And it seems that we hit this problem again today, just few hours after it.

I think that there are basically two potential solutions:

increase disks of nodes in the cluster significantly
add more nodes to the cluster (so that there are less pods/node on average).

wojtek-t · 2017-05-11T14:53:18Z

BTW - it seems that **any of the last 50 runs ** https://k8s-testgrid.appspot.com/perf-tests produced anything useful (i.e. there are no logs from them). So I would suggest working on stability of prow first before migrating more to it. @fejta

yujuhong · 2017-05-11T16:40:56Z

We should check first whether kubelet is cleaning up disk space properly.

increase disks of nodes in the cluster significantly
add more nodes to the cluster (so that there are less pods/node on average).

Also, if we know how much disk space each pod consumes on average, we can set a bogus resource request to make sure pods are spread out evenly.

fejta · 2017-05-11T19:07:36Z

The issue is in Kubernetes itself -- 1.6 doesn't work -- and we're resolving by downgrading to 1.4.

fejta · 2017-05-11T19:08:25Z

kubernetes/kubernetes#45558 (that Joe mentioned earlier)

wojtek-t · 2017-05-12T11:51:44Z

In the meantime, maybe we can add new nodes to the cluster? Or mitigate this problem in a different way?

yujuhong · 2017-05-12T16:49:39Z

@wojtek-t we've found the cause in kubernetes/kubernetes#45558.

There are too many terminated pods in the cluster, and each of them takes up a significant amount of disk space since kubelet keeps the dead containers for logs.
Kubelet's eviction manager should evict those terminated pods, but it doesn't do that today. The pod GC controller doesn't help since they usually have a high threshold.
The temporary workaround is to scan and delete those pods more frequently.

I believe 1.4 cluster would have the same problem as terminated pods start accumulating.

spxtr · 2017-05-12T17:06:04Z

Fixed in the short-term by #2736. The long-term fix will have to fix kubernetes/kubernetes#45558.

wojtek-t · 2017-05-12T17:17:30Z

Cool - thanks a lot for debugging and fixing!

wojtek-t assigned fejta May 10, 2017

k8s-ci-robot assigned spxtr May 10, 2017

wojtek-t changed the title ~~A lot of tests of clusterloader is not finishing~~ A lot of tests of prow tests are not finishing May 10, 2017

krzyzacy mentioned this issue May 10, 2017

Migrate gce-gci-ci/gke-1.4 jobs to prow #2707

Closed

wojtek-t added priority/P0 kind/bug Categorizes issue or PR as related to a bug. labels May 10, 2017

gnufied mentioned this issue May 11, 2017

detach the volume when pod is terminated kubernetes/kubernetes#45286

Merged

krzyzacy mentioned this issue May 12, 2017

Testgrid sometimes shows empty test result #2668

Closed

spxtr closed this as completed May 12, 2017

spxtr mentioned this issue May 12, 2017

Use Jenkins to run Ubuntu e2e tests #2746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lot of tests of prow tests are not finishing #2708

A lot of tests of prow tests are not finishing #2708

wojtek-t commented May 10, 2017

krzyzacy commented May 10, 2017

wojtek-t commented May 10, 2017

wojtek-t commented May 10, 2017

krzyzacy commented May 10, 2017

wojtek-t commented May 10, 2017

wojtek-t commented May 10, 2017 •

edited

Loading

wojtek-t commented May 10, 2017

spxtr commented May 10, 2017

spxtr commented May 10, 2017

krzyzacy commented May 11, 2017

wojtek-t commented May 11, 2017

wojtek-t commented May 11, 2017

yujuhong commented May 11, 2017

fejta commented May 11, 2017

fejta commented May 11, 2017

wojtek-t commented May 12, 2017

yujuhong commented May 12, 2017

spxtr commented May 12, 2017

wojtek-t commented May 12, 2017

A lot of tests of prow tests are not finishing #2708

A lot of tests of prow tests are not finishing #2708

Comments

wojtek-t commented May 10, 2017

krzyzacy commented May 10, 2017

wojtek-t commented May 10, 2017

wojtek-t commented May 10, 2017

krzyzacy commented May 10, 2017

wojtek-t commented May 10, 2017

wojtek-t commented May 10, 2017 • edited Loading

wojtek-t commented May 10, 2017

spxtr commented May 10, 2017

spxtr commented May 10, 2017

krzyzacy commented May 11, 2017

wojtek-t commented May 11, 2017

wojtek-t commented May 11, 2017

yujuhong commented May 11, 2017

fejta commented May 11, 2017

fejta commented May 11, 2017

wojtek-t commented May 12, 2017

yujuhong commented May 12, 2017

spxtr commented May 12, 2017

wojtek-t commented May 12, 2017

wojtek-t commented May 10, 2017 •

edited

Loading