Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of tests of prow tests are not finishing #2708

Closed
wojtek-t opened this issue May 10, 2017 · 19 comments
Closed

A lot of tests of prow tests are not finishing #2708

wojtek-t opened this issue May 10, 2017 · 19 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@wojtek-t
Copy link
Member

Looking into:
https://k8s-testgrid.appspot.com/perf-tests
I'm seeing 10s of tests that are claimed to be "Build Still Running!"

There has to be something wrong in prow regarding this - I'm not even able to debug those runs due to no logs.

@krzyzacy
Copy link
Member

/assign @spxtr

@wojtek-t
Copy link
Member Author

BTW - now when I'm looking into:
https://k8s-testgrid.appspot.com/perf-tests

it seems that majority of runs are not even there, e.g. there isn't any run between 371 and 400 and anything after 400 (and now we seem to be at 470):
https://prow.k8s.io/log?pod=ci-perf-tests-e2e-gce-clusterloader-470

@wojtek-t
Copy link
Member Author

Clicking from here: https://prow.k8s.io/?type=periodic

@krzyzacy
Copy link
Member

the pod might be evicted..

@wojtek-t
Copy link
Member Author

OK - so it seems there isn't anything clusterloader specific here:
https://k8s-testgrid.appspot.com/google-gke#ubuntu-gke-1.6-slow

@wojtek-t wojtek-t changed the title A lot of tests of clusterloader is not finishing A lot of tests of prow tests are not finishing May 10, 2017
@wojtek-t
Copy link
Member Author

wojtek-t commented May 10, 2017

It seems that almost all the jobs are being Evicted due to disk-pressure:

  8m		8m		1	{kubelet gke-prow-build-pool-a89df2af-bg4x}			Warning		Evicted			The node was low on resource: [DiskPressure].

@wojtek-t
Copy link
Member Author

@dchen1107 @yujuhong - I gues we are configuring things badly for the prow cluster, but you may be able to provide some insights what this is happening and how to solve it

@wojtek-t wojtek-t added priority/P0 kind/bug Categorizes issue or PR as related to a bug. labels May 10, 2017
@spxtr
Copy link
Contributor

spxtr commented May 10, 2017

I opened kubernetes/kubernetes#45558 to track the evictions. Prow will restart the pods once they're evicted, so your job will eventually start.

@spxtr
Copy link
Contributor

spxtr commented May 10, 2017

We hit a point last night when every single node had disk pressure so we couldn't make any progress. I recreated all nodes so we should be okay for another couple days. This is obviously not a good state to be in.

@krzyzacy
Copy link
Member

worth add more nodes?

@wojtek-t
Copy link
Member Author

I recreated all nodes so we should be okay for another couple days.

I'm not sure how it was supposed to help. IIUC, every test run by prow is heavily using disk, because it is downloading the whole repo etc. so it is expected that we are using disk heavily in my opinion.

And it seems that we hit this problem again today, just few hours after it.

I think that there are basically two potential solutions:

  • increase disks of nodes in the cluster significantly
  • add more nodes to the cluster (so that there are less pods/node on average).

@wojtek-t
Copy link
Member Author

BTW - it seems that **any of the last 50 runs ** https://k8s-testgrid.appspot.com/perf-tests produced anything useful (i.e. there are no logs from them). So I would suggest working on stability of prow first before migrating more to it. @fejta

@yujuhong
Copy link
Contributor

We should check first whether kubelet is cleaning up disk space properly.

increase disks of nodes in the cluster significantly
add more nodes to the cluster (so that there are less pods/node on average).

Also, if we know how much disk space each pod consumes on average, we can set a bogus resource request to make sure pods are spread out evenly.

@fejta
Copy link
Contributor

fejta commented May 11, 2017

The issue is in Kubernetes itself -- 1.6 doesn't work -- and we're resolving by downgrading to 1.4.

@fejta
Copy link
Contributor

fejta commented May 11, 2017

kubernetes/kubernetes#45558 (that Joe mentioned earlier)

@wojtek-t
Copy link
Member Author

In the meantime, maybe we can add new nodes to the cluster? Or mitigate this problem in a different way?

@yujuhong
Copy link
Contributor

@wojtek-t we've found the cause in kubernetes/kubernetes#45558.

There are too many terminated pods in the cluster, and each of them takes up a significant amount of disk space since kubelet keeps the dead containers for logs.
Kubelet's eviction manager should evict those terminated pods, but it doesn't do that today. The pod GC controller doesn't help since they usually have a high threshold.
The temporary workaround is to scan and delete those pods more frequently.

I believe 1.4 cluster would have the same problem as terminated pods start accumulating.

@spxtr
Copy link
Contributor

spxtr commented May 12, 2017

Fixed in the short-term by #2736. The long-term fix will have to fix kubernetes/kubernetes#45558.

@spxtr spxtr closed this as completed May 12, 2017
@wojtek-t
Copy link
Member Author

Cool - thanks a lot for debugging and fixing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants