-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A lot of tests of prow tests are not finishing #2708
Comments
/assign @spxtr |
BTW - now when I'm looking into: it seems that majority of runs are not even there, e.g. there isn't any run between 371 and 400 and anything after 400 (and now we seem to be at 470): |
Clicking from here: https://prow.k8s.io/?type=periodic |
the pod might be evicted.. |
OK - so it seems there isn't anything clusterloader specific here: |
It seems that almost all the jobs are being Evicted due to disk-pressure:
|
@dchen1107 @yujuhong - I gues we are configuring things badly for the prow cluster, but you may be able to provide some insights what this is happening and how to solve it |
I opened kubernetes/kubernetes#45558 to track the evictions. Prow will restart the pods once they're evicted, so your job will eventually start. |
We hit a point last night when every single node had disk pressure so we couldn't make any progress. I recreated all nodes so we should be okay for another couple days. This is obviously not a good state to be in. |
worth add more nodes? |
I'm not sure how it was supposed to help. IIUC, every test run by prow is heavily using disk, because it is downloading the whole repo etc. so it is expected that we are using disk heavily in my opinion. And it seems that we hit this problem again today, just few hours after it. I think that there are basically two potential solutions:
|
BTW - it seems that **any of the last 50 runs ** https://k8s-testgrid.appspot.com/perf-tests produced anything useful (i.e. there are no logs from them). So I would suggest working on stability of prow first before migrating more to it. @fejta |
We should check first whether kubelet is cleaning up disk space properly.
Also, if we know how much disk space each pod consumes on average, we can set a bogus resource request to make sure pods are spread out evenly. |
The issue is in Kubernetes itself -- 1.6 doesn't work -- and we're resolving by downgrading to 1.4. |
kubernetes/kubernetes#45558 (that Joe mentioned earlier) |
In the meantime, maybe we can add new nodes to the cluster? Or mitigate this problem in a different way? |
@wojtek-t we've found the cause in kubernetes/kubernetes#45558. There are too many terminated pods in the cluster, and each of them takes up a significant amount of disk space since kubelet keeps the dead containers for logs. I believe 1.4 cluster would have the same problem as terminated pods start accumulating. |
Fixed in the short-term by #2736. The long-term fix will have to fix kubernetes/kubernetes#45558. |
Cool - thanks a lot for debugging and fixing! |
Looking into:
https://k8s-testgrid.appspot.com/perf-tests
I'm seeing 10s of tests that are claimed to be "Build Still Running!"
There has to be something wrong in prow regarding this - I'm not even able to debug those runs due to no logs.
The text was updated successfully, but these errors were encountered: