-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeing a decrease in scheduler throughput #56714
Comments
Are alpha features enabled in these tests? Scheduler uses a new alpha scheduling queue in 1.9. That might be causing the slow down. |
@bsalamat Are those features turned on by default? Those jobs are mostly following the default configs of kube-up. The only ones we're overriding are these. |
@shyamjvs No, the new queue is off by default and the overrides that you linked do not enable it. |
In the past, the throughput was significantly higher initially, and it was dropping with the number of scheduled pods in the system. Is it still the case? Or is the throughput roughly constant over time this time? |
Wojtek: IIRC it was constant for almost the whole time at ~300 pods per 10s. Hmm.. that's strange that the throughput was affected by the #scheduled pods already in the system, especially given that we're not even using any affinity-related features. After thinking a bit, I have the following hypothesis:
If the above is correct, the only part which remains unexplained is why did we have the increase from 1h5m -> 1h28m? regression? |
That isn't strange. There is a bunch of things in scheduler that are still dependeng on the number of scheduled pods. One of them is spreading priority.
That isn't fully true - see above. |
Seeing that it matches the mean value of the samples in above graphs pretty well, I took a look at the throughput reported in the |
Unfortunately on a 100-node cluster the throughput has been almost always exactly |
In small clusters, it's limited by qps limits. |
Looks like the way to go ahead is to bisect on a 5k-node cluster :-/ |
Or bump QPS limits in 100-node cluster locally to do bisection. |
Unfortunately I'm not able to make scheduler the bottleneck in a 100-node test:
even using the |
You can try bumping it even more, though I'm not sure if this won't cause some other issue. |
Thanks guys for looking into this. IIUC, scheduler's throughput is at least 75 pods/s in a 100 node cluster. With this throughput we can achieve our SLO of 1000 pods in 15 seconds. That said, I think we should continue our investigations and make scheduler faster if possible. We are in the process of adding an order for the execution of predicates. So, predicates with least amount of computation overhead can be executed first and if they fail, we won't run any other predicate. |
Bobby, we didn't change anything yet, so far I'm just trying
(unsuccessfully) to reproduce the issue of throughput being ~28 in a
5k-node cluster.
|
Still no luck with the following. Looks like I'll need to bring up a full-sized cluster after all:
|
@porridge Maybe it's not needed to go all the way till 5000 nodes (which
also our project quota won't allow atm) - can we check if we see some
difference on 1000 or 2000 nodes? Also bisection should be relatively fast
if you run just density test (should take about 2-3 hrs per run iirc).
…On Tue, Dec 5, 2017, 1:34 AM Marcin Owsiany ***@***.***> wrote:
Still no luck with the following. Looks like I'll need to bring up a
full-sized cluster after all:
export SCHEDULER_TEST_ARGS='--kube-api-qps=100'
export CONTROLLER_MANAGER_TEST_ARGS='--kube-api-qps=100 --kube-api-burst=100'
export APISERVER_TEST_ARGS='--max-requests-inflight=9000 --max-mutating-requests-inflight=4000'
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#56714 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEIhk4yxivlNQpcpHZGmzXBBTaEa1g-5ks5s9PH2gaJpZM4Qyz5x>
.
|
See above, 2k node clusters don't show this regression.
|
I'm also using a doctored density test which runs the bare minimum we need.
|
@wojtek-t here is how all RCs grew over time |
@porridge It kind of does, but the fact that throughput is higher in larger clusters does not make sense. |
@bsalamat that in turn was answered in #56714 (comment) :-) |
Ooops, it seems that the latency graphs for the manual test in #56714 (comment) and #56714 (comment) were off by an hour - because the test output for manual run was in CET and the master-side logs - in UTC. So I actually graphed latency for the time period when the test was over and the RCs were being deleted. The pod creations happening in that period are still surprising to me, and perhaps should be investigated, but are likely irrelevant to the question of low scheduler throughput which I'm debugging here. |
In other words, I would like to know if we can count on at least 75 pods/s that you had seen in 100-node clusters. |
You're asking purely about scheduler right? I'm not sure if the system would be able to handle that throughput in general (if we don't increase master size or sth like that). We would need to check it. But if we focus just on scheduler, then in most of cases 75 should be easily achievable. But I think if we have many pods with pod affinity/anti-affinity, this would no longer be the case. |
Thanks, @wojtek-t! Yes, I am asking about scheduler and assuming that API server handles the associated load. I am aware that affinity/anti-affinity can drop performance a lot, but we are interested in scheduler's performance in scheduling simple pods, i.e., no fancy spec like affinity/anti-affinity. |
Adding a graph of CPU usage of kube-scheduler across gce-scale-performance runs. There is a major difference between runs and it DOES seem to be correlated with the drop in throughput that we're seeing in #56714 (comment) |
I ran another test today (blue) with the same commit b262585, on the same cluster that I used in #56714 (comment) and obtained a very similar result to the previous one (red). One interesting thing is how the throughput visibly improved in the last quarter: Whatever changed at that point, made a big difference. I checked the CPU usage of kube-scheduler with htop a few times during the test and it was between 7.2 and 8.2 cores. Unfortunately I did not look during that last part. |
I ran another test, and the throughput was similarly low. Then in the middle of the run I restarted kube-scheduler. The throughput did not increase. Then I tried to similarly restart kube-apiserver, but this was not a good decision. With ~100k pods running, ~50k pods still waiting to be scheduled, and ~70 pending, the load on API server is so high that it is not able to recover. Most requests are refused with a 429, and the health checks done by master kubelet fails, causing a crashloop. Bumping the grace period of the health check to several minutes is not enough. |
Summary of findings so far:
Given:
I decided to stop here an un-assign this for now bug. If anyone has some ideas how to approach it, I might come back to this. |
/unassign |
/cc @misterikkit |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Following from an offline discussion.
I'm noticing that the scheduler throughput of our latest successful 5k-node run seems to be ~300 pods per 10s:
IIRC it was considerably higher before. Also,
It's weird that it is that high in comparison with 2k-node test:
/assign @porridge
cc @kubernetes/sig-scheduling-bugs @kubernetes/sig-scalability-bugs @bsalamat @wojtek-t
The text was updated successfully, but these errors were encountered: