Seeing a decrease in scheduler throughput #56714

shyamjvs · 2017-12-01T19:35:31Z

Following from an offline discussion.
I'm noticing that the scheduler throughput of our latest successful 5k-node run seems to be ~300 pods per 10s:

I1122 21:22:07.066] Nov 22 21:22:07.065: INFO: Density Pods: 150000 out of 150000 created, 147773 running, 49 pending, 2178 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:17.020] Nov 22 21:22:17.020: INFO: Density Pods: 150000 out of 150000 created, 148027 running, 53 pending, 1920 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:26.990] Nov 22 21:22:26.989: INFO: Density Pods: 150000 out of 150000 created, 148310 running, 45 pending, 1645 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:37.017] Nov 22 21:22:37.017: INFO: Density Pods: 150000 out of 150000 created, 148566 running, 53 pending, 1381 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:46.998] Nov 22 21:22:46.997: INFO: Density Pods: 150000 out of 150000 created, 148843 running, 56 pending, 1101 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady

IIRC it was considerably higher before. Also,

E2E startup time for 150000 pods: 1h28m53.009192542s

It's weird that it is that high in comparison with 2k-node test:

E2E startup time for 60000 pods: 11m10.450099342s

/assign @porridge

cc @kubernetes/sig-scheduling-bugs @kubernetes/sig-scalability-bugs @bsalamat @wojtek-t

The text was updated successfully, but these errors were encountered:

shyamjvs · 2017-12-01T19:37:36Z

Indeed it seems like it decreased. Earlier it was:

E2E startup time for 150000 pods: 1h4m31.349005395s (in run 52)
E2E startup time for 150000 pods: 1h5m11.387550071s (in run 54)

It's about 1.5 times slower now.

bsalamat · 2017-12-01T19:47:13Z

Are alpha features enabled in these tests? Scheduler uses a new alpha scheduling queue in 1.9. That might be causing the slow down.

shyamjvs · 2017-12-01T20:03:43Z

@bsalamat Are those features turned on by default? Those jobs are mostly following the default configs of kube-up. The only ones we're overriding are these.

bsalamat · 2017-12-01T20:57:16Z

@shyamjvs No, the new queue is off by default and the overrides that you linked do not enable it.

wojtek-t · 2017-12-03T19:37:56Z

In the past, the throughput was significantly higher initially, and it was dropping with the number of scheduled pods in the system. Is it still the case? Or is the throughput roughly constant over time this time?

shyamjvs · 2017-12-04T01:38:32Z

Wojtek: IIRC it was constant for almost the whole time at ~300 pods per 10s. Hmm.. that's strange that the throughput was affected by the #scheduled pods already in the system, especially given that we're not even using any affinity-related features.

After thinking a bit, I have the following hypothesis:

throughput in small clusters is bottlenecked by qps, while in large clusters it is bottlenecked by scheduling computation. This explains why we're scheduling much less than 1000 pods per 10s (even though 100 qps allows for it).
scheduling computation takes time linear in #nodes (which imo sounds reasonable). This explains why throughput in 2k node (800) is roughly 2.5 times that of 5k node (300). @bsalamat - Could you confirm if this is true?
The overall E2E times for pod startup I mentioned above (which basically can be called overall scheduling time, as that seemed to be the bottleneck anyway) for 2k node (11m) and 5k node (1h5m) have that much difference because a factor of 2.5 is coming from scheduling speed and another factor of 2.5 is coming from #pods we're creating.

If the above is correct, the only part which remains unexplained is why did we have the increase from 1h5m -> 1h28m? regression?

wojtek-t · 2017-12-04T07:06:08Z

Wojtek: IIRC it was constant for almost the whole time at ~300 pods per 10s. Hmm.. that's strange that the throughput was affected by the #scheduled pods already in the system, especially given that we're not even using any affinity-related features.

That isn't strange. There is a bunch of things in scheduler that are still dependeng on the number of scheduled pods. One of them is spreading priority.

scheduling computation takes time linear in #nodes (which imo sounds reasonable). This explains why throughput in 2k node (800) is roughly 2.5 times that of 5k node (300). @bsalamat - Could you confirm if this is true?

That isn't fully true - see above.

porridge · 2017-12-04T08:51:07Z

Obligatory pretty graph. We can clearly see that current best is worse than old worst...

porridge · 2017-12-04T10:01:30Z

FWIW, there is a slight decrease in throughput while pods are scheduled, but the rate seems similar, and negligible compared to the drop we're seeing between run 52 and 67.

porridge · 2017-12-04T10:39:00Z

Seeing that it matches the mean value of the samples in above graphs pretty well, I took a look at the throughput reported in the Throughput (pods/s) during cluster saturation phase message in past runs, and it seems that there were several regressions that took it from ~38 pods/sec level to ~28 pods/sec level, not a single one. However I think we should first focus on what happened between runs 64 and 66. Let me see if I can reproduce this on a 100-node cluster first.

porridge · 2017-12-04T11:00:47Z

Unfortunately on a 100-node cluster the throughput has been almost always exactly 18.75 since the beginning of October - I guess it's bound by something else than scheduler.

porridge · 2017-12-04T11:07:57Z

Interestingly, there is also no recent throughput performance drop in the kubemark-5000 runs:

porridge · 2017-12-04T11:14:51Z

And the data for the 2k node tests is all over the place:

wojtek-t · 2017-12-04T11:15:05Z

In small clusters, it's limited by qps limits.

porridge · 2017-12-04T11:23:22Z

Looks like the way to go ahead is to bisect on a 5k-node cluster :-/

wojtek-t · 2017-12-04T11:25:57Z

Or bump QPS limits in 100-node cluster locally to do bisection.

porridge · 2017-12-04T13:44:21Z

Unfortunately I'm not able to make scheduler the bottleneck in a 100-node test:

Dec  4 14:33:22.040: INFO: Created replication controller with name: density3000-0-b1afb355-d8f7-11e7-a13a-ecb1d7404c25, namespace: e2e-tests-density-30-1-xn862, replica count: 3000
I1204 14:33:22.040446   78313 runners.go:178] Created replication controller with name: density3000-0-b1afb355-d8f7-11e7-a13a-ecb1d7404c25, namespace: e2e-tests-density-30-1-xn862, replica count: 3000
Dec  4 14:33:31.486: INFO: Density Pods: 1018 out of 3000 created, 798 running, 219 pending, 1 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:33:41.486: INFO: Density Pods: 2016 out of 3000 created, 1573 running, 442 pending, 1 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:33:51.487: INFO: Density Pods: 3000 out of 3000 created, 2454 running, 546 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:34:01.488: INFO: Density Pods: 3000 out of 3000 created, 3000 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:34:02.049: INFO: E2E startup time for 3000 pods: 40.564111831s
Dec  4 14:34:02.049: INFO: Throughput (pods/s) during cluster saturation phase: 75
STEP: Scheduling additional Pods to measure startup latencies

even using the *_ARGS settings from gce-scale-performance

wojtek-t · 2017-12-04T13:47:40Z

You can try bumping it even more, though I'm not sure if this won't cause some other issue.

bsalamat · 2017-12-04T19:46:45Z

Thanks guys for looking into this. IIUC, scheduler's throughput is at least 75 pods/s in a 100 node cluster. With this throughput we can achieve our SLO of 1000 pods in 15 seconds. That said, I think we should continue our investigations and make scheduler faster if possible.
One potential problem that I found while looking into scheduler's code several weeks ago is the fact that we keep running all predicates even when some of them have already failed. I am specifically referring to this loop. I think we should break out of the loop when fit is false.
@davidopp believes that we continue running all predicates so that we report all the predicates that have failed for the node. I think it'd be fine if we reported just the first predicate that failed, instead of all predicates that failed. An optimization like this will probably not change performance numbers much for scheduling simple pods, but it will probably make things better for pods with more advanced scheduling requirements, particularly affinity/anti-affinity. The effects will be even more visible in large clusters.

We are in the process of adding an order for the execution of predicates. So, predicates with least amount of computation overhead can be executed first and if they fail, we won't run any other predicate.

porridge · 2017-12-05T06:01:01Z

Bobby, we didn't change anything yet, so far I'm just trying (unsuccessfully) to reproduce the issue of throughput being ~28 in a 5k-node cluster.

wojtek-t · 2017-12-05T07:21:03Z

@bsalamat - as @porridge wrote above, we are not thinking how to improve a scheduler. This issue is all about regressions that happened and we need to investigate it.

porridge · 2017-12-05T07:33:53Z

Still no luck with the following. Looks like I'll need to bring up a full-sized cluster after all:

export SCHEDULER_TEST_ARGS='--kube-api-qps=100'
export CONTROLLER_MANAGER_TEST_ARGS='--kube-api-qps=100 --kube-api-burst=100'
export APISERVER_TEST_ARGS='--max-requests-inflight=9000 --max-mutating-requests-inflight=4000'

shyamjvs · 2017-12-05T11:42:14Z

@porridge Maybe it's not needed to go all the way till 5000 nodes (which also our project quota won't allow atm) - can we check if we see some difference on 1000 or 2000 nodes? Also bisection should be relatively fast if you run just density test (should take about 2-3 hrs per run iirc).

…

On Tue, Dec 5, 2017, 1:34 AM Marcin Owsiany ***@***.***> wrote: Still no luck with the following. Looks like I'll need to bring up a full-sized cluster after all: export SCHEDULER_TEST_ARGS='--kube-api-qps=100' export CONTROLLER_MANAGER_TEST_ARGS='--kube-api-qps=100 --kube-api-burst=100' export APISERVER_TEST_ARGS='--max-requests-inflight=9000 --max-mutating-requests-inflight=4000' — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#56714 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEIhk4yxivlNQpcpHZGmzXBBTaEa1g-5ks5s9PH2gaJpZM4Qyz5x> .

porridge · 2017-12-05T12:06:44Z

See above, 2k node clusters don't show this regression.

porridge · 2017-12-05T12:07:11Z

I'm also using a doctored density test which runs the bare minimum we need.

porridge · 2017-12-14T07:37:58Z

@wojtek-t here is how all RCs grew over time

bsalamat · 2017-12-14T08:01:21Z

@bsalamat didn't #56714 (comment) answer your question?

@porridge It kind of does, but the fact that throughput is higher in larger clusters does not make sense.

porridge · 2017-12-14T09:11:48Z

@bsalamat that in turn was answered in #56714 (comment) :-)

porridge · 2017-12-14T09:28:52Z

Ooops, it seems that the latency graphs for the manual test in #56714 (comment) and #56714 (comment) were off by an hour - because the test output for manual run was in CET and the master-side logs - in UTC. So I actually graphed latency for the time period when the test was over and the RCs were being deleted.

The pod creations happening in that period are still surprising to me, and perhaps should be investigated, but are likely irrelevant to the question of low scheduler throughput which I'm debugging here.

porridge · 2017-12-14T10:06:21Z

The binding submission request latency for the correct time period of the manual run looks much more in line with the graph from the automatic run:

bsalamat · 2017-12-14T18:40:48Z

@bsalamat that in turn was answered in #56714 (comment) :-)

I saw that too, but it means that we still don't know what is the actual throughput of scheduler itself in a 100 node cluster without rate-limitation, and possible other restrictions.

bsalamat · 2017-12-14T18:42:32Z

In other words, I would like to know if we can count on at least 75 pods/s that you had seen in 100-node clusters.

wojtek-t · 2017-12-15T06:29:14Z

In other words, I would like to know if we can count on at least 75 pods/s that you had seen in 100-node clusters.

You're asking purely about scheduler right? I'm not sure if the system would be able to handle that throughput in general (if we don't increase master size or sth like that). We would need to check it.

But if we focus just on scheduler, then in most of cases 75 should be easily achievable. But I think if we have many pods with pod affinity/anti-affinity, this would no longer be the case.

bsalamat · 2017-12-15T23:19:38Z

Thanks, @wojtek-t! Yes, I am asking about scheduler and assuming that API server handles the associated load. I am aware that affinity/anti-affinity can drop performance a lot, but we are interested in scheduler's performance in scheduling simple pods, i.e., no fancy spec like affinity/anti-affinity.

porridge · 2017-12-18T13:45:17Z

Adding a graph of CPU usage of kube-scheduler across gce-scale-performance runs. There is a major difference between runs and it DOES seem to be correlated with the drop in throughput that we're seeing in #56714 (comment)

porridge · 2017-12-18T15:57:23Z

I ran another test today (blue) with the same commit b262585, on the same cluster that I used in #56714 (comment) and obtained a very similar result to the previous one (red).

One interesting thing is how the throughput visibly improved in the last quarter:

.

Whatever changed at that point, made a big difference.

I checked the CPU usage of kube-scheduler with htop a few times during the test and it was between 7.2 and 8.2 cores. Unfortunately I did not look during that last part.

porridge · 2017-12-18T20:57:54Z

I shut down the master, changed it to use Intel Ivy Bridge rather than Intel Broadwell (which was used so far by that cluster's master) and re-ran the test. Now I can see the throughput observed in the automatic test (34.3 pods/s).

porridge · 2017-12-19T07:05:30Z

Strange things continue: I changed master back to Broadwell and repeated the test, and got the same high-throughput result as with Ivy Bridge. So clearly it's not the variance in CPU type, but perhaps some cache which gets cleared on restart of one of the components.

porridge · 2017-12-19T10:14:29Z

I bounced a couple of large things in the cluster: the kube-dns deployment and the fluentd-gcp-v2.0.10 daemonset, and ran the test again. This time I got low throughput:

porridge · 2017-12-19T13:50:25Z

I ran another test, and the throughput was similarly low. Then in the middle of the run I restarted kube-scheduler. The throughput did not increase. Then I tried to similarly restart kube-apiserver, but this was not a good decision. With ~100k pods running, ~50k pods still waiting to be scheduled, and ~70 pending, the load on API server is so high that it is not able to recover. Most requests are refused with a 429, and the health checks done by master kubelet fails, causing a crashloop. Bumping the grace period of the health check to several minutes is not enough.

porridge · 2017-12-19T18:28:16Z

Summary of findings so far:

this slowdown is definitely and certainly not due to a single one code change,
it is also most likely not connected to code changes at all, but rather something in the context in which the scheduler runs, in particular restarting the master node causes subsequent test to show higher throughput than before restart,
in general throughput falls as the number of pods in the system rises, however we sometimes see a noticeable throughput increase during a test,
the two points above suggest this may be related to caching effects somewhere (but I confirmed restarting the scheduler alone does not change throughput),
master VM processor version (Broadwell vs Ivy Bridge) does not seem to have an impact,
I spent majority of last 2 weeks on this, and don't feel like the above is a lot of progress, and frankly I ran out of ideas how to proceed
it's not all lost time, though:
- I learned quite a bit about the system,
- fixed a couple of bugs: Limit number of pods listed as master liveness check. #56888 Log actual return code, not the default value. #56401
- tidied up the GCP projects (Move kubemark jobs to us-east1-b. test-infra#5891 Move large kubemark tests to their own new project. test-infra#5856) which will make future investigations easier and kubemark tests more robust, and
- uncovered two issues along the way:
  - "kube-apiserver cannot recover under load" kube-apiserver cannot recover under load #57405 and
  - "Deleting Replication Controllers causes them to create massive amounts of extra pods" Deleting Replication Controllers causes them to create massive amounts of extra pods #57191

Given:

the small amount of progress so far despite relatively large time investment,
relatively small effect on performance (10-20% of scheduler throughput which is not normally the bottleneck except in largest clusters, and the fact that this is not putting us outside SLOs),
the fact that my test cluster is blocking other users of the testing GCP project

I decided to stop here an un-assign this for now bug. If anyone has some ideas how to approach it, I might come back to this.

porridge · 2017-12-19T18:28:26Z

/unassign

misterikkit · 2018-01-10T18:23:17Z

/cc @misterikkit

dhilipkumars · 2018-02-02T09:42:05Z

cc: @deepak-vij & @shivramsrivastava

fejta-bot · 2018-05-03T10:16:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-06-02T11:03:18Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-07-02T11:48:13Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot assigned porridge Dec 1, 2017

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Dec 1, 2017

porridge mentioned this issue Dec 14, 2017

Deleting Replication Controllers causes them to create massive amounts of extra pods #57191

Closed

porridge mentioned this issue Dec 19, 2017

kube-apiserver cannot recover under load #57405

Closed

k8s-ci-robot unassigned porridge Dec 19, 2017

davidopp assigned bsalamat Jan 1, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2018

k8s-ci-robot closed this as completed Jul 2, 2018

Seeing a decrease in scheduler throughput #56714

Seeing a decrease in scheduler throughput #56714

Comments

shyamjvs commented Dec 1, 2017

shyamjvs commented Dec 1, 2017

bsalamat commented Dec 1, 2017

shyamjvs commented Dec 1, 2017

bsalamat commented Dec 1, 2017

wojtek-t commented Dec 3, 2017

shyamjvs commented Dec 4, 2017

wojtek-t commented Dec 4, 2017

porridge commented Dec 4, 2017

porridge commented Dec 4, 2017

porridge commented Dec 4, 2017

porridge commented Dec 4, 2017

porridge commented Dec 4, 2017

porridge commented Dec 4, 2017

wojtek-t commented Dec 4, 2017

porridge commented Dec 4, 2017

wojtek-t commented Dec 4, 2017

porridge commented Dec 4, 2017

wojtek-t commented Dec 4, 2017

bsalamat commented Dec 4, 2017

porridge commented Dec 5, 2017 via email

wojtek-t commented Dec 5, 2017

porridge commented Dec 5, 2017

shyamjvs commented Dec 5, 2017 via email

porridge commented Dec 5, 2017 via email

porridge commented Dec 5, 2017 via email

porridge commented Dec 14, 2017

bsalamat commented Dec 14, 2017

porridge commented Dec 14, 2017

porridge commented Dec 14, 2017 • edited Loading

porridge commented Dec 14, 2017

bsalamat commented Dec 14, 2017

bsalamat commented Dec 14, 2017

wojtek-t commented Dec 15, 2017

bsalamat commented Dec 15, 2017

porridge commented Dec 18, 2017 • edited Loading

porridge commented Dec 18, 2017

porridge commented Dec 18, 2017

porridge commented Dec 19, 2017

porridge commented Dec 19, 2017

porridge commented Dec 19, 2017

porridge commented Dec 19, 2017

porridge commented Dec 19, 2017

misterikkit commented Jan 10, 2018

dhilipkumars commented Feb 2, 2018

fejta-bot commented May 3, 2018

fejta-bot commented Jun 2, 2018

fejta-bot commented Jul 2, 2018

porridge commented Dec 14, 2017 •

edited

Loading

porridge commented Dec 18, 2017 •

edited

Loading