Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeing a decrease in scheduler throughput #56714

Closed
shyamjvs opened this issue Dec 1, 2017 · 66 comments
Closed

Seeing a decrease in scheduler throughput #56714

shyamjvs opened this issue Dec 1, 2017 · 66 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@shyamjvs
Copy link
Member

shyamjvs commented Dec 1, 2017

Following from an offline discussion.
I'm noticing that the scheduler throughput of our latest successful 5k-node run seems to be ~300 pods per 10s:

I1122 21:22:07.066] Nov 22 21:22:07.065: INFO: Density Pods: 150000 out of 150000 created, 147773 running, 49 pending, 2178 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:17.020] Nov 22 21:22:17.020: INFO: Density Pods: 150000 out of 150000 created, 148027 running, 53 pending, 1920 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:26.990] Nov 22 21:22:26.989: INFO: Density Pods: 150000 out of 150000 created, 148310 running, 45 pending, 1645 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:37.017] Nov 22 21:22:37.017: INFO: Density Pods: 150000 out of 150000 created, 148566 running, 53 pending, 1381 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
I1122 21:22:46.998] Nov 22 21:22:46.997: INFO: Density Pods: 150000 out of 150000 created, 148843 running, 56 pending, 1101 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady

IIRC it was considerably higher before. Also,

E2E startup time for 150000 pods: 1h28m53.009192542s

It's weird that it is that high in comparison with 2k-node test:

E2E startup time for 60000 pods: 11m10.450099342s

/assign @porridge

cc @kubernetes/sig-scheduling-bugs @kubernetes/sig-scalability-bugs @bsalamat @wojtek-t

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Dec 1, 2017
@shyamjvs
Copy link
Member Author

shyamjvs commented Dec 1, 2017

Indeed it seems like it decreased. Earlier it was:

E2E startup time for 150000 pods: 1h4m31.349005395s (in run 52)
E2E startup time for 150000 pods: 1h5m11.387550071s (in run 54)

It's about 1.5 times slower now.

@bsalamat
Copy link
Member

bsalamat commented Dec 1, 2017

Are alpha features enabled in these tests? Scheduler uses a new alpha scheduling queue in 1.9. That might be causing the slow down.

@shyamjvs
Copy link
Member Author

shyamjvs commented Dec 1, 2017

@bsalamat Are those features turned on by default? Those jobs are mostly following the default configs of kube-up. The only ones we're overriding are these.

@bsalamat
Copy link
Member

bsalamat commented Dec 1, 2017

@shyamjvs No, the new queue is off by default and the overrides that you linked do not enable it.

@wojtek-t
Copy link
Member

wojtek-t commented Dec 3, 2017

In the past, the throughput was significantly higher initially, and it was dropping with the number of scheduled pods in the system. Is it still the case? Or is the throughput roughly constant over time this time?

@shyamjvs
Copy link
Member Author

shyamjvs commented Dec 4, 2017

Wojtek: IIRC it was constant for almost the whole time at ~300 pods per 10s. Hmm.. that's strange that the throughput was affected by the #scheduled pods already in the system, especially given that we're not even using any affinity-related features.

After thinking a bit, I have the following hypothesis:

  • throughput in small clusters is bottlenecked by qps, while in large clusters it is bottlenecked by scheduling computation. This explains why we're scheduling much less than 1000 pods per 10s (even though 100 qps allows for it).
  • scheduling computation takes time linear in #nodes (which imo sounds reasonable). This explains why throughput in 2k node (800) is roughly 2.5 times that of 5k node (300). @bsalamat - Could you confirm if this is true?
  • The overall E2E times for pod startup I mentioned above (which basically can be called overall scheduling time, as that seemed to be the bottleneck anyway) for 2k node (11m) and 5k node (1h5m) have that much difference because a factor of 2.5 is coming from scheduling speed and another factor of 2.5 is coming from #pods we're creating.

If the above is correct, the only part which remains unexplained is why did we have the increase from 1h5m -> 1h28m? regression?

@wojtek-t
Copy link
Member

wojtek-t commented Dec 4, 2017

Wojtek: IIRC it was constant for almost the whole time at ~300 pods per 10s. Hmm.. that's strange that the throughput was affected by the #scheduled pods already in the system, especially given that we're not even using any affinity-related features.

That isn't strange. There is a bunch of things in scheduler that are still dependeng on the number of scheduled pods. One of them is spreading priority.

scheduling computation takes time linear in #nodes (which imo sounds reasonable). This explains why throughput in 2k node (800) is roughly 2.5 times that of 5k node (300). @bsalamat - Could you confirm if this is true?

That isn't fully true - see above.

@porridge
Copy link
Member

porridge commented Dec 4, 2017

Obligatory pretty graph. We can clearly see that current best is worse than old worst...

chart

@porridge
Copy link
Member

porridge commented Dec 4, 2017

FWIW, there is a slight decrease in throughput while pods are scheduled, but the rate seems similar, and negligible compared to the drop we're seeing between run 52 and 67.

time

@porridge
Copy link
Member

porridge commented Dec 4, 2017

Seeing that it matches the mean value of the samples in above graphs pretty well, I took a look at the throughput reported in the Throughput (pods/s) during cluster saturation phase message in past runs, and it seems that there were several regressions that took it from ~38 pods/sec level to ~28 pods/sec level, not a single one. However I think we should first focus on what happened between runs 64 and 66. Let me see if I can reproduce this on a 100-node cluster first.

throughput-run

throughput-time

@porridge
Copy link
Member

porridge commented Dec 4, 2017

Unfortunately on a 100-node cluster the throughput has been almost always exactly 18.75 since the beginning of October - I guess it's bound by something else than scheduler.

@porridge
Copy link
Member

porridge commented Dec 4, 2017

Interestingly, there is also no recent throughput performance drop in the kubemark-5000 runs:

kubemark5k

@porridge
Copy link
Member

porridge commented Dec 4, 2017

And the data for the 2k node tests is all over the place:
large

@wojtek-t
Copy link
Member

wojtek-t commented Dec 4, 2017

In small clusters, it's limited by qps limits.

@porridge
Copy link
Member

porridge commented Dec 4, 2017

Looks like the way to go ahead is to bisect on a 5k-node cluster :-/

@wojtek-t
Copy link
Member

wojtek-t commented Dec 4, 2017

Or bump QPS limits in 100-node cluster locally to do bisection.

@porridge
Copy link
Member

porridge commented Dec 4, 2017

Unfortunately I'm not able to make scheduler the bottleneck in a 100-node test:

Dec  4 14:33:22.040: INFO: Created replication controller with name: density3000-0-b1afb355-d8f7-11e7-a13a-ecb1d7404c25, namespace: e2e-tests-density-30-1-xn862, replica count: 3000
I1204 14:33:22.040446   78313 runners.go:178] Created replication controller with name: density3000-0-b1afb355-d8f7-11e7-a13a-ecb1d7404c25, namespace: e2e-tests-density-30-1-xn862, replica count: 3000
Dec  4 14:33:31.486: INFO: Density Pods: 1018 out of 3000 created, 798 running, 219 pending, 1 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:33:41.486: INFO: Density Pods: 2016 out of 3000 created, 1573 running, 442 pending, 1 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:33:51.487: INFO: Density Pods: 3000 out of 3000 created, 2454 running, 546 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:34:01.488: INFO: Density Pods: 3000 out of 3000 created, 3000 running, 0 pending, 0 waiting, 0 inactive, 0 terminating, 0 unknown, 0 runningButNotReady 
Dec  4 14:34:02.049: INFO: E2E startup time for 3000 pods: 40.564111831s
Dec  4 14:34:02.049: INFO: Throughput (pods/s) during cluster saturation phase: 75
STEP: Scheduling additional Pods to measure startup latencies

even using the *_ARGS settings from gce-scale-performance

@wojtek-t
Copy link
Member

wojtek-t commented Dec 4, 2017

You can try bumping it even more, though I'm not sure if this won't cause some other issue.

@bsalamat
Copy link
Member

bsalamat commented Dec 4, 2017

Thanks guys for looking into this. IIUC, scheduler's throughput is at least 75 pods/s in a 100 node cluster. With this throughput we can achieve our SLO of 1000 pods in 15 seconds. That said, I think we should continue our investigations and make scheduler faster if possible.
One potential problem that I found while looking into scheduler's code several weeks ago is the fact that we keep running all predicates even when some of them have already failed. I am specifically referring to this loop. I think we should break out of the loop when fit is false.
@davidopp believes that we continue running all predicates so that we report all the predicates that have failed for the node. I think it'd be fine if we reported just the first predicate that failed, instead of all predicates that failed. An optimization like this will probably not change performance numbers much for scheduling simple pods, but it will probably make things better for pods with more advanced scheduling requirements, particularly affinity/anti-affinity. The effects will be even more visible in large clusters.

We are in the process of adding an order for the execution of predicates. So, predicates with least amount of computation overhead can be executed first and if they fail, we won't run any other predicate.

@porridge
Copy link
Member

porridge commented Dec 5, 2017 via email

@wojtek-t
Copy link
Member

wojtek-t commented Dec 5, 2017

@bsalamat - as @porridge wrote above, we are not thinking how to improve a scheduler. This issue is all about regressions that happened and we need to investigate it.

@porridge
Copy link
Member

porridge commented Dec 5, 2017

Still no luck with the following. Looks like I'll need to bring up a full-sized cluster after all:

export SCHEDULER_TEST_ARGS='--kube-api-qps=100'
export CONTROLLER_MANAGER_TEST_ARGS='--kube-api-qps=100 --kube-api-burst=100'
export APISERVER_TEST_ARGS='--max-requests-inflight=9000 --max-mutating-requests-inflight=4000'

@shyamjvs
Copy link
Member Author

shyamjvs commented Dec 5, 2017 via email

@porridge
Copy link
Member

porridge commented Dec 5, 2017 via email

@porridge
Copy link
Member

porridge commented Dec 5, 2017 via email

@porridge
Copy link
Member

@wojtek-t here is how all RCs grew over time
growth

@bsalamat
Copy link
Member

@bsalamat didn't #56714 (comment) answer your question?

@porridge It kind of does, but the fact that throughput is higher in larger clusters does not make sense.

@porridge
Copy link
Member

@bsalamat that in turn was answered in #56714 (comment) :-)

@porridge
Copy link
Member

porridge commented Dec 14, 2017

Ooops, it seems that the latency graphs for the manual test in #56714 (comment) and #56714 (comment) were off by an hour - because the test output for manual run was in CET and the master-side logs - in UTC. So I actually graphed latency for the time period when the test was over and the RCs were being deleted.

The pod creations happening in that period are still surprising to me, and perhaps should be investigated, but are likely irrelevant to the question of low scheduler throughput which I'm debugging here.

@porridge
Copy link
Member

The binding submission request latency for the correct time period of the manual run looks much more in line with the graph from the automatic run:

manual_

@bsalamat
Copy link
Member

@bsalamat that in turn was answered in #56714 (comment) :-)

I saw that too, but it means that we still don't know what is the actual throughput of scheduler itself in a 100 node cluster without rate-limitation, and possible other restrictions.

@bsalamat
Copy link
Member

In other words, I would like to know if we can count on at least 75 pods/s that you had seen in 100-node clusters.

@wojtek-t
Copy link
Member

In other words, I would like to know if we can count on at least 75 pods/s that you had seen in 100-node clusters.

You're asking purely about scheduler right? I'm not sure if the system would be able to handle that throughput in general (if we don't increase master size or sth like that). We would need to check it.

But if we focus just on scheduler, then in most of cases 75 should be easily achievable. But I think if we have many pods with pod affinity/anti-affinity, this would no longer be the case.

@bsalamat
Copy link
Member

Thanks, @wojtek-t! Yes, I am asking about scheduler and assuming that API server handles the associated load. I am aware that affinity/anti-affinity can drop performance a lot, but we are interested in scheduler's performance in scheduling simple pods, i.e., no fancy spec like affinity/anti-affinity.

@porridge
Copy link
Member

porridge commented Dec 18, 2017

Adding a graph of CPU usage of kube-scheduler across gce-scale-performance runs. There is a major difference between runs and it DOES seem to be correlated with the drop in throughput that we're seeing in #56714 (comment)

scwqlher1uy

@porridge
Copy link
Member

I ran another test today (blue) with the same commit b262585, on the same cluster that I used in #56714 (comment) and obtained a very similar result to the previous one (red).

One interesting thing is how the throughput visibly improved in the last quarter:

chart 1.

Whatever changed at that point, made a big difference.

I checked the CPU usage of kube-scheduler with htop a few times during the test and it was between 7.2 and 8.2 cores. Unfortunately I did not look during that last part.

@porridge
Copy link
Member

I shut down the master, changed it to use Intel Ivy Bridge rather than Intel Broadwell (which was used so far by that cluster's master) and re-ran the test. Now I can see the throughput observed in the automatic test (34.3 pods/s).

chart 2

@porridge
Copy link
Member

Strange things continue: I changed master back to Broadwell and repeated the test, and got the same high-throughput result as with Ivy Bridge. So clearly it's not the variance in CPU type, but perhaps some cache which gets cleared on restart of one of the components.

chart 3

@porridge
Copy link
Member

I bounced a couple of large things in the cluster: the kube-dns deployment and the fluentd-gcp-v2.0.10 daemonset, and ran the test again. This time I got low throughput:
chart 4

@porridge
Copy link
Member

I ran another test, and the throughput was similarly low. Then in the middle of the run I restarted kube-scheduler. The throughput did not increase. Then I tried to similarly restart kube-apiserver, but this was not a good decision. With ~100k pods running, ~50k pods still waiting to be scheduled, and ~70 pending, the load on API server is so high that it is not able to recover. Most requests are refused with a 429, and the health checks done by master kubelet fails, causing a crashloop. Bumping the grace period of the health check to several minutes is not enough.

@porridge
Copy link
Member

Summary of findings so far:

Given:

  • the small amount of progress so far despite relatively large time investment,
  • relatively small effect on performance (10-20% of scheduler throughput which is not normally the bottleneck except in largest clusters, and the fact that this is not putting us outside SLOs),
  • the fact that my test cluster is blocking other users of the testing GCP project

I decided to stop here an un-assign this for now bug. If anyone has some ideas how to approach it, I might come back to this.

@porridge
Copy link
Member

/unassign

@misterikkit
Copy link

/cc @misterikkit

@dhilipkumars
Copy link

cc: @deepak-vij & @shivramsrivastava

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

10 participants