New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many and how big services should kubernetes scale to? #48938

Open
shyamjvs opened this Issue Jul 14, 2017 · 23 comments

Comments

Projects
None yet
9 participants
@shyamjvs
Copy link
Member

shyamjvs commented Jul 14, 2017

A placeholder issue for the discussion we had on the email thread just to have a proper record.
Here's a (somewhat) abridged copy of that discussion:


@gmarek :
I'd like to get your opinion on the number of Services that we want to support in the cluster plus and how many Pods/Service we can have. Note that our goal is to support 150k Pods in 5k Node cluster, and probably product of number of Services and max Pods/Service shouldn't be smaller than that.

@bowei :
+minhan, as he has first hand experience with kube-proxy + services scalability
It seems like 15k - 20k services for that size cluster seems to be a good stress test max (that's around ~10 - 20 pods/service).
What is the number we are testing today? I recall that we are doing something already wrt to the number of services.

@freehan :
Let us say each service only has one port. A service with 2 ports will almost double the number of iptables rules.
Based on the current design of kube-proxy, I do not think we can scale over 10k services without significant overhead.
I have seen customer with 2k service ports got into resource usage problem with kube-proxy.
I would say 10k services for stress test on 5k node cluster is enough.

@shyamjvs :
Of the 150k pods we start, 1/2 of the pods are from replicasets of size 5 (= 15k RSs), 1/4th from those of size 30 (= 1250 RSs) and 1/4th from those of size 250 (= 150 RSs). And we create one service for each RS, so that's 16.4k services. So from what Minhan says, if we half the current no. of our services it should be enough.

@gmarek :
+1. We can group together 4 small RSs into one service.

@shyamjvs :
IIUC we need to also reduce the no. of service endpoints (basically no. of pods involved in services). Grouping RSs into one service will just reduce no. of services, but size of iptables, size of endpoints objects and no. of watch events on pods (that are part of services) will still be the same. WDYT?

@gmarek :
I don't see where Minhan mentioned number of endpoints. He wrote only about # of services.

@shyamjvs :
That's true. But he mentioned about multiple ports within services breaking things due to number of iptables rules. Multiple pods within a service would create a similar effect, no?
minhan@: Could you clarify if just decreasing no. of services is enough or we also need to decrease no. of endpoints?

@wojtek-t :
We have an entry in iptables for every single endpoint (I mean real endpoint, not endpoint object):
https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go#L1436

That said if we have problems with the current setup as described by Shyam above, we should just remove some of the services instead of merging few of them into single one.

For now, I'm fine with cutting the number of services e.g. by 2 (i.e. create a service only for RS with even index or sth like that). That's just for make the test pass in large clusters.

However, we definitely need to understand what is happening there. It's obvious that kube-proxy will be using a lot of resources if we have all the services, but if we provide them it should work (potentially slowly but still). It shouldn't be in some crashlooping or non-working state, which from what I've heard from Shyam is the case now.
So if we reduce the number of services in the test, I would like this to be accompanied by an opened issue with:

  • understanding it as a 1.8 goal
  • and fixing it for some time in the future (potentially not in 1.8)

@gmarek :
There's also a wider question of how useful are Pods that are not covered by Service. I guess they may be useful for batch workloads, though I'm not 100% sure. Having said that I'd like to add one more 1.8 goal to ones that Wojtek wrote, i.e. figure out how to announce how many services we support and what does this mean (IIUC currently it means that our tests don't explode, but maybe we can do better;).

Ping Bowei, Minhan - can you clarify a bit what you included in those 10k services? Those were plain 'Service' objects, or number of Pods (x number of ports per Pod) covered by some service?

@freehan :
Let's be precise, 10k service ports. Each service port has an entrance iptables rule.
From what I have observed, the number of service ports have a larger impact than the number of backends.
Last time I check production data, the service with highest number of backends has ~2000 endpoints. Also, considering the scalability goal of 5000 nodes, the test should at least include service with 5000 backends. To be conservative, let us raise it up to 10k.
So, a few services with high number of backends (10k, 5k, 2k, 1k) is needed. The rest just reuse the existing distribution.
What do you think?

@bowei :
As an aside -- do we have a master doc for scalability parameters?
I would like to put these values (and reasoning) in a doc that can be more accessible that this e-mail thread so we don't have lost context the next time someone discusses params.

@davidopp :
+1, would be great to have a doc or spreadsheet that describes all the scalability tests and their parameters
Currently the choices are blog posts (not necessarily up to date) or reading the code of the tests.

@shyamjvs :
minhan@ - Thanks for the explanation. Interesting that no. of service ports matters more than no. of backends as I was under the impression that size of the iptables was the bottleneck. So this means that without reducing no. of endpoints, but by just regrouping them into larger services, we should be able to pass load test (we can test this hypothesis)?

@thockin :
I don't understand how that can be. The ruleset gnerates rules per
backend and per port. Minhan - do you have something that shows a
behavior difference of more backends vs more ports?

@freehan :
I meant that 1. rules in KUBE-SERVICES chains are evaluated much more often than KUBE-SEP/KUBE-SVC chains. 2. During an experiment, I got the impression that manipulating a chain with lots of rules (e.g. KUBE-SERVICES) takes noticeably longer than others.

@thockin :
Interesting. I assumed it was bounded mostly by net footprint, but I
guess there are some paths that will linear-search through a single
chain. We should maybe spend time figuring out what operations are
bounded by what factor.

@wojtek-t :
Yeah - that's very interesting. It would be goodl to actually understand this.


@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jul 14, 2017

@shyamjvs shyamjvs changed the title How many and how big services should kube-proxy support? How many and how big services should kubernetes scale to? Jul 14, 2017

@kubernetes kubernetes deleted a comment from k8s-github-robot Jul 14, 2017

k8s-github-robot pushed a commit that referenced this issue Jul 17, 2017

Kubernetes Submit Queue
Merge pull request #48908 from shyamjvs/reduce-services-loadtest
Automatic merge from submit-queue (batch tested with PRs 48991, 48908)

Group every two services into one in load test

Ref #48938

Following from discussion with @bowei and @freehan .
This reduces #services to 8200 while keeping no. of backends same.

/cc @wojtek-t @gmarek
@bowei

This comment has been minimized.

Copy link
Member

bowei commented Jul 18, 2017

@shyamjvs could you drop a link to the scale parameter doc here?

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jul 18, 2017

It's currently under review. Here's the PR - kubernetes/community#811

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 1, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jan 9, 2018

/remove-lifecycle stale
/lifecycle frozen

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jan 10, 2018

@porridge FYI

@porridge

This comment has been minimized.

Copy link
Member

porridge commented Jan 10, 2018

See also #58050 which tracks ability to handle updates to Endpoints in large clusters.

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jan 10, 2018

See also #58050 which tracks ability to handle updates to Endpoints in large clusters.

quick update of (relatively) large services

Note that e.g. in kubemark-5000 (where we have 5000 nodes - from apiserver perspective fake or not doesn't really matter) and we are creating services there.
And the number of updates there is higher than 4 from what I can tell.

The difference is that most of those services are pretty small.

And that works fine, see e.g.:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/801/build-log.txt

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jan 10, 2018

+1 to what @wojtek-t wrote - I mentioned that too offline regarding #58050.

@porridge

This comment has been minimized.

Copy link
Member

porridge commented Jan 10, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jan 10, 2018

It's worth taking a closer look, then - could it really be that the
difference between a 250-member Endpoints (which I what I believe kubemark
job has) and a 400-member one (in #58050) is so large? Or perhaps the
update rate is also smaller?

There are a couple of things that differ:

  • most of services have O(5) endpoints, there are just relatively small number of those with O(250)
  • we rely on endpoint controller there (which is sending appropriate Endpoints updates); that means that it is allowed to batch updates to the same Endpoint objects. So e.g. creating 5 new pods in a service, may actually result in just a single Endpoints object update (I'm not saying it always is, but it at least can be)
@porridge

This comment has been minimized.

Copy link
Member

porridge commented Jan 10, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jan 10, 2018

Maybe EndpointController in scalability tests have bigger lag? And thus performs better batching?
[Just thinking loudly here]

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jan 10, 2018

IIUC even in the worst case if each update is made separately, we ideally still would have just 250 updates one-off (during those pod creations which btw would also be spread out by our load test). But the one @porridge is pointing to created a consistent qps of 8.

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jan 10, 2018

To clarify,
I meant that updates in our load test are one-off (atleast that's what I'd expect) while @porridge created a case where they were constantly flipping.

@porridge

This comment has been minimized.

Copy link
Member

porridge commented Jan 10, 2018

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jan 10, 2018

@shyamjvs - I think I'm not really following what you wrote above.
We are creating/deleting pods with the throughput of ~10pods/s in our tests. That means, that without any batching they are producing ~10 Endpoints updates per second (becasue each pod is one of the services).

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Jan 10, 2018

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jan 10, 2018

@wojtek-t Yes, but not all of them would be 250-sized updates. Assume those 10 updates are for endpoints of size 5 (i.e. our small services), then the total size of updates in that second would be 50 (which is quite small compared to say 2500). Now compare that with 8 qps of 400-sized updates (as @porridge mentioned in #48938 (comment)) which would be 3200 size total. FMU the 250 ones matter more, so I was talking about it.

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jan 10, 2018

@shyamjvs - ok here is what you meant. Yes I generally agree with the above.
One thing to keep in mind that Endpoints object is not addresses of endpoints, it's also metadata. But in general yes - 8x 400 may still be visibly more than what we have in our test.

@shyamjvs

This comment has been minimized.

Copy link
Member Author

shyamjvs commented Jan 10, 2018

In short, I was trying to mention these 2 differences:

  • the 250 sized updates are spread out more than 400 ones (for 250 it is 10/4 = 2.5 updates/s (as 1/4th of our pods belong to the 250 category), while for 400 it is 8 updates/s)
  • the 400 ones are continuously generated with endpoints size of ~400, but the 250 ones are generated one-off (so the sizes of updates would be 1, 2, 3, ..., 250 and that's it)
@thockin

This comment has been minimized.

Copy link
Member

thockin commented Aug 22, 2018

So IPVS is GA now. We haven't adopted it for GKE yet, but we're looking at it. Would be nice to understand how it impacts perceived scalability here, and whether it is "good enough". There's an alt proposal for nftables and we know some folks are doing EBPF.

I am disinclined to go back to iptables mode and massively refactor the rulesets - the risk is just too high given the alternatives.

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Aug 23, 2018

@thockin

We already started looking into that. There were some issues with running that, and now we should get back our tests to green.
But I think you may expect some answer about its impact on scalability in the next month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment