New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase maximum pods per node #23349

Open
jeremyeder opened this Issue Mar 22, 2016 · 39 comments

Comments

Projects
None yet
@jeremyeder
Member

jeremyeder commented Mar 22, 2016

As discussed on the sig-node call on March 22:

max-pods on kube-1.1 was 40, kube-1.2 is 110 pods per node.

We have use-cases expressed by customers for increased node vertical scalability. This is (generally) for environments using fewer larger capacity nodes and perhaps running lighter-weight pods.

For kube-1.3 we would like to discuss targeting a 100 node cluster running 500 pods per node. This will require coordination with @kubernetes/sig-scalability as it would increase the total pods-per-cluster.

/cc @kubernetes/sig-node @kubernetes/sig-scalability @dchen1107 @timothysc @derekwaynecarr @ncdc @smarterclayton @pmorie

Thoughts?

@spiffxp

This comment has been minimized.

Member

spiffxp commented Mar 22, 2016

How do we vet that a given node configuration is "qualified enough" to meet the 500 pods per node goal?

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 22, 2016

As discussed in the meeting, using a single number (max pods) can be misleading for the users, given the huge variation in machine specs, workload, and environment. If we have a node benchmark, we can let users profile their nodes and decide what is the best configuration for them. The benchmark can exist as a node e2e test, or in the contrib repository.

@jeremyeder, you mentioned you've tried running more pods in a test environment. What's the machine spec and could you share the numbers?

@yujuhong yujuhong added the sig/node label Mar 22, 2016

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 22, 2016

How do we vet that a given node configuration is "qualified enough" to meet the 500 pods per node goal?

That's where the benchmark can play an important role. Community can also share the results on different platform with each other using the standardized benchmark.

I'd suggest we should look at:

  • management overhead in terms of resource usage
  • performance (responsiveness) in terms of latency for various operations (create/delete pod, etc). Or even detecting the container changes.

So far we only have very limited testing for the above (with non-realistic workloads), and RedHat should be able to contribute more.

@yujuhong yujuhong added this to the next-candidate milestone Mar 22, 2016

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 22, 2016

Let's target for having a proper benchmark for the next release, so that we can decide how many pods to support.

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Mar 23, 2016

@jeremyeder - from the "cluster" perspective, what really matter is the total number of pods. So if you have fewer nodes in the cluster, you can put more pods on them without affecting apiserver, etcd or controllers performance. So that doesn't seem to be a problem given that you are talking abour smaller deployments in terms of number of nodes.

Also, we are planning to increase the total number of pods in 1.3. The final number is not decided, but I hope it will be 100.000 or even more (pods/cluster)

@jeremyeder

This comment has been minimized.

Member

jeremyeder commented Mar 23, 2016

@wojtek-t understood, thank you -- that's basically what I was wondering, if the pods-per-cluster limits would be increased during 1.3 cycle.

@zhouhaibing089

This comment has been minimized.

Contributor

zhouhaibing089 commented Mar 23, 2016

@yujuhong I agree with the benchmark testing to decide the max pod number before deploy cluster, :)

@dchen1107

This comment has been minimized.

Member

dchen1107 commented Mar 23, 2016

@jeremyeder Thanks for filing the issue so we can carry on the discussion.

Here is a small summary what I talked in sig-node meeting as a record:

  • max-pods is configurable. The default value we chosen today is targeted for the users who want an out-of-box solution, and is decided under several constraints:
    • Docker's performance and limitation. Docker has improved a lot since 1.8 release, but still a lot of room to improve. Unfortunately today docker's management overhead highly depends on the workloads on the node, which makes
    • Kubelet's performance and management overhead. We improved this dramatically in 1.2 release through several projects: PLEG, new metrics API, and cAdvisor cleanup, etc. We will continue improving it for each release.
    • Available resource at node, including cpu, memory, ip, etc.
    • Cluster level components (like: apiserver, scheduler, and controller-manager, heapster, etc.) limit since you have think about both num-of-node and num-of pods-per-node.
    • Performance SLOs we made for the users
    • etc.
  • Based on my experience with Google internal system, even our node team improved the node level scalability and performance dramatically over time, in reality, most (>99%) nodes in a shared cluster (shared means host both service and batch jobs, both production and testing jobs, etc.) hosts no more than ~100 jobs due to other resource constraints.
  • On another hand, I understood that for certain users, they want to run simple jobs with limited management overhead. For example, I ran an experimental test against 1.1 release on a big node, KNode (Kubelet, and docker) can easily host > 200 do-nothing pods with reasonable performance. 1.2 release can do much better job if I re-run the same experimental test.
  • I suggested to publish a node benchmark since 1.1 release (#14754 etc.), and we are working toward to that goal for each release:
    • Improving kubelet's performance overall and reduce its management overhead for both baseline and per-pod management (#16943, #22542, etc.)
    • Carefully measured docker's performance stats with Kubelet's use (https://github.com/kubernetes/contrib/tree/master/docker-micro-benchmark)
    • Introduced node-level resource usage tracking test to monitor daemon's resource usages and performance stats
    • Introduced --kube-reserve and --system-reserve (#17201) to Kubelet so that the admin to config the proportion of overall resource (cpu and memory) a node devote to daemons including Kubelet, docker along with other daemons. Today it is properly configured due to lack of benchmark here. :-)
    • etc.
    • In 1.3 node team roadmap (proposed), dropping custom metrics and other unnecessary stats from Kubelet and providing a standalone cAdvisor is along with this direction of reducing the system daemon (e.g. Kubelet ) overheads and improving the predictability and manageability on a node.

cc/ @bgrant0607 since we talked about this before.

@timothysc

This comment has been minimized.

Member

timothysc commented Mar 23, 2016

Introduced --kube-reserve and --system-reserve (#17201) to Kubelet so that the admin to config the proportion of overall resource (cpu and memory) a node devote to daemons including Kubelet, docker along with other daemons. Today it is properly configured due to lack of benchmark here. :-)

System reserve is arguably a systemd.slice provisioning constraint from our side, but +1 on reserve. Going forwards, I think think the only limits should be resource constraints. If needed, we could put pod limits on admission control.

/cc @quinton-hoole as he was conversing on the topic in the @kubernetes/sig-scalability meeting last week.

@jeremyeder

This comment has been minimized.

Member

jeremyeder commented Mar 24, 2016

Agree with all comments about the benchmark. @dchen1107 perhaps we should file a new issue to deliver that and leave this one to delivering the increase, should we be able to agree on something.

This test is:

  1. sleep 60
  2. schedule 100 "hello-openshift" pods across 2 nodes https://github.com/openshift/origin/tree/master/examples/hello-openshift
  3. wait til they are all running
  4. sleep 60
  5. schedule 100 more
  6. loop up through 800 pods per node.
  7. sleep 60

stacked_cpu

stacked_mem

500 pods (somewhere around the 700-second mark) uses a very reasonable amount of CPU (around 1 CPU core) for the node process and docker combined. This is the openshift-node process in this test, not strictly kubernetes, but it's based on kubernetes v1.2.0-36-g4a3f9c5.

About the memory graphs. I did not restart the node service before running the test, so the numbers are about 300MB higher than they should be. But the growth/trend in RSS is accurate.

@jeremyeder

This comment has been minimized.

Member

jeremyeder commented Mar 24, 2016

@dchen1107 as far as what a "benchmark" may look like...

Perhaps we generate a "node scaling score" out of factors like cpu_generation+core_count+GB_RAM+kube_version+other_factors. That score would set max-pods dynamically.

This way we don't have to inject a "test" into the admission control pipeline or product install paths, the node process could compute the score/max-pods dynamically during it's startup phase.

Thoughts ?

@vishh

This comment has been minimized.

Member

vishh commented Mar 24, 2016

+1 for dynamic limits. Those limits should also take into account kubelet's
internal design though, specifically around latency.

On Thu, Mar 24, 2016 at 1:28 PM, Jeremy Eder notifications@github.com
wrote:

@dchen1107 https://github.com/dchen1107 as far as what a "benchmark"
may look like...

Perhaps we generate a "node scaling score" out of factors like
cpu_generation+core_count+GB_RAM+kube_version+other_factors. That score
would set max-pods dynamically.

This way we don't have to inject a "test" into the admission control
pipeline or product install paths, the node process could compute the
score/max-pods dynamically during it's startup phase.

Thoughts ?


You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub
#23349 (comment)

@timstclair

This comment has been minimized.

timstclair commented Mar 24, 2016

Another factor we need to consider is probing. As #16943 (comment) shows, agressive liveness / readiness probing can have a significant impact on performance. We may eventually need to figure out how to account probe usage to the containers being probed.

@vishh

This comment has been minimized.

Member

vishh commented Mar 24, 2016

@timstclair: Unless we can move to an exec model for all probing, it will be difficult to tackle charging.

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 24, 2016

Another factor we need to consider is probing. As #16943 (comment) shows, agressive liveness / readiness probing can have a significant impact on performance. We may eventually need to figure out how to account probe usage to the containers being probed.

That's why I think benchmark with realistic/customizable workloads is valuable. Users can benchmark their cluster and adjust if they want (e.g., determine max pods allowed with 10% of dedicated resources).

@timothysc

This comment has been minimized.

Member

timothysc commented Mar 24, 2016

We also need to take disk resources into account going forwards, right now that's a level of overhead that we haven't really captured.

@dchen1107

This comment has been minimized.

Member

dchen1107 commented Mar 24, 2016

@jeremyeder I can file a separate benchmark issue. Actually I were in the middle of filing that, and saw this issue and everyone jumped into all over talking about the benchmark.

But on another side, I think publishing the benchmark can serve the purpose without keeping increasing --max-pods per node. Node team signed up to:

  • together with other teams & community to define the performance SLOs
  • together with community to choose one or several representative workloads to generate benchmark for node performance and scalability
  • continue improving the performance
  • continue reducing the system overhead introduced for management
  • improving our node test suite to detect the regression on both performance and resource consumption

But like what I mentioned at #23349 (comment) and several examples listed by others above, there are too many varieties and a lot of them are even out of our control, I don't think we are ready to have a formula applying to all to dynamically figure out --max-pods.

Ideally, the cluster should easily have their own formula with fudge factors based on their own environments and requirements, and come up the max-pods for their nodes and apply to the node config. If applying the value of max-pods to the node config object is too hard, we should solve this usability issue.

cc/ @davidopp @bgrant0607

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 24, 2016

I think we all agreed that developing a node benchmark should be the next step. The benchmark will allow users to test their nodes and adjust kubelet configuration (e.g., --max-pods) accordingly. They can also publish the results and share them with the community. The results can serve as a ballpark for users who just want some configuration to start with. In addition to that, having the published results will also help us discover issues on different platform.

Some initial thoughts about what we want in the benchmark:

  1. user-observed latency (e.g., pod startup/deletion, time to restart a container that just died). We need to adhere to the k8s SLO, and maybe add new kubelet SLO on top of that.
  2. resource usage (e.g., cpu, memory, disk)

We should have diverse workload (e.g., probing) and test the node in difference scenarios (e.g., steady state vs batch creation).

What we have now in our e2e tests is very limited. Redhat and/or the community may want to chime in to share what they have. Anyway, below is what we use today:

  • density test: starts up N pods and then start an additional pod and track its latency.
  • resource usage test: tracks the cpu/memory usage of kubelet and docker on a steady node with N pods. The cpu usage collection in this test depends on cadvisor at a granularity of 10s by default.
@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Mar 25, 2016

@timothysc Could you please clarify the following comment?

Going forwards, I think think the only limits should be resource constraints. If needed, we could put pod limits on admission control.

Which resource constraints? The containers'?

Each pod and container requires some amount of resources from the management agents (Kubelet, cadvisor, docker) and kernel. These resources can't be attributed to the cgroups of the containers (we've been working on such things for years internally). Depending on the resources allocated to the system and management agents, on the average number of containers per pod, on the rate of container churn, on the number of containers with probes, assumptions about container failure rates, etc., a different number of pods might be supportable, though there are also inherent limits in the management agents, since they are not perfectly scalable. These factors are complex and numerous.

Additionally, we currently allow best-effort pods to not reserve any fixed amount of resources. This is a deliberate choice. Admins may choose to impose a minimum resource request, but that's independent of the issues mentioned above.

@timothysc

This comment has been minimized.

Member

timothysc commented Mar 25, 2016

OBJECTIVE
Users would like to achieve much higher density numbers per machine with a large number of underutilized pods. Current --max-pods represents an artificial governor for machines which have ample resources available. Instead, admins would prefer to set some system reserve, as well as resource thresholds (watermarks) after which pods are not longer accepted.

Which resource constraints? The containers'?

I meant available machine resources that exist for the kubelet + container subtree "slice". At some point the machine simply passes an acceptable watermark and no further containers should be accepted. We are not aiming for optimal packing in this use case, so "fudging" the reserve and watermarks can be controlled by the administrator based on their load profiles and history. Admins would like to "set it and forget it".

In conversations on sig-node we all agree on the premise of not putting false limits in place, but at this point I believe we need to hash out the designs that were discussed and make it a reality in 1.3.

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Mar 28, 2016

@timothysc Other considerations:

  • Kubelet, Docker, network and storage plugins, etc. are not perfectly scalable
  • Increased load on these agents impacts quality of service from these components
  • An explicit, predictable limit (as opposed to opaquely just denying requests at some point) helps schedulers make better decisions (e.g., avoiding resource stranding) and helps users understand placement decisions

I agree that accurate, simple, automatically set limits would be desirable.

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Mar 31, 2016

There is also a realistic difference in time it takes to go from 99-100
running pods in my experience than it takes to go from 0-100 running pods.
Right now, I have been running a loop in a three-node cluster that creates
a NS with a single RC with 500 pods, and I wait for at least 200 of those
pods to report back running before terminating the namespace (I am trying
to debug a stuck terminating pod flake that is hard to reproduce), but it
seems extremely obvious to me that we are less stable going from 0-100 pods
running on a node than we are going from 99-100.

On Mon, Mar 28, 2016 at 12:50 PM, Brian Grant notifications@github.com
wrote:

@timothysc https://github.com/timothysc Other considerations:

  • Kubelet, Docker, network and storage plugins, etc. are not perfectly
    scalable
  • Increased load on these agents impacts quality of service from these
    components
  • An explicit, predictable limit (as opposed to opaquely just denying
    requests at some point) helps schedulers make better decisions (e.g.,
    avoiding resource stranding) and helps users understand placement decisions

I agree that accurate, simple, automatically set limits would be desirable.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#23349 (comment)

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Mar 31, 2016

seems extremely obvious to me that we are less stable going from 0-100 pods
running on a node than we are going from 99-100.

@derekwaynecarr, do you mean that batch creation of 100 pods makes the node less stable than when starting a single pod at 99 pods? We don't limit the docker qps in kubelet and creating/deleting pods are the heaviest operations for now. We discussed before v1.2 to mitigate this issue, but decided that it wasn't necessary for v1.2. If we are going to claim support of higher pod density, we'd need to add tests and re-evaluate this.

@dchen1107

This comment has been minimized.

Member

dchen1107 commented Apr 4, 2016

cc/ @gmarek @wojtek-t This is the issue I mentioned to @gmarek earlier. For 1.3 release, we plan to publish node level benchmark at #23349 (comment) To do that, we need to define our performance SLO at node level.

@derekwaynecarr

This comment has been minimized.

Member

derekwaynecarr commented Apr 4, 2016

@yujuhong - yes. i think qps is an important thing to keep in mind as we change this number. i was able to discover the pod stuck in terminating problem by overwhelming the docker daemon with this scenario.

@dchen1107

This comment has been minimized.

Member

dchen1107 commented Jun 17, 2016

cc/ @coufon

@warmchang

This comment has been minimized.

Contributor

warmchang commented Sep 15, 2017

mark.

@bgrant0607

This comment has been minimized.

Member

bgrant0607 commented Sep 15, 2017

@yujuhong Can this be closed? If not, the title definitely needs to be changed

@bgrant0607 bgrant0607 removed this from the next-candidate milestone Sep 15, 2017

@zhangxiaoyu-zidif

This comment has been minimized.

Member

zhangxiaoyu-zidif commented Sep 15, 2017

mark.

@yujuhong yujuhong changed the title from Increase maximum pods per node for kube-1.3 release to Increase maximum pods per node Sep 15, 2017

@yujuhong

This comment has been minimized.

Contributor

yujuhong commented Sep 15, 2017

@yujuhong Can this be closed? If not, the title definitely needs to be changed

Updated the title. We never increased the maximum number of pods above 110. There is still work to do for scaling the pod capacity based on the machine size.

@warmchang

This comment has been minimized.

Contributor

warmchang commented Sep 16, 2017

Hi @yujuhong , The goal of max pods per machine is 500: https://github.com/kubernetes/community/blob/master/sig-scalability/goals.md

And openshift team increase the default number form 110 to 250 (from the ocp 3.4):
https://docs.openshift.com/container-platform/3.4/install_config/install/planning.html

@timothysc

This comment has been minimized.

Member

timothysc commented Sep 25, 2017

You'll always be limited by the Pod CIDR range based on the networking solution chosen. The changes are already there, they just need to be defaulted.

@fejta-bot

This comment has been minimized.

fejta-bot commented Jan 6, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@errordeveloper

This comment has been minimized.

Member

errordeveloper commented Jan 15, 2018

/lifecycle frozen

@errordeveloper

This comment has been minimized.

Member

errordeveloper commented Jan 15, 2018

It'd be very interesting to hear whether cri-containerd might provide some improvements vs latest version of Docker daemon and whether getting rid of pause containers could have any additional improvements.

@spiffxp

This comment has been minimized.

Member

spiffxp commented Jan 16, 2018

/lifecycle frozen
/remove-lifecycle stale
per @errordeveloper's comment above

@orangefiredragon

This comment has been minimized.

orangefiredragon commented Jan 18, 2018

/remove-lifecycle stale

@pruthvipn

This comment has been minimized.

pruthvipn commented Jun 27, 2018

-- Maximum pods per single node on Kubernetes 1.10.x --
Adding my questions to this existing thread as they are on same lines..

Looks like there were scaling issues with Kubernetes and Docker software. I believe some of these issues would have been addressed by now. For ex, the “context deadline exceeded” errors from docker, is resolved in recent releases (moby/moby#29369)

I am trying to understand what is the situation with latest versions of Kubernetes (1.10.x), Docker and etcd. Assuming that node has HIGH hardware configuration – Are there still challenges to Docker/Kubernetes scale?

  1. The default --max-pods per node is 110. What is the system/functionality/performance impact on changing this to 200, 300, 500 given that hardware has enough resources?
  2. Are there still any known scaling issues from Kubernetes, Docker, Etcd or any other software and what are the parameters impacting these?
  3. Are there any related benchmarks done recently with latest versions of Kubernetes and docker? Please share details.
@obriensystems

This comment has been minimized.

obriensystems commented Sep 5, 2018

(+) procedure
https://wiki.onap.org/display/DW/Cloud+Native+Deployment#CloudNativeDeployment-Changemax-podsfromdefault110podlimit
Manual procedure: change the kubernetes template (1pt2) before using it to create an environment (1a7)

add --max-pods=500 to the "Additional Kubelet Flags" box on the v1.10.13 version of the kubernetes template from the "Manage Environments" dropdown on the left of the 8880 rancher console.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment