KEP-2837: Pod level resource limits #1592

liorokman · 2020-03-04T11:22:38Z

keps/sig-node: add KEP for shared-burstable-limits - pod level resource limits

k8s-ci-robot · 2020-03-04T11:22:46Z

Welcome @liorokman!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-03-04T11:22:46Z

Hi @liorokman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

liorokman · 2020-03-04T11:53:52Z

/assign @dchen1107

liorokman · 2020-03-06T12:17:27Z

Kind reminder - can anyone review this?

derekwaynecarr · 2020-03-06T15:44:26Z

@liorokman the PR seems to have carried along a number of other KEPs. Can you update the PR to just have the KEP presented? Are you able to present at a future SIG Node meeting the high level overview of the proposal?

derekwaynecarr · 2020-03-06T15:44:38Z

/assign

derekwaynecarr

I understand the motivation for the feature.

Have you thought about the implications of this feature with vertical pod autoscaling?
Have you thought about how we could map this pod bounding behavior potentially to a RuntimeClass or kubelet configuration rather expose the knob directly in the end-user pod spec? I am not sure the average end-user would understand shareBurstableLimits .

derekwaynecarr · 2020-03-06T15:47:20Z

keps/sig-node/20200303-shared-pod-limits.md

+
+This proposal aims to:
+
+*  Allow a `Burstable` pod to define that memory and CPU resource limits should be set on the level of the Pod.


The same arguments for the Burstable tier would apply to the Guaranteed tier. I could see users that do not want to specify explicit requests/limits for each container, but just want to bound the pod with fixed resources.

The current behavior for the Guaranteed tier is already exactly this. A pod is in the guaranteed QoS if and only if it always has a request equal to the limit for all the containers in the pod. If this is the case, the current code already sets the pod level cgroup to the sum of the limits for each of the resources.

But because in the Guaranteed case all of the containers also have a limit set to them, setting the pod level cgroup to any limit is meaningless. The pod level cgroup limit is by definition greater than any individual container, which means that the individual container limit would apply before the pod level limit would apply.

A pod that wants to set pod-level fixed resources and not container level resources is by definition not in the Guaranteed QoS level.

derekwaynecarr · 2020-03-06T15:49:12Z

keps/sig-node/20200303-shared-pod-limits.md

+<pre>
+  QoS CGroup (one of guaranteed, burstable, or besteffort)
+   |
+   \ pod (memory: unlimited, cpu: unlimited)


for clarity, cpu is not unlimited... it is setting shares, just quota is unbounded.

Correct, I will fix the KEP.

derekwaynecarr · 2020-03-06T15:50:13Z

keps/sig-node/20200303-shared-pod-limits.md

+<pre>
+  QoS CGroup (one of guaranteed, burstable, or besteffort)
+   |
+   \ pod (memory: 256M limit, CPU: 1 core)


the biggest benefit for this is when its difficult to size two containers that may individually write to a memory backed volume.

pod resource overhead feature is in flight, it would be good to combine this example with a pod that has a runtimeclass that has an overhead. i also wonder as i type this if its better to associate an attribute on the runtime class to say if the pod should be bounded or not rather than change the pod spec itself. have you explored this idea?

Can you link to the pod resource overhead feature? I'm not familiar with this.

Intuitively the pod level seems to me to be a better fit since this sort of resource allocation is not dependent on any specific runtime, and has no meaning outside of the specific Burstable QoS level. But I'd love to read about the pod resource overhead feature, and give it some thought...

While we are waiting for merge of the updated Beta docs, I'd suggest checking out the PR: https://github.com/kubernetes/website/pull/19059/files

Thanks, I'll go over this.

keps/sig-node/20200303-shared-pod-limits.md

liorokman · 2020-03-06T16:33:15Z

@liorokman the PR seems to have carried along a number of other KEPs. Can you update the PR to just have the KEP presented?

Done.

liorokman · 2020-03-06T16:34:02Z

Are you able to present at a future SIG Node meeting the high level overview of the proposal?

I'd be happy to. When is the next SIG Node meeting?

liorokman · 2020-03-06T16:56:02Z

Have you thought about the implications of this feature with vertical pod autoscaling?

If this feature is turned on, the VPA could be set to update the budget of one single container in the pod, and also set to monitor the pod-level cgroup for actual maximum resource consumption. This would make the VPA able to also work correctly for pods with this behavior enabled. In effect, the VPA would work the same, except that it would scale the memory for the entire pod instead of working on the container level.

By allowing the deployment to specify a resource budget for the entire pod and having the Linux kernel play with the details of allowing each container to allocate what it needs, when it needs it, the end result is a more flexible arrangement. This is a win-win scenario:

the developer doesn't need to micro-manage container limits but is not forced to grant unlimited resources to critical containers,
the application has less disruption because the VPA isn't evicting the pod to fine-tune the specific container limits,
the node is less susceptible to the noisy-neighbor problem with pods in the Burstable QoS level (also making the scheduler placement decisions better for this same reason).

Have you thought about how we could map this pod bounding behavior potentially to a RuntimeClass or kubelet configuration rather expose the knob directly in the end-user pod spec? I am not sure the average end-user would understand shareBurstableLimits

I see this feature as specific configuration for a specific pod, not as a general configuration that applies to all of the pods running on the node. I'm not sure how it would map to RuntimeClass either.

Maybe a different attribute name would be clearer? Maybe podResourceSharing instead of shareBurstableLimits?

dashpole · 2020-03-06T19:17:44Z

/cc

thockin · 2023-01-16T00:02:39Z

@n4j - you want to take it over and get it merged as provisional?

c3d · 2023-01-16T14:18:35Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+
+## Summary
+
+Pod resources are currently enforced on a container-by-container case. Each container defines a limit for each managed resource, and the Kubelet translates these limits into the appropriate cgroup definitions. Kubelet creates a three-level deep hierarchy of cgroups for each pod. The top-level is the QoS grouping of the pod (`Guaranteed`, `Burstable`, and `BestEffort`), the second level is the pod itself, and the bottom level are the pod containers. The current Pod API doesn't enable developers to define resource limits at the pod level. This KEP proposes a method for developers to define resource requests and limits on the pod level in addition to the resource requests and limits currently possible on the individual container level.


I wonder if we should have the phrase "the Kubelet translates these limits into the appropriate cgroup definitions". For runtimes like kata-containers, there is an additional level of indirection, because each pod is a VM and the cgroups (when they exist) are actually inside the VM.

n4j · 2023-03-02T03:52:25Z

@n4j - you want to take it over and get it merged as provisional?

@thockin yes

n4j · 2023-03-30T16:25:20Z

@thockin Can this be provisionally merged?

thockin

@n4j, I can't approve the PR as it stands - you need to at least set the metadata properly. Also please make sure to use the most up to date KEP template

I just skimmed today, but you could also fix the open comments or put them as UNRESOLVED so we don't lose them

thockin · 2023-05-26T22:44:12Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+to the resource limit. The current implementation considers a pod to be `Burstable` if at least one container defines resource requests or limits. Any
+pods not matching the `Guaranteed` or `Burstable` QoS definitions are placed in the `BestEffort` QoS level.
+
+This proposal suggests modifying these definitions the following: 


This section is awkwardly worded. I think we can find a better way to express it. Something like:

"""

For each core resource type (cpu, memory) a pod may have a pod-scoped resource request and/or limit. Pods which have a defined request or limit for a given resource type are said to "have explicit resources" or "be explicitly resourced". Pods which DO NOT have a defined request or limit for a given resource type are said to "have implicit resources" or "be implicitly resourced".

A pod is considered Guaranteed if, for each core resource type, either of the following is true:

It is explicitly resourced for that resource type AND the the pod's request is equal to its limit.

It is implicitly resourced for that resource type AND all containers' requests are equal to their limits.

A pod is considered Burstable if, for each core resource type, any of the following is true:

It meets the criteria for "Guaranteed" for that resource

It is explicitly resourced for that resource type AND the pod's request is specified and less than or equal to its limit.

It is implicitly resourced for that resource type AND all containers' requests are specified and less than or equal to their limits.

A pod is considered BestEffort if, for each core resource type:

It is implicitly resourced for that resource type AND any container's request is not specified

We could maybe simplify more. Could we say that:

"""
For each core resource type, a Pod's resources are either (evaluated in order):

the values specified in the pod's resources stanza

the sum of the values specified in all of the pod's containers' resources stanzas, where a single unspecified value means the result is unspecified

A pod is considered guaranteed if, for each core resource type, the pod's request equals the limit.
"""

thockin · 2023-05-26T22:53:49Z

keps/sig-node/1592-pod-level-resource-limits/kep.yaml

+owning-sig: sig-node
+participating-sigs:
+  - sig-api-machinery
+status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced


k8s-triage-robot · 2024-01-19T10:57:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kad · 2024-01-23T18:56:21Z

/cc @kad
/cc @marquiz

tallclair · 2024-02-06T17:48:55Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+than the sum of limit for all the containers, then the limit requested by the containers is unreachable. It becomes impossible for all of the containers to 
+reach their requested limit, since the sum of available memory becomes the pod-level limit. Note that even though setting `pod.limit[memory] < sum(pod.containers[*].limit[memory])`
+would not allow all of the containers to allocate memory up to their limit, this is not considered an error by this proposal (though a warning should be emitted).
+If it were considered an error, then any admission controller that injects containers to pods with defined container-level limits might cause the pod to fail in unexpected ways. 


Eventually I think the right behavior of these admission controllers would be to add the resources of the injected containers to the pod-level resources.

tallclair · 2024-02-06T17:55:09Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+If a there are resource request definition on the pod level, even without any definition on any of the included containers, then it should be considered
+`Burstable` since there is at least some knowledge as to the amount of resources the pod should receive. 


What do container-level resource requests mean if pod-level requests are set? Are they completely ignored in that case? If so, should that be an error case?

EDIT: Actually, I don't think it should be an error case to account for injected sidecars. Maybe a warning though.

tallclair · 2024-02-06T18:02:55Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+1. Require the pod-level cgroup (if it exists) to apply to the `InitContainers` as well.
+1. Allow specifying pod-level restrictions separately for Init containers and sidecars.
+
+Since init containers run sequentially, there is no requirement to share resources with any other container. Therefore the first option is the best - the pod-level cgroup should be restricted only after all of the init containers have finished running.


I disagree. I think we should go with the second option for consistency (apply pod-level cgroup to init containers). Consider the case where pod-level limits are defined, and the user didn't bother to set container-level limits. I would expect the pod limits to cap the init containers.

@tallclair I agree with you from the UX point of view.

One problem I see is that if the init container requires much more resources than the regular + sidecar containers, users will be forced to set a very high pod resource limits to accommodate initial surge. This might not be a concern given that we already account for the init container resource requirements during scheduling, but just want to bring it up as it's legit use case.

thockin · 2024-02-07T18:16:37Z

I'd be happy to see this topic come bck to life, but we probably need to re-evaluate all preconditions - a lot of things hve changed since it was written :)

AnishShah · 2024-02-08T19:26:49Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+
+A pod is considered `Guaranteed` if either of the following two conditions apply: 
+1. All of the containers included in the pod define memory and CPU resource requests equal to resource limits
+1. The pod includes a pod-level memory and CPU resource request which is equal to the resource limit, even if one or more of the included containers doesn't define either a CPU or a memory request or limit.


Should we address how will pod level resource limits affect ephemeral containers?

yujuhong · 2024-02-15T20:16:40Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+Conversely if `pod.limit[memory] > sum(pod.containers[*].limit[memory])` then the excess is not relevant since the individual containers 
+can never use more than their defined limit, and even if all of the containers use the maximum allowed by their defined memory limit, it can never amount
+to the pod-level limit. Setting such a limit on the pod level does not _reserve_ any memory to the pod and will not negatively affect the rest of the 
+workloads running on the node. Therefore this is not considered an error. 


Might be okay (not worth a warning) if we account for ephemeral containers

yujuhong · 2024-02-20T21:39:17Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+
+### User Stories 
+
+#### Story 1


Another use case is to allow better resource utilization since we don't need to specify the limits (i.e., peak usage) for each individual containers. Instead, we do this at the pod level, which allows more flexibility through sharing.

yujuhong · 2024-02-20T22:01:53Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+1. Require the pod-level cgroup (if it exists) to apply to the `InitContainers` as well.
+1. Allow specifying pod-level restrictions separately for Init containers and sidecars.
+
+Since init containers run sequentially, there is no requirement to share resources with any other container. Therefore the first option is the best - the pod-level cgroup should be restricted only after all of the init containers have finished running.


@tallclair I agree with you from the UX point of view.

One problem I see is that if the init container requires much more resources than the regular + sidecar containers, users will be forced to set a very high pod resource limits to accommodate initial surge. This might not be a concern given that we already account for the init container resource requirements during scheduling, but just want to bring it up as it's legit use case.

yujuhong · 2024-02-20T22:02:29Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+The current implementation (before this proposal) sets up the pod-level cgroup using the maximum between the sum of all the sidecars or the biggest limit for any single Init container. 
+
+There are a few options: 
+1. Limit the pod-level cgroup only after the `InitContainers` have successfully terminated


What's the behavior if resources are not specified for the init containers?

Also, wouldn't this complicate the implementation significantly?

k8s-triage-robot · 2024-03-21T22:23:50Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

…-team-lead add suleymanakbas91 as team lead for edge enablement

SergeyKanzhelev · 2024-05-23T20:09:13Z

will this be worked on in 1.31? please let me know so I can mark the KEP #1592 with the correct metadata.

yujuhong · 2024-05-24T17:06:51Z

will this be worked on in 1.31? please let me know so I can mark the KEP #1592 with the correct metadata.

@ndixita is planning to work on this in 1.31.

jpbetz · 2024-06-06T22:04:15Z

keps/sig-node/1592-pod-level-resource-limits/kep.yaml

@@ -0,0 +1,48 @@
+---


Would you also add a keps/prod-readiness/sig-auth/1592.yaml file with:

kep-number:1592 alpha: approver: "@jpbetz"

? Thanks!

jpbetz · 2024-06-06T22:20:30Z

keps/sig-node/1592-pod-level-resource-limits/README.md

+- 2020-06-04 - Updating the KEP after a trial-run on some real-world workloads
+- 2020-09-22 - Updating the KEP to allow pod-level resource requests
+- 2020-10-07 - Clarified how container-level memory and cpu limits are expected to relate to defined pod limits
+- 2020-11-23 - Updated the KEP to the new KEP format. 


Is this still targeting 1.31 for alpha? If so, please copy in the Production Readiness Review Questionnaire section from the template and fill out the alpha sections.

@jpbetz #4678 is the revised version of this KEP that is being worked on actively. May I please add you as the PRR reviewer since John might be busy.

Yes, please sign me up

ndixita · 2024-06-06T23:53:53Z

/close

Closing this PR in favor of #4678 to avoid confusion

k8s-ci-robot · 2024-06-06T23:53:57Z

@ndixita: Closed this PR.

In response to this:

/close

Closing this PR in favor of #4678 to avoid confusion

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 4, 2020

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 4, 2020

k8s-ci-robot requested review from dchen1107 and derekwaynecarr March 4, 2020 11:23

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 4, 2020

liorokman mentioned this pull request Mar 4, 2020

Add support for shared burstable pod resource limits kubernetes/kubernetes#88722

Closed

k8s-ci-robot assigned dchen1107 Mar 4, 2020

k8s-ci-robot assigned derekwaynecarr Mar 6, 2020

derekwaynecarr reviewed Mar 6, 2020

View reviewed changes

liorokman mentioned this pull request Mar 6, 2020

Add support for shared burstable pod resource limits kubernetes/kubernetes#88899

Closed

k8s-ci-robot requested a review from dashpole March 6, 2020 19:17

liorokman force-pushed the shared-burstable-limits branch from b546008 to 73a9b6b Compare March 8, 2020 13:22

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 8, 2020

c3d reviewed Jan 16, 2023

View reviewed changes

vinaykul mentioned this pull request Feb 1, 2023

Sidecar containers KEP #3761

Merged

bart0sh moved this from Waiting on Author to Needs Reviewer in SIG Node PR Triage Apr 6, 2023

pacoxu mentioned this pull request May 15, 2023

Resource limits are at the pod level, not at the container level kubernetes/kubernetes#118004

Closed

thockin reviewed May 26, 2023

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024

k8s-ci-robot requested review from kad and marquiz January 23, 2024 18:56

tallclair reviewed Feb 6, 2024

View reviewed changes

AnishShah reviewed Feb 8, 2024

View reviewed changes

yujuhong reviewed Feb 20, 2024

View reviewed changes

yujuhong mentioned this pull request Mar 4, 2024

REQUEST: New membership for yujuhong kubernetes/org#4801

Closed

9 tasks

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2024

thockin removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 25, 2024

RomanBednar pushed a commit to RomanBednar/enhancements that referenced this pull request Apr 5, 2024

Merge pull request kubernetes#1592 from dhellmann/add-suleymanakbas91…

cfbd024

…-team-lead add suleymanakbas91 as team lead for edge enablement

jpbetz reviewed Jun 6, 2024

View reviewed changes

k8s-ci-robot closed this Jun 6, 2024


		This proposal aims to:

		* Allow a `Burstable` pod to define that memory and CPU resource limits should be set on the level of the Pod.


		## Summary

		Pod resources are currently enforced on a container-by-container case. Each container defines a limit for each managed resource, and the Kubelet translates these limits into the appropriate cgroup definitions. Kubelet creates a three-level deep hierarchy of cgroups for each pod. The top-level is the QoS grouping of the pod (`Guaranteed`, `Burstable`, and `BestEffort`), the second level is the pod itself, and the bottom level are the pod containers. The current Pod API doesn't enable developers to define resource limits at the pod level. This KEP proposes a method for developers to define resource requests and limits on the pod level in addition to the resource requests and limits currently possible on the individual container level.

		If a there are resource request definition on the pod level, even without any definition on any of the included containers, then it should be considered
		`Burstable` since there is at least some knowledge as to the amount of resources the pod should receive.

KEP-2837: Pod level resource limits #1592

KEP-2837: Pod level resource limits #1592

Conversation

liorokman commented Mar 4, 2020

k8s-ci-robot commented Mar 4, 2020

k8s-ci-robot commented Mar 4, 2020

liorokman commented Mar 4, 2020

liorokman commented Mar 6, 2020

derekwaynecarr commented Mar 6, 2020

derekwaynecarr commented Mar 6, 2020

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liorokman Mar 6, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liorokman commented Mar 6, 2020 • edited

liorokman commented Mar 6, 2020

liorokman commented Mar 6, 2020

dashpole commented Mar 6, 2020

thockin commented Jan 16, 2023

Choose a reason for hiding this comment

n4j commented Mar 2, 2023

n4j commented Mar 30, 2023

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Jan 19, 2024

kad commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Feb 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-triage-robot commented Mar 21, 2024

SergeyKanzhelev commented May 23, 2024

yujuhong commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndixita commented Jun 6, 2024

k8s-ci-robot commented Jun 6, 2024

liorokman Mar 6, 2020 •

edited

liorokman commented Mar 6, 2020 •

edited