Update pod resource management design and rollout plan #314

derekwaynecarr · 2017-01-31T16:24:38Z

Update the documentation for pod resource management based on current plan.

Document a rollout plan and tentative timeline for enabling the feature.

derekwaynecarr · 2017-01-31T16:46:16Z

contributors/design-proposals/pod-resource-management.md

-High level requirements for the design are as follows:
- - Do not break existing users. Ideally, there should be no changes to the Kubernetes API semantics.
- - Support multiple cgroup managers - systemd, cgroupfs, etc.
+### Memory overcommitment


@sjenning -- this section captures the work you are implementing.

sjenning · 2017-01-31T17:01:59Z

contributors/design-proposals/pod-resource-management.md

+* A value of 0 will instruct the `kubelet` to adjust the Burstable and
+BestEffort cgroup to restrict memory overcommit by inducing memory
+pressure and reclaim if needed at the QoS level cgroup tiers before
+inducing pressure at the `ROOT` cgroup.


I didn't realize this flag would take a value. I was thinking, at least at first, this would do a hard reserve of G and Bu limits against lower tiers. Basically hardcoding this value to 0.

I think a range is a valuable heuristic, any objections?

sjenning · 2017-01-31T17:06:27Z

contributors/design-proposals/pod-resource-management.md

+
+```
+ROOT/Burstable/memory.limit_in_bytes = 
+    Node.Allocatable - {(summation of memory requests of `Guaranteed` pods)*(1-qmo/100)}


I don't think this formula is correct. I think it should be

Node.Allocatable - {(summation of memory requests of Guaranteed pods)*(100-qmo/100)}

It's possible I have a math error, will double check.

Shouldn't this be
ROOT/Burstable/memory.limit_in_bytes = Node.Allocatable - { (sum (Guaranteed) + sum (Burstable) ) * (1-(qmo/100)) }

:sigm

sjenning · 2017-01-31T17:07:28Z

contributors/design-proposals/pod-resource-management.md

+    Node.Allocatable - {(summation of memory requests of all `Guaranteed` and `Burstable` pods)*(1-qmo/100)}
+```
+
+Each time a pod is admitted to the `kubelet`, the `kubelet` will increment


s/increment/adjust. Increment makes me think "increase" when we are actually decreasing the limit on lower QoS tier with each added pod.

agreed on the word change.

calebamiles · 2017-01-31T18:26:07Z

cc: @ethernetdan, change will require users to drain nodes. You likely want to track the progress of this

ethernetdan · 2017-01-31T18:43:25Z

@calebamiles looking like 1.6 will require a Node drains due to this + CRI enablement, we should talk about how to minimize impact

ethernetdan · 2017-01-31T18:45:39Z

contributors/design-proposals/pod-resource-management.md


-__Note__: The cgroup-root flag would allow the user to configure the root of the QoS cgroup hierarchy. Hence cgroup-root would be redefined as the root of QoS cgroup hierarchy and not containers.
+* 01/31/2017 - Discuss the rollout plan in sig-node meeting
+* 02/14/2017 - Flip the switch to enable pod level cgroups by default


Made this comment in SIG-node but if we can get this in by the 13th (I'd try to get it in the Friday before to be safe) we can target v1.6.0-alpha.2

vishh · 2017-01-31T23:11:42Z

contributors/design-proposals/pod-resource-management.md

+container and a BestEffort container is classified as a Burstable pod.  The
+BestEffort container is not able to consume slack resources from the sibling
+Burstable container.  It must instead compete for scarce resources at the node
+level across all containers in all QoS classes which violates the design.


Even with pod level cgroups, the BestEffort container in this example will compete with other Burstable pods. Remember there are no pod cgroup limits for Burstable pods unless all their containers specify limits, in which case there would be no BestEffort containers in the first place.

that is correct. i had taken this verbiage from the original document. will update that it will allow containers in a pod to share slack resources within its qos tier.

the best effort container gets cpu share time relative to the burstable pods request is the scenario i was most targeting here.

vishh · 2017-01-31T23:13:59Z

contributors/design-proposals/pod-resource-management.md

+1. Ability to charge any memory usage of memory-backed volumes to the pod when
+an individual container exits instead of the node.
+
+## Enabling the unified cgroup hierarchy


unified cgroup hierarchy could mean multiple things. What are you referring to here? If it's just about QoS and Pod cgroups, I'd rather refer to it as QoS cgroups.

vishh · 2017-01-31T23:18:03Z

contributors/design-proposals/pod-resource-management.md

+## Enabling the unified cgroup hierarchy
+
+To enable the unified cgroup hierarchy, the operator must enable the
+`--cgroups-per-qos` flag.  Once enabled, the `kubelet` will start managing


Kindly note that we require --cgroup-root to not be /. This may change once Node Allocatable changes are made.

i would prefer we do not require the cgroup-root to previously exist, so ideally as part of node allocatable, the kubelet can create that step in the taxonomy. maybe we can just state that we anticipate that the root will not be /.

per node allocatable design, i am keeping the cgroup-root flag defaulted to / since we will create the kubepods cgroup relative to that.

vishh · 2017-01-31T23:20:03Z

contributors/design-proposals/pod-resource-management.md

+operators may have to choose a particular cgroup driver to ensure
+proper system behavior.  For example, if operators use the `systemd`
+cgroup driver provided by the `docker` runtime, the `kubelet` must
+be configured to use the `systemd` cgroup driver.


It is worth mentioning that with systemd, ownership of a slice must be delegated to the kubelet in-order to use the cgroupfs driver.

actually, i think Delegate=true relative to the cgroup root for the hierarchy should go in node allocatable follow-on or that PR that has that design.

Given that we introduce cgroup drivers here, I'd appreciate if you can cross reference this doc from node-allocatable.md or move this section to that doc.

i had hoped my doc could merge first, but ok ;-)

linked to node_allocatable design in "## Enabling QoS and Pod level cgroups" section (so even earlier in the doc).

vishh · 2017-01-31T23:23:15Z

contributors/design-proposals/pod-resource-management.md

+## Integration with container runtimes
+
+The `kubelet` when integrating with container runtimes always provides the
+concrete cgroup filesystem name for the pod sandbox.


Concrete or abstract? IIRC, we intended to keep the wire format for CRI to be Abstract and have the runtime shims choose between Abstract and Concrete?

we settled on concrete. the shims are always provided the cgroupfs syntax as found on the host.

vishh · 2017-02-06T20:08:55Z

contributors/design-proposals/pod-resource-management.md


-For the initial implementation we will only support limits for cpu and memory resources.
+The `cgroups-per-qos` flag will be enabled by default, but user's


I'm wondering if we can use FeatureGate instead?

vishh · 2017-02-06T20:11:26Z

contributors/design-proposals/pod-resource-management.md

-
-#### Rkt runtime
-
-We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.


Does rkt support pod level cgroups?

I think rkt needs to support it. I think individual container runtime support should be separate from this proposal generally. Looking in the current code in head, I do not see it being used.

vishh · 2017-02-06T20:11:54Z

contributors/design-proposals/pod-resource-management.md

-
-We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.
-
-#### Add Pod level metrics to Kubelet's metrics provider


Why are you deleting this section? We do intend to get pod level metrics to help with evictions.

i thought that was covered in the core metrics proposal, i can add back some of these items.

i added this to a future enhancements section.

vishh · 2017-02-06T20:13:37Z

contributors/design-proposals/pod-resource-management.md

- [ ] Check if parent cgroup exist and error out if they don't.
- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root.
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager.
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node.


I believe this is still necessary. We cannot admit a pod until (or unless) the QoS cgroups can accommodate it.

i wanted to avoid enumerating must have work items as part of the design doc. i will update the verbiage in the Memory allocation section to denote that it will do a check on Admit.

i describe the heuristic we will apply in the memory allocation section.

vishh · 2017-02-06T20:14:12Z

contributors/design-proposals/pod-resource-management.md

- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root.
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager.
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node.
- [ ] Log basic cgroup management ie. creation/deletion metrics


We need this feature. Internally, we have observed high tail latencies for cgroupfs operations. I'd prefer having metrics to help us narrow down such issues.

ok, i will add a section for this in the document.

i added a section for logging.

vishh · 2017-02-09T00:24:42Z

contributors/design-proposals/pod-resource-management.md

+ * opt-in behavior surrounding the feature (`qos-memory-overcommit` support) completed.
+* 03/01/2017 - Send an announcement to kubernetes-dev@ about the rollout and potential impact
+* 03/22/2017 - Kubernetes 1.6 release
+* TBD (1.7?) - Deprecate the old implementation


Is there an old implementation?

i meant this to read look at deprecating the ability to have pod level cgroups off. will clarify text.

sjenning · 2017-02-14T23:22:30Z

contributors/design-proposals/pod-resource-management.md

+accounting on the node, and introduces a number of code complexities when
+trying to build features around QoS.
+
+This design introduces a unified cgroup hierarchy to enable the following:


s/unified/new/ since unified means something else in the context of cgroups

sjenning · 2017-02-14T23:24:24Z

contributors/design-proposals/pod-resource-management.md

+
+This design introduces a unified cgroup hierarchy to enable the following:
+
+1. Improve enforcement of QoS class on the node.


nit: s/Improve/Enable s/class/classes/

nit: also do we want to say "QoS cgroup" instead of "QoS class" throughout since it is "Pod cgoups" and "Pod Class" doesn't make sense?

sjenning · 2017-02-14T23:26:28Z

contributors/design-proposals/pod-resource-management.md

+
+1. Improve enforcement of QoS class on the node.
+1. Simplify resource accounting at the pod level.
+1. Allow containers in a pod to share slack resources within its QoS class.


if this is talking about pod level cgroups wouldn't it be "share slack resources within its pod cgroup"?

sjenning · 2017-02-14T23:33:43Z

contributors/design-proposals/pod-resource-management.md

+
+The `--cgroup-root` flag must have a value specified to use this feature.
+The `kubelet` will parent any cgroups it creates below that specified value.
+The `--cgroup-root` flag will default to `/` if not specified.


This conflicts with L57. Does the flag have a default or must it be specified?

the flag must have a value, and it defaults to /

sjenning · 2017-02-14T23:39:09Z

contributors/design-proposals/pod-resource-management.md


-A pod can belong to one of the following 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority.
+The `kubelet` will support a flag `--qos-reserve-limits` that takes a
+set of percentages per compressible resource that controls how the QoS


s/compressible/incompressible

sjenning · 2017-02-14T23:42:32Z

contributors/design-proposals/pod-resource-management.md

+in a range from 0-100%, where a value of 0 instructs the `kubelet` to
+attempt no reservation, and a value of 100 will instruct the `kubelet`
+to attempt to reserve the sum of requested resource across all pods
+on the node.  How the `kubelet` achieves this desired state is resource


nit: maybe something along the lines of "exclude pods from lower QoS classes from using resources requested by higher QoS classes" would be clearer.

sjenning · 2017-02-14T23:42:50Z

contributors/design-proposals/pod-resource-management.md

+attempt no reservation, and a value of 100 will instruct the `kubelet`
+to attempt to reserve the sum of requested resource across all pods
+on the node.  How the `kubelet` achieves this desired state is resource
+specific.  The default value per compressible resource if not specified


s/compressible/incompressible

sjenning · 2017-02-14T23:44:49Z

contributors/design-proposals/pod-resource-management.md

+to attempt to reserve the sum of requested resource across all pods
+on the node.  How the `kubelet` achieves this desired state is resource
+specific.  The default value per compressible resource if not specified
+is for no reservation to occur.


Should to default be 100% reservation? I think this is what Vish was thinking. Since 0% is the effective value if you don't use the flag. The flag without a value should do something different than not having flag at all.

i would assume that if a resource is not enumerated in the flag value, we would do no reservation.

sjenning · 2017-02-14T23:46:19Z

contributors/design-proposals/pod-resource-management.md

+
+By default, no memory limits are applied to the BestEffort
+and Burstable QoS level cgroups unless a `--qos-reserve-limits` value
+is specified for memory.


Needs changing if you agree with my previous comment about default reservations

i dont think i agree w/ previous comment.

my opinion is that if no value is provided, we should do no harm, and therefore do nothing. it will have the same literal effect of qos-reserve-limits=memory=0%

derekwaynecarr · 2017-02-21T01:45:55Z

@vishh - please take a look since node allocatable basically requires this as well.

derekwaynecarr · 2017-02-21T22:11:04Z

@vishh @sjenning @dchen1107 -- all updates have been made. i believe this accurately represents the plan we have PRs in flight to pursue for 1.6. i updated text around how qos level cgroup sandboxes are updated to try to reduce confusion, and minimize risk. ptal.

Automatic merge from submit-queue (batch tested with PRs 41349, 41532, 41256, 41587, 41657) Enable pod level cgroups by default **What this PR does / why we need it**: It enables pod level cgroups by default. **Special notes for your reviewer**: This is intended to be enabled by default on 2/14/2017 per the plan outlined here: kubernetes/community#314 **Release note**: ```release-note Each pod has its own associated cgroup by default. ```

vishh · 2017-02-22T16:39:16Z

contributors/design-proposals/pod-resource-management.md

-We use the following denotations in the sections below:
+Internally, the `kubelet` maintains both an abstract and a concrete name
+for its associated cgroup sandboxes.  The abstract name follows the traditional
+`cgroupfs` style syntax.  The concrete name is the name for how the cgroup


In the case of systemd the concrete name is also abstract :)

i am not following, unless you mean the concrete name is abstract by virtue that it encodes the hierarchy at each step. if so, i agree, that is an abstraction in and of itself. that said, it is current reality ;-)

vishh · 2017-02-22T16:41:03Z

contributors/design-proposals/pod-resource-management.md

+pod<UID>/memory.limit_in_bytes = sum(pod.spec.containers.resources.limits[memory])
+```
+
+Note: This design enables containers in a pod to optionally share slack compute resources.


Well, unless all containers specify limits, there will no pod level limits. Once they specify limits, there is no sharing.

vishh · 2017-02-22T16:42:22Z

contributors/design-proposals/pod-resource-management.md

 ```
+pod<UID>/cpu.shares = 2


Once we set cpu.shares to 2 at the QoS level cgroup, is this setting at the pod level still relevant?

it is less relevant, but it is consistent with what we end up setting for the container cgroup sandbox that it contains.

vishh · 2017-02-22T16:43:23Z

contributors/design-proposals/pod-resource-management.md

+
+## QoS level cgroups
+
+The `kubelet` defines a `--cgroup-root` flag that is used to specify the `ROOT`


I'd mention that the recommended configuration is to set --cgroup-root to / to avoid having a deep cgroup hierarchy.

text was added re: deep hierarchies.

vishh · 2017-02-22T16:44:23Z

contributors/design-proposals/pod-resource-management.md

+allows BestEffort and Burstable pods to potentially consume as many
+resources that are presently available on the node.
+
+For compressible resources, this prioritization scheme has little impact.


This is not entirely true since power is limited across CPU sockets and so Burstable or BestEffort can cause performance issues for Guaranteed pods.

Hi @vishh --

You're talking about impact of TDP/HWP and turbo-boost frequency scaling interfering with G pods? Isn't the millicores concept intended to gloss over that level of detail, or is it that we are now concerned with it again in scenarios such as you described?

Now you have got me curious ... Any chance you could provide the content of /proc/cmdline and the output of "turbostat sleep 10" from a hypervisor node along with "numastat -p kvm"?

i rephrased this, but avoided discussing why a pod cpu request should ultimately being measured in watts ;-)

@jeremyeder The discussion here is about providing consistent performance. I agree with @derekwaynecarr that we should be discussing watts per core in a different issue.

vishh · 2017-02-22T16:46:27Z

contributors/design-proposals/pod-resource-management.md

+pods memory request may not be satisfied if there are active BestEffort
+pods consuming all available memory.
+
+The `kubelet` will support a flag `experimental-qos-reserve-requests` that


reserve-requests is a bit confusing. Are we reserving for Guaranteed pods? One could also interpret it as node level reservation.

That's why I preferred overcommit-percentage where the default "" would be 100% of Allocatable.

I thought we had agreed on this.

You can't "overcommit" a node in the way I think of overcommitting. The scheduler looks at pod resources requests and packs them in up to node Allocatable. In my mind, overcommitment is a scheduler thing i.e. pack more pods on the node assuming not all of them will use their requests all at once.

What we are doing is reserving the requests made by G and Bu pods against use in the lower QoS tiers. Hence qos-reserve-requests.

by definition, we are reserving for G pods in the same manner we are reserving for node allocatable top level cgroup. both have a memory limit set equal to their request. i agree with @sjenning . i think overcommit-percentage means something like a physical cpu/memory to virtual cpu/memory scalar that would skew what the node reports as allocatable back to the scheduler. this is precisely not that. this is attempting to reserve for the qos tier in the same manner we are reserving for the kubepods cgroup sandbox one level up in the hierarchy. it's also experimental in 1.6, so i am not sure how much we want to bikeshed on the name.

You can't "overcommit" a node in the way I think of overcommitting.

Our current QoS doc talks about oversubscribing. May be that's a better term?

What we are doing is reserving the requests made by G and Bu pods against use in the lower QoS tiers. Hence qos-reserve-requests

This is not obvious from the flag name. We use the keyword reservation in other places to indicate System and Kube reservations for example where its a static reservation.
Whereas here, we are referring to a dynamic quantity.

Because the kubernetes scheduler only looks at scheduler, it inherently oversubscribes (trying a different term) the nodes. This flag is merely controlling the amount of oversubscription.

it's also experimental in 1.6, so i am not sure how much we want to bikeshed on the name.

Having spent so much time on it, we might as well resolve it now among the three of us. I feel this conversation will be helpful in agreeing on the semantic meaning of the some of the terms we use all over the code too - reservation, commitment, limits, etc.

vishh · 2017-02-22T16:48:20Z

contributors/design-proposals/pod-resource-management.md

+converges to a desired state.  Failure to set `cpu.shares` at the QoS level
+cgroup would result in `500m` of cpu for a Guaranteed pod to have different
+meaning than `500m` of cpu for a Burstable pod in the current hierarchy.
+For this reason, we will always set `cpu.shares` for the QoS level sandboxes


I wasn't aware of this change. I was under the assumption that all QoS level settings would be opt-in.

Also, given that we have pod level limits, I don't see QoS level shares limits to have that much of an impact unless a user explicitly opts in to restricting overcommit.

I am not following, absent setting a value:

$ cat /Burstable/cpu.shares 1024

If you have a allocatable=4cpu, and you run the following:

This means if you run stress to do the following:

kubectl run burstable --image=sjenning/stress --requests=cpu=500m,memory=100Mi --replicas=1 --command -- /usr/bin/stress -c 4 -t 3600

by default, this bursts to consume all 4 cpus, which is fine, that is the point.

but if you do this:

kubectl run guaranteed --image=sjenning/stress --requests=cpu=500m,memory=100Mi --limits=cpu=500m,memory=100Mi --replicas=7 --command -- /usr/bin/stress -c 4 -t 3600

the result is that the burstable pod skews closer to 1 core of usage since the 1024 share evaluation is made relative to the guaranteed pods, and we really want the Burstable cpu share to be 500 so it was given equal time relative to the other pods.

i tested this locally and confirmed this was the case, and manually assigning

echo 500 > /Burstable/cpu.shares

fixes usage to closer to the actual value we requested as expected...

do you disagree that we need to set shares on burstable tier?

do you disagree that we need to set shares on burstable tier?

Nope. All I'm saying is that I'd prefer not setting QoS limits by default in v1.6.

As per offline discussion, given that this will lead to a regression, I agree that its necessary to have it turned on by default. Apologies for the confusion.

vishh · 2017-02-22T16:50:38Z

contributors/design-proposals/pod-resource-management.md

+A value of `--experimental-qos-reserve-requests=memory=100%` will cause the
+`kubelet` to adjust the Burstable and BestEffort cgroups from consuming memory
+that was requested by a higher QoS class. This increases the risk
+of inducing OOM on BestEffort workloads in favor of increasing memory


and on Burstable too since they cannot consume memory that is otherwise free

vishh · 2017-02-22T21:05:00Z

contributors/design-proposals/pod-resource-management.md

+that was requested by a higher QoS class. This increases the risk
+of inducing OOM on BestEffort workloads in favor of increasing memory
+resource guarantees for Burstable and Guaranteed workloads.  A value of 
+`--experimental-qos-reserve-requests=memory=0%` will allow a Burstable


As I mentioned earlier, I find reservation misleading since it is not clear what the reservation is meant for.

vishh · 2017-02-22T21:06:30Z

contributors/design-proposals/pod-resource-management.md

+
+Since memory is an incompressible resource, it is possible that a QoS
+level cgroup sandbox may not be able to reduce memory usage below the
+value specified in the heuristic during pod admission and pod termination.


We do use any heuristic do we? Limits are based on the policy mentioned above right?

vishh · 2017-02-22T21:10:52Z

contributors/design-proposals/pod-resource-management.md

+As a result, the `kubelet` runs a periodic thread to attempt to converge
+to this desired state from the above heuristic.  If unreclaimable memory
+usage has exceeded the desired limit for the sandbox, the `kubelet` will
+attempt to set the effective limit near the current usage to put pressure


It is not safe to create a pod level cgroup unless updates on QoS level memory cgroup has been successful.
It's not obvious what the general design principle is from this text.

@vishh -- does something like the following help? I disagree on the safety argument. Operators most concerned about safety may deploy G workloads first to a node before opening up other workloads for scheduling. For nodes with fluid set of pods being scheduled, I want to attempt to provide safety, but do no worse than if the reservation feature is disabled.

Use cases:

I want to prioritize access to compressible resources for my system/cluster daemons over end-user pods.

I want to prioritize access to compressible resources for my G workloads over my Bu workloads.

I want to prioritize access to compressible resources for my Bu workloads over my Be workloads.

An operator may choose to prioritize access to a compressible resource over utilization for each step in the hierarchy of workloads described above dependent upon their intended workload. Almost all cluster operators will use node allocatable to enforce the first use case in order to provide reliable operation of the node. It is understood that not all operators may feel the same for G/Bu workloads across the user community. For the users I represent, I know they may opt to deploy important cluster services as G workloads via a DaemonSet and would like a similar resource reservation model as is provided via node allocatable for system/cluster daemons and end-user pods. In addition, I know users I represent would like to run Be workloads and minimize their ability to impact Bu pods. Reliance on eviction, oom_killer, and critical pods is not always sufficient. Eviction still has latency, oom_killer is disruptive, and critical pods as I view them are intended to provide G style guarantees but run with Bu resource requirements.

Design Principle
kubelet with opt-in configuration will attempt to limit the ability for a pod in a lower QoS tier to burst utilization of a compressible resource that was requested by a pod in a higher QoS tier.

Mechanism
Prior to starting a G pod, we will always attempt to update the Bu and Be tiers to have their memory limits reduced based on the incoming G pod memory request. It is possible that we are unable to reduce the Bu and Be tier to their new desired limit if usage already has exceeded that value prior to execution of the G pod. If they have exceeded their usage, we will set the limit at or near their usage to put pressure on the cgroup to prevent further growth. The kubelet will not wait for the QoS cgroup memory limit to converge to the desired state prior to execution of the G pod. This does mean that the G pod could induce an OOM for the kubepods cgroup, but per our QoS design, we would prefer the oom_killer targets the Bu/Be pods first, or ideally, we get the eviction code path to kick in response to kernel notification, and evict the Bu/Be pod. The periodic task that attempts to converge the QoS tier memory limits would then have a better opportunity to converge on the desired state so any future Bu/Be pods that land do not impact the G pod since they would be working under a reduced memory limit. The same logic applies for the Bu pods that get scheduled, but reduces the pertinent QoS cgroup sandbox update to just Be tier.

Best Practices
Operators that want to provide a similar resource reservation model for G pods as we offer via enforcement of node allocatable are encouraged to schedule their G pods via a DameonSet prior to opening up the node for scheduling of Bu/Be workloads. This ensures that the Bu/Be tiers have had their QoS memory limits appropriately adjusted before taking unbounded workloads on the node.

Future considerations
It's possible some users may want to stall execution of a pod in a higher QoS tier if resource reservations for compressible resources have not converged on the desired state. It can be evaluated based on more experience in the community.

We have discussed potentially allow end-user pods to schedule in the kube/system cgroup spaces, but the same hierarchy of needs may still be extended. For example, hosted multi-tenant shared-node offerings may want to allow the operators to deploy pods to kube/system cgroup spaces, but not the tenants on a shared node. For those users, offering the same opportunity to reserve requests at each level in the hierarchy is important.

Firstly, I'd prefer all this content to exist in the doc. This is kind of what I have been asking for in @sjenning's PRs. Thanks for posting this.

Your design choice is not geared towards predictable tail latencies. May be the customers you represent aren't expecting that yet. I do know some customers who care about that. Especially ones who run user facing web services that cannot have unpredictable latencies.

Relying on OOM scores and user space evictions isn't reliable enough yet. What this proposal is attempting to provide is definitely better than the current state, but we should clarify that it is besteffort Quality of Service. Users should know that we do not optimize for predicatble tail latencies (yet).

Given that the kernel does not prevent over commitment from a memcg perspective this is not a correctness issue though.

vishh · 2017-02-22T21:18:50Z

contributors/design-proposals/pod-resource-management.md

+All memory backed volumes are removed when a pod reaches a terminal state.
+
+The `kubelet` verifies that a pod's cgroup is deleted from the
+host before deleting a pod from the API server as part of the 


part of what?

graceful deletion ;-)

vishh · 2017-02-22T21:19:32Z

contributors/design-proposals/pod-resource-management.md

+
+## Log basic cgroup management
+
+The `kubelet` will log and collect metrics associated with cgroup manipulation.


Are the metrics already in place or is this a TODO?

this is a TODO, i am hoping between @sjenning and me, we can bang a PR out by monday.

we basically will need an instrumented CgroupManager

see kubernetes/kubernetes#41988

derekwaynecarr · 2017-02-23T21:28:56Z

@vishh -- ready for another round....

vishh · 2017-02-24T01:21:06Z

Just a couple more comment threads still open.

vishh · 2017-02-27T22:26:15Z

@dchen1107 @davidopp @erictune @thockin

There has been a debate on naming the QoS over subscription policy introduced by this proposal. I will try to capture the discussion thus far.

Requests are guaranteed. QoS policies let's k8s oversubscribe requests in a safe manner.
This proposal is introducing new QoS level cgroup limits policy that will provide a means to control over subscription of requests / Guaranteed resources at the node level
The proposed flag name and hence user facing name (via docs) for this feature is --experimental-qos-reserve-requests.
I find the proposed name to not be capturing the over subscription and policy aspects of this feature.
I prefer to name this feature something in the lines of --experimental-qos-over-subscription-policy
There is a thought that the scheduler does not over subscribe in kubernetes by default because it guaranteed requests. I feel by not scheduling based against limits the scheduler is always over subscribing. It is not over subscribing requests, but it is over subscribing capacity.
In the case of QoS policy feature in this proposal though, we are attempting to control over subscription of requests which leads to some confusion.
May be an alternative name would be --experimental-qos-request-over-subscription-policy ?

@derekwaynecarr @sjenning did I miss anything?

EDIT:
It will be helpful to clarify for once what certain keywords mean to be consistent across the project. The ones specific to this proposal are reservations, requests, limits, over-subscription, over commitment, QoS

derekwaynecarr · 2017-02-27T22:32:17Z

contributors/design-proposals/pod-resource-management.md

+will attempt to limit the abilty for a pod in a lower QoS tier to burst utilization
+of a compressible resource that was requested by a pod in a higher QoS tier.
+
+The `kubelet` will support a flag `experimental-qos-reserve-requests` that


to capture what was discussed via slack:

i want this flag to be called --experimental-qos-reserved

flag name updated.

derekwaynecarr · 2017-02-27T22:37:19Z

@vishh -- what is missed is how i view this is doing the exact same thing for the qos tiers as we are doing with the system and kube tiers via system-reserved and kube-reserved, both of which statically reserve resources available away from workloads with a lower precedent in the hierarchy. the distinction here is that the reservation is not static (because its based on pods scheduled). the action is still the same.

see use case discussion here for context:
https://github.com/kubernetes/community/pull/314/files#diff-adf1bc7c9f94f4ebefc6894c6ec50fd8R206

thockin · 2017-02-27T23:40:05Z

I see the word "reserved" has multiple meanings, but I am OK with it. At least, I don't immediately see a better word - reservation feels right here.

This doesn't allow me to express "under-provision Burstable but fully fund Guaranteed", which seems like it might be a useful policy (an sort of maps to Borg tiers)?

This representation (qos-reserved) is hard to wrap one's head around, though. It doesn't mean to overcommit, if I understand. It simply modulates the availability of resources for lower tiers. Do we need a way for the scheduler to understand over-commit, too?

vishh · 2017-03-03T18:54:54Z

contributors/design-proposals/pod-resource-management.md


-## Design
+For example, the cgroup name `/Burstable/pod_123-456` is translated to a


Didn't you want to make all cgroup names lower case?

yes, per our discussion, i think the existing code should following cgroup v2 naming conventions now rather than later.

see: https://www.kernel.org/doc/Documentation/cgroup-v2.txt
section: 2-6-2. Avoid Name Collisions

i will update the text, and open a pr.

opened kubernetes/kubernetes#42497 to update

vishh · 2017-03-03T18:59:38Z

contributors/design-proposals/pod-resource-management.md

+node in the cgroup hierarchy below which the `kubelet` should manange individual
+cgroup sandboxes.  It is strongly recommended that users keep the default
+value for `--cgroup-root` as `/` in order to avoid deep cgroup hierarchies.  If
+`--enforce-node-alloctable` is enabled per [node allocatable](node-allocatable.md),


This is not correct. kubepods is controlled by --cgroups-per-qos flag

vishh · 2017-03-03T19:04:30Z

contributors/design-proposals/pod-resource-management.md

+cluster services as Guaranteed workloads via a `DaemonSet` and would like a similar
+resource reservation model as is provided via [node allocatable](node-allocatable)
+for system and kubernetes daemons.  In addition, depending how operators choose
+to leverage BestEffort workloads, they may choose to make a similar decision


This rationale calls for having separate configuration for G and Bu QoS levels. Something consider before going to GA with this feature.

the rationale was not intended to be read in that manner. as you know, many may choose to deploy cluster daemons in burstable tier as well (GKE does with kube-proxy).

Got it. When I read your description, you try to point out that each use case can exist independent of the other. That's why I felt it is useful to have per QoS configuration.

clarified text to avoid confusion.

vishh · 2017-03-03T19:16:33Z

contributors/design-proposals/pod-resource-management.md

+to the desired state prior to execution of the pod, but it will always
+attempt to cap the existing usage of QoS cgroup sandboxes in lower tiers.
+This does mean that the new pod could induce an OOM event at the `ROOT`
+cgroup, but ideally per our QoS design, the oom_killer targets a pod


As a next step, we need to enable evictions at the QoS levels too to facilitate QoS enforcement.

added a note to Future enhancements.

vishh · 2017-03-03T19:18:16Z

contributors/design-proposals/pod-resource-management.md

+reservation model for Guaranteed pods as we offer via enforcement of
+node allocatable are encouraged to schedule their Guaranteed pods first
+as it will ensure the Burstable and BestEffort tiers have had their QoS
+memory limits appropriately ajdusted before taking unbounded workload on


nit: typo: adjusted

vishh · 2017-03-03T19:20:36Z

contributors/design-proposals/pod-resource-management.md

+host before deleting a pod from the API server as part of the graceful
+deletion process.
+
+This ensures resource consumption associated with those volumes are not


This is not totally true because we do not wait for the pod level cgroup usage to drop to 0 before deleting it.
I feel this is a TODO.

This feature does prevent a crash looping containers from taking up all the memory on the node if node and cause system OOMs if node allocatable and evictions are enabled.

removed confusing text.

vishh · 2017-03-03T20:33:36Z

contributors/design-proposals/pod-resource-management.md

+style syntax into transient slices, and as a result, it must follow `systemd`
+conventions for path encoding.
+
+For example, the cgroup name `/Burstable/pod_123-456` is translated to a


nit: I'd appreciate a future PR to rename all names to lowercase.

i thought i caught them all, i will take another pass.

vishh · 2017-03-03T20:35:42Z

LGTM.

Merging this PR because it's the design doc for a feature that is already merged into v1.6.

@derekwaynecarr

Automatic merge from submit-queue (batch tested with PRs 41919, 41149, 42350, 42351, 42285) kubelet: enable qos-level memory limits ```release-note Experimental support to reserve a pod's memory request from being utilized by pods in lower QoS tiers. ``` Enables the QoS-level memory cgroup limits described in kubernetes/community#314 **Note: QoS level cgroups have to be enabled for any of this to take effect.** Adds a new `--experimental-qos-reserved` flag that can be used to set the percentage of a resource to be reserved at the QoS level for pod resource requests. For example, `--experimental-qos-reserved="memory=50%`, means that if a Guaranteed pod sets a memory request of 2Gi, the Burstable and BestEffort QoS memory cgroups will have their `memory.limit_in_bytes` set to `NodeAllocatable - (2Gi*50%)` to reserve 50% of the guaranteed pod's request from being used by the lower QoS tiers. If a Burstable pod sets a request, its reserve will be deducted from the BestEffort memory limit. The result is that: - Guaranteed limit matches root cgroup at is not set by this code - Burstable limit is `NodeAllocatable - Guaranteed reserve` - BestEffort limit is `NodeAllocatable - Guaranteed reserve - Burstable reserve` The only resource currently supported is `memory`; however, the code is generic enough that other resources can be added in the future. @derekwaynecarr @vishh

Update pod resource management design and rollout plan

Fix frakti release notes

Update pod resource management design and rollout plan

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 31, 2017

derekwaynecarr force-pushed the cgroup_rollout branch from c13e7b0 to 6d3a603 Compare January 31, 2017 16:44

derekwaynecarr requested a review from vishh January 31, 2017 16:45

derekwaynecarr assigned dchen1107 Jan 31, 2017

derekwaynecarr requested a review from sjenning January 31, 2017 16:45

derekwaynecarr commented Jan 31, 2017

View reviewed changes

sjenning reviewed Jan 31, 2017

View reviewed changes

vishh self-assigned this Jan 31, 2017

ethernetdan reviewed Jan 31, 2017

View reviewed changes

vishh reviewed Feb 6, 2017

View reviewed changes

sjenning mentioned this pull request Feb 8, 2017

kubelet: enable qos-level memory limits kubernetes/kubernetes#41149

Merged

vishh reviewed Feb 9, 2017

View reviewed changes

derekwaynecarr mentioned this pull request Feb 13, 2017

Enable pod level cgroups by default kubernetes/kubernetes#41349

Merged

derekwaynecarr force-pushed the cgroup_rollout branch 2 times, most recently from 68973b7 to 5e9be36 Compare February 14, 2017 23:25

sjenning reviewed Feb 14, 2017

View reviewed changes

derekwaynecarr force-pushed the cgroup_rollout branch from 5e9be36 to 8c71cc4 Compare February 15, 2017 01:54

derekwaynecarr force-pushed the cgroup_rollout branch 3 times, most recently from 67e32e5 to 69f95c0 Compare February 21, 2017 21:59

vishh reviewed Feb 22, 2017

View reviewed changes

derekwaynecarr force-pushed the cgroup_rollout branch 2 times, most recently from 13ee005 to 648809e Compare February 23, 2017 21:24

derekwaynecarr commented Feb 27, 2017

View reviewed changes

derekwaynecarr force-pushed the cgroup_rollout branch from 648809e to 3b88cc4 Compare March 2, 2017 18:56

vishh reviewed Mar 3, 2017

View reviewed changes

Update pod resource management design and rollout plan

5f8f613

derekwaynecarr force-pushed the cgroup_rollout branch from 3b88cc4 to 5f8f613 Compare March 3, 2017 20:07

vishh reviewed Mar 3, 2017

View reviewed changes

vishh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2017

vishh merged commit 5390ec4 into kubernetes:master Mar 3, 2017

ruebenramirez pushed a commit to ruebenramirez/community that referenced this pull request Apr 22, 2017

Merge pull request kubernetes#314 from derekwaynecarr/cgroup_rollout

a97f08f

Update pod resource management design and rollout plan

shyamjvs pushed a commit to shyamjvs/community that referenced this pull request Sep 22, 2017

Merge pull request kubernetes#314 from feiskyer/fix

129e1cd

Fix frakti release notes

MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021

Merge pull request kubernetes#314 from derekwaynecarr/cgroup_rollout

98553fd

Update pod resource management design and rollout plan

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

add stevenctl to org (kubernetes#314)

da0e11a


		For the initial implementation we will only support limits for cpu and memory resources.
		The `cgroups-per-qos` flag will be enabled by default, but user's


		#### Rkt runtime

		We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.


		We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.

		#### Add Pod level metrics to Kubelet's metrics provider


		This design introduces a unified cgroup hierarchy to enable the following:

		1. Improve enforcement of QoS class on the node.


		## QoS level cgroups

		The `kubelet` defines a `--cgroup-root` flag that is used to specify the `ROOT`


		## Log basic cgroup management

		The `kubelet` will log and collect metrics associated with cgroup manipulation.


		## Design
		For example, the cgroup name `/Burstable/pod_123-456` is translated to a

		```
		pod<UID>/cpu.shares = 2

Update pod resource management design and rollout plan #314

Update pod resource management design and rollout plan #314

Conversation

derekwaynecarr commented Jan 31, 2017

Choose a reason for hiding this comment

sjenning Jan 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr Jan 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calebamiles commented Jan 31, 2017

ethernetdan commented Jan 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Feb 21, 2017

derekwaynecarr commented Feb 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjenning Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

sjenning Jan 31, 2017 •

edited

Loading

derekwaynecarr Jan 31, 2017 •

edited

Loading

sjenning Feb 22, 2017 •

edited

Loading