keps: sig-node: initial pod overhead proposal #887

egernst · 2019-03-11T18:11:35Z

Initial push for the pod overhead KEP.

keps/sig-node/20190226-pod-overhead.md

thockin · 2019-03-18T15:49:39Z

keps/sig-node/20190226-pod-overhead.md

+           runtimeHandler:
+             type: string
+             Pattern: '^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)?$'
+           runtimeCpuReqOverhead:


For the larger audience: Do we have no way to get the quantity behavior into CRDs?

derekwaynecarr

i have questions about eviction, veritical pod autoscaling, and if this inhibits our ability to do pod level resource requirements in the future.

derekwaynecarr · 2019-03-18T19:27:59Z

keps/sig-node/20190226-pod-overhead.md

+for a given runtimeClass.  A mutating admission controller is introduced which will update the `Overhead`
+field in the workload's `PodSpec` to match what is provided for the selected RuntimeClass, if one is specified.
+
+Kubelet's creation of the pod cgroup will be calculated as the sum of container `ResourceRequirements` fields,


do we need to account for overhead in how we handle eviction decisions? right now, we evict based on usage relative to request. this may prove problematic as you will potentially want to subtract overhead from the observed usage to make a more accurate decision.

I made this more explicit in the proposal.

keps/sig-node/20190226-pod-overhead.md

derekwaynecarr · 2019-03-18T19:30:39Z

keps/sig-node/20190226-pod-overhead.md

+
+For scheduling, the pod resource requests are added to the container resource requests.
+
+We don't currently enforce resource limits on the pod cgroup, but this becomes feasible once


i thought we had a flag to toggle this. pod level cgroup takes charges for empty dir memory backed volumes across container restart boundaries, so there are many potential sources of things that can get charged to the pod cgroup.

is there any concern with how a vertical pod autoscaler would work with this function? i thought the vpa would set container requests, but external tools like kubectl top will see the overhead charged to the container, and so vpa sizing may be skewed. is vpa going to be runtime class aware? it cannot use the same estimation for an image run in a normal container versus a vm-container.

@derekwaynecarr - for the pod, the overhead would be observed.

Container level statistics are gathered directly from the cgroups in the guest, and do not include any sandbox overheads.

updated to clarify scaling scenario

i was not sure if this was the case with gvisor, but i had followed up with @dchen1107 where she confirmed separately.

derekwaynecarr · 2019-03-18T19:37:03Z

keps/sig-node/20190226-pod-overhead.md

+
+## Alternatives [optional]
+
+In order to achieve proper handling of sandbox runtimes, the scheduler/resourceQuota handling needs to take


i think other alternatives can be discussed.

alternative options would be to support pod level resource requirements rather than just overhead. i think pod level resource requirements are very useful for shared resources (hugepages, memory when doing emptyDir volumes), and its possible doing overhead now may complicate our ability to do that later.

the benefit of overhead is a kubernetes service provider can subsidize the charge-back model potentially and eat the cost of the runtime choice, but charge the user for the cpu/memory consumed independent of runtime choice. there are pros/cons to either approach.

Pod level resource requirements makes sense to me. I was originally trying to keep this out of the scope of this KEP, but it is indeed hard to do so. If pod level resources existed, then adding pod overhead would be a small addition imo (probably just a mutating admission controller to augment based on overhead for the registered runtimeClass).

added this as a suggested alternative. TBH, though, I see this as an augmentation rather than an alternative. I think both could exist.

egernst · 2019-04-03T16:50:04Z

@tallclair, @derekwaynecarr PTAL?

Since my proposal will also impact scheduling, would it make sense to include someone for scheduling sig?

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

to-be-squashed - including in history for review period only Signed-off-by: Eric Ernst <eric.ernst@intel.com>

egernst · 2019-04-04T18:25:53Z

...updated to align with RuntimeClass admission controller naming, as proposed in #909

derekwaynecarr · 2019-04-04T18:49:42Z

keps/sig-node/20190226-pod-overhead.md

+```
+Pod {
+  Spec PodSpec {
+    // Overhead is the resource overhead consumed by the Pod, not including


rather than the "Pod" can we say "the overhead incurred from the container runtime" ?

or just runtime

derekwaynecarr · 2019-04-04T18:50:32Z

keps/sig-node/20190226-pod-overhead.md

+pod specify a limit, then the sum of those limits becomes a pod-level limit, enforced through the
+pod cgroup.
+
+Users are not expected to manually set the pod resources; if a runtimeClass is being utilized,


is there a reason we would not always discard if no RuntimeClass is configured?

derekwaynecarr · 2019-04-04T18:52:24Z

keps/sig-node/20190226-pod-overhead.md

+In the scope of this KEP, The RuntimeClass controller will have a single job: set the pod overhead field in the
+workload's PodSpec according to the runtimeClass specified.
+
+It is expected that only the RuntimeClass controller will set Pod.Spec.Overhead. If a value is provided


i think we should strip the overhead declaration from any pod that has no associated runtime class.

derekwaynecarr · 2019-04-04T18:54:16Z

before this moves to implementable, i would like to see graduation criteria and understand if this is a separate feature gate from the existing RuntimeClass feature gate. I would also like to prevent users from populating overhead if RuntimeClass is not associated with a pod.

derekwaynecarr · 2019-04-04T18:54:25Z

/approve
/lgtm

k8s-ci-robot · 2019-04-04T18:54:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, egernst

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tallclair

Sorry for not getting these comments in before this merged. Would you mind opening a follow-up PR to address these (and some of @derekwaynecarr's open concerns)?

tallclair · 2019-04-05T22:31:45Z

keps/sig-node/20190226-pod-overhead.md

@@ -0,0 +1,310 @@
+---
+title: KEP Template


tallclair · 2019-04-05T22:32:39Z

keps/sig-node/20190226-pod-overhead.md

+authors:
+  - "@egernst"
+owning-sig: sig-node
+participating-sigs:


sig-scheduling?
sig-autoscaling?
wg-resource-management

tallclair · 2019-04-05T22:33:02Z

keps/sig-node/20190226-pod-overhead.md

+status: provisional
+---
+
+# pod overhead


nit: capitalize

tallclair · 2019-04-05T22:33:21Z

keps/sig-node/20190226-pod-overhead.md

+
+## Table of Contents
+
+Tools for generating: https://github.com/ekalinin/github-markdown-toc


Please fix (run the tool, paste the output here - I usually drop the bullets for the title & TOC sections)

tallclair · 2019-04-05T22:36:08Z

keps/sig-node/20190226-pod-overhead.md

+
+### Non-Goals
+
+* Making runtimeClass selections


A few more that come to mind:

auto detecting overhead

per-container overhead

enforcement of overhead (pod cgroup limits) - Actually, maybe this is covered?

tallclair · 2019-04-05T23:03:32Z

keps/sig-node/20190226-pod-overhead.md

+
+## Drawbacks [optional]
+
+This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify.


typo

Suggested change

This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify.

This KEP introduces further complexity, and adds a field the PodSpec which users aren't expected to modify.

tallclair · 2019-04-05T23:04:32Z

keps/sig-node/20190226-pod-overhead.md

+Even if this were to be introduced, there is a benefit in keeping the overhead separate.
+ - post-pod creation handling of pod events: if runtimeClass definition is removed after a pod is created,
+  it will be very complicated to calculate which part of the pod resource requirements were associated with
+  the workloads versus the sandbox overhead.


Does that matter?

tallclair · 2019-04-05T23:05:53Z

keps/sig-node/20190226-pod-overhead.md

+to add a sandbox overhead when applicable.
+
+Pros:
+ * no changes to the pod spec


And the user doesn't have the option of setting the field.

tallclair · 2019-04-05T23:07:32Z

keps/sig-node/20190226-pod-overhead.md

+Cons:
+ * handling of the pod overhead is spread out across a few components
+ * Not user perceptible from a workload perspective.
+ * very complicated if the runtimeClass policy changes after workloads are running


It's worth noting that we don't merge the runtime Handler into the pod spec. The RuntimeClass is immutable, although it can still be deleted and recreated. We could add special handling to prevent the RuntimeClass from being deleted as long there are pods running with it.

tallclair · 2019-04-05T23:08:19Z

keps/sig-node/20190226-pod-overhead.md

+ runtime choice, but charge the user for the cpu/memory consumed independent of runtime choice.
+
+
+### Leaving the PodSpec unchaged


I had been assuming that we would merge the overhead into the podspec, but now I'm second-guessing that...

tallclair · 2019-04-05T23:10:10Z

Also, I forgot to say that this is awesome, especially as your first contribution to Kubernetes! Thanks so much for taking this project on!

k8s-ci-robot requested review from dchen1107 and derekwaynecarr March 11, 2019 18:14

egernst changed the title ~~keps: sig-node: initial pod overhead proposal~~ [WIP] keps: sig-node: initial pod overhead proposal Mar 12, 2019

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2019

thockin reviewed Mar 18, 2019

View reviewed changes

derekwaynecarr requested changes Mar 18, 2019

View reviewed changes

egernst force-pushed the pod-overhead branch from 04f0736 to 3cc4e55 Compare April 3, 2019 16:39

egernst changed the title ~~[WIP] keps: sig-node: initial pod overhead proposal~~ keps: sig-node: initial pod overhead proposal Apr 3, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2019

egernst force-pushed the pod-overhead branch 2 times, most recently from 741ecd6 to 72b5dbf Compare April 3, 2019 16:48

Eric Ernst added 2 commits April 4, 2019 11:22

keps: sig-node: initial pod overhead proposal

9a0b39e

Signed-off-by: Eric Ernst <eric.ernst@intel.com>

kep: pod-overhead: simplify runtimeClass description

745d21d

to-be-squashed - including in history for review period only Signed-off-by: Eric Ernst <eric.ernst@intel.com>

egernst force-pushed the pod-overhead branch from 72b5dbf to 745d21d Compare April 4, 2019 18:23

derekwaynecarr reviewed Apr 4, 2019

View reviewed changes

k8s-ci-robot assigned derekwaynecarr Apr 4, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2019

k8s-ci-robot merged commit bb79ab4 into kubernetes:master Apr 4, 2019

egernst deleted the pod-overhead branch April 4, 2019 20:31

tallclair reviewed Apr 5, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keps: sig-node: initial pod overhead proposal #887

keps: sig-node: initial pod overhead proposal #887

egernst commented Mar 11, 2019 •

edited

thockin Mar 18, 2019

derekwaynecarr left a comment

derekwaynecarr Mar 18, 2019

egernst Apr 3, 2019

derekwaynecarr Mar 18, 2019

derekwaynecarr Mar 18, 2019

egernst Mar 25, 2019

egernst Apr 3, 2019

derekwaynecarr Apr 4, 2019

derekwaynecarr Mar 18, 2019

egernst Mar 25, 2019

egernst Apr 3, 2019

egernst commented Apr 3, 2019

egernst commented Apr 4, 2019

derekwaynecarr Apr 4, 2019

derekwaynecarr Apr 4, 2019

derekwaynecarr Apr 4, 2019

derekwaynecarr Apr 4, 2019

derekwaynecarr commented Apr 4, 2019

derekwaynecarr commented Apr 4, 2019

k8s-ci-robot commented Apr 4, 2019

tallclair left a comment

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair Apr 5, 2019

tallclair commented Apr 5, 2019


		For scheduling, the pod resource requests are added to the container resource requests.

		We don't currently enforce resource limits on the pod cgroup, but this becomes feasible once


		## Alternatives [optional]

		In order to achieve proper handling of sandbox runtimes, the scheduler/resourceQuota handling needs to take


		## Table of Contents

		Tools for generating: https://github.com/ekalinin/github-markdown-toc


		## Drawbacks [optional]

		This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify.

	This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify.
	This KEP introduces further complexity, and adds a field the PodSpec which users aren't expected to modify.

		runtime choice, but charge the user for the cpu/memory consumed independent of runtime choice.


		### Leaving the PodSpec unchaged

keps: sig-node: initial pod overhead proposal #887

keps: sig-node: initial pod overhead proposal #887

Conversation

egernst commented Mar 11, 2019 • edited

Choose a reason for hiding this comment

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egernst commented Apr 3, 2019

egernst commented Apr 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Apr 4, 2019

derekwaynecarr commented Apr 4, 2019

k8s-ci-robot commented Apr 4, 2019

tallclair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallclair commented Apr 5, 2019

egernst commented Mar 11, 2019 •

edited