kep: pod-overhead: clarify handling of Overhead #939

egernst · 2019-04-06T05:12:43Z

Clarify what Overhead is in the pod spec, and behavior when
this is manually defined without a runtimeClass.

Clarify other opens in the proposal based on feedback.

egernst · 2019-04-06T05:13:25Z

/cc @tallclair

keps/sig-node/20190226-pod-overhead.md

k8s-ci-robot · 2019-04-06T05:16:10Z

@egernst: GitHub didn't allow me to request PR reviews from the following users: mcastelino.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I like the % based concept, but need to think more on this before commenting on the suggested API. Based on this initial reaction, I tend to think the "door open" may be a good first step.

/cc @mcastelino

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

egernst · 2019-04-06T05:18:06Z

keps/sig-node/20190226-pod-overhead.md

-Users are not expected to manually set the pod resources; if a runtimeClass is being utilized,
-the manual value will be discarded. See RuntimeController for the proposal for setting these
-resources.
+Users are not expected to manually set `Overhead`; any prior values will be discarded during admission. If runtimeClass


@derekwaynecarr had good feedback on initial PR re: using a suberesource, similar to how binding is handled, in order to make sure the end user wouldn't be setting this. Curious about your thoughts, @tallclair

egernst · 2019-04-06T05:43:46Z

keps/sig-node/20190226-pod-overhead.md

@@ -82,7 +106,7 @@ introduced which will update the `Overhead` field in the workload's `PodSpec` to
 what is provided for the selected RuntimeClass, if one is specified.

 Kubelet's creation of the pod cgroup will be calculated as the sum of container
-`ResourceRequirements` fields, plus the Overhead associated with the pod.
+`ResourceRequirements.Limits` fields, plus the Overhead associated with the pod.

 The scheduler, resource quota handling, and Kubelet's pod cgroup creation and eviction handling


TODO: currently not clear how resource quota handling will take Overhead into account. Probably makes sense to make this optional (default to subtract the overhead if applicable to better match the current status quo), and leave it to the administrator to decide.

Similarly, early RFC discussions included augmenting CRI interface. This should be added:

I suggest adding the LinuxContainerResources message to the LinuxPodSandboxConfig, as an optional field. Unlike the Resources field on the Kubernetes PodSpec, the resources field on the LinuxPodSandboxConfig matches the pod-level limits (i.e. the total of pod & container limits). The field is only provided when a pod-level limit is enforced.

CRI extension SGTM.

For resource quota, I think @derekwaynecarr or @bsalamat might be able to provide some guidance.

Just my 2c - I think a configurable option makes sense, but I still think we should treat the decision of the default behaviour as if there wasn't an option (because in practice very few will change it). I think accounting the overhead to the user makes sense as a default.

i prefer to charge the user for the overhead by default.

+1 to @derekwaynecarr's point

Updated accordingly. PTAL.

tallclair

Thanks!

tallclair · 2019-04-08T18:57:20Z

keps/sig-node/20190226-pod-overhead.md

@@ -82,7 +106,7 @@ introduced which will update the `Overhead` field in the workload's `PodSpec` to
 what is provided for the selected RuntimeClass, if one is specified.

 Kubelet's creation of the pod cgroup will be calculated as the sum of container
-`ResourceRequirements` fields, plus the Overhead associated with the pod.
+`ResourceRequirements.Limits` fields, plus the Overhead associated with the pod.

 The scheduler, resource quota handling, and Kubelet's pod cgroup creation and eviction handling


CRI extension SGTM.

keps/sig-node/20190226-pod-overhead.md

tallclair · 2019-04-08T19:52:58Z

keps/sig-node/20190226-pod-overhead.md

@@ -82,7 +106,7 @@ introduced which will update the `Overhead` field in the workload's `PodSpec` to
 what is provided for the selected RuntimeClass, if one is specified.

 Kubelet's creation of the pod cgroup will be calculated as the sum of container
-`ResourceRequirements` fields, plus the Overhead associated with the pod.
+`ResourceRequirements.Limits` fields, plus the Overhead associated with the pod.

 The scheduler, resource quota handling, and Kubelet's pod cgroup creation and eviction handling


For resource quota, I think @derekwaynecarr or @bsalamat might be able to provide some guidance.

Just my 2c - I think a configurable option makes sense, but I still think we should treat the decision of the default behaviour as if there wasn't an option (because in practice very few will change it). I think accounting the overhead to the user makes sense as a default.

fejta-bot · 2019-04-08T20:07:19Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

keps/sig-node/20190226-pod-overhead.md

tallclair

Thanks Eric. Almost there...

keps/sig-node/20190226-pod-overhead.md

tallclair · 2019-04-10T20:20:25Z

keps/sig-node/20190226-pod-overhead.md

+The LinuxContainerResources message is added to the LinuxPodSandboxConfig, as an optional field. Unlike
+the Resources field on the Kubernetes PodSpec, the resources field in the LinuxPodSandboxConfig matches
+the pod-level limits (i.e. the total of pod overhead & container limits). The field is only provided for
+pods where each container provides limits (ie, guaranteed pods and a subset of burstable pods).


How does Kata handle those other pods? Would having the pod overhead still be useful in those situations?

I think it could be useful to make this information available in either situation, though with no limits defined, we'd just see the overhead then, which wouldn't be useful for sizing.

The provided information would be more useful if we provided the overhead + container requests separately.

As for what is done today:

If requests and limits are not provided, the sizing of the virtual machine is based on kata defaults (default configurable parameters for Kata, which are currently a single vCPU and 2048 MB memory). This isn't optimal, obvsiously, as the best effort will be pretty limited. For realistic performance, Kata users should provide initial resource values to the pod, and rely on HPA or VPA to adjust if necessary.

We adjust the VM sizing each time we receive a container spec, and only if limits and / or requests are provided.

Receiving this field, included overhead, will help the runtime know an appropriate initial size up front (at VM creation time), potentially avoiding needing to read each container spec and hotplugging CPUs/memory.

keps/sig-node/20190226-pod-overhead.md

tallclair

Thanks. LGTM with one question about the CRI API.

tallclair · 2019-04-16T01:22:27Z

keps/sig-node/20190226-pod-overhead.md

+totals, as optional fields:
+
+type LinuxPodSandboxConfig struct {
+	Overhead *LinuxContainerResources


Is there a use case for providing overhead to the CRI? Otherwise it seems like just providing a pod total (sum of these 2 fields) would cover the use case of sizing the kata vm?

@tallclair: the use case is being able to determine the non-overhead amount, actually. In many (most?) cases, the overhead could account for hypervisor impact on the host, rather than guest components impact on the workload.

I see. I think of the runtime as being the source-of-truth for the overhead, and the runtimeclass piece as just communicating that info to Kubernetes. I'm not strictly against it, but an alternative is to just have Kata subtract it's known overhead amount.

Also, isn't the guest OS overhead significant? Or is the overhead dominated by the hypervisor?

I see. I think of the runtime as being the source-of-truth for the overhead, and the runtimeclass piece as just communicating that info to Kubernetes. I'm not strictly against it, but an alternative is to just have Kata subtract it's known overhead amount.

It should, agreed, but system administrator could create their own runtimeClass definition, and the runtime wouldn't necessarily have visibility into what is being provided 'above.' Consider a scenario where the sys-admin defines two runtimeClasses, with differing overheads because one is for dedicated heavy IO, while another one is for compute workloads. The OCI runtime wouldn't necessarily know in which context it is getting called. This may not be a strong use case, but for a couple of bytes this helps leave the door open.

Also, isn't the guest OS overhead significant? Or is the overhead dominated by the hypervisor?

It depends. For a vCPU its negligible inside the guest, but more important on the host (ie, consider high bandwith IO, and the work a IO threads would carry out on the host).

Makes sense. It still seems like the runtime needs some understanding of the overhead to know whether it should be allocated to the VM though. I'm happy to defer to you on this, but I'd like @yujuhong or @derekwaynecarr to sign off too.

Ah, I didn't take the windows sandbox into consideration here, this CRI API section is a bit Linux heavy. Wouldn't the WindowsContainerResources only be useful in a Windows equivalent of the LinuxPodSandboxConfig ?

RE: the translation from ResourceRequirements to LinuxContainerResources (and the Windows equivalent): I understand your point; not all the fields will be used. TBH, I'm not sure which is worse: not using a couple of fields or having another structure in the API that is almost the same? I haven't touched this space and would seek input from @tallclair

Makes sense, thanks for weighing in Patrick. Here's my proposal: keep the current linux API as Eric has specified it, and add a parallel windows API:

message WindowsPodSandboxConfig { // A new type, added to the PodSandboxConfig Overhead *WindowsContainerResources ContainerResources *WindowsContainerResources }

As an aside, I don't love the field name "ContainerResources", but don't have a better suggestion...

A separate WindowsPodSandboxConfig sounds good to me.
And I'm terrible at naming :-|

Sorry to chime in so late. I don't completely get why separate overhead and container resource fields are required. According the the thread below, they will be summed together for sizing the VM. Did I miss something?

@yujuhong they aren't necessarily both used to size the VM. The final text (not thread) should highlight this. PTAL @ #976

tallclair · 2019-04-16T20:54:25Z

/assign @tallclair @yujuhong @derekwaynecarr

keps/sig-node/20190226-pod-overhead.md

mcastelino · 2019-04-16T21:25:20Z

keps/sig-node/20190226-pod-overhead.md

+type LinuxPodSandboxConfig struct {
+	Overhead *LinuxContainerResources
+	ContainerResources *LinuxContainerResources
+}


In summation
VM Size = sum(ContainerResources)
Host Cgroup size = sum(Overhead)

well, no - host cgroup size = sum(overhead, containeResources)

and I'd make the Sandbox Size = Hyper-V partition size = sum(overhead, containerResources.Limit) on Windows.

Container requests or limits in this case? Do we enforce that requests == limits for sandboxed pods?

Hey @yujuhong -- it is limits. We don't force limits and requests to be equal (guaranteed), but do need a limit defined for appropriate VM sizing.

derekwaynecarr

Just the one comment to not modify the cri if the use case is not strong. How is the overhead being assigned ? Is it a separate subresource on pod?

egernst · 2019-04-17T20:49:06Z

Just the one comment to not modify the cri if the use case is not strong. How is the overhead being assigned ? Is it a separate subresource on pod?

@derekwaynecarr - it is assigned through the runtime class controller at admission time

derekwaynecarr · 2019-04-17T20:52:18Z

Followed up separately:

The cri change was useful for hypervisor that didn’t have hot swap. Admission can set the overhead so does not need a separate subresource.

/lgtm
/approve

keps/sig-node/20190226-pod-overhead.md

PatrickLang · 2019-04-17T23:09:03Z

keps/sig-node/20190226-pod-overhead.md

- - the pod spec can be referenced directly from scheduler, resourceQuote controller and kubelet, instead of referencing
- a runtimeClass object which could have possibly been removed.
+The pod cgroup is managed by the Kubelet, so passing the pod-level resource to the CRI implementation
+is not strictly necessary. However, some runtimes may wish to take advantage of this information, for


Providing overhead across CRI is necessary for Windows as we implement a runtimeclass for Hyper-V. There's no cgroup that a limit can be inherited from, so the sandbox resources need to be determined at creation time. OS overhead will be on the order of hundreds of MB for a Windows sandbox and may vary release to release as Windows inbox features are tuned, added or cut.

Thanks @PatrickLang - with each release version, then, the runtimeClass should be updated. Glad to hear this CRI addition is useful for your use case as well.

PatrickLang · 2019-04-18T00:00:23Z

keps/sig-node/20190226-pod-overhead.md

 authors:
  - "@egernst"
 owning-sig: sig-node
 participating-sigs:
+  - sig-scheduling
+  - sig-autoscaling


since CRI is involved, should ~~SIG-Node and~~ SIG-Windows be added?

egernst · 2019-04-18T00:17:11Z

/lgtm

just pushed a typo fix.

k8s-ci-robot · 2019-04-18T00:17:11Z

@egernst: you cannot LGTM your own PR.

In response to this:

/lgtm

just pushed a typo fix.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tallclair · 2019-04-19T18:53:40Z

@egernst Can you add the windows CRI API as discussed? Then LGTM. Thanks!

Updates to pod-overhead based in review discussion: - Clarify what Overhead is in the pod spec, and behavior when this is manually defined without a runtimeClass. - Clarify ResourceQuota changes necessary - Add in CRI API change to make pod details available - Define feature gate - Update runtimeClass definition Fixes: kubernetes#688 Signed-off-by: Eric Ernst <eric.ernst@intel.com>

tallclair · 2019-04-19T20:14:35Z

/lgtm

k8s-ci-robot · 2019-04-19T20:14:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, egernst, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yujuhong · 2019-04-19T22:40:59Z

keps/sig-node/20190226-pod-overhead.md

+ - overhead, once added to the spec, stays with the workload, even if runtimeClass is redefined
+ or removed.
+ - the pod spec can be referenced directly from scheduler, resourceQuota controller and kubelet,
+ instead of referencing a runtimeClass object which could have possibly been removed.


I assume the overhead is not mutable? Didn't see it in the KEP. Maybe I've missed it.

All PodSpec and RuntimeClass fields are immutable by default, but probably worth explicitly stating.

ACK - added to #976

yujuhong · 2019-04-19T23:33:51Z

keps/sig-node/20190226-pod-overhead.md

+type LinuxPodSandboxConfig struct {
+	Overhead *LinuxContainerResources
+	ContainerResources *LinuxContainerResources
+}


Container requests or limits in this case? Do we enforce that requests == limits for sandboxed pods?

yujuhong · 2019-04-19T23:35:33Z

keps/sig-node/20190226-pod-overhead.md

+totals, as optional fields:
+
+type LinuxPodSandboxConfig struct {
+	Overhead *LinuxContainerResources


A separate WindowsPodSandboxConfig sounds good to me.
And I'm terrible at naming :-|

yujuhong · 2019-04-19T23:37:12Z

keps/sig-node/20190226-pod-overhead.md

+totals, as optional fields:
+
+type LinuxPodSandboxConfig struct {
+	Overhead *LinuxContainerResources


Sorry to chime in so late. I don't completely get why separate overhead and container resource fields are required. According the the thread below, they will be summed together for sizing the VM. Did I miss something?

k8s-ci-robot requested review from dchen1107 and derekwaynecarr April 6, 2019 05:12

k8s-ci-robot requested a review from tallclair April 6, 2019 05:13

egernst commented Apr 6, 2019

View reviewed changes

keps/sig-node/20190226-pod-overhead.md Show resolved Hide resolved

egernst commented Apr 6, 2019

View reviewed changes

egernst force-pushed the pod-overhead branch 2 times, most recently from 31cbcff to b556a3f Compare April 6, 2019 05:37

egernst commented Apr 6, 2019

View reviewed changes

tallclair reviewed Apr 8, 2019

View reviewed changes

k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Apr 8, 2019

derekwaynecarr reviewed Apr 8, 2019

View reviewed changes

keps/sig-node/20190226-pod-overhead.md Outdated Show resolved Hide resolved

egernst force-pushed the pod-overhead branch 2 times, most recently from 7436d22 to 1450f17 Compare April 10, 2019 03:12

tallclair reviewed Apr 10, 2019

View reviewed changes

egernst force-pushed the pod-overhead branch from 1450f17 to 54d26e8 Compare April 12, 2019 18:20

egernst commented Apr 12, 2019

View reviewed changes

keps/sig-node/20190226-pod-overhead.md Show resolved Hide resolved

tallclair reviewed Apr 16, 2019

View reviewed changes

k8s-ci-robot assigned derekwaynecarr, tallclair and yujuhong Apr 16, 2019

mcastelino reviewed Apr 16, 2019

View reviewed changes

keps/sig-node/20190226-pod-overhead.md Outdated Show resolved Hide resolved

mcastelino reviewed Apr 16, 2019

View reviewed changes

egernst force-pushed the pod-overhead branch from 54d26e8 to 8af9fbf Compare April 16, 2019 21:28

derekwaynecarr reviewed Apr 17, 2019

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 17, 2019

PatrickLang reviewed Apr 17, 2019

View reviewed changes

keps/sig-node/20190226-pod-overhead.md Outdated Show resolved Hide resolved

PatrickLang reviewed Apr 17, 2019

View reviewed changes

PatrickLang reviewed Apr 18, 2019

View reviewed changes

egernst force-pushed the pod-overhead branch from 8af9fbf to 0ec52e9 Compare April 18, 2019 00:03

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2019

egernst force-pushed the pod-overhead branch from 0ec52e9 to 81c19b6 Compare April 18, 2019 00:06

egernst force-pushed the pod-overhead branch from 81c19b6 to 03dea1d Compare April 18, 2019 18:02

egernst force-pushed the pod-overhead branch from 03dea1d to 8a356c8 Compare April 19, 2019 19:53

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2019

k8s-ci-robot merged commit 180a92b into kubernetes:master Apr 19, 2019

yujuhong reviewed Apr 19, 2019

View reviewed changes

egernst deleted the pod-overhead branch April 22, 2019 23:02

kep: pod-overhead: clarify handling of Overhead #939

kep: pod-overhead: clarify handling of Overhead #939

Conversation

egernst commented Apr 6, 2019

egernst commented Apr 6, 2019

k8s-ci-robot commented Apr 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egernst Apr 10, 2019 • edited

Choose a reason for hiding this comment

tallclair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fejta-bot commented Apr 8, 2019

tallclair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egernst Apr 12, 2019 • edited

Choose a reason for hiding this comment

tallclair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egernst Apr 16, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egernst Apr 17, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallclair commented Apr 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PatrickLang Apr 17, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr left a comment

Choose a reason for hiding this comment

egernst commented Apr 17, 2019

derekwaynecarr commented Apr 17, 2019

PatrickLang Apr 17, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PatrickLang Apr 18, 2019 • edited

Choose a reason for hiding this comment

egernst commented Apr 18, 2019

k8s-ci-robot commented Apr 18, 2019

tallclair commented Apr 19, 2019

tallclair commented Apr 19, 2019

k8s-ci-robot commented Apr 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egernst Apr 10, 2019 •

edited

egernst Apr 12, 2019 •

edited

egernst Apr 16, 2019 •

edited

egernst Apr 17, 2019 •

edited

PatrickLang Apr 17, 2019 •

edited

PatrickLang Apr 17, 2019 •

edited

PatrickLang Apr 18, 2019 •

edited