Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kep: pod-overhead: clarify handling of Overhead #939

Merged
merged 1 commit into from Apr 19, 2019

Conversation

egernst
Copy link

@egernst egernst commented Apr 6, 2019

Clarify what Overhead is in the pod spec, and behavior when
this is manually defined without a runtimeClass.

Clarify other opens in the proposal based on feedback.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 6, 2019
@egernst
Copy link
Author

egernst commented Apr 6, 2019

/cc @tallclair

@k8s-ci-robot
Copy link
Contributor

@egernst: GitHub didn't allow me to request PR reviews from the following users: mcastelino.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I like the % based concept, but need to think more on this before commenting on the suggested API. Based on this initial reaction, I tend to think the "door open" may be a good first step.

/cc @mcastelino

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Users are not expected to manually set the pod resources; if a runtimeClass is being utilized,
the manual value will be discarded. See RuntimeController for the proposal for setting these
resources.
Users are not expected to manually set `Overhead`; any prior values will be discarded during admission. If runtimeClass
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr had good feedback on initial PR re: using a suberesource, similar to how binding is handled, in order to make sure the end user wouldn't be setting this. Curious about your thoughts, @tallclair

@egernst egernst force-pushed the pod-overhead branch 2 times, most recently from 31cbcff to b556a3f Compare April 6, 2019 05:37
@@ -82,7 +106,7 @@ introduced which will update the `Overhead` field in the workload's `PodSpec` to
what is provided for the selected RuntimeClass, if one is specified.

Kubelet's creation of the pod cgroup will be calculated as the sum of container
`ResourceRequirements` fields, plus the Overhead associated with the pod.
`ResourceRequirements.Limits` fields, plus the Overhead associated with the pod.

The scheduler, resource quota handling, and Kubelet's pod cgroup creation and eviction handling
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: currently not clear how resource quota handling will take Overhead into account. Probably makes sense to make this optional (default to subtract the overhead if applicable to better match the current status quo), and leave it to the administrator to decide.

Similarly, early RFC discussions included augmenting CRI interface. This should be added:

I suggest adding the LinuxContainerResources message to the LinuxPodSandboxConfig,
as an optional field. Unlike the Resources field on the Kubernetes PodSpec, the resources
field on the LinuxPodSandboxConfig matches the pod-level limits (i.e. the total of pod &
container limits). The field is only provided when a pod-level limit is enforced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRI extension SGTM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For resource quota, I think @derekwaynecarr or @bsalamat might be able to provide some guidance.

Just my 2c - I think a configurable option makes sense, but I still think we should treat the decision of the default behaviour as if there wasn't an option (because in practice very few will change it). I think accounting the overhead to the user makes sense as a default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer to charge the user for the overhead by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @derekwaynecarr's point

Copy link
Author

@egernst egernst Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly. PTAL.

Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -82,7 +106,7 @@ introduced which will update the `Overhead` field in the workload's `PodSpec` to
what is provided for the selected RuntimeClass, if one is specified.

Kubelet's creation of the pod cgroup will be calculated as the sum of container
`ResourceRequirements` fields, plus the Overhead associated with the pod.
`ResourceRequirements.Limits` fields, plus the Overhead associated with the pod.

The scheduler, resource quota handling, and Kubelet's pod cgroup creation and eviction handling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRI extension SGTM.

keps/sig-node/20190226-pod-overhead.md Show resolved Hide resolved
keps/sig-node/20190226-pod-overhead.md Outdated Show resolved Hide resolved
@@ -82,7 +106,7 @@ introduced which will update the `Overhead` field in the workload's `PodSpec` to
what is provided for the selected RuntimeClass, if one is specified.

Kubelet's creation of the pod cgroup will be calculated as the sum of container
`ResourceRequirements` fields, plus the Overhead associated with the pod.
`ResourceRequirements.Limits` fields, plus the Overhead associated with the pod.

The scheduler, resource quota handling, and Kubelet's pod cgroup creation and eviction handling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For resource quota, I think @derekwaynecarr or @bsalamat might be able to provide some guidance.

Just my 2c - I think a configurable option makes sense, but I still think we should treat the decision of the default behaviour as if there wasn't an option (because in practice very few will change it). I think accounting the overhead to the user makes sense as a default.

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Apr 8, 2019
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@egernst egernst force-pushed the pod-overhead branch 2 times, most recently from 7436d22 to 1450f17 Compare April 10, 2019 03:12
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Eric. Almost there...

keps/sig-node/20190226-pod-overhead.md Outdated Show resolved Hide resolved
The LinuxContainerResources message is added to the LinuxPodSandboxConfig, as an optional field. Unlike
the Resources field on the Kubernetes PodSpec, the resources field in the LinuxPodSandboxConfig matches
the pod-level limits (i.e. the total of pod overhead & container limits). The field is only provided for
pods where each container provides limits (ie, guaranteed pods and a subset of burstable pods).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does Kata handle those other pods? Would having the pod overhead still be useful in those situations?

Copy link
Author

@egernst egernst Apr 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be useful to make this information available in either situation, though with no limits defined, we'd just see the overhead then, which wouldn't be useful for sizing.

The provided information would be more useful if we provided the overhead + container requests separately.

As for what is done today:

  • If requests and limits are not provided, the sizing of the virtual machine is based on kata defaults (default configurable parameters for Kata, which are currently a single vCPU and 2048 MB memory). This isn't optimal, obvsiously, as the best effort will be pretty limited. For realistic performance, Kata users should provide initial resource values to the pod, and rely on HPA or VPA to adjust if necessary.
  • We adjust the VM sizing each time we receive a container spec, and only if limits and / or requests are provided.

Receiving this field, included overhead, will help the runtime know an appropriate initial size up front (at VM creation time), potentially avoiding needing to read each container spec and hotplugging CPUs/memory.

keps/sig-node/20190226-pod-overhead.md Outdated Show resolved Hide resolved
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM with one question about the CRI API.

totals, as optional fields:

type LinuxPodSandboxConfig struct {
Overhead *LinuxContainerResources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case for providing overhead to the CRI? Otherwise it seems like just providing a pod total (sum of these 2 fields) would cover the use case of sizing the kata vm?

Copy link
Author

@egernst egernst Apr 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair: the use case is being able to determine the non-overhead amount, actually. In many (most?) cases, the overhead could account for hypervisor impact on the host, rather than guest components impact on the workload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think of the runtime as being the source-of-truth for the overhead, and the runtimeclass piece as just communicating that info to Kubernetes. I'm not strictly against it, but an alternative is to just have Kata subtract it's known overhead amount.

Also, isn't the guest OS overhead significant? Or is the overhead dominated by the hypervisor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think of the runtime as being the source-of-truth for the overhead, and the runtimeclass piece as just communicating that info to Kubernetes. I'm not strictly against it, but an alternative is to just have Kata subtract it's known overhead amount.

It should, agreed, but system administrator could create their own runtimeClass definition, and the runtime wouldn't necessarily have visibility into what is being provided 'above.' Consider a scenario where the sys-admin defines two runtimeClasses, with differing overheads because one is for dedicated heavy IO, while another one is for compute workloads. The OCI runtime wouldn't necessarily know in which context it is getting called. This may not be a strong use case, but for a couple of bytes this helps leave the door open.

Also, isn't the guest OS overhead significant? Or is the overhead dominated by the hypervisor?

It depends. For a vCPU its negligible inside the guest, but more important on the host (ie, consider high bandwith IO, and the work a IO threads would carry out on the host).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It still seems like the runtime needs some understanding of the overhead to know whether it should be allocated to the VM though. I'm happy to defer to you on this, but I'd like @yujuhong or @derekwaynecarr to sign off too.

Copy link
Author

@egernst egernst Apr 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't take the windows sandbox into consideration here, this CRI API section is a bit Linux heavy. Wouldn't the WindowsContainerResources only be useful in a Windows equivalent of the LinuxPodSandboxConfig ?

RE: the translation from ResourceRequirements to LinuxContainerResources (and the Windows equivalent): I understand your point; not all the fields will be used. TBH, I'm not sure which is worse: not using a couple of fields or having another structure in the API that is almost the same? I haven't touched this space and would seek input from @tallclair

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for weighing in Patrick. Here's my proposal: keep the current linux API as Eric has specified it, and add a parallel windows API:

message WindowsPodSandboxConfig {  // A new type, added to the PodSandboxConfig
  Overhead *WindowsContainerResources
  ContainerResources *WindowsContainerResources
}

As an aside, I don't love the field name "ContainerResources", but don't have a better suggestion...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate WindowsPodSandboxConfig sounds good to me.
And I'm terrible at naming :-|

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in so late. I don't completely get why separate overhead and container resource fields are required. According the the thread below, they will be summed together for sizing the VM. Did I miss something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong they aren't necessarily both used to size the VM. The final text (not thread) should highlight this. PTAL @ #976

@tallclair
Copy link
Member

/assign @tallclair @yujuhong @derekwaynecarr

type LinuxPodSandboxConfig struct {
Overhead *LinuxContainerResources
ContainerResources *LinuxContainerResources
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summation
VM Size = sum(ContainerResources)
Host Cgroup size = sum(Overhead)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, no - host cgroup size = sum(overhead, containeResources)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I'd make the Sandbox Size = Hyper-V partition size = sum(overhead, containerResources.Limit) on Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container requests or limits in this case? Do we enforce that requests == limits for sandboxed pods?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @yujuhong -- it is limits. We don't force limits and requests to be equal (guaranteed), but do need a limit defined for appropriate VM sizing.

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one comment to not modify the cri if the use case is not strong. How is the overhead being assigned ? Is it a separate subresource on pod?

@egernst
Copy link
Author

egernst commented Apr 17, 2019

Just the one comment to not modify the cri if the use case is not strong. How is the overhead being assigned ? Is it a separate subresource on pod?

@derekwaynecarr - it is assigned through the runtime class controller at admission time

@derekwaynecarr
Copy link
Member

Followed up separately:

The cri change was useful for hypervisor that didn’t have hot swap. Admission can set the overhead so does not need a separate subresource.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 17, 2019
- the pod spec can be referenced directly from scheduler, resourceQuote controller and kubelet, instead of referencing
a runtimeClass object which could have possibly been removed.
The pod cgroup is managed by the Kubelet, so passing the pod-level resource to the CRI implementation
is not strictly necessary. However, some runtimes may wish to take advantage of this information, for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing overhead across CRI is necessary for Windows as we implement a runtimeclass for Hyper-V. There's no cgroup that a limit can be inherited from, so the sandbox resources need to be determined at creation time. OS overhead will be on the order of hundreds of MB for a Windows sandbox and may vary release to release as Windows inbox features are tuned, added or cut.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PatrickLang - with each release version, then, the runtimeClass should be updated. Glad to hear this CRI addition is useful for your use case as well.

authors:
- "@egernst"
owning-sig: sig-node
participating-sigs:
- sig-scheduling
- sig-autoscaling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since CRI is involved, should SIG-Node and SIG-Windows be added?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2019
@egernst
Copy link
Author

egernst commented Apr 18, 2019

/lgtm

just pushed a typo fix.

@k8s-ci-robot
Copy link
Contributor

@egernst: you cannot LGTM your own PR.

In response to this:

/lgtm

just pushed a typo fix.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tallclair
Copy link
Member

@egernst Can you add the windows CRI API as discussed? Then LGTM. Thanks!

Updates to pod-overhead based in review discussion:
 - Clarify what Overhead is in the pod spec, and behavior when
 this is manually defined without a runtimeClass.
 - Clarify ResourceQuota changes necessary
 - Add in CRI API change to make pod details available
 - Define feature gate
 - Update runtimeClass definition

Fixes: kubernetes#688

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, egernst, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 180a92b into kubernetes:master Apr 19, 2019
- overhead, once added to the spec, stays with the workload, even if runtimeClass is redefined
or removed.
- the pod spec can be referenced directly from scheduler, resourceQuota controller and kubelet,
instead of referencing a runtimeClass object which could have possibly been removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the overhead is not mutable? Didn't see it in the KEP. Maybe I've missed it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All PodSpec and RuntimeClass fields are immutable by default, but probably worth explicitly stating.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK - added to #976

type LinuxPodSandboxConfig struct {
Overhead *LinuxContainerResources
ContainerResources *LinuxContainerResources
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container requests or limits in this case? Do we enforce that requests == limits for sandboxed pods?

totals, as optional fields:

type LinuxPodSandboxConfig struct {
Overhead *LinuxContainerResources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate WindowsPodSandboxConfig sounds good to me.
And I'm terrible at naming :-|

totals, as optional fields:

type LinuxPodSandboxConfig struct {
Overhead *LinuxContainerResources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to chime in so late. I don't completely get why separate overhead and container resource fields are required. According the the thread below, they will be summed together for sizing the VM. Did I miss something?

@egernst egernst deleted the pod-overhead branch April 22, 2019 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants