Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keps: sig-node: initial pod overhead proposal #887

Merged
merged 2 commits into from Apr 4, 2019

Conversation

egernst
Copy link

@egernst egernst commented Mar 11, 2019

Initial push for the pod overhead KEP.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 11, 2019
@egernst egernst changed the title keps: sig-node: initial pod overhead proposal [WIP] keps: sig-node: initial pod overhead proposal Mar 12, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2019
keps/sig-node/20190226-pod-overhead.md Show resolved Hide resolved
runtimeHandler:
type: string
Pattern: '^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)?$'
+ runtimeCpuReqOverhead:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the larger audience: Do we have no way to get the quantity behavior into CRDs?

Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have questions about eviction, veritical pod autoscaling, and if this inhibits our ability to do pod level resource requirements in the future.

for a given runtimeClass. A mutating admission controller is introduced which will update the `Overhead`
field in the workload's `PodSpec` to match what is provided for the selected RuntimeClass, if one is specified.

Kubelet's creation of the pod cgroup will be calculated as the sum of container `ResourceRequirements` fields,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to account for overhead in how we handle eviction decisions? right now, we evict based on usage relative to request. this may prove problematic as you will potentially want to subtract overhead from the observed usage to make a more accurate decision.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this more explicit in the proposal.

keps/sig-node/20190226-pod-overhead.md Show resolved Hide resolved

For scheduling, the pod resource requests are added to the container resource requests.

We don't currently enforce resource limits on the pod cgroup, but this becomes feasible once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we had a flag to toggle this. pod level cgroup takes charges for empty dir memory backed volumes across container restart boundaries, so there are many potential sources of things that can get charged to the pod cgroup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any concern with how a vertical pod autoscaler would work with this function? i thought the vpa would set container requests, but external tools like kubectl top will see the overhead charged to the container, and so vpa sizing may be skewed. is vpa going to be runtime class aware? it cannot use the same estimation for an image run in a normal container versus a vm-container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr - for the pod, the overhead would be observed.

Container level statistics are gathered directly from the cgroups in the guest, and do not include any sandbox overheads.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to clarify scaling scenario

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was not sure if this was the case with gvisor, but i had followed up with @dchen1107 where she confirmed separately.


## Alternatives [optional]

In order to achieve proper handling of sandbox runtimes, the scheduler/resourceQuota handling needs to take
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think other alternatives can be discussed.

alternative options would be to support pod level resource requirements rather than just overhead. i think pod level resource requirements are very useful for shared resources (hugepages, memory when doing emptyDir volumes), and its possible doing overhead now may complicate our ability to do that later.

the benefit of overhead is a kubernetes service provider can subsidize the charge-back model potentially and eat the cost of the runtime choice, but charge the user for the cpu/memory consumed independent of runtime choice. there are pros/cons to either approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pod level resource requirements makes sense to me. I was originally trying to keep this out of the scope of this KEP, but it is indeed hard to do so. If pod level resources existed, then adding pod overhead would be a small addition imo (probably just a mutating admission controller to augment based on overhead for the registered runtimeClass).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this as a suggested alternative. TBH, though, I see this as an augmentation rather than an alternative. I think both could exist.

@egernst egernst changed the title [WIP] keps: sig-node: initial pod overhead proposal keps: sig-node: initial pod overhead proposal Apr 3, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2019
@egernst egernst force-pushed the pod-overhead branch 2 times, most recently from 741ecd6 to 72b5dbf Compare April 3, 2019 16:48
@egernst
Copy link
Author

egernst commented Apr 3, 2019

@tallclair, @derekwaynecarr PTAL?

Since my proposal will also impact scheduling, would it make sense to include someone for scheduling sig?

Eric Ernst added 2 commits April 4, 2019 11:22
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
to-be-squashed - including in history for review period only

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
@egernst
Copy link
Author

egernst commented Apr 4, 2019

...updated to align with RuntimeClass admission controller naming, as proposed in #909

```
Pod {
Spec PodSpec {
// Overhead is the resource overhead consumed by the Pod, not including
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than the "Pod" can we say "the overhead incurred from the container runtime" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just runtime

pod specify a limit, then the sum of those limits becomes a pod-level limit, enforced through the
pod cgroup.

Users are not expected to manually set the pod resources; if a runtimeClass is being utilized,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we would not always discard if no RuntimeClass is configured?

In the scope of this KEP, The RuntimeClass controller will have a single job: set the pod overhead field in the
workload's PodSpec according to the runtimeClass specified.

It is expected that only the RuntimeClass controller will set Pod.Spec.Overhead. If a value is provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should strip the overhead declaration from any pod that has no associated runtime class.

@derekwaynecarr
Copy link
Member

before this moves to implementable, i would like to see graduation criteria and understand if this is a separate feature gate from the existing RuntimeClass feature gate. I would also like to prevent users from populating overhead if RuntimeClass is not associated with a pod.

@derekwaynecarr
Copy link
Member

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, egernst

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2019
@k8s-ci-robot k8s-ci-robot merged commit bb79ab4 into kubernetes:master Apr 4, 2019
@egernst egernst deleted the pod-overhead branch April 4, 2019 20:31
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not getting these comments in before this merged. Would you mind opening a follow-up PR to address these (and some of @derekwaynecarr's open concerns)?

@@ -0,0 +1,310 @@
---
title: KEP Template
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix

authors:
- "@egernst"
owning-sig: sig-node
participating-sigs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig-scheduling?
sig-autoscaling?
wg-resource-management

status: provisional
---

# pod overhead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: capitalize


## Table of Contents

Tools for generating: https://github.com/ekalinin/github-markdown-toc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix (run the tool, paste the output here - I usually drop the bullets for the title & TOC sections)


### Non-Goals

* Making runtimeClass selections
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more that come to mind:

  • auto detecting overhead
  • per-container overhead
  • enforcement of overhead (pod cgroup limits) - Actually, maybe this is covered?


## Drawbacks [optional]

This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
This KEP introduceds further complexity, and adds a field the PodSpec which users aren't expected to modify.
This KEP introduces further complexity, and adds a field the PodSpec which users aren't expected to modify.

Even if this were to be introduced, there is a benefit in keeping the overhead separate.
- post-pod creation handling of pod events: if runtimeClass definition is removed after a pod is created,
it will be very complicated to calculate which part of the pod resource requirements were associated with
the workloads versus the sandbox overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that matter?

to add a sandbox overhead when applicable.

Pros:
* no changes to the pod spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the user doesn't have the option of setting the field.

Cons:
* handling of the pod overhead is spread out across a few components
* Not user perceptible from a workload perspective.
* very complicated if the runtimeClass policy changes after workloads are running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that we don't merge the runtime Handler into the pod spec. The RuntimeClass is immutable, although it can still be deleted and recreated. We could add special handling to prevent the RuntimeClass from being deleted as long there are pods running with it.

runtime choice, but charge the user for the cpu/memory consumed independent of runtime choice.


### Leaving the PodSpec unchaged
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had been assuming that we would merge the overhead into the podspec, but now I'm second-guessing that...

@tallclair
Copy link
Member

Also, I forgot to say that this is awesome, especially as your first contribution to Kubernetes! Thanks so much for taking this project on!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants