New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updaing QoS policy to be at the pod level #14943

Merged
merged 3 commits into from May 21, 2016

Conversation

Projects
None yet
@vishh
Member

vishh commented Oct 1, 2015

Release Note

Existing pods might be more susceptible to OOM Kills on the node due to this PR! 
To protect pods from being OOM killed on the node, set `limits` for all resources across all containers in a pod.

Quality of Service will be derived from an entire Pod Spec, instead of being derived from resource specifications of individual resources per-container.
A Pod is `Guaranteed` iff all its containers have limits == requests for all the first-class resources (cpu, memory as of now).
A Pod is `BestEffort` iff requests & limits are not specified for any resource across all containers.
A Pod is `Burstable` otherwise. 

Analytics

@derekwaynecarr

This comment has been minimized.

Show comment
Hide comment
@derekwaynecarr

derekwaynecarr Oct 2, 2015

Member

I need to think on this some, as I thought we were going to ignore CPU
since it was compressible and ceiling enforced if flag was enabled. It had
made sense to me to do only a cgroup per memory qos on the node.

For the end-user, this means my OOMScoreAdjust value is based on the lowest
level of compute resource QoS, but from a scheduling perspective, I would
still be BestEffort/Burstable/Guaranteed on a per compute resource basis,
correct?

Are there ramifications other than OOMScoreAdjust that I am missing?

On Thursday, October 1, 2015, Kubernetes Bot notifications@github.com
wrote:

Unit, integration and GCE e2e test build/test passed for commit 094371c
094371c
.


Reply to this email directly or view it on GitHub
#14943 (comment)
.

Member

derekwaynecarr commented Oct 2, 2015

I need to think on this some, as I thought we were going to ignore CPU
since it was compressible and ceiling enforced if flag was enabled. It had
made sense to me to do only a cgroup per memory qos on the node.

For the end-user, this means my OOMScoreAdjust value is based on the lowest
level of compute resource QoS, but from a scheduling perspective, I would
still be BestEffort/Burstable/Guaranteed on a per compute resource basis,
correct?

Are there ramifications other than OOMScoreAdjust that I am missing?

On Thursday, October 1, 2015, Kubernetes Bot notifications@github.com
wrote:

Unit, integration and GCE e2e test build/test passed for commit 094371c
094371c
.


Reply to this email directly or view it on GitHub
#14943 (comment)
.

@derekwaynecarr

This comment has been minimized.

Show comment
Hide comment
@derekwaynecarr

derekwaynecarr Oct 2, 2015

Member

For example, we would continue to render the qos tier on a per compute resource basis when describing a pod?

Member

derekwaynecarr commented Oct 2, 2015

For example, we would continue to render the qos tier on a per compute resource basis when describing a pod?

@derekwaynecarr

This comment has been minimized.

Show comment
Hide comment
@derekwaynecarr

derekwaynecarr Oct 2, 2015

Member

Thinking on this more, if memory is what is under system pressure, it does
seem right that my OOMScoreAdjust should be based solely on my memory qos?

Can you elaborate a little more on why we would need a different cgroup for
CPU versus memory in the alternate model that is causing you concern?

We can chat more in real life if it's easier.

On Thursday, October 1, 2015, Derek Carr decarr@redhat.com wrote:

I need to think on this some, as I thought we were going to ignore CPU
since it was compressible and ceiling enforced if flag was enabled. It had
made sense to me to do only a cgroup per memory qos on the node.

For the end-user, this means my OOMScoreAdjust value is based on the
lowest level of compute resource QoS, but from a scheduling perspective, I
would still be BestEffort/Burstable/Guaranteed on a per compute resource
basis, correct?

Are there ramifications other than OOMScoreAdjust that I am missing?

On Thursday, October 1, 2015, Kubernetes Bot <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Unit, integration and GCE e2e test build/test passed for commit 094371c
094371c
.


Reply to this email directly or view it on GitHub
#14943 (comment)
.

Member

derekwaynecarr commented Oct 2, 2015

Thinking on this more, if memory is what is under system pressure, it does
seem right that my OOMScoreAdjust should be based solely on my memory qos?

Can you elaborate a little more on why we would need a different cgroup for
CPU versus memory in the alternate model that is causing you concern?

We can chat more in real life if it's easier.

On Thursday, October 1, 2015, Derek Carr decarr@redhat.com wrote:

I need to think on this some, as I thought we were going to ignore CPU
since it was compressible and ceiling enforced if flag was enabled. It had
made sense to me to do only a cgroup per memory qos on the node.

For the end-user, this means my OOMScoreAdjust value is based on the
lowest level of compute resource QoS, but from a scheduling perspective, I
would still be BestEffort/Burstable/Guaranteed on a per compute resource
basis, correct?

Are there ramifications other than OOMScoreAdjust that I am missing?

On Thursday, October 1, 2015, Kubernetes Bot <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Unit, integration and GCE e2e test build/test passed for commit 094371c
094371c
.


Reply to this email directly or view it on GitHub
#14943 (comment)
.

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Oct 2, 2015

Member

@derekwaynecarr: My understanding is that the entire system will treat all resources to be belonging to the same QoS class. Since we cannot guarantee isolation per-resource at the node level, I don't see why the rest of the system will treat each resource separately.
On the CPU front, just hardcapping wont be enough I think. The reason is that cpu scheduling is also inferred from cpu shares and hierarchies. The best way to guarantee fairness among QoS classes and at the same time not starve best-effort tasks will be to place them in hierarchical cgroups.

Member

vishh commented Oct 2, 2015

@derekwaynecarr: My understanding is that the entire system will treat all resources to be belonging to the same QoS class. Since we cannot guarantee isolation per-resource at the node level, I don't see why the rest of the system will treat each resource separately.
On the CPU front, just hardcapping wont be enough I think. The reason is that cpu scheduling is also inferred from cpu shares and hierarchies. The best way to guarantee fairness among QoS classes and at the same time not starve best-effort tasks will be to place them in hierarchical cgroups.

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Oct 2, 2015

Member

Consider this example: A container x is cpu guaranteed, and memory burstable. Our current plan is to have hierarchical cgroups for each class. This would result in following cgroup hierarchies:
cpu: /sys/fs/cgroup/cpu//docker/x
memory: /sys/fs/cgroup/memory/docker/best-effort/x

Ideally, we don't want the OOM score to kick in, since it affects system stability. If we were to limit memory allocations to lower QoS classes, based on a TBD policy (#13006), we can avoid incurring system memory pressure to a large extent.

Member

vishh commented Oct 2, 2015

Consider this example: A container x is cpu guaranteed, and memory burstable. Our current plan is to have hierarchical cgroups for each class. This would result in following cgroup hierarchies:
cpu: /sys/fs/cgroup/cpu//docker/x
memory: /sys/fs/cgroup/memory/docker/best-effort/x

Ideally, we don't want the OOM score to kick in, since it affects system stability. If we were to limit memory allocations to lower QoS classes, based on a TBD policy (#13006), we can avoid incurring system memory pressure to a large extent.

@bgrant0607

This comment has been minimized.

Show comment
Hide comment
@bgrant0607
Member

bgrant0607 commented Oct 3, 2015

Show outdated Hide outdated docs/design/resource-qos.md Outdated
Show outdated Hide outdated docs/design/resource-qos.md Outdated
Show outdated Hide outdated docs/design/resource-qos.md Outdated
@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp Oct 5, 2015

Member

I kinda sympathize with @derekwaynecarr's view that anything of the form "user sets X, but system treats it as if the user set Y" is asking for confusion. A better approach, that would still enforce the same policy, might be to use validation to require a user to set all resources of a container to the same QoS class and reject if they are not (rather than allowing the user to set different QoS's and then treating them all as the min).

Member

davidopp commented Oct 5, 2015

I kinda sympathize with @derekwaynecarr's view that anything of the form "user sets X, but system treats it as if the user set Y" is asking for confusion. A better approach, that would still enforce the same policy, might be to use validation to require a user to set all resources of a container to the same QoS class and reject if they are not (rather than allowing the user to set different QoS's and then treating them all as the min).

Since we want to make Kubernetes as simple as possible for its users we don’t want to require setting [Resources](../design/resource-qos.md) for container by its owner.
On the other hand having Resources filled is critical for scheduling decisions.
Current solution to set up Resources to hardcoded value has obvious drawbacks.
We need to implement a component which will set initial Resources to a reasonable value.

This comment has been minimized.

@davidopp

davidopp Oct 5, 2015

Member

This is already implemented as of 1.1. Unfortunately there doesn't seem to be any user or admin documentation yet, just design docs in proposals/. I've asked the autoscaling team to write something, and then we can link to it here.

@davidopp

davidopp Oct 5, 2015

Member

This is already implemented as of 1.1. Unfortunately there doesn't seem to be any user or admin documentation yet, just design docs in proposals/. I've asked the autoscaling team to write something, and then we can link to it here.

This comment has been minimized.

@vishh

vishh Oct 5, 2015

Member

Acknowledged. cc @piosz

@vishh

vishh Oct 5, 2015

Member

Acknowledged. cc @piosz

This comment has been minimized.

@piosz

piosz Oct 5, 2015

Member

I'll try to do it tomorrow.

@piosz

piosz Oct 5, 2015

Member

I'll try to do it tomorrow.

Show outdated Hide outdated docs/design/resource-qos.md Outdated
Show outdated Hide outdated docs/design/resource-qos.md Outdated
@derekwaynecarr

This comment has been minimized.

Show comment
Hide comment
@derekwaynecarr

derekwaynecarr Oct 5, 2015

Member

@davidopp - I am not a huge fan of requiring all compute resources be a member of the same qos class as part of validation. We have a number of things that are setting resources, that I think that requiring this rule would make for a poor user experience. I still need to read the referenced document on unified hierarchy. I agree with @vishh that we want to more proactively prevent OOM scenarios using some type of memory monitoring in the future that pro-actively kills containers when under memory pressure. In that scenario, I was less concerned with this change just updating the oom_score_adj value, and more concerned on if there were other semantics I was missing as part of this change.

Member

derekwaynecarr commented Oct 5, 2015

@davidopp - I am not a huge fan of requiring all compute resources be a member of the same qos class as part of validation. We have a number of things that are setting resources, that I think that requiring this rule would make for a poor user experience. I still need to read the referenced document on unified hierarchy. I agree with @vishh that we want to more proactively prevent OOM scenarios using some type of memory monitoring in the future that pro-actively kills containers when under memory pressure. In that scenario, I was less concerned with this change just updating the oom_score_adj value, and more concerned on if there were other semantics I was missing as part of this change.

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Oct 5, 2015

Member

Rebased and addressed comments.

PTAL @davidopp @bgrant0607 @derekwaynecarr.

Member

vishh commented Oct 5, 2015

Rebased and addressed comments.

PTAL @davidopp @bgrant0607 @derekwaynecarr.

Show outdated Hide outdated docs/design/resource-qos.md Outdated
Show outdated Hide outdated docs/design/resource-qos.md Outdated
Show outdated Hide outdated docs/design/resource-qos.md Outdated
Show outdated Hide outdated docs/design/resource-qos.md Outdated
Show outdated Hide outdated docs/design/resource-qos.md Outdated
@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Oct 6, 2015

Member

@derekwaynecarr: I assume you are OK with this change in general. I will address your comments in a bit.

Member

vishh commented Oct 6, 2015

@derekwaynecarr: I assume you are OK with this change in general. I will address your comments in a bit.

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh
Member

vishh commented May 20, 2016

@dchen1107 PTAL

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 May 20, 2016

Member

LGTM

Member

dchen1107 commented May 20, 2016

LGTM

@dchen1107 dchen1107 added lgtm and removed lgtm labels May 20, 2016

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 May 20, 2016

Member

Can you squash the commits, then we are ok to go? Thanks!

Member

dchen1107 commented May 20, 2016

Can you squash the commits, then we are ok to go? Thanks!

vishh added some commits Oct 1, 2015

Updating QoS policy to be per-pod instead of per-resource.
Signed-off-by: Vishnu kannan <vishnuk@google.com>
Update kubelet to use per-pod QoS policy.
Signed-off-by: Vishnu kannan <vishnuk@google.com>
@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh May 20, 2016

Member

Commits squashed. Applying LGTM label as per offline conversation..

Member

vishh commented May 20, 2016

Commits squashed. Applying LGTM label as per offline conversation..

@vishh vishh added the lgtm label May 20, 2016

satisfy flags check script by including a few files
Signed-off-by: Vishnu kannan <vishnuk@google.com>

@k8s-merge-robot k8s-merge-robot removed the lgtm label May 20, 2016

@vishh vishh added the lgtm label May 20, 2016

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot May 21, 2016

Contributor

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

Contributor

k8s-merge-robot commented May 21, 2016

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot

This comment has been minimized.

Show comment
Hide comment
@k8s-bot

k8s-bot commented May 21, 2016

GCE e2e build/test passed for commit a64fe65.

@k8s-merge-robot

This comment has been minimized.

Show comment
Hide comment
@k8s-merge-robot

k8s-merge-robot May 21, 2016

Contributor

Automatic merge from submit-queue

Contributor

k8s-merge-robot commented May 21, 2016

Automatic merge from submit-queue

@k8s-merge-robot k8s-merge-robot merged commit 46504c2 into kubernetes:master May 21, 2016

4 of 5 checks passed

Submit Queue Github CI tests are not green.
Details
Jenkins GCE Node e2e Build finished.
Details
Jenkins GCE e2e 298 tests run, 122 skipped, 0 failed.
Details
Jenkins unit/integration 6334 tests run, 24 skipped, 0 failed.
Details
cla/google All necessary CLAs are signed

ingvagabund added a commit to ingvagabund/kubernetes that referenced this pull request May 21, 2016

Scheduler: introduce CheckNodeMemoryPressurePredicate, don't schedule…
… pods for nodes that reports memory pressury.

Introduce unit-test for CheckNodeMemoryPressurePredicate

Following work done in kubernetes#14943

ingvagabund added a commit to ingvagabund/kubernetes that referenced this pull request May 21, 2016

Scheduler: introduce CheckNodeMemoryPressurePredicate, don't schedule…
… pods for nodes that reports memory pressury.

Introduce unit-test for CheckNodeMemoryPressurePredicate

Following work done in kubernetes#14943

k8s-merge-robot added a commit that referenced this pull request May 22, 2016

Merge pull request #25531 from ingvagabund/introduce-memory-pressure-…
…to-scheduler

Automatic merge from submit-queue

Introduce node memory pressure condition to scheduler

Following the work done by @derekwaynecarr at #21274, introducing memory pressure predicate for scheduler.

Missing:

* write down unit-test
* test the implementation

At the moment this is a heads up for further discussion how the new node's memory pressure condition should be handled in the generic scheduler.

**Additional info**

* Based on [1], only best effort pods are subject to filtering.
* Based on [2], best effort pods are those pods "iff requests & limits are not specified for any resource across all containers".

[1] https://github.com/derekwaynecarr/kubernetes/blob/542668cc7998fe0acb315a43731e1f45ecdcc85b/docs/proposals/kubelet-eviction.md#scheduler
[2] #14943

girishkalele pushed a commit to girishkalele/kubernetes that referenced this pull request May 24, 2016

Scheduler: introduce CheckNodeMemoryPressurePredicate, don't schedule…
… pods for nodes that reports memory pressury.

Introduce unit-test for CheckNodeMemoryPressurePredicate

Following work done in kubernetes#14943
@erictune

This comment has been minimized.

Show comment
Hide comment
@erictune

erictune Jul 2, 2016

Member

@vishh Does this PR require action by the user when upgrading from 1.2.x to 1.3.0? (Think about non-developer users.) If so, please edit your first comment to have a release-note block, like in #28132. If it is just an optional feature, please change the label to just release-note. If it is not a complete feature by itself, then apply "release-note-none" label instead.

Member

erictune commented Jul 2, 2016

@vishh Does this PR require action by the user when upgrading from 1.2.x to 1.3.0? (Think about non-developer users.) If so, please edit your first comment to have a release-note block, like in #28132. If it is just an optional feature, please change the label to just release-note. If it is not a complete feature by itself, then apply "release-note-none" label instead.

@vishh

This comment has been minimized.

Show comment
Hide comment
@vishh

vishh Jul 2, 2016

Member

@erictune Yes. This PR calls for users to set limits on all the pods they care about before upgrade.
I updated the PR description to match the release note template, similar to #28132.
It is not an option feature and will affect existing pods.
It is a behavior change in the system.

Member

vishh commented Jul 2, 2016

@erictune Yes. This PR calls for users to set limits on all the pods they care about before upgrade.
I updated the PR description to match the release note template, similar to #28132.
It is not an option feature and will affect existing pods.
It is a behavior change in the system.

michelleN pushed a commit to michelleN/community that referenced this pull request Nov 19, 2016

Merge pull request #25531 from ingvagabund/introduce-memory-pressure-…
…to-scheduler

Automatic merge from submit-queue

Introduce node memory pressure condition to scheduler

Following the work done by @derekwaynecarr at kubernetes/kubernetes#21274, introducing memory pressure predicate for scheduler.

Missing:

* write down unit-test
* test the implementation

At the moment this is a heads up for further discussion how the new node's memory pressure condition should be handled in the generic scheduler.

**Additional info**

* Based on [1], only best effort pods are subject to filtering.
* Based on [2], best effort pods are those pods "iff requests & limits are not specified for any resource across all containers".

[1] https://github.com/derekwaynecarr/kubernetes/blob/542668cc7998fe0acb315a43731e1f45ecdcc85b/docs/proposals/kubelet-eviction.md#scheduler
[2] kubernetes/kubernetes#14943

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016

Merge pull request kubernetes#14943 from vishh/qos
Automatic merge from submit-queue

Updaing QoS policy to be at the pod level

Quality of Service will be derived from an entire Pod Spec, instead of being derived from resource specifications of individual resources per-container.
A Pod is `Guaranteed` iff all its containers have limits == requests for all the first-class resources (cpu, memory as of now).
A Pod is `BestEffort` iff requests & limits are not specified for any resource across all containers.
A Pod is `Burstable` otherwise. 

Note: Existing pods might be more susceptible to OOM Kills on the node due to this PR! To protect pods from being OOM killed on the node, set `limits` for all resources across all containers in a pod.

<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/14943)
<!-- Reviewable:end -->

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016

Merge pull request kubernetes#14943 from vishh/qos
Automatic merge from submit-queue

Updaing QoS policy to be at the pod level

Quality of Service will be derived from an entire Pod Spec, instead of being derived from resource specifications of individual resources per-container.
A Pod is `Guaranteed` iff all its containers have limits == requests for all the first-class resources (cpu, memory as of now).
A Pod is `BestEffort` iff requests & limits are not specified for any resource across all containers.
A Pod is `Burstable` otherwise. 

Note: Existing pods might be more susceptible to OOM Kills on the node due to this PR! To protect pods from being OOM killed on the node, set `limits` for all resources across all containers in a pod.

<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/14943)
<!-- Reviewable:end -->

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016

Scheduler: introduce CheckNodeMemoryPressurePredicate, don't schedule…
… pods for nodes that reports memory pressury.

Introduce unit-test for CheckNodeMemoryPressurePredicate

Following work done in kubernetes#14943
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment