Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: set low oom_score_adj for containers in critical pods #73758

Merged
merged 1 commit into from Feb 7, 2019

Conversation

@sjenning
Copy link
Contributor

sjenning commented Feb 5, 2019

@dashpole this is really all I was talking about in sig-node.

Currently, the only way to get the lowest oom_score_adj is to set requests/limits on the containers in a Pod such that it is in the Guaranteed pod QoS tier. This requires placing a memory limit on the Pod. This can make the Pod more susceptible to OOM kills in that, if it exceeds its memory limit, it is killed regardless of its oom_score_adj because it is the only candidate process in the cgroup under memory pressure.

oom_score_adj is really an expression of priority. Currently users must express that priority through layers of indirection; namely container request/limits expressing pod QoS expressing oom_score_adj.

We have two reserved critical pod priority classes which users can explicitly convey: system-node-critical and system-cluster-critical.

We should treat this expression as the primary indicator of priority and, therefore, oom_adjust_score.

Set a low oom_score_adj for containers in pods with system-critical priorities
@sjenning

This comment has been minimized.

Copy link
Contributor Author

sjenning commented Feb 5, 2019

@sjenning sjenning force-pushed the sjenning:priority-based-oom-score-adj branch 3 times, most recently from d110f03 to a38a357 Feb 5, 2019

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Feb 5, 2019

So just to walk through an example:

Pod_1
Req = 1Gi
Limit = 1Gi
QoS = Guaranteed
oom_score_adj = -998

Node Capacity = 10 Gi

Pod_2
Req = 1Gi
Limit = 1.01Gi
QoS = Burstable
oom_score_adj = 1000 - (1000*1Gi) / 10Gi = 900

That is a pretty huge difference.

Given that we have priority, I think we should revisit the decision to set the oom_score_adj so low for Guaranteed pods. I would prefer actually using the burstable formula for all pods except system-node-critical and system-cluster-critical priority pods. This keeps the oom score the same for Burstable and BestEffort, but increases the oom_score_adj for Guaranteed pods to be in-line with a Burstable pod with the same requests.

Generally, this issue occurs because we differentiate between QoS tiers in a non-continuous way. As mentioned at sig-node, using the guaranteed QoS tier requires setting CPU and memory limits, which isn't always desirable. Just as eviction doesn't take limits into account, I don't think setting the oom_score_adj should either, as that ends up requiring setting limits just to achieve some notion of priority.

My ideal formula would be something like:

oomScoreAdjust := 1000 - (1000* memoryRequest)/memoryCapacity
if podIsCritical(pod) {
    oomScoreAdjust -= 999
}
return oomScoreAdjust

This way, critical pods are still differentiated by requests, and the only input variable from the pod is the pod's memory request.

@sjenning

This comment has been minimized.

Copy link
Contributor Author

sjenning commented Feb 5, 2019

@dashpole I am not against making the score adjust less discontinuous among the QoS tiers.

However, this PR is more limited in scope, targeted at honoring explicit expression of the pod priority when setting the score adjust.

@sjenning sjenning force-pushed the sjenning:priority-based-oom-score-adj branch from a38a357 to 34c5c12 Feb 5, 2019

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Feb 5, 2019

However, this PR is more limited in scope, targeted at honoring explicit expression of the pod priority when setting the score adjust.

In that case, what would you think about something that differentiates between critical pods that have lower/higher requests, rather than setting all critical pods to -998?

I definitely think this an improvement. Just seeing what else we can accomplish while we have people's attention :).

@sjenning sjenning force-pushed the sjenning:priority-based-oom-score-adj branch from 34c5c12 to 7dcf1fe Feb 5, 2019

@sjenning

This comment has been minimized.

Copy link
Contributor Author

sjenning commented Feb 6, 2019

/kind bug
/priority important-soon

@sjenning

This comment has been minimized.

Copy link
Contributor Author

sjenning commented Feb 6, 2019

/assign @dashpole

I would be willing to review any follow on PR. I'd just like to keep this one simple and on point.

@dashpole

This comment has been minimized.

Copy link
Contributor

dashpole commented Feb 6, 2019

sgtm
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Feb 6, 2019

@sjenning

This comment has been minimized.

Copy link
Contributor Author

sjenning commented Feb 6, 2019

/assign @derekwaynecarr

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Feb 6, 2019

i agree this is better, and we can revisit the heuristic w/o breaking any bw compatibility in the future.

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 6, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Feb 7, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Feb 7, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 6796645 into kubernetes:master Feb 7, 2019

15 of 16 checks passed

pull-kubernetes-local-e2e-containerized Job triggered.
Details
cla/linuxfoundation sjenning authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 7, 2019

@sjenning: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e-containerized 7dcf1fe link /test pull-kubernetes-local-e2e-containerized

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@szuecs

This comment has been minimized.

Copy link
Member

szuecs commented Feb 12, 2019

Thanks @sjenning it makes our configurations so much easier and we already had smaller outages because we forgot in one pod to set cpu limits....
Maybe we can think of having different values for different priority classes. It makes most sense for me to not kill high priority pods, like this you can define ingress router, dns, .. as more important than regular pods serving also important things, but not everything will break if they will be killed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.