QoS proposal #11713

AnanyaKumar · 2015-07-22T22:01:03Z

k8s-bot · 2015-07-22T22:01:26Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

AnanyaKumar · 2015-07-22T22:03:09Z

@vishh @bgrant0607 @dchen1107 @erictune @yujuhong @lavalamp PTAL when you have time, and let me know what you think.

cc'ing @vmarmol @rjnagal

davidopp · 2015-07-23T00:43:09Z

Hi. I like this proposal in general, but I have a few comments. I went through a few different approaches to try to give the feedback. This was the best I could come up with, but I'm happy to talk in person because this may still be incomprehensible.

At a high level, I think there are three things to consider changing about your proposal:

Completely ignoring best-effort pods in scheduling is likely to cause problems. The main concern is that you can get into an infinite loop of OOM kill. As a trivial case, consider a cluster with one machine. It is running a top-tier pod that requests and uses 100% of the machine's memory. Now someone submits a best-effort pod. It will schedule onto that machine and immediately be killed once it starts trying to use memory. It will then reschedule onto that machine over and over and over again. I'm not sure it's possible to solve this problem without jumping into using both "request" and "limit." But if you're just going to just use "limit" then I think for now the best you can do is something like this:

(i) When scheduling a top-tier pod, ignore best-effort pods
(ii) When scheduling a best-effort pod that specifies a limit, only schedule it onto the machine if capacity - sum of top-tier limits - sum of best-effort usages - new pod's limit >= 0
(iii) When scheduling a best-effort pod that specifies no limit, do the same as in (ii) but pretend the new pod's limit is some reasonable value (maybe use the default values we're using for memory and CPU when spreading zero-limit pods)

Maybe for CPU this isn't an issue because you wouldn't kill, but I would suggest applying the above policy at least for memory.

(Before someone jumps in and suggests some kind of rate limiter to detect these rescheduling loops -- I do not like the approach of using scheduling rate limiters when there are static approaches to avoid these kinds of problems, such as what I described above.)

It seems there are some downsides to the way you plan to handle CPU shares for best-effort pods, both in the short-term plan and the long-term plan. I think the behavior we really want is something like: top-tier pods are always guaranteed to be able use up to their limit; whatever is "left over" from their usage is distributed among the best-effort pods in a fair way.

As a concrete example, let's say there are three pods
T: top-tier, limit 60%, wants to use 100%
L: best-effort, limit 60%, wants to use 100%
B: best-effort, no limit specified, wants to use 100%

The system would give the following distribution: T gets 60% of the CPU, L gets 40%-epsilon, and B gets epsilon. I don't have a problem with putting all the no-limit best-effort pods in a single cgroup that gets a fixed share, but I think the best-effort pods that do request a limit should not be put into that cgroup.

I didn't see any mention of killing based on exceeding memory limit (as opposed to killing due to OOM killer). We should configure the cgroups for memory so that a pod is killed if it exceeds its memory limit, for the pods that specify limit. For the pods that don't specify limit (which should only be allowed for best-effort pods, I think), we would not kill them but should make them the first to be killed when there is an OOM.

davidopp · 2015-07-23T01:12:30Z

Oh, it just occurred to me that you can't do the thing I suggested in 1) (ii) because we don't report usage as an API endpoint available to the scheduler yet. Until we have that, I think you would still benefit from doing what I suggested there (and just pretend sum of usages is zero). I understand the argument that this hurts utilization, but until we have separate "request" and "limit" I think this is the most reasonable thing to do.

bgrant0607 · 2015-07-23T01:25:26Z

docs/proposals/kubelet-qos.md

+
+In the first iteration, we will stick to a simple policy. Pods that specify CPU and memory limits for every container will be classified as top-tier, all other pods will be classified as best-effort. This policy offers protection to pods that specify resource requirements (from pods that have no limits and can go overboard in resource consumption), which gives them better resource guarantees. 
+
+Extension: In the future, we might add a field to PodSpec in the API, to allow users to specify whether a pod is best-effort or top-tier.


Please remove this line. At this time, I am not in favor of adding an explicit field.

In this proposal, how are best-effort pods distinguished from top-tier pods? The proposal seems to be assuming there can be non-0-limit best-effort pods, unless I'm missing something. (And again, I am assuming this is just talking about limit, not request.)

(And to be less opaque -- I was assuming there had to be such a bit, since I didn't see any other way to distinguish best-effort from top-tier.)

Top-tier pods must fully specify all relevant limits.

Why do I not want a more granular and explicit specification, such as that accepted by lmctfy?

The main reasons are complexity and API expansion.

However, there are also inadequate mechanisms in the upstream kernel to distinguish multiple levels of latency sensitivity and multiple levels of compressibility.

Also, to me, not specifying resources implies a lack of requirement, hence best effort. Besides, the scheduler, Kubelet, and kernel can't guarantee much in the case of unbounded oversubscription, and most such applications disable limit enforcement in order to opportunistically consume resources.

@bgrant0607 I will remove the line.

@davidopp Yes, in my proposal pods that specify all CPU and memory limits are top-tier, all other pods are best-effort.

I think it's still useful to split this discussion into: 1. The policy of deciding whether a pod is best-effort or top-tier (in our case based on whether pods specify limits) and 2. The QoS that services pods based on the label of top-tier and best-effort. That's what I'm saying here, so hypothetically, limits and tiers can be independent. There was a discussion of possible use cases in #147

I'm not a fan of saying that top-tiers = has limits specified and best-effort = no limits specified. This will cause scheduling of best-effort pods to be pretty much random, as we won't have any data which would allow distinguishing them (unless we'll bring requirements into the picture).

Yes, the implication is that we'll implement "request".

You're right of course: "request" not "requirement" - too many similar words around:/ The thing is I don't see the proposal on how to incorporate requests here, and I think this should be a prerequisite to any QoS work.

I'll update the proposal with requests today!

On 24 Jul, 2015, at 2:31 am, Marek Grabowski notifications@github.com wrote:

In docs/proposals/kubelet-qos.md:

Best effort pods will be evicted if a top tier pod needing the memory requirements comes in, or if some other top-tier pod asks for memory.

Best-effort pods will not be allowed to exceed their memory limit.

Extension: Consider distinguishing between best-effort pods that specify memory limits and give them even lower priority. Otherwise a best-effort pod without a memory limit could use up system resources and kill off other best-effort pods.
+- Extension: support other resources like network, disk.

+Resource guarantees for top tier pods
+- Will not change.
+- Extension: Consider giving top-tier pods that do not specify memory limits a lower memory priority than top tier pods that specify memory limits (but a higher priority than best-effort pods)
+
+## Policy
+
+The QoS design does not make assumptions about the “policy” used to label pods as best-effort and top-tier.
+
+In the first iteration, we will stick to a simple policy. Pods that specify CPU and memory limits for every container will be classified as top-tier, all other pods will be classified as best-effort. This policy offers protection to pods that specify resource requirements (from pods that have no limits and can go overboard in resource consumption), which gives them better resource guarantees.
+
+Extension: In the future, we might add a field to PodSpec in the API, to allow users to specify whether a pod is best-effort or top-tier.
You're right of course: "request" not "requirement" - too many similar words around:/ The thing is I don't see the proposal on how to incorporate requests here, and I think this should be a prerequisite to any QoS work.

—
Reply to this email directly or view it on GitHub.

pmorie · 2015-07-23T02:05:48Z

@ncdc @derekwaynecarr

bgrant0607 · 2015-07-23T02:35:54Z

cc @timothysc

bgrant0607 · 2015-07-23T02:46:33Z

cc @eparis

bgrant0607 · 2015-07-23T02:47:11Z

docs/proposals/kubelet-qos.md

+We will introduce two tiers of pods: top-tier pods and best-effort pods. We will not support multiple levels or priorities.
+
+Best-effort pods will be invisible to top-tier pods:
+- **Existence of best-effort pods should not prevent top tier pods from launching**. When deciding whether to launch a top-tier pod, we only look at resource limits of other top-tier pod. Suppose we have a machine with 4GB of memory. Top-tier pod P1 has a 2GB limit, and 4 best-effort pods have a 500MB limit each. Suppose the user launches a new top-tier pod P2 with a 1GB limit. The Kubelet will evict best-effort pods, in unspecified order, to make room for P2. P2 will then be admitted and allocated the requested resources.


We should define compressible and incompressible.

Actually, they are defined in docs/design/resources.md, which this doc should link to.

erictune · 2015-07-23T04:20:16Z

Agree with Brian that you should phrase this proposal in terms of adding "request" fields.

bgrant0607 · 2015-07-23T04:42:22Z

docs/proposals/kubelet-qos.md

+
+On the other hand in the current system, **if all pods have specified limits, cluster utilization can be poor**. In the current implementation, Kubelet does not support over-committing pods. Very often, pods don’t use all the resources that they request, which leads to a lot of wasted resources. For example, we might have 4 pods, each reserving 1GB of memory in a system with 4GB memory, but only using 500MB of memory. Theoretically, we could slot in another 4 pods of this type, but Kubernetes will reject new pods.
+
+In this design document, we propose **implementing two tiers of pods** - one in which we guarantee resources (top-tier pods) and the second wherein we provide resources if they’re available and at a slightly lower quality (best-effort pods). Adding a best-effort tier of pods will increase node’s utilization in a Kubernetes cluster (in Borg, supporting best-effort pods increased utilization by  about 20%). We also implement **a policy of automatically classifying pods that don’t provide limits as best-effort**. This provides much better resource guarantees to pods with limits specified.


By 20% cpu, do you mean "cpu: 200m"?

I think it's a 20% increase in the number of pods that could be squeezed into the same cluster? The statistic was mentioned in the Borg paper. I looked through most of the Borg paper after our discussion, but I didn't analyze the statistics in detail.

derekwaynecarr · 2015-07-23T07:07:55Z

Can you elaborate on the interplay with ResourceQuota? Right now, as I read this proposal if a quota is applied to a namespace, I am only able to run top tier pods. At one point, I thought we discussed ResourceQuota for each tier. I need to think about this some more, but I suspect I do not want the presence of a ResourceQuota to prohibit my namespace to a single tier of pod in that Namespace.

gmarek · 2015-07-23T14:40:32Z

docs/proposals/kubelet-qos.md

+  - Set OOM_SCORE_ADJ to 501
+- Top-tier
+  - Set CPU shares and memory requirements as we do right now
+  - Set OOM_SCORE_ADJ to -501


We need to take care about docker daemon and kubelet. I guess that they should have lower oom_scores than top-tier pods.

Yup they'll have a lower oom_score_adj. I think right now the kubelet/daemon's oom_score_adj is -900 and kube-proxy's oom_score_adj is -899, so they should be pretty well protected from the user containers.

I'm far from Linux expert, but aren't real scores max (whatever we computed, 0)? At least I've never seen anything negative in /proc/pid/oom_score.

That's a good point. There would be an issue if the user launches lots of top-tier pods that don't take up much memory. Then all the top-tier pods, the Kubelet, and the docker daemon would have the same oom_score of 0, and we cannot guarantee what gets evicted. Is this the issue that you're concerned about?

I think we could avoid this by keeping some padding for the kubelet, OS, etc (which we don't do at the moment).

That's exactly my point;)

Do you think this solution is good enough: keep a padding of about Xgb for system processes, the kubelet, docker daemon and kube-proxy. Put all user defined pods inside a cgroup with system memory - X gb limit. Or do you have suggestions on how I can tackle this issue? Thanks!

That's good enough - we probably need to monitor resource usage and possibly resize the cgroup if system daemons will require it, but I guess it's out of scope of your proposal.

erictune · 2015-07-23T17:02:16Z

I don't think we need to introduce priority or tiers at this time (in a way other than by adding request). I've commented on this on #147.

AnanyaKumar · 2015-07-23T21:13:25Z

@davidopp Thank you very much for your useful feedback! If you have time, I'd like to talk to you in person to understand some of your concerns better. I'll post my thoughts on the issues you mention, 1 per comment.

bgrant0607 · 2015-08-06T23:42:35Z

Please rebase and run gendocs. Once shippable and travis pass, we can merge this and move the doc and fix problems in subsequent PRs.

davidopp · 2015-08-07T00:16:29Z

docs/proposals/resource-qos.md

+
+## QoS Classes
+
+In an overcommitted system, containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority.


You should define what overcommitted means: sum(request) > machine capacity.

davidopp · 2015-08-07T00:26:55Z

BTW feel free to address my comments in a follow-up PR. No need to block submitting this.

bgrant0607 · 2015-08-07T16:42:44Z

Need to re-run hack/run-gendocs.sh.

k8s-bot · 2015-08-07T17:32:13Z

GCE e2e build/test passed for commit 03490af.

AnanyaKumar · 2015-08-07T18:29:39Z

Let's run Shippable again, there's an integration test test failure with TestUnschedulableNodes, which is probably a race, since the PR doesn't modify any code :)

bgrant0607 · 2015-08-07T18:30:49Z

LGTM. Shippable failure was a flaky integration test.

bgrant0607 · 2015-08-07T18:39:48Z

Restarted

davidopp · 2015-08-07T19:00:53Z

docs/proposals/resource-qos.md

+    - Assuming that a container typically has a single big process, if a burstable container that uses more memory than requested conflicts with a burstable container using less memory than requested, the former will be killed
+    - If burstable containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees. This is one reason why control loops will be added in subsequent iterations.
+- Pod infrastructure container
+  - OOM_SCORE_ADJ: -999


Why not -1000?

Personally, I thought it should be -1000 until we have better protection for Kubelet (and -1000 was what was in my initial proposal), but @vmarmol had great reasons for why it should be -999. If it's -1000 we run the risk of system panics, because it tells the OOM killer the process can't be killed. This would be an issue if Docker, for example, uses too much memory.

It depends on the behavior we want: if we think there's no point keeping up the node if Docker dies - then we should set it to -1000. But cluster administrators might want the node running even if Docker dies.

Thanks for the explanation. Maybe mention this in the doc? People might wonder.

AnanyaKumar · 2015-08-07T21:14:12Z

Flaky test failed again :( Doesn't have anything to do with this PR though

QoS proposal

Due to kubernetes#11713

googlebot added the cla: yes label Jul 22, 2015

bgrant0607 mentioned this pull request Jul 23, 2015

QoS tiers #147

Closed

bgrant0607 reviewed Jul 23, 2015
View reviewed changes

gmarek reviewed Jul 23, 2015
View reviewed changes

vishh assigned bgrant0607 Jul 23, 2015

derekwaynecarr mentioned this pull request Aug 6, 2015

Disable limit checking for scheduler #12345

Closed

davidopp reviewed Aug 7, 2015
View reviewed changes

Add qos proposal

03490af

AnanyaKumar force-pushed the qos-proposal branch from a2c65b0 to 03490af Compare August 7, 2015 17:10

bgrant0607 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2015

davidopp reviewed Aug 7, 2015
View reviewed changes

satnam6502 added a commit that referenced this pull request Aug 7, 2015

Merge pull request #11713 from AnanyaKumar/qos-proposal

be5e224

QoS proposal

satnam6502 merged commit be5e224 into kubernetes:master Aug 7, 2015

piosz mentioned this pull request Aug 12, 2015

Initial Resources proposal #12472

Merged

HaiyangDING mentioned this pull request Aug 17, 2015

Support multiple schedulers #11793

Closed

HaiyangDING mentioned this pull request Aug 20, 2015

Replace limits with requests in scheduler documentation. #12979

Merged

HaiyangDING pushed a commit to HaiyangDING/kubernetes that referenced this pull request Aug 24, 2015

Repalce limits with requests in scheduler documentation.

2979ae7

Due to kubernetes#11713

bgrant0607 mentioned this pull request Sep 23, 2016

Explicit pod QoS class. #32468

Closed

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016

Repalce limits with requests in scheduler documentation.

66a9ff2

Due to kubernetes#11713


		In the first iteration, we will stick to a simple policy. Pods that specify CPU and memory limits for every container will be classified as top-tier, all other pods will be classified as best-effort. This policy offers protection to pods that specify resource requirements (from pods that have no limits and can go overboard in resource consumption), which gives them better resource guarantees.

		Extension: In the future, we might add a field to PodSpec in the API, to allow users to specify whether a pod is best-effort or top-tier.


		On the other hand in the current system, if all pods have specified limits, cluster utilization can be poor. In the current implementation, Kubelet does not support over-committing pods. Very often, pods don’t use all the resources that they request, which leads to a lot of wasted resources. For example, we might have 4 pods, each reserving 1GB of memory in a system with 4GB memory, but only using 500MB of memory. Theoretically, we could slot in another 4 pods of this type, but Kubernetes will reject new pods.

		In this design document, we propose implementing two tiers of pods - one in which we guarantee resources (top-tier pods) and the second wherein we provide resources if they’re available and at a slightly lower quality (best-effort pods). Adding a best-effort tier of pods will increase node’s utilization in a Kubernetes cluster (in Borg, supporting best-effort pods increased utilization by about 20%). We also implement a policy of automatically classifying pods that don’t provide limits as best-effort. This provides much better resource guarantees to pods with limits specified.


		## QoS Classes

		In an overcommitted system, containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: Guaranteed, Burstable, and Best-Effort, in decreasing order of priority.

QoS proposal #11713

QoS proposal #11713

Conversation

AnanyaKumar commented Jul 22, 2015

k8s-bot commented Jul 22, 2015

AnanyaKumar commented Jul 22, 2015

davidopp commented Jul 23, 2015

davidopp commented Jul 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmorie commented Jul 23, 2015

bgrant0607 commented Jul 23, 2015

bgrant0607 commented Jul 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erictune commented Jul 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Jul 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erictune commented Jul 23, 2015

AnanyaKumar commented Jul 23, 2015

bgrant0607 commented Aug 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp commented Aug 7, 2015

bgrant0607 commented Aug 7, 2015

k8s-bot commented Aug 7, 2015

AnanyaKumar commented Aug 7, 2015

bgrant0607 commented Aug 7, 2015

bgrant0607 commented Aug 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AnanyaKumar commented Aug 7, 2015