Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QoS proposal #11713

Merged
merged 1 commit into from Aug 7, 2015
Merged

QoS proposal #11713

merged 1 commit into from Aug 7, 2015

Conversation

AnanyaKumar
Copy link
Contributor

@AnanyaKumar AnanyaKumar commented Jul 22, 2015

Ref. #147

@k8s-bot
Copy link

k8s-bot commented Jul 22, 2015

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@AnanyaKumar
Copy link
Contributor Author

AnanyaKumar commented Jul 22, 2015

@vishh @bgrant0607 @dchen1107 @erictune @yujuhong @lavalamp PTAL when you have time, and let me know what you think.

cc'ing @vmarmol @rjnagal

@davidopp
Copy link
Member

davidopp commented Jul 23, 2015

Hi. I like this proposal in general, but I have a few comments. I went through a few different approaches to try to give the feedback. This was the best I could come up with, but I'm happy to talk in person because this may still be incomprehensible.

At a high level, I think there are three things to consider changing about your proposal:

  1. Completely ignoring best-effort pods in scheduling is likely to cause problems. The main concern is that you can get into an infinite loop of OOM kill. As a trivial case, consider a cluster with one machine. It is running a top-tier pod that requests and uses 100% of the machine's memory. Now someone submits a best-effort pod. It will schedule onto that machine and immediately be killed once it starts trying to use memory. It will then reschedule onto that machine over and over and over again. I'm not sure it's possible to solve this problem without jumping into using both "request" and "limit." But if you're just going to just use "limit" then I think for now the best you can do is something like this:

(i) When scheduling a top-tier pod, ignore best-effort pods
(ii) When scheduling a best-effort pod that specifies a limit, only schedule it onto the machine if capacity - sum of top-tier limits - sum of best-effort usages - new pod's limit >= 0
(iii) When scheduling a best-effort pod that specifies no limit, do the same as in (ii) but pretend the new pod's limit is some reasonable value (maybe use the default values we're using for memory and CPU when spreading zero-limit pods)

Maybe for CPU this isn't an issue because you wouldn't kill, but I would suggest applying the above policy at least for memory.

(Before someone jumps in and suggests some kind of rate limiter to detect these rescheduling loops -- I do not like the approach of using scheduling rate limiters when there are static approaches to avoid these kinds of problems, such as what I described above.)

  1. It seems there are some downsides to the way you plan to handle CPU shares for best-effort pods, both in the short-term plan and the long-term plan. I think the behavior we really want is something like: top-tier pods are always guaranteed to be able use up to their limit; whatever is "left over" from their usage is distributed among the best-effort pods in a fair way.

As a concrete example, let's say there are three pods
T: top-tier, limit 60%, wants to use 100%
L: best-effort, limit 60%, wants to use 100%
B: best-effort, no limit specified, wants to use 100%

The system would give the following distribution: T gets 60% of the CPU, L gets 40%-epsilon, and B gets epsilon. I don't have a problem with putting all the no-limit best-effort pods in a single cgroup that gets a fixed share, but I think the best-effort pods that do request a limit should not be put into that cgroup.

  1. I didn't see any mention of killing based on exceeding memory limit (as opposed to killing due to OOM killer). We should configure the cgroups for memory so that a pod is killed if it exceeds its memory limit, for the pods that specify limit. For the pods that don't specify limit (which should only be allowed for best-effort pods, I think), we would not kill them but should make them the first to be killed when there is an OOM.

@bgrant0607 bgrant0607 mentioned this pull request Jul 23, 2015
@davidopp
Copy link
Member

davidopp commented Jul 23, 2015

Oh, it just occurred to me that you can't do the thing I suggested in 1) (ii) because we don't report usage as an API endpoint available to the scheduler yet. Until we have that, I think you would still benefit from doing what I suggested there (and just pretend sum of usages is zero). I understand the argument that this hurts utilization, but until we have separate "request" and "limit" I think this is the most reasonable thing to do.


In the first iteration, we will stick to a simple policy. Pods that specify CPU and memory limits for every container will be classified as top-tier, all other pods will be classified as best-effort. This policy offers protection to pods that specify resource requirements (from pods that have no limits and can go overboard in resource consumption), which gives them better resource guarantees.

Extension: In the future, we might add a field to PodSpec in the API, to allow users to specify whether a pod is best-effort or top-tier.
Copy link
Member

@bgrant0607 bgrant0607 Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line. At this time, I am not in favor of adding an explicit field.

Copy link
Member

@davidopp davidopp Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this proposal, how are best-effort pods distinguished from top-tier pods? The proposal seems to be assuming there can be non-0-limit best-effort pods, unless I'm missing something. (And again, I am assuming this is just talking about limit, not request.)

Copy link
Member

@davidopp davidopp Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And to be less opaque -- I was assuming there had to be such a bit, since I didn't see any other way to distinguish best-effort from top-tier.)

Copy link
Member

@bgrant0607 bgrant0607 Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-tier pods must fully specify all relevant limits.

Copy link
Member

@bgrant0607 bgrant0607 Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do I not want a more granular and explicit specification, such as that accepted by lmctfy?

The main reasons are complexity and API expansion.

However, there are also inadequate mechanisms in the upstream kernel to distinguish multiple levels of latency sensitivity and multiple levels of compressibility.

Also, to me, not specifying resources implies a lack of requirement, hence best effort. Besides, the scheduler, Kubelet, and kernel can't guarantee much in the case of unbounded oversubscription, and most such applications disable limit enforcement in order to opportunistically consume resources.

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607 I will remove the line.

@davidopp Yes, in my proposal pods that specify all CPU and memory limits are top-tier, all other pods are best-effort.

I think it's still useful to split this discussion into: 1. The policy of deciding whether a pod is best-effort or top-tier (in our case based on whether pods specify limits) and 2. The QoS that services pods based on the label of top-tier and best-effort. That's what I'm saying here, so hypothetically, limits and tiers can be independent. There was a discussion of possible use cases in #147

Copy link
Contributor

@gmarek gmarek Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of saying that top-tiers = has limits specified and best-effort = no limits specified. This will cause scheduling of best-effort pods to be pretty much random, as we won't have any data which would allow distinguishing them (unless we'll bring requirements into the picture).

Copy link
Member

@bgrant0607 bgrant0607 Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the implication is that we'll implement "request".

Copy link
Contributor

@gmarek gmarek Jul 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right of course: "request" not "requirement" - too many similar words around:/ The thing is I don't see the proposal on how to incorporate requests here, and I think this should be a prerequisite to any QoS work.

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Jul 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the proposal with requests today!

On 24 Jul, 2015, at 2:31 am, Marek Grabowski notifications@github.com wrote:

In docs/proposals/kubelet-qos.md:

    • Best effort pods will be evicted if a top tier pod needing the memory requirements comes in, or if some other top-tier pod asks for memory.
    • Best-effort pods will not be allowed to exceed their memory limit.
    • Extension: Consider distinguishing between best-effort pods that specify memory limits and give them even lower priority. Otherwise a best-effort pod without a memory limit could use up system resources and kill off other best-effort pods.
      +- Extension: support other resources like network, disk.

+Resource guarantees for top tier pods
+- Will not change.
+- Extension: Consider giving top-tier pods that do not specify memory limits a lower memory priority than top tier pods that specify memory limits (but a higher priority than best-effort pods)
+
+## Policy
+
+The QoS design does not make assumptions about the “policy” used to label pods as best-effort and top-tier.
+
+In the first iteration, we will stick to a simple policy. Pods that specify CPU and memory limits for every container will be classified as top-tier, all other pods will be classified as best-effort. This policy offers protection to pods that specify resource requirements (from pods that have no limits and can go overboard in resource consumption), which gives them better resource guarantees.
+
+Extension: In the future, we might add a field to PodSpec in the API, to allow users to specify whether a pod is best-effort or top-tier.
You're right of course: "request" not "requirement" - too many similar words around:/ The thing is I don't see the proposal on how to incorporate requests here, and I think this should be a prerequisite to any QoS work.


Reply to this email directly or view it on GitHub.

@pmorie
Copy link
Member

pmorie commented Jul 23, 2015

@ncdc @derekwaynecarr

@bgrant0607
Copy link
Member

bgrant0607 commented Jul 23, 2015

cc @timothysc

@bgrant0607
Copy link
Member

bgrant0607 commented Jul 23, 2015

cc @eparis

We will introduce two tiers of pods: top-tier pods and best-effort pods. We will not support multiple levels or priorities.

Best-effort pods will be invisible to top-tier pods:
- **Existence of best-effort pods should not prevent top tier pods from launching**. When deciding whether to launch a top-tier pod, we only look at resource limits of other top-tier pod. Suppose we have a machine with 4GB of memory. Top-tier pod P1 has a 2GB limit, and 4 best-effort pods have a 500MB limit each. Suppose the user launches a new top-tier pod P2 with a 1GB limit. The Kubelet will evict best-effort pods, in unspecified order, to make room for P2. P2 will then be admitted and allocated the requested resources.
Copy link
Member

@bgrant0607 bgrant0607 Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define compressible and incompressible.

Copy link
Member

@bgrant0607 bgrant0607 Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, they are defined in docs/design/resources.md, which this doc should link to.

@erictune
Copy link
Member

erictune commented Jul 23, 2015

Agree with Brian that you should phrase this proposal in terms of adding "request" fields.


On the other hand in the current system, **if all pods have specified limits, cluster utilization can be poor**. In the current implementation, Kubelet does not support over-committing pods. Very often, pods don’t use all the resources that they request, which leads to a lot of wasted resources. For example, we might have 4 pods, each reserving 1GB of memory in a system with 4GB memory, but only using 500MB of memory. Theoretically, we could slot in another 4 pods of this type, but Kubernetes will reject new pods.

In this design document, we propose **implementing two tiers of pods** - one in which we guarantee resources (top-tier pods) and the second wherein we provide resources if they’re available and at a slightly lower quality (best-effort pods). Adding a best-effort tier of pods will increase node’s utilization in a Kubernetes cluster (in Borg, supporting best-effort pods increased utilization by about 20%). We also implement **a policy of automatically classifying pods that don’t provide limits as best-effort**. This provides much better resource guarantees to pods with limits specified.
Copy link
Member

@bgrant0607 bgrant0607 Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 20% cpu, do you mean "cpu: 200m"?

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Jul 27, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a 20% increase in the number of pods that could be squeezed into the same cluster? The statistic was mentioned in the Borg paper. I looked through most of the Borg paper after our discussion, but I didn't analyze the statistics in detail.

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Jul 23, 2015

Can you elaborate on the interplay with ResourceQuota? Right now, as I read this proposal if a quota is applied to a namespace, I am only able to run top tier pods. At one point, I thought we discussed ResourceQuota for each tier. I need to think about this some more, but I suspect I do not want the presence of a ResourceQuota to prohibit my namespace to a single tier of pod in that Namespace.

- Set OOM_SCORE_ADJ to 501
- Top-tier
- Set CPU shares and memory requirements as we do right now
- Set OOM_SCORE_ADJ to -501
Copy link
Contributor

@gmarek gmarek Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to take care about docker daemon and kubelet. I guess that they should have lower oom_scores than top-tier pods.

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup they'll have a lower oom_score_adj. I think right now the kubelet/daemon's oom_score_adj is -900 and kube-proxy's oom_score_adj is -899, so they should be pretty well protected from the user containers.

Copy link
Contributor

@gmarek gmarek Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm far from Linux expert, but aren't real scores max (whatever we computed, 0)? At least I've never seen anything negative in /proc/pid/oom_score.

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. There would be an issue if the user launches lots of top-tier pods that don't take up much memory. Then all the top-tier pods, the Kubelet, and the docker daemon would have the same oom_score of 0, and we cannot guarantee what gets evicted. Is this the issue that you're concerned about?

I think we could avoid this by keeping some padding for the kubelet, OS, etc (which we don't do at the moment).

Copy link
Contributor

@gmarek gmarek Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly my point;)

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Jul 23, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this solution is good enough: keep a padding of about Xgb for system processes, the kubelet, docker daemon and kube-proxy. Put all user defined pods inside a cgroup with system memory - X gb limit. Or do you have suggestions on how I can tackle this issue? Thanks!

Copy link
Contributor

@gmarek gmarek Jul 24, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good enough - we probably need to monitor resource usage and possibly resize the cgroup if system daemons will require it, but I guess it's out of scope of your proposal.

@erictune
Copy link
Member

erictune commented Jul 23, 2015

I don't think we need to introduce priority or tiers at this time (in a way other than by adding request). I've commented on this on #147.

@AnanyaKumar
Copy link
Contributor Author

AnanyaKumar commented Jul 23, 2015

@davidopp Thank you very much for your useful feedback! If you have time, I'd like to talk to you in person to understand some of your concerns better. I'll post my thoughts on the issues you mention, 1 per comment.

@bgrant0607
Copy link
Member

bgrant0607 commented Aug 6, 2015

Please rebase and run gendocs. Once shippable and travis pass, we can merge this and move the doc and fix problems in subsequent PRs.


## QoS Classes

In an overcommitted system, containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority.
Copy link
Member

@davidopp davidopp Aug 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should define what overcommitted means: sum(request) > machine capacity.

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Aug 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@davidopp
Copy link
Member

davidopp commented Aug 7, 2015

BTW feel free to address my comments in a follow-up PR. No need to block submitting this.

@bgrant0607
Copy link
Member

bgrant0607 commented Aug 7, 2015

Need to re-run hack/run-gendocs.sh.

@k8s-bot
Copy link

k8s-bot commented Aug 7, 2015

GCE e2e build/test passed for commit 03490af.

@AnanyaKumar
Copy link
Contributor Author

AnanyaKumar commented Aug 7, 2015

Let's run Shippable again, there's an integration test test failure with TestUnschedulableNodes, which is probably a race, since the PR doesn't modify any code :)

@bgrant0607
Copy link
Member

bgrant0607 commented Aug 7, 2015

LGTM. Shippable failure was a flaky integration test.

@bgrant0607 bgrant0607 added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2015
@bgrant0607
Copy link
Member

bgrant0607 commented Aug 7, 2015

Restarted

- Assuming that a container typically has a single big process, if a burstable container that uses more memory than requested conflicts with a burstable container using less memory than requested, the former will be killed
- If burstable containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees. This is one reason why control loops will be added in subsequent iterations.
- Pod infrastructure container
- OOM_SCORE_ADJ: -999
Copy link
Member

@davidopp davidopp Aug 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not -1000?

Copy link
Contributor Author

@AnanyaKumar AnanyaKumar Aug 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I thought it should be -1000 until we have better protection for Kubelet (and -1000 was what was in my initial proposal), but @vmarmol had great reasons for why it should be -999. If it's -1000 we run the risk of system panics, because it tells the OOM killer the process can't be killed. This would be an issue if Docker, for example, uses too much memory.

It depends on the behavior we want: if we think there's no point keeping up the node if Docker dies - then we should set it to -1000. But cluster administrators might want the node running even if Docker dies.

Copy link
Member

@davidopp davidopp Aug 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. Maybe mention this in the doc? People might wonder.

@AnanyaKumar
Copy link
Contributor Author

AnanyaKumar commented Aug 7, 2015

Flaky test failed again :( Doesn't have anything to do with this PR though

satnam6502 added a commit that referenced this pull request Aug 7, 2015
@satnam6502 satnam6502 merged commit be5e224 into kubernetes:master Aug 7, 2015
HaiyangDING pushed a commit to HaiyangDING/kubernetes that referenced this pull request Aug 24, 2015
xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet