QoS proposal #11713
Hi. I like this proposal in general, but I have a few comments. I went through a few different approaches to try to give the feedback. This was the best I could come up with, but I'm happy to talk in person because this may still be incomprehensible.
At a high level, I think there are three things to consider changing about your proposal:
(i) When scheduling a top-tier pod, ignore best-effort pods
Maybe for CPU this isn't an issue because you wouldn't kill, but I would suggest applying the above policy at least for memory.
(Before someone jumps in and suggests some kind of rate limiter to detect these rescheduling loops -- I do not like the approach of using scheduling rate limiters when there are static approaches to avoid these kinds of problems, such as what I described above.)
As a concrete example, let's say there are three pods
The system would give the following distribution: T gets 60% of the CPU, L gets 40%-epsilon, and B gets epsilon. I don't have a problem with putting all the no-limit best-effort pods in a single cgroup that gets a fixed share, but I think the best-effort pods that do request a limit should not be put into that cgroup.
Oh, it just occurred to me that you can't do the thing I suggested in 1) (ii) because we don't report usage as an API endpoint available to the scheduler yet. Until we have that, I think you would still benefit from doing what I suggested there (and just pretend sum of usages is zero). I understand the argument that this hurts utilization, but until we have separate "request" and "limit" I think this is the most reasonable thing to do.
Can you elaborate on the interplay with ResourceQuota? Right now, as I read this proposal if a quota is applied to a namespace, I am only able to run top tier pods. At one point, I thought we discussed ResourceQuota for each tier. I need to think about this some more, but I suspect I do not want the presence of a ResourceQuota to prohibit my namespace to a single tier of pod in that Namespace.