New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External QoS controller and implement resizing a Pod/Container resource envelope via update in the API server. #28316

Closed
balajismaniam opened this Issue Jun 30, 2016 · 12 comments

Comments

Projects
None yet
@balajismaniam
Copy link
Member

balajismaniam commented Jun 30, 2016

Note: Relates to #28315 (Online resource oversubscription and external QoS control).

Background:
An external QoS controller can continuously monitor the cluster utilization and application performance and perform some actions. The actions performed by the external QoS controller include the following:

  1. Tag pods and nodes so that only certain pods can be scheduled on some nodes.
  2. Evict pods and tag the nodes appropriately so that a class of pods cannot be scheduled on the same node.
  3. Resource deflation or inflation of jobs running in the node. In Kubernetes, this resource deflation and inflation translates to resizing the pod/container resource requests and limits.

In Kubernetes, the current QoS classifications allow best-effort pods to consume the usage slack (i.e., the difference between the resources allocated and the resources used). However, there are no restrictions on which node can host the best-effort pods or the number of best-effort pods scheduled on a node. This could lead to performance interference between the original pods scheduled on the node and the best-effort pods leading to service-level objective (SLO) violations. The proposed external QoS controller can mitigate such an issue.

Requirements:

  1. A mechanism to tag pods and nodes so that only certain pods can be hosted in a node.
  2. Evicts pods from a node and use requirement 1 to tag the node.
  3. Allow resizing the resource requests and limits on each container within a pod via pod update.

Usage Examples:

  1. An external agent can tag non-best-effort pods so that only they can be scheduled in certain nodes of the cluster.
  2. An aggressor best-effort pod can be evicted and the node can be tagged till the performance of the victim high-priority pod stabilizes.
  3. A high-priority pod is being under-utilized. An external agent can resize this under-utilized pod and allow scheduling of best-effort pods on the same node.
  4. After the best-effort pod completes, an external agent can again resize the high-priority pod back to its original size.

Possible Implementation:

  1. Requirement 1 can be implemented using admission controllers and taint/toleration mechanism.
  2. For requirement 2, the pod can be evicted by sending a DELETE request on the pod to the API server and use requirement 1 to tag the node.

cc @davidopp @ConnorDoyle

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Jun 30, 2016

@kubernetes/sig-scheduling @kubernetes/sig-node

@vishh

This comment has been minimized.

Copy link
Member

vishh commented Jun 30, 2016

This could lead to performance interference between the original pods scheduled on the node and the best-effort pods leading to service-level objective (SLO) violations.

Can you elaborate on the SLO issues? We are continuously improving QoS support.

  • We added Out Of Resource eviction based on memory usage and are working towards supporting evictions based on disk as well.
  • We are improving cgroups based isolation by having QoS and pod level cgroups.
@nqn

This comment has been minimized.

Copy link
Contributor

nqn commented Jul 1, 2016

@vishh It's not SLO issues seen in the current Kubernetes implementation. I think @balajismaniam is referring to it as a general issue as in SLO violations when co-locating latency critical services with sub millisecond response time requirements. Classic noisy neighbor problems with a mix of OS and hardware resource congestions[1][2]. Use case being scheduling memcached on Kubernetes and safely colocate with other workloads (which is should be a common pattern). Known issues being L1 to L3 data cache interference, memory bandwidth and network congestion and CFS scheduling hazards (mostly talking about on-prem environment).

[1] http://csl.stanford.edu/~christos/publications/2014.mutilate.eurosys.pdf
[2] http://csl.stanford.edu/~christos/publications/2015.heracles.isca.pdf

@balajismaniam

This comment has been minimized.

Copy link
Member Author

balajismaniam commented Jul 1, 2016

Thanks, @nqn. @vishh I was referring to the SLO issues that @nqn has described above.

@davidopp davidopp referenced this issue Jul 12, 2016

Open

Vertical Scaling of Pods #21

13 of 18 tasks complete
@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Jul 13, 2016

Same comment on this as it relates to the re-scheduler, there are overlapping requirements.

@therc

This comment has been minimized.

Copy link
Contributor

therc commented Jul 16, 2016

Should something like "A standard way to collect SLI data (e.g. custom metrics)" be an explicit requirement? It's implied somehow in the background: "An external QoS controller can continuously monitor the cluster utilization and application performance". Even if the controller is replaceable, I'd prefer a canonical way to get at the data. That way, there are higher chances of a single controller being enough, without needing to deploy N different ones, potentially interfering with each other.

@ConnorDoyle

This comment has been minimized.

Copy link
Member

ConnorDoyle commented Aug 24, 2016

@therc, an API for app metrics would be nice to have but not sure it needs to block this.

@fgrzadkowski

This comment has been minimized.

Copy link
Member

fgrzadkowski commented Sep 12, 2016

@DonghuiZhuo

This comment has been minimized.

Copy link

DonghuiZhuo commented Mar 17, 2017

@balajismaniam You mentioned that "An aggressor best-effort pod can be evicted and the node can be tagged till the performance of the victim high-priority pod stabilizes."

How do you plan to measure the performance of the victim?

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 22, 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 21, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Feb 20, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment