Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet should respect Pod Affinity and Pod AntiAffinity #29276

Open
lukaszo opened this Issue Jul 20, 2016 · 17 comments

Comments

@lukaszo
Copy link
Member

lukaszo commented Jul 20, 2016

This is a split from #22205 where only node affinity was added to DaemonSets. Pod Affinity and Pod AntiAffinity are still missing in DaemonSets.

It can be implemented in two ways:

  1. Add InterPodAffinityMatches predicate check to nodeShouldRunDaemonPod in daemoncontroller.go
  2. Add InterPodAffinityMatches predicate to GeneralPredicates which is used by daemon set.

cc @bgrant0607

@kargakis

This comment has been minimized.

Copy link
Member

kargakis commented May 20, 2017

@lukaszo is this issue fixed?

@lukaszo

This comment has been minimized.

Copy link
Member Author

lukaszo commented May 21, 2017

@kargakis nope

@kow3ns

This comment has been minimized.

Copy link
Member

kow3ns commented Jul 14, 2017

@lukaszo @kargakis @davidopp

DeamonSet is meant to run one copy of Pod on every node that matches the NodeSelector. What would PodAffinity PodAntiAffinity do here?

I can image that hard PodAntiAffinity might be used to indicate "put a Pod on every Node that matches your NodeSelector except for Nodes that have Pod x", but users could achieve the same functionality with a more restrictive NodeSelector and a more granular labeling scheme.

Have we put thought into the semantics of the other forms of Pod Affinity/AntiAffinity?

@lukaszo

This comment has been minimized.

Copy link
Member Author

lukaszo commented Jul 14, 2017

@kow3ns some use cases are described in the PR #31136

@davidopp

This comment has been minimized.

Copy link
Member

davidopp commented Jul 14, 2017

@kow3ns it's not an unreasonable question. Thinking out loud here, I could imagine you might want to run one daemon per rack, e.g. to control some rack-level hardware resource. (Pod affinity is harder to justify.)

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Dec 31, 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@bgrant0607

This comment has been minimized.

Copy link
Member

bgrant0607 commented Jan 23, 2018

/remove-lifecycle stale

@bsalamat has been looking at this

@dblackdblack

This comment has been minimized.

Copy link

dblackdblack commented Jan 23, 2018

+1 . This used to work back when affinity was indicated via the scheduler.alpha.kubernetes.io/affinity pod annotation. We used this feature for daemonsets but it is now broken.

In other words, this is a regression.

@kow3ns kow3ns added this to In Progress in Workloads Feb 27, 2018

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Apr 30, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@dblackdblack

This comment has been minimized.

Copy link

dblackdblack commented Apr 30, 2018

/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jul 29, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@dblackdblack

This comment has been minimized.

Copy link

dblackdblack commented Jul 29, 2018

@mitchellh

This comment has been minimized.

Copy link

mitchellh commented Sep 5, 2018

Hello! I'd love to see this as well. Rather than just saying +1, which I know is unhelpful, I'll explain our use case:

We're using a DaemonSet to deploy Consul clients on all nodes. However, clients don't need to run where server agents are running. We'd like to use pod anti-affinity to avoid scheduling the clients where servers are currently running.

We understand that there are obvious other shortcomings today, but we hope these will be resolved in the future, such as what happens when a server agent has to be rescheduled and so on. We're hoping those issues will get resolved as preemption stabilizes.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Sep 5, 2018

Up until K8s 1.12, DaemonSet (DS) pods used to be scheduled by the DS controller and the DS controller didn't support pod affinity/anti-affinity. In Kubernetes 1.12 (the coming release), DS pods will be scheduled by the default scheduler. With the new change, it is possible to have inter-pod affinity/anti-affinity (IPAA) for DS pods, but the workflow of supporting IPAA is not clear to me. Here are more details:

DS pods continue to be created by the DS controller (in 1.12). The DS controller first determines which nodes should run the DS pods and then creates those pods (with node affinity targeting specific nodes). If we want to support IPAA for DS pods, the DS controller should evaluate the IPAA rules and create pods for nodes that pass IPAA constraints. There are two issues with this approach:

  1. IPAA is complex and may slow down the DS controller.
  2. In the default scheduler, IPAA is evaluated at the time of scheduling Pods. If we add the logic to the DS controller, IPAA will be evaluated at the time of creating a DS and then the right number of pods are created for the right set of nodes. However, that would not be enough. When a pod is removed from a node, then DS controller must evaluate all DS'es with anti-affinity and create new DS pods for such nodes if the anti-affinity rules are satisfied after the removal of the pod. Similarly, when a Pod is scheduled on a new node, the DS controller must evaluate all DS'es with affinity and create new DS pods if the affinity rules are satisfied. This would cause a high load on the DS controller in large clusters with a high churn.

Given these, I don't think we can support IPAA in DS controller.
An alternative approach is to let the DS controller create DS pods for all nodes, ignoring the IPAA rules. Since the default scheduler enforces IPAA, some of these pods will remain pending for a long time until they become schedulable on their assigned nodes. They may remain pending for the life of the cluster if those nodes never become feasible for the pods. This is not good solution either, as it would cause the scheduler to keep reevaluating these pods. It is also not good UX to have DS controller create pods that may never be scheduled.

@rabbitfang

This comment has been minimized.

Copy link

rabbitfang commented Oct 8, 2018

I wanted to mention my use case for this feature. We use New Relic APM for monitoring some of our applications. NR APM charges per-host, not per-pod/per-container. To reduce costs, we would like to run the APM agent only on the nodes where the relevant pods are running (i.e. the pods running applications that use NR APM). One (theoretical) way of doing this would be to add a pod affinity to the New Relic APM DaemonSet to match pods labeled as using the APM; however, that feature obviously isn't supported.

A workaround that I thought of (but not yet implemented) would be to have a custom controller that labels nodes with new-relic-apm: active whenever relevant pods are scheduled on that nodes, and removes the label after all the relevant pods are removed from the node. Then, we would add a node selector to the DaemonSet to match that label. Therefore, the DaemonSet would still hold the responsibility for creating the APM agent pods and performing updates and we wouldn't need to implement our own pod controller.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jan 6, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@dblackdblack

This comment has been minimized.

Copy link

dblackdblack commented Jan 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.