Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSet pods should be scheduled by default scheduler, not DaemonSet controller #42002

Closed
davidopp opened this Issue Feb 23, 2017 · 73 comments

Comments

@davidopp
Copy link
Member

davidopp commented Feb 23, 2017

DaemonSet controller being its own scheduler has caused lots of confusion. Once we have tolerations and preemption, I think we won't need DaemonSet-specific scheduling logic. See also #42001.

cc/ @kubernetes/sig-scheduling-misc @kubernetes/sig-cluster-lifecycle-misc

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Feb 23, 2017

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Feb 23, 2017

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Feb 24, 2017

So the part where @jayunit100 was refactoring portions of the scheduler to be linked externally without causing a dependency train wreck was to support issues like this.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Feb 24, 2017

It's much better for DaemonSet to just produce pods like every other controller, and let them be scheduled by the regular scheduler, than to be its own scheduler. There are definitely use cases for linking in scheduler code to a component, but I don't think this is one of them. Even if it were easy to do that, DaemonSet shouldn't be a scheduler.

@kargakis

This comment has been minimized.

Copy link
Member

kargakis commented Feb 24, 2017

+1

Trying to make the DS controller sync with the scheduler feels like a nightmare so far.

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Feb 24, 2017

@davidopp Totally agree...
But then daemonset just becomes a special anti-affinity based on node... More of a macro than a primitive, which was my original position in the 1st place.

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Feb 25, 2017

+1

@resouer

This comment has been minimized.

Copy link
Member

resouer commented Feb 25, 2017

A naive question, what happens if there's no scheduler running? (for example, this DS is used to deploy master components).

@kargakis

This comment has been minimized.

Copy link
Member

kargakis commented Feb 25, 2017

Good question, cluster admins can always run one-off pods and bypass the scheduler manually.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Feb 26, 2017

This would presumably also fix #35228.

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Feb 27, 2017

A naive question, what happens if there's no scheduler running? (for example, this DS is used to deploy master components).

@mikedanese proposed 1-shot scheduling for boot-strapping ~year ago.

@kargakis

This comment has been minimized.

Copy link
Member

kargakis commented Apr 4, 2017

@jbeda @luxas @roberthbailey you guys will be interested in this one

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Apr 4, 2017

Yeah, thanks @kargakis. This was exactly the "fear" @jbeda mentioned.
If we do this, I'd rather have a "bootstrapping mode" that bypasses this (in case this breaks people, we don't know that yet)

@roberthbailey

This comment has been minimized.

Copy link
Member

roberthbailey commented Apr 5, 2017

I not sure I fully understand the bootstrapping concerns. You can always bypass the scheduler if you want (by specifying a node explicitly), so why not start with that for bootstrapping and then pivot into using a DS once the scheduler is running?

@jayunit100

This comment has been minimized.

Copy link
Member

jayunit100 commented Apr 5, 2017

This brings up the question of wether it's possible to protect certain things from being written to the API server by certain unauthorized components in general? I don't know the answer to this question.

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Apr 5, 2017

We can also just add a new core controller that satisfies long running scheduler failures. However, that's not enough, since you usually need a node, replication controller, and deployment controller in order to run. Failsafe self-hosted components is really a discussion in itself, and deserves a longer design. I don't think daemonset should be doing its own scheduling either, especially with sophisticated taints in play.

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Apr 5, 2017

Not sure why we support bypass the scheduler; scheduler should be the only component to place the pods on Node (static pod is considered to be out of scheduling. Can we update the Node.allocable to not include static pod's resource?).

  1. If placing Pod by admin, I doubt whether he knows the detail of cluster to make the "right" decision
  2. If placing Pod by external-scheduler, it should be moved to scheduler or an extender/plugin of scheduler
  3. If any requirement to scheduler, suggest to enhance scheduler
@jayunit100

This comment has been minimized.

Copy link
Member

jayunit100 commented Apr 5, 2017

Are we all in agreement here that the scheduler should be the one doing the scheduling in all cases ?

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Apr 5, 2017

the scheduler should be the one doing the scheduling in all cases

IMO: +1 , and it should be one of "kubernetes core". I'd like to see others' comments :).

@smarterclayton

This comment has been minimized.

Copy link
Contributor

smarterclayton commented Apr 5, 2017

Having the scheduler doing the scheduling means we can take away bind permission from components besides the scheduler, and then introduce openshift's admission controller that requires that permission to modify node name on pods, which then prevents bypassing scheduling for "normal" tenants and makes tolerations a viable security mechanism

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Apr 6, 2017

The primary issue we discussed in sig-cluster-lifecycle was that currently a pod network DS is deployed regardless of Node Ready/NotReady. When we switch to the scheduler this won't happen.

But I think switching to tolerations for reporting Ready/NotReady status will fix that issue

@timothysc

This comment has been minimized.

Copy link
Member

timothysc commented Apr 6, 2017

@davidopp enqueue for next sig meeting.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Apr 7, 2017

I think it's reasonable to ship by default with "only default scheduler can set NodeName" (assuming there aren't any bootstrapping issues) but I think we need to make it possible for cluster admins to optionally add to the set of components that can set NodeName, so that people can write custom schedulers. (Not all user customizations to the scheduler can be cast into the extender model.)

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jul 5, 2017

@davidopp - yes I think we understand it the same way - for me it is "running scheduling algorithm (in scheduler itself) in the current state without applying any changes to the internal state".

So what I was trying to say is that:

  1. currently it's not possible to run multiple "dry-run queries" at the same time (also not simultanously with some other real scheduling)
  2. to make this API somehow useful, we would have to enable it
  3. rate-limitting will solve a bunch of problems but we need to make sure this exactly what we want (maybe, I don't know now)
  4. I don't think dry-run will solve the cluster-autoscaler usecase, because it is running predicates also on artificial nodes it is planning to add for simulation purposes - scheduler won't be able to do it (because there is no such node).

The other thing that I didn't though very carefully is that for the "dry run mode", all the usecases I can think about don't require running priorities - it's only about predicates. If that's the case, we would need a bit of refactoring to support it (though not that much).

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Jul 5, 2017

Good point about cluster autoscaler; I hadn't thought about that. (You could imagine "dry run" mode allowing you to add nodes, not just a pending pod, but that's starting to get pretty complicated...)

For the others, my point is just that maybe continuing to do sequential scheduling would be OK (mix dry-runs and real scheduling, but never both at the same time, in sequence) if we limit the rate of the dry-run requessts.

@wojtek-t

This comment has been minimized.

Copy link
Member

wojtek-t commented Jul 5, 2017

For the others, my point is just that maybe continuing to do sequential scheduling would be OK (mix dry-runs and real scheduling, but never both at the same time, in sequence) if we limit the rate of the dry-run requessts.

Maybe - we just need to list the usecases and see if rate-limiting wouldn't make it unusable due to too small number of possible requests.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Jul 8, 2017

I'm not sure where we should go with this. If we rely on DaemonSet controller to manage critical system pods (e.g. run kube-proxy on every node), then the system may break if the cluster admin substitutes a scheduler that doesn't schedule the DaemonSet pods as expected.

I discussed this a little with @timothysc and his perspective (I'll let him elaborate more on it) was that there are many things a cluster admin can do to break things, and they should just make sure that if they use a custom scheduler, it should handle DaemonSet pods correctly.

I think the "simulation" question (how components that need to "simulate" where pods will schedule should work in a world of replaceable and multiple schedulers) is a good one but should be discussed in a separate issue (#48657) since it applies to more than just DaemonSet. I think that even if we had a perfect solution for "simulation" of scheduler behavior, we might still want DaemonSet to schedule its own pods, so that critical system pods can always rely on being scheduled in a predictable way regardless of what replacement scheduler might be used.

But I'll let @timothysc chime in with his perspective.

(Also, I came across #29276, which is related to this one.)

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Jul 10, 2017

I think that even if we had a perfect solution for "simulation" of scheduler behavior, we might still want DaemonSet to schedule its own pods, so that critical system pods can always rely on being scheduled in a predictable way regardless of what replacement scheduler might be used.

So we're going to still let DaemonSet schedule its pods?

I think the replacement scheduler should handle Pods's attributes well, e.g. mirror pod, critical pods.

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Jul 10, 2017

This is a tough one and is hard to say which one is a better approach. If I had to decide, I would still let the main scheduler schedule DaemonSet pods. I agree with @timothysc that admin can do many things to break a system. Even if we let the DaemonSet controller schedule its pods, a substitute scheduler can still break the system badly by, for example, not scheduling ReplicaSet pods. I know that many critical system pods are DaemonSet pods and if they are not scheduled properly the cluster will not function, but the net outcome of a cluster with a broken scheduler that supposed to run, for example, a replicated web service is the same whether DeamonSet pods are scheduled by the DaemonSet controller or not. In both cases, the cluster will not perform what it is supposed to do. So, it is imperative that the substitute scheduler perform all of its expected work correctly and part of the "expectation" can be scheduling DaemonSet pods.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Jul 10, 2017

One thing I forgot that @timothysc had suggested was that we could have scheduler conformance tests that verify that a replacement scheduler meets whatever minimum criteria we have. (This is kinda an automated version of #17208.) This is an excellent idea regardless of what we do with DaemonSet. For example it could verify that the scheduler implements the predicates that are built into kubelet admission.

I'm still not sure about DaemonSet. If we decide we want it to use the default scheduler, we may have to implement #48657.

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Jul 11, 2017

+1 to conformance tests for replacement scheduler

After went through the discussion again, I'm also not sure about DaemonSet :). I think it dependents on our expectation to the replacement scheduler, e.g. will replacement scheduler support NodeAffinity?

regarding dry-run, prefer to add some annotations to Pod instead of exposing REST from default-scheduler; it introduce extras dependency between components (we introduce HTTPExtender because Golang does not support dynamic lab)

@erictune

This comment has been minimized.

Copy link
Member

erictune commented Jul 11, 2017

To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Jul 11, 2017

Bootstrapping was discussed earlier in the thread (though perhaps not thoroughly), see #42002 (comment)

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Jul 28, 2017

To allow bootstrapping a scheduler pod onto a cluster, there needs to be some type of controller that does not depend on the scheduler itself. Given that there has to be something that does it, I don't see why DaemonSet can't be that thing.

If using DaemonSet, we need to enhance DaemonSet to support start only one daemon (scheduler) on a set of nodes (master nodes).

@luxas

This comment has been minimized.

Copy link
Member

luxas commented Jul 28, 2017

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Jul 29, 2017

I think it's usually part of master nodes, e.g. 3 scheduler in 5 master nodes, with other add-on, e.g. dashboard.

@aveshagarwal

This comment has been minimized.

Copy link
Member

aveshagarwal commented Sep 25, 2017

@k82cn I am not sure if this has been discussed. How will the scenario, where a DS' pod's node selector is updated in an admission plugin like PodNodeSelector, be handled? See this issue #51788

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Sep 26, 2017

@aveshagarwal , nop, I think there's no finial decision for now; maybe we can continue the discuss in 1.9. According to the sig-apps backlog, maybe @erictune have some suggestion to move forward :).

@erictune

This comment has been minimized.

Copy link
Member

erictune commented Oct 11, 2017

It is not going to be a good user experience if the daemonSet controller makes one pod for every node, and lets the scheduler decide which ones fit. Then you end up with a bunch of pending pods, which is confusing to users, and perhaps extra load on the scheduler, and perhaps confusing for cluster auto scalers.

So, somehow, the daemonSet needs to know (modulo race conditions, taints, etc), how many pods to create.

Is there some way that the daemonSet can know which nodes likely match, and make pods for each node, and then still let the scheduler do a final fit check, etc.

Also, isn't scheduler performance going to be poor if it has to deal with a large number of pods with anti-affinity?

What can we do.

(BTW, I no longer thing the bootstrapping use case matters, so ignore this comment: #42002 (comment))

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Oct 11, 2017

Given all the discussions around this topic, I am fine with letting DaemonSet controller keep its current behavior and schedule DaemonSet pods. As @erictune said, this has the benefit of reducing scheduler and auto-scaler load. It will also provide better guarantees that critical DaemonSet pods will be scheduled in a more predictable manner even in clusters where the default scheduler is replaced by a custom scheduler.
The obvious drawbacks are: 1) the DeamonSet controller needs to keep some of the cluster state in memory, as does scheduler, to check certain scheduling rules with reasonable throughput, 2) despite the fact that scheduler predicates are like a library that can be called by DaemonSet controller, there will be still more work maintaining some of the scheduling logic in both the scheduler and DaemonSet controller. We should especially be careful when adding new predicate functions and make sure they are added to the DaemonSet controller if needed.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Oct 11, 2017

Just an observation: if we keep DaemonSet scheduling pods, then we will need to add some of the preemption logic to it -- getting rid of the "critical pod rescheduler" is one of the motivations for having a general preemption mechanism. I say "some" because it doesn't need to decide which node to preempt on (since it wants to run on every node that meets the other constraints), but it does need to decide the correct minimal set of victims. Also, it is one more component that can delete pods.

@k82cn

This comment has been minimized.

Copy link
Member

k82cn commented Oct 12, 2017

It is not going to be a good user experience if the daemonSet controller makes one pod for every node, and lets the scheduler decide which ones fit. Then you end up with a bunch of pending pods, which is confusing to users, and perhaps extra load on the scheduler, and perhaps confusing for cluster auto scalers.

IMO, as an admin, I'll plan DaemonSet based on "static" setting, for example, I plan to start DaemonSet on 5 nodes (by nodeSelector), and maxUnavailable is 3; if it only start 2 pod because of resource, the rolling upgrade should be pending (it will break my QoS/SLA if terminating the only 2 available pod).

Also, isn't scheduler performance going to be poor if it has to deal with a large number of pods with anti-affinity?

Maybe NodeAffinity :). Regarding the scheduler performance, I think there're some work we can do to improve it, e.g. #42002 (comment) vs #42002 (comment)

Our doc said k8s support multiple and user-provided schedulers but we did not give a suggestion on how those scheduler work together; there maybe several conflicts if dependent on kubelet.

~~~And will customized scheduler have to handle scheduling features, e.g. Affinity, Taint/Tolerant, preemption? I'm not sure the impact if it does not follow :(.~~~ Handled by conformance test :).

@bsalamat

This comment has been minimized.

Copy link
Contributor

bsalamat commented Oct 19, 2017

As discussed in the mailing lists of sig-scheduling and sig-architecture, the final decision is to keep the current behavior of DaemonSet controller and let it schedule its own pods.

@davidopp

This comment has been minimized.

Copy link
Member Author

davidopp commented Oct 19, 2017

In that case I guess we should close this.

BTW the thread @bsalamat refers to is here:
https://groups.google.com/d/msg/kubernetes-sig-scheduling/kMG7yfONwY4/Nx3abXuNAAAJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.