-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: move all preCheck to QueueingHint #110175
Comments
/assign |
Maybe it's only on a newly created unscheduled Pod. |
And deleting Node events has the same issue...? I guess we need to run kubernetes/pkg/scheduler/eventhandlers.go Lines 92 to 112 in 31a1024
In in-tree plugins, only pod topology spread uses the deleting Node event and I guess this bug affects its performance. (Even when some Nodes are deleted, the Pods, that were unscheduled by Pod topology spread, aren't moved to activeQ/backoffQ by that deleting events.)
|
Hm, it may be a problem that events related to Node and Pod are not always called when an event occurs. (Even if it's intended behavior, at least we need to write this on the comment of /retitle scheduler: events in |
EventsToRegister
related to Pod doesn't work EventsToRegister
related to Pod and Node don't always work
To summarize: The problem is "not all events related to Pod or Node don't always make move from unschedulableQ to active/backoffQ and it may not be intended for custom-plugin developers because this behavior isn't even documented." I think we should move all unschedulable Pods with all Pod and Node events. If concerned about performance degradation, we can extend This problem has a non-small impact, especially in v1.24+ because we change PodMaxUnschedulableQDuration to 5 min. Let's see the details... ↓ Assigned Pod's add/update eventsWe move the unscheduled Pods (rejected by the plugin that has Pod add/update events in kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go Lines 593 to 607 in 31a1024
But, the target Pods against this moving are only unschedulable pods which have any affinity term that matches "pod". kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go Lines 660 to 678 in 31a1024
So,,,, users cannot move from unschedulableQ to activeQ/backoffQ with all assignedPod events? I guess we should pass all unscheduled Pods to Assigned Pod's delete eventsNo issue on this. kubernetes/pkg/scheduler/eventhandlers.go Line 234 in 31a1024
Non-scheduled Pod's add/update/delete eventsWe don't move the unschedulable Pods with those events at all. That's a problem... Even when a plugin uses Pod add/update/delete event in (In our company, we have the custom plugin for Node's add/update eventsWe move only unscheduled Pods that pass kubernetes/pkg/scheduler/eventhandlers.go Line 70 in 31a1024
kubernetes/pkg/scheduler/eventhandlers.go Line 88 in 31a1024
kubernetes/pkg/scheduler/eventhandlers.go Lines 474 to 483 in 31a1024
The problem that arises here is similar to "Assigned Pod's add/update events". Node's delete eventsWe don't move the unschedulable Pods with Node delete events... That's a problem. Even when the plugin use Node delete event in |
/cc @Huang-Wei @alculquicondor @ahg-g Could you please take a look |
These are all intended optimizations. These basic checks prevent the scheduler from retrying unschedulable pods over and and over. So far we haven't had the need to make these checks configurable as they check well established scheduling features. +1 on documenting the behavior, but I'm not sure about changing it or making it configurable (that adds complexity that might not be justified). What is it that you are trying to do? |
We are trying to remove flushing in queue in #87850, right? Without improving like I suggest, some Pods in unschedulableQ cannot move back to activeQ forever in some cases if we remove flushing. This can happen in some custom plugins like our company's one and also even in in-tree plugins. |
cc @kerthcet, who implemented NodeInclusionPolicy.
Can you provide a concrete example? The policy just refers to the topology spread calculations. It doesn't mean that the node doesn't have to match the nodeaffinity to be able to accept the pod.
I'm asking what is your custom plugin trying to do. Depending on whether it makes sense, we need to evaluate if it's worth the complexity. |
This is actually a problem here, 5 mins would be too long for background flush. But it doesn't mean that we should simply shorten the interval time.
The skew of a new created node is always 0, so it's still unfit I think. |
Our company's custom plugin is like coscheduling plugin. And like coscheduling's prefilter, our plugin rejects Pods in PreFilter when the number of Pods is less than the defined number. Yes, so, maybe coscheduling plugin has the same issue as our plugin. The coscheduling plugin also uses Pod Add event and I guess it's for non-scheduled Pod's creation. (I don't know much about the coscheduling plugin though.)
Oh, that's true 😓 Please forget about that example.
|
Maybe we can find some pointcuts by the For customize plugins, we may have to do some refactoring to register the filter func together with the event. |
So now NodeC is excluded from calculations. Hence the pod fits in either NodeA or NodeB, which were unchanged. Tricky... I'll have to think about it. We really need those quick checks to pass for performance in most scenarios. /triage accepted |
Thanks @sanposhiho for bringing this up. Some comments below:
You're right. Pods failed by PodTopologySpread may not get re-queued immediately. So instead of "pass all unscheduled Pods to movePodsToActiveOrBackoffQueue", I'd like p.movePodsToActiveOrBackoffQueue(
p.getUnschedulablePodsWithMatchingAffinityTerm(pod),
p.getUnschedulablePodsWithPodTopologySpreadConstraints(pod), /* Not Implemented Yet */
AssignedPodAdd,
)
It's pretty rare to move pods depending on Unschedulable Pods' events. In your particular case, maybe a proactive Activate
It doesn't sound correlated, is it? As the Updated: The example #110175 (comment) is a valid point...
Can you share some use-cases to move Pods upon node's deletion? In the before, I don't think we have any, |
If we do so, also, we need to provide the way to pass some functions to decide which Pods should pass to
Ah~, that's a good idea :)
Maybe like this (similar to #110175 (comment))
So,,, ignoring some events for performance in scheduler side (without providing a way to change this) could be inconvenient for some plugins. What I thought in the above previous comment (#110175 (comment)) is to define "which kind of resource's events should be ignored" in plugin side. We can add pre filter functions in // obj: the created, updated, or deleted object
// rejectedPod: one of the unschedulable Pod in unschedulableQ
// return value: whether this unschedulable Pod should be moved to activeQ/backoffQ or not.
type PreFilterFn func(obj interface{}, rejectedPod *v1.Pod) bool
type ClusterEvent struct {
Resource GVK
ActionType ActionType
Label string
// PreFilterFn is the function to check which Pod should be moved from unschedulableQ to activeQ/backoffQ by this event happens.
PreFilterFn PreFilterFn
} kubernetes/pkg/scheduler/framework/types.go Lines 80 to 84 in da7f184
Then we can set any pre-filter functions for each plugin. What do you think about this idea? |
Nope, that is something the scheduler core needs to implement.
Thanks. It makes sense to me.
A customizable PreCheck approach for sure can satisfy more dynamic cases, but it's inappropriate to associate it with ClusterEvent as (1) ClusterEvent is basically an immutable constant, and (2) we use it to map to plugins. If we go this way, we need to wrap ClusterEvent into another data structure which is kind of complicated. To summarize, the following cases doesn't properly re-queue relevant pods:
For 1 & 2, I've commented above with solutions. (1 needs core changes similar like podAffinity, 2 may leverage the Activate hook) For 3, with @sanposhiho 's examples, I'm convinced this is some area we need to improve. One simple and stratightforward solution is to plumb a hook to let the plugin claim "alright, upon Node events, move the pod failed by me anyways". Then once this knob is registered, we move pods failed by that plugin back unconditionally; instead of proceeding with So, as a first step, let's ensure we're on the same page and agree on the ^^ summary. Next, we can better evaluate trade-offs and come up with an optimal solution. |
Thanks for discussing this topic in the sig meeting. 🙏 So... the benefit of the "customizable PreCheck approach" (#110175 (comment)) isn't only to solve this issue. We can use this feature for other optimization: like improving the whole performance of scheduler by moving only Pods that are more likely to be able to be scheduled by defining more detailed events on
Yes, I watched the sig-meeting recording as well and the opinion that the "customizable PreCheck approach" increases some implementation complexity surely makes sense. But,,, isn't this complexity increasing acceptable even if the above is taken into account? EDIT: It is a bad habit of mine to always write long sentences even though I am not very good at English. 😓
|
I didn't jump in the thread since it's actively on the discussion. |
@sanposhiho I'm not sure I fully understand your intention. The slack thread was discussing the AssignedPod update case, so it fallls into case 1 in #110175 (comment), which we already agreed to improve in upstream. |
Hi @sanposhiho Are you working on this? I think this is quite important and if you're not working on this, I can follow. |
Sorry, leave here for a while. I wanted to wait for KEP-3521: Pod Scheduling Readiness (and then forgot to come back here. 😓 ) @Huang-Wei Do you think it makes sense to add "By which event the pod will be moved back to schedQ/backoffQ" on type EnqueuePlugin interface {
Plugin
// event: By which event the pod will be moved back to schedQ/backoffQ.
// If the Pod will be backed to schedQ/backoffQ by flushing, it will be nil. (Flushing plans to be removed in https://github.com/kubernetes/kubernetes/issues/87850 though.)
// involvedObj: the object involved in that event.
// For example, the given event is "Node deleted", the involvedObj will be that deleted Node.
PreEnqueue(ctx context.Context, state *CycleState, p *v1.Pod, event framework.ClusterEvent, involvedObj runtime.Object) *Status
} I know For example, I believe we can move ↓ this logic to Pod Affinity plugin by implementing kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go Lines 660 to 678 in 31a1024
Pod Affinity plugin will remember Pods rejected by itself in We can resolve (3) by implementing
My original concern in this issue is that having logic for specific plugins in the scheduler core may prevent out-of-tree plugins' extendability. The scheduler itself should be pure, and shouldn't do anything special for any plugins implementation ideally. If we can move all such logic to each plugin, that'd be perfect. |
^ is more like a new feature than just a solution to this problem. I created another issue and describe my proposal in detail. Let's have a discussion there. |
It can be resolved by moving all preCheck into each plugin's QueueingHint. Let me retitle this issue so that the current status is easy to understand. /retitle scheduler: move all preCheck to QueueingHint |
EventsToRegister
related to Pod and Node don't always work
What happened?
In our custom plugin, we want to move pods that is unscheduled by the plugin to activeQ/backoffQ when a new Pod is created.
So, we define
EventsToRegister
like this:But, even when a new Pod is created, the Pod that was unscheduled by our custom plugin isn't moved to backoffQ/activeQ.
What did you expect to happen?
when a new Pod is created, the Pod unscheduled by our custom plugin is moved to backoffQ/activeQ.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
If I understand correctly and if it's not expected behavior, this bug has a non-small impact, especially in v1.24 because we change
PodMaxUnschedulableQDuration
to 5 min.#108761
Kubernetes version
v1.23.4 but I think latest master has the same issue.
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: