-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
take PodTopologySpread into consideration when requeueing Pods based on Pod related events #122627
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @alculquicondor |
/hold to go through an approver's review. |
e6e265f
to
a70b42f
Compare
/retest |
test/integration/util/util.go
Outdated
// Note that this function leaks goroutines in the case of timeout; even after this function returns after timeout, | ||
// the goroutine made by this function keep waiting to pop a pod from the queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fix this in timeout()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to fix it, I guess we have to modify schedulingQueue.Pop() to somehow abort waiting (e.g., add ctx
in Pop() and abort waiting if ctx
is canceled.).
I don't have any other idea to solve this leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pop()
would be able to quit in another way I guess - we can use defer schedulingQueue.Close()
in each test to signal the queue to quit gracefully, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we use a different scheduler in every test case, so this leaked goroutine won't be carried over to the next test case.
What I meant here is that this NextPod is called multiple times in one test. What confused me when writing tests was:
- The first call of
NextPod
is timeout-ed. - Do something and Pod-A is moved to activeQ.
- I want to confirm (2) moved Pod-A, and use
NextPod
again to make sure Pod-A is popped. But the goroutine made in (1) is still alive and pops Pod-A out. The second call ofNextPod
returns timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei addressed your suggestion |
@kubernetes/sig-scheduling-approvers Can anyone take a look? |
0415147
to
f5490a9
Compare
01fc67c
to
d7c4b64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments.
pkg/scheduler/eventhandlers.go
Outdated
@@ -204,7 +204,19 @@ func (sched *Scheduler) addPodToCache(obj interface{}) { | |||
logger.Error(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod)) | |||
} | |||
|
|||
sched.SchedulingQueue.AssignedPodAdded(logger, pod) | |||
// SchedulingQueue.AssignedPodAdded internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave the logic centralized at AssignedPodAdded
? Make the function more pure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the current implementation because after SchedulerQueueingHints
feature flag goes to GA, AssignedPodAdded is no longer needed and we'll use MoveAllToActiveOrBackoffQueue only. What do you think?
if utilfeature.DefaultFeatureGate.Enabled(features.SchedulerQueueingHints) { | ||
sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(logger, queue.AssignedPodUpdate, oldPod, newPod, nil) | ||
} else { | ||
sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -1085,9 +1089,13 @@ func isPodResourcesResizedDown(pod *v1.Pod) bool { | |||
func (p *PriorityQueue) AssignedPodUpdated(logger klog.Logger, oldPod, newPod *v1.Pod) { | |||
p.lock.Lock() | |||
if isPodResourcesResizedDown(newPod) { | |||
// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossNodeTerm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz remove this, let's comment on special case only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is special case actually, people don't have a clue why we don't use getUnschedulablePodsWithCrossNodeTerm
if isPodResourcesResizedDown is true.
nsLabels := interpodaffinity.GetNamespaceLabelsSnapshot(logger, pod.Namespace, p.nsLister) | ||
|
||
var podsToMove []*framework.QueuedPodInfo | ||
for _, pInfo := range p.unschedulablePods.podInfoMap { | ||
if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right? I'm ok with the current implementation, since no feedback/issue ticketed about this. Don't want to expand the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think we can still check the podAffinity here, if notMatch, don't append, agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right?
Yes, right. People have to enable QHint if they want to avoid the same problem in custom scheduler plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think we can still check the podAffinity here, if notMatch, don't append, agree?
I took time thinking deeply about this, and found that we should go with the current implementation rather than the suggestion from @kerthcet .
In short: Supposing a Pod has both topology spread and pod affinity, and a Pod is rejected only by topology spread. In this case, even pod
here doesn't match pod-affinity, it could make the Pod schedulable.
More in-depth explanation: Supposing,
- a Pod has both topology spread and pod affinity. Topology spread's maxSkew is 1 and topology key is nodename.
- node-a has 1 Pod matching topology spread, while node-b has 3 Pod.
- both nodes have a scheduled Pod matching pod affinity.
In this case, a assigned Pod, which is on node-a and matching topology spread, can make the Pod schedulable, regardless of whether this Pod matches pod affinity or not. (because node-a already has another Pod matching pod-affinity)
@@ -225,7 +237,19 @@ func (sched *Scheduler) updatePodInCache(oldObj, newObj interface{}) { | |||
logger.Error(err, "Scheduler cache UpdatePod failed", "pod", klog.KObj(oldPod)) | |||
} | |||
|
|||
sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod) | |||
// SchedulingQueue.AssignedPodUpdated internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration. | |||
// Consequently, if custom plugins that subscribes Pod/Update events reject Pods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only want to handle podTopologySpread problem here, let's not mention the custom plugins, or will lead to confusing like we solved the problems for all plugins, unless you're intended to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the comment more concise? It's a little too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this PR solves two things:
- take PodTopologySpread into consideration when requeueing Pods based on Pod related events.
- This fix is mostly for people who don't use QHint enabled.
- use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.
- It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)
Maybe you want me to create separate PR for the changes in eventhandlers.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are no plans for cherry-pick, I'm ok fixing both problems in this PR.
Just update the title and description accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the description.
test/integration/util/util.go
Outdated
@@ -1157,21 +1157,3 @@ func NextPodOrDie(t *testing.T, testCtx *TestContext) *schedulerframework.Queued | |||
} | |||
return podInfo | |||
} | |||
|
|||
// NextPod returns the next Pod in the scheduler queue, with a 5 seconds timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this? Unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unused and it has the goroutine-leak problem as mentioned in the comment. So, just cleanup in this PR (I can create another PR, if you want)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this cleanup to #125433
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kerthcet Sorry for late, I just replied to comments. (will have actual impl changes later)
pkg/scheduler/eventhandlers.go
Outdated
@@ -204,7 +204,19 @@ func (sched *Scheduler) addPodToCache(obj interface{}) { | |||
logger.Error(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod)) | |||
} | |||
|
|||
sched.SchedulingQueue.AssignedPodAdded(logger, pod) | |||
// SchedulingQueue.AssignedPodAdded internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the current implementation because after SchedulerQueueingHints
feature flag goes to GA, AssignedPodAdded is no longer needed and we'll use MoveAllToActiveOrBackoffQueue only. What do you think?
test/integration/util/util.go
Outdated
@@ -1157,21 +1157,3 @@ func NextPodOrDie(t *testing.T, testCtx *TestContext) *schedulerframework.Queued | |||
} | |||
return podInfo | |||
} | |||
|
|||
// NextPod returns the next Pod in the scheduler queue, with a 5 seconds timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unused and it has the goroutine-leak problem as mentioned in the comment. So, just cleanup in this PR (I can create another PR, if you want)
@@ -225,7 +237,19 @@ func (sched *Scheduler) updatePodInCache(oldObj, newObj interface{}) { | |||
logger.Error(err, "Scheduler cache UpdatePod failed", "pod", klog.KObj(oldPod)) | |||
} | |||
|
|||
sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod) | |||
// SchedulingQueue.AssignedPodUpdated internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration. | |||
// Consequently, if custom plugins that subscribes Pod/Update events reject Pods, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this PR solves two things:
- take PodTopologySpread into consideration when requeueing Pods based on Pod related events.
- This fix is mostly for people who don't use QHint enabled.
- use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.
- It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)
Maybe you want me to create separate PR for the changes in eventhandlers.go?
nsLabels := interpodaffinity.GetNamespaceLabelsSnapshot(logger, pod.Namespace, p.nsLister) | ||
|
||
var podsToMove []*framework.QueuedPodInfo | ||
for _, pInfo := range p.unschedulablePods.podInfoMap { | ||
if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right?
Yes, right. People have to enable QHint if they want to avoid the same problem in custom scheduler plugins.
@kerthcet Fixed / replied to your comments🙏 |
/retest |
bfef5e7
to
4afbd71
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@alculquicondor @kerthcet Sorry for a long wait on it. Addressed the comments. |
/retest |
4afbd71
to
b41a7ab
Compare
/retest |
…on Pod related events
b41a7ab
to
6d73478
Compare
rebased. |
// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm | ||
// because Pod related events maybe make Pods that rejected by NodeResourceFit schedulable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm | |
// because Pod related events maybe make Pods that rejected by NodeResourceFit schedulable. | |
// In this case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm | |
// because Pod related events may make Pods that were rejected by NodeResourceFit schedulable. |
@@ -1104,9 +1108,13 @@ func isPodResourcesResizedDown(pod *v1.Pod) bool { | |||
func (p *PriorityQueue) AssignedPodUpdated(logger klog.Logger, oldPod, newPod *v1.Pod) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this include changes to Pod status? or only Pod spec?
if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) { | ||
// This Pod may be schedulable now by this Pod event. | ||
podsToMove = append(podsToMove, pInfo) | ||
continue | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you consider checking for the namespace?
@@ -1853,6 +1854,7 @@ func TestPriorityQueue_AssignedPodAdded(t *testing.T) { | |||
defer cancel() | |||
|
|||
affinityPod := st.MakePod().Name("afp").Namespace("ns1").UID("afp").Annotation("annot2", "val2").Priority(mediumPriority).NominatedNodeName("node1").PodAffinityExists("service", "region", st.PodAffinityWithRequiredReq).Obj() | |||
spreadPod := st.MakePod().Name("tsp").Namespace("ns1").UID("tsp").SpreadConstraint(1, "node", v1.DoNotSchedule, nil, nil, nil, nil, nil).Obj() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please instead create a new test and rename the existing one to be specific about pod affinity (the comment already talks about pod affinity specifically).
Or maybe it can be converted into a test table.
What type of PR is this?
/kind bug
What this PR does / why we need it:
kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go
Line 1044 in b529e6f
kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go
Line 1069 in b529e6f
When assigned Pods are created/updated, cluster events (Pod/Add or Pod/Update) should be delivered for all plugins that subscribe those events. But, actually
getUnschedulablePodsWithMatchingAffinityTerm
pre-filters Pods to receive those events based on Pods' affinity andpod.Status.Resize
. - meaning only PodAffinity/NodeResourceFit receive them actually.The problem here is similar to #110175 - in in-tree plugins, PodAffinity, PodTopologySpread and NodeResourceFit subscribe Pod add and/or update events. And, Pods rejected by PodTopologySpread will never be requeued to activeQ by Pod-related events. The same happens for custom plugins registering pod add and/or update events.
This PR addresses this problem by -
Which issue(s) this PR fixes:
Fixes #122626
Fixes #123480
Special notes for your reviewer:
@kubernetes/sig-scheduling-leads
It might be a candidate for cherry-pick. I'm not sure and don't have a strong opinion.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: