take PodTopologySpread into consideration when requeueing Pods based on Pod related events #122627

sanposhiho · 2024-01-07T03:08:14Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

AssignedPodAdded():

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Line 1044 in b529e6f

p.movePodsToActiveOrBackoffQueue(logger, p.getUnschedulablePodsWithMatchingAffinityTerm(logger, pod), AssignedPodAdd, nil, pod)
AssignedPodUpdated():

kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go

Line 1069 in b529e6f

p.movePodsToActiveOrBackoffQueue(logger, p.getUnschedulablePodsWithMatchingAffinityTerm(logger, newPod), AssignedPodUpdate, oldPod, newPod)

When assigned Pods are created/updated, cluster events (Pod/Add or Pod/Update) should be delivered for all plugins that subscribe those events. But, actually getUnschedulablePodsWithMatchingAffinityTerm pre-filters Pods to receive those events based on Pods' affinity and pod.Status.Resize. - meaning only PodAffinity/NodeResourceFit receive them actually.

The problem here is similar to #110175 - in in-tree plugins, PodAffinity, PodTopologySpread and NodeResourceFit subscribe Pod add and/or update events. And, Pods rejected by PodTopologySpread will never be requeued to activeQ by Pod-related events. The same happens for custom plugins registering pod add and/or update events.

This PR addresses this problem by -

take PodTopologySpread into consideration when requeueing Pods based on Pod related events.
- This fix is mostly for people who don't use QHint enabled.
use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.
- It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)

Which issue(s) this PR fixes:

Fixes #122626
Fixes #123480

Special notes for your reviewer:

@kubernetes/sig-scheduling-leads
It might be a candidate for cherry-pick. I'm not sure and don't have a strong opinion.

Does this PR introduce a user-facing change?

Fix a bug that when PodTopologySpread rejects Pods, they may be stuck in Pending state for 5 min in a worst case scenario.
The same problem could happen with custom plugins which have Pod/Add or Pod/Update in EventsToRegister,
which is also solved with this PR, but only when the feature flag SchedulerQueueingHints is enabled.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-01-07T03:08:16Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2024-01-07T03:08:23Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sanposhiho · 2024-01-07T03:13:23Z

/cc @alculquicondor

sanposhiho · 2024-01-07T03:14:26Z

/hold to go through an approver's review.

sanposhiho · 2024-01-07T05:58:09Z

/retest

pkg/scheduler/internal/queue/scheduling_queue.go

Huang-Wei · 2024-01-09T00:30:26Z

test/integration/util/util.go

+// Note that this function leaks goroutines in the case of timeout; even after this function returns after timeout,
+// the goroutine made by this function keep waiting to pop a pod from the queue.


can you fix this in timeout()?

If we want to fix it, I guess we have to modify schedulingQueue.Pop() to somehow abort waiting (e.g., add ctx in Pop() and abort waiting if ctx is canceled.).
I don't have any other idea to solve this leak.

Pop() would be able to quit in another way I guess - we can use defer schedulingQueue.Close() in each test to signal the queue to quit gracefully, right?

Actually, we use a different scheduler in every test case, so this leaked goroutine won't be carried over to the next test case.

What I meant here is that this NextPod is called multiple times in one test. What confused me when writing tests was:

The first call of NextPod is timeout-ed.

Do something and Pod-A is moved to activeQ.

I want to confirm (2) moved Pod-A, and use NextPod again to make sure Pod-A is popped. But the goroutine made in (1) is still alive and pops Pod-A out. The second call of NextPod returns timeout.

sanposhiho · 2024-01-14T03:47:14Z

@Huang-Wei addressed your suggestion

sanposhiho · 2024-02-03T08:06:07Z

@kubernetes/sig-scheduling-approvers Can anyone take a look?

kerthcet

Several comments.

kerthcet · 2024-04-08T03:25:14Z

pkg/scheduler/eventhandlers.go

@@ -204,7 +204,19 @@ func (sched *Scheduler) addPodToCache(obj interface{}) {
 		logger.Error(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod))
 	}

-	sched.SchedulingQueue.AssignedPodAdded(logger, pod)
+	// SchedulingQueue.AssignedPodAdded internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.


Can we leave the logic centralized at AssignedPodAdded? Make the function more pure.

I prefer the current implementation because after SchedulerQueueingHints feature flag goes to GA, AssignedPodAdded is no longer needed and we'll use MoveAllToActiveOrBackoffQueue only. What do you think?

kerthcet · 2024-04-08T03:27:11Z

pkg/scheduler/eventhandlers.go

+	if utilfeature.DefaultFeatureGate.Enabled(features.SchedulerQueueingHints) {
+		sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(logger, queue.AssignedPodUpdate, oldPod, newPod, nil)
+	} else {
+		sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod)


kerthcet · 2024-04-08T03:30:37Z

pkg/scheduler/internal/queue/scheduling_queue.go

@@ -1085,9 +1089,13 @@ func isPodResourcesResizedDown(pod *v1.Pod) bool {
 func (p *PriorityQueue) AssignedPodUpdated(logger klog.Logger, oldPod, newPod *v1.Pod) {
 	p.lock.Lock()
 	if isPodResourcesResizedDown(newPod) {
+		// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossNodeTerm


Plz remove this, let's comment on special case only.

This is special case actually, people don't have a clue why we don't use getUnschedulablePodsWithCrossNodeTerm if isPodResourcesResizedDown is true.

pkg/scheduler/internal/queue/scheduling_queue.go

kerthcet · 2024-04-08T03:52:03Z

pkg/scheduler/internal/queue/scheduling_queue.go

 	nsLabels := interpodaffinity.GetNamespaceLabelsSnapshot(logger, pod.Namespace, p.nsLister)

 	var podsToMove []*framework.QueuedPodInfo
 	for _, pInfo := range p.unschedulablePods.podInfoMap {
+		if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) {


So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right? I'm ok with the current implementation, since no feedback/issue ticketed about this. Don't want to expand the scope.

We think we can still check the podAffinity here, if notMatch, don't append, agree?

So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right?

Yes, right. People have to enable QHint if they want to avoid the same problem in custom scheduler plugins.

We think we can still check the podAffinity here, if notMatch, don't append, agree?

I took time thinking deeply about this, and found that we should go with the current implementation rather than the suggestion from @kerthcet .

In short: Supposing a Pod has both topology spread and pod affinity, and a Pod is rejected only by topology spread. In this case, even pod here doesn't match pod-affinity, it could make the Pod schedulable.

More in-depth explanation: Supposing,

a Pod has both topology spread and pod affinity. Topology spread's maxSkew is 1 and topology key is nodename.

node-a has 1 Pod matching topology spread, while node-b has 3 Pod.

both nodes have a scheduled Pod matching pod affinity.

In this case, a assigned Pod, which is on node-a and matching topology spread, can make the Pod schedulable, regardless of whether this Pod matches pod affinity or not. (because node-a already has another Pod matching pod-affinity)

kerthcet · 2024-04-08T03:59:19Z

pkg/scheduler/eventhandlers.go

@@ -225,7 +237,19 @@ func (sched *Scheduler) updatePodInCache(oldObj, newObj interface{}) {
 		logger.Error(err, "Scheduler cache UpdatePod failed", "pod", klog.KObj(oldPod))
 	}

-	sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod)
+	// SchedulingQueue.AssignedPodUpdated internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.
+	// Consequently, if custom plugins that subscribes Pod/Update events reject Pods,


If we only want to handle podTopologySpread problem here, let's not mention the custom plugins, or will lead to confusing like we solved the problems for all plugins, unless you're intended to do this.

Can we make the comment more concise? It's a little too long.

So, this PR solves two things:

take PodTopologySpread into consideration when requeueing Pods based on Pod related events.

This fix is mostly for people who don't use QHint enabled.

use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.

It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)

Maybe you want me to create separate PR for the changes in eventhandlers.go?

Since there are no plans for cherry-pick, I'm ok fixing both problems in this PR.
Just update the title and description accordingly.

Updated the description.

kerthcet · 2024-04-08T04:00:15Z

test/integration/util/util.go

@@ -1157,21 +1157,3 @@ func NextPodOrDie(t *testing.T, testCtx *TestContext) *schedulerframework.Queued
 	}
 	return podInfo
 }
-
-// NextPod returns the next Pod in the scheduler queue, with a 5 seconds timeout.


Why remove this? Unused?

Yes, unused and it has the goroutine-leak problem as mentioned in the comment. So, just cleanup in this PR (I can create another PR, if you want)

Yes, please

Moved this cleanup to #125433

sanposhiho

@kerthcet Sorry for late, I just replied to comments. (will have actual impl changes later)

sanposhiho · 2024-04-15T01:34:52Z

pkg/scheduler/eventhandlers.go

@@ -204,7 +204,19 @@ func (sched *Scheduler) addPodToCache(obj interface{}) {
 		logger.Error(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod))
 	}

-	sched.SchedulingQueue.AssignedPodAdded(logger, pod)
+	// SchedulingQueue.AssignedPodAdded internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.


I prefer the current implementation because after SchedulerQueueingHints feature flag goes to GA, AssignedPodAdded is no longer needed and we'll use MoveAllToActiveOrBackoffQueue only. What do you think?

sanposhiho · 2024-04-15T01:36:59Z

test/integration/util/util.go

@@ -1157,21 +1157,3 @@ func NextPodOrDie(t *testing.T, testCtx *TestContext) *schedulerframework.Queued
 	}
 	return podInfo
 }
-
-// NextPod returns the next Pod in the scheduler queue, with a 5 seconds timeout.


Yes, unused and it has the goroutine-leak problem as mentioned in the comment. So, just cleanup in this PR (I can create another PR, if you want)

sanposhiho · 2024-04-15T01:44:29Z

pkg/scheduler/eventhandlers.go

@@ -225,7 +237,19 @@ func (sched *Scheduler) updatePodInCache(oldObj, newObj interface{}) {
 		logger.Error(err, "Scheduler cache UpdatePod failed", "pod", klog.KObj(oldPod))
 	}

-	sched.SchedulingQueue.AssignedPodUpdated(logger, oldPod, newPod)
+	// SchedulingQueue.AssignedPodUpdated internally pre-filters Pods to move to activeQ while taking only in-tree plugins into consideration.
+	// Consequently, if custom plugins that subscribes Pod/Update events reject Pods,


So, this PR solves two things:

take PodTopologySpread into consideration when requeueing Pods based on Pod related events.

This fix is mostly for people who don't use QHint enabled.

use MoveAllToActiveOrBackoffQueue only, when the feature gate is enabled.

It's the general fix to fix the same problem in custom plugins. We have to guard it with feature flag, otherwise negative impact on the throughput. (it's like we can't remove preCheck until QHint is done)

Maybe you want me to create separate PR for the changes in eventhandlers.go?

sanposhiho · 2024-04-15T01:46:40Z

pkg/scheduler/internal/queue/scheduling_queue.go

 	nsLabels := interpodaffinity.GetNamespaceLabelsSnapshot(logger, pod.Namespace, p.nsLister)

 	var podsToMove []*framework.QueuedPodInfo
 	for _, pInfo := range p.unschedulablePods.podInfoMap {
+		if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) {


So this only work for in-tree plugins, for out-of-tree cross-topology plugins, the problem still exists, right?

Yes, right. People have to enable QHint if they want to avoid the same problem in custom scheduler plugins.

sanposhiho · 2024-04-22T10:26:37Z

@kerthcet Fixed / replied to your comments🙏

alculquicondor · 2024-05-13T18:48:28Z

/retest

k8s-ci-robot · 2024-06-11T07:26:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [sanposhiho]
~~test/integration/scheduler/OWNERS~~ [sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sanposhiho · 2024-06-11T07:32:38Z

@alculquicondor @kerthcet Sorry for a long wait on it. Addressed the comments.

sanposhiho · 2024-06-11T11:24:12Z

/retest

sanposhiho · 2024-06-13T07:33:35Z

/retest

…on Pod related events

sanposhiho · 2024-06-23T04:40:58Z

rebased.

alculquicondor · 2024-06-24T15:16:46Z

pkg/scheduler/internal/queue/scheduling_queue.go

+		// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm
+		// because Pod related events maybe make Pods that rejected by NodeResourceFit schedulable.


Suggested change

// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm

// because Pod related events maybe make Pods that rejected by NodeResourceFit schedulable.

// In this case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm

// because Pod related events may make Pods that were rejected by NodeResourceFit schedulable.

alculquicondor · 2024-06-24T15:20:40Z

pkg/scheduler/internal/queue/scheduling_queue.go

@@ -1104,9 +1108,13 @@ func isPodResourcesResizedDown(pod *v1.Pod) bool {
 func (p *PriorityQueue) AssignedPodUpdated(logger klog.Logger, oldPod, newPod *v1.Pod) {


Does this include changes to Pod status? or only Pod spec?

alculquicondor · 2024-06-24T15:23:52Z

pkg/scheduler/internal/queue/scheduling_queue.go

+		if pInfo.UnschedulablePlugins.Has(podtopologyspread.Name) {
+			// This Pod may be schedulable now by this Pod event.
+			podsToMove = append(podsToMove, pInfo)
+			continue
+		}
+


did you consider checking for the namespace?

alculquicondor · 2024-06-24T15:32:58Z

pkg/scheduler/internal/queue/scheduling_queue_test.go

@@ -1853,6 +1854,7 @@ func TestPriorityQueue_AssignedPodAdded(t *testing.T) {
 	defer cancel()

 	affinityPod := st.MakePod().Name("afp").Namespace("ns1").UID("afp").Annotation("annot2", "val2").Priority(mediumPriority).NominatedNodeName("node1").PodAffinityExists("service", "region", st.PodAffinityWithRequiredReq).Obj()
+	spreadPod := st.MakePod().Name("tsp").Namespace("ns1").UID("tsp").SpreadConstraint(1, "node", v1.DoNotSchedule, nil, nil, nil, nil, nil).Obj()


Please instead create a new test and rename the existing one to be specific about pod affinity (the comment already talks about pod affinity specifically).

Or maybe it can be converted into a test table.

k8s-ci-robot requested review from denkensk and kerthcet January 7, 2024 03:09

k8s-ci-robot requested a review from alculquicondor January 7, 2024 03:13

sanposhiho marked this pull request as ready for review January 7, 2024 03:13

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 7, 2024

sanposhiho marked this pull request as draft January 7, 2024 03:44

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2024

sanposhiho force-pushed the remove-AssignedPodUpdated branch from e6e265f to a70b42f Compare January 7, 2024 04:05

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2024

sanposhiho changed the title ~~remove(scheduling queue): remove AssignedPodAdded and AssignedPodUpdated~~ take PodTopologySpread into consideration when requeueing Pods based on Pod related events Jan 7, 2024

sanposhiho marked this pull request as ready for review January 7, 2024 04:06

Huang-Wei reviewed Jan 9, 2024

View reviewed changes

pkg/scheduler/internal/queue/scheduling_queue.go Outdated Show resolved Hide resolved

Huang-Wei reviewed Jan 9, 2024

View reviewed changes

sanposhiho mentioned this pull request Feb 24, 2024

util.NextPod leaks goroutine #123480

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2024

sanposhiho force-pushed the remove-AssignedPodUpdated branch from 0415147 to f5490a9 Compare April 7, 2024 10:55

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2024

sanposhiho force-pushed the remove-AssignedPodUpdated branch 2 times, most recently from 01fc67c to d7c4b64 Compare April 7, 2024 14:39

kerthcet reviewed Apr 8, 2024

View reviewed changes

sanposhiho commented Apr 15, 2024

View reviewed changes

sanposhiho force-pushed the remove-AssignedPodUpdated branch from bfef5e7 to 4afbd71 Compare June 11, 2024 07:25

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2024

sanposhiho force-pushed the remove-AssignedPodUpdated branch from 4afbd71 to b41a7ab Compare June 13, 2024 07:02

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 22, 2024

take PodTopologySpread into consideration when requeueing Pods based …

6d73478

…on Pod related events

sanposhiho force-pushed the remove-AssignedPodUpdated branch from b41a7ab to 6d73478 Compare June 23, 2024 04:40

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2024

alculquicondor reviewed Jun 24, 2024

View reviewed changes

		// Note that this function leaks goroutines in the case of timeout; even after this function returns after timeout,
		// the goroutine made by this function keep waiting to pop a pod from the queue.

		// This case, we don't want to pre-filter Pods by getUnschedulablePodsWithCrossTopologyTerm
		// because Pod related events maybe make Pods that rejected by NodeResourceFit schedulable.

		@@ -1104,9 +1108,13 @@ func isPodResourcesResizedDown(pod *v1.Pod) bool {
		func (p PriorityQueue) AssignedPodUpdated(logger klog.Logger, oldPod, newPod v1.Pod) {

take PodTopologySpread into consideration when requeueing Pods based on Pod related events #122627

Are you sure you want to change the base?

take PodTopologySpread into consideration when requeueing Pods based on Pod related events #122627

Conversation

sanposhiho commented Jan 7, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Jan 7, 2024

k8s-ci-robot commented Jan 7, 2024

sanposhiho commented Jan 7, 2024

sanposhiho commented Jan 7, 2024

sanposhiho commented Jan 7, 2024

Choose a reason for hiding this comment

sanposhiho Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Jan 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Jan 14, 2024

sanposhiho commented Feb 3, 2024

kerthcet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Apr 22, 2024

alculquicondor commented May 13, 2024

k8s-ci-robot commented Jun 11, 2024

sanposhiho commented Jun 11, 2024

sanposhiho commented Jun 11, 2024

sanposhiho commented Jun 13, 2024

sanposhiho commented Jun 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented Jan 7, 2024 •

edited

Loading

sanposhiho Jan 9, 2024 •

edited

Loading

sanposhiho Jan 14, 2024 •

edited

Loading

sanposhiho Apr 22, 2024 •

edited

Loading

sanposhiho Apr 15, 2024 •

edited

Loading

sanposhiho Apr 15, 2024 •

edited

Loading