New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: when PreFilter returns UnschedulableAndUnresolvable, copy the state in all nodes in statusmap #119778
fix: when PreFilter returns UnschedulableAndUnresolvable, copy the state in all nodes in statusmap #119778
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sat Aug 5 22:30:03 UTC 2023. |
will soon add UTs |
e00983f
to
3320a0a
Compare
/cc @Huang-Wei |
3320a0a
to
e08e6e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you incorporate the integration test in (#119777) to this PR?
Sure, added. |
if _, err := state.Read(tokenFilterName); err != nil { | ||
// Should be bug. | ||
// We don't store state only when PreFilter returned Skip. | ||
// In other words, if reaches here, PreFilter returned Skip but somehow Filter is called. | ||
return framework.NewStatus(framework.Error, err.Error()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place this to AddPod/RemovePod would be more close to real world usage.
Unresolvable bool | ||
Tokens int | ||
PreFilterStatus *framework.Status | ||
PreFilterResult *framework.PreFilterResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this parameter is not used in any test?
} | ||
|
||
func (fp *tokenFilter) PreFilter(ctx context.Context, state *framework.CycleState, pod *v1.Pod) (*framework.PreFilterResult, *framework.Status) { | ||
if pod.Name == fp.PreFilterTargetPodName && fp.PreFilterStatus.IsSkip() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should only check is: (not quite a fan to check the pod name if we can achieve this by the following logic)
if fp.PreFilterStatus.Code() == framework.UnschedulableAndUnresolvable || fp.PreFilterStatus.IsSkip() {
return nil, fp.PreFilterStatus
}
|
||
state.Write(tokenFilterName, &stateData{}) | ||
|
||
if pod.Name == fp.PreFilterTargetPodName { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, I think we're able to get rid of PreFilterTargetPodName
v1.ResourceMemory: *resource.NewQuantity(200, resource.DecimalSI)}, | ||
}, | ||
}), | ||
preemptedPodIndexes: map[int]struct{}{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is nil, it seems the existing test logic is problematic: (w/o the fix, the test would also pass)
The else
loop should be employing a "Eventually" semantics; ohterwise it would always pass. In other words, a) the p
in else block should be obtained from api server and b) we should wait a bit to ensure the preemption doesn't happen. But this is an existing test issue.
// Wait for preemption of pods and make sure the other ones are not preempted.
for i, p := range pods {
if _, found := test.preemptedPodIndexes[i]; found { ... } else {
if p.DeletionTimestamp != nil {
t.Errorf("Didn't expect pod %v to get preempted.", p.Name)
}
}
}
Given the integration test has been wonky for a while, and the UT should suffice to verify the fix. I'm inclined with a separate PR to polish the integration test: including
|
// needed to avoid this copy. | ||
for _, n := range allNodes { | ||
diagnosis.NodeToStatusMap[n.Node().Name] = s | ||
} | ||
// Record the messages from PreFilter in Diagnosis.PreFilterMsg. | ||
msg := s.Message() | ||
diagnosis.PreFilterMsg = msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this cause the message to be duplicated in the event/log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. @sanposhiho could you double check the event/log is still rendered properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good catch. I changed type.go so that we won't make duplicates in the message.
@@ -100,8 +100,11 @@ func waitForNominatedNodeName(cs clientset.Interface, pod *v1.Pod) error { | |||
const tokenFilterName = "token-filter" | |||
|
|||
type tokenFilter struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment describing this plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it's likely this PR will proceed with UT only, I will add comment in #119769.
+1 on a quick fix with UTs that can be merged ASAP. |
The effect here would be that the preemption fails with an error, causing the pod to be re-queued into backoff, instead of unschedulable. Correct? |
In most cases, yes - just unnecessary cycles spent on preemption which should have been avoided. But, with a plugin returning UnschedulableAndResolvable in PreFilter, and due to the regression, preemption may be able to find a node to host the pod by preempting pods (b/c dryrun preemption runs Filter only). In this edge case, it'd cause unexpected disruption to existing pods. |
Hopefully we can merge before the rc.1 tomorrow |
…ate in all nodes in statusmap
9730596
to
b008223
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@alculquicondor @Huang-Wei As suggested,
PTAL again. |
/retest |
/release-note-edit |
@sanposhiho LGTM. Could you polish the release note a bit? like ^^ /lgtm |
LGTM label has been added. Git tree hash: 0b5316f8f51ce76a8524cab2db4a640bdba927ea
|
@Huang-Wei done. |
Let's keep an eye for the end of the code freeze. |
Changelog suggestion Fixed a scheduling bug by ensuring that preemption is skipped when a PreFilter plugin returns `UnschedulableAndUnresolvable` |
…119778-upstream-release-1.26 Automated cherry pick of #119778: fix: when PreFilter returns UnschedulableAndUnresolvable, copy the state in all nodes in statusmap
…119778-upstream-release-1.28 Automated cherry pick of #119778: fix: when PreFilter returns UnschedulableAndUnresolvable, copy the state in all nodes in statusmap
…119778-upstream-release-1.27 Automated cherry pick of #119778: fix: when PreFilter returns UnschedulableAndUnresolvable, copy the state in all nodes in statusmap
What type of PR is this?
/kind bug
/triage accepted
/priority important-soon
What this PR does / why we need it:
When PreFilter returns UnschedulableAndUnresolvable, we don't need to run the preemption, but we run.
This unexpected preemption could cause errors from the preemption by trying to read non-exist PreFilter data.
As @Huang-Wei pointed out, we didn't run the preemption in such cases, but #110894 introduced this bug.
#119777 adds integ test for this scenario and #119780 proves this patch fixing the bug.
Which issue(s) this PR fixes:
Fixes: #119782
Special notes for your reviewer:
We should cherry-pick this to v1.26 and v1.27.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: