New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sched: integration test to cover event registration #105337
Conversation
@Huang-Wei: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
// Create two Pods that are both unschedulable. | ||
// - Pod1 is a best-efforts Pod, but doesn't have the required toleration. | ||
// - Pod2 has the required toleration, but requests a large amount of CPU resource that the node cannot fit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Pod2 has no tolerations, but not affect the finally test results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should have updated the comment. Done right now.
My original thought was to add a toleration for pod2, but it turned out with no extra code coverage. So I intentionally leave it without toleration, but with excessive pod req - that will cover the usage of preCheckForNode()
.
// - Pod2 requests a large amount of CPU resource that the node cannot fit. | ||
// Note: Pod2 will fail the tainttoleration plugin b/c that's ordered prior to noderesources. | ||
pod1 := st.MakePod().Namespace(ns).Name("pod1").Container("image").Obj() | ||
pod2 := st.MakePod().Namespace(ns).Name("pod2").Req(map[v1.ResourceName]string{v1.ResourceCPU: "4"}).Obj() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap the pod2
with Toleration
to make sure pod2
will be failed with noderesource
instead of thetaint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional. Also explained in #105337 (comment).
If pod2 goes with a mapped toleration, it'd fail due to NodeResource failure, and hence won't be triggered at all. However, in the case above, pod2 also failed due to TaintToleration, and hence looked like it should have been triggered. But why it's not? It's due to the preCheckNode()
logic, which serves as the last gate to move a pod or not. So in this case, we used pod2 to cover the preCheckNode()
logic.
BTW: I will come up with a pod3 case that failed VolumeBinding, but due to a tiny bug, I will raise a bug fix along with the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, any idea on why the extra check in the preCheckForNode
is needed instead of just moving a pod and leaving the filters to make the judgement? those logic seems like was migrated from kubelet
and only cover some basic filtering.
} | ||
// Schedule the Pod manually. | ||
_, fitError := testCtx.Scheduler.Algorithm.Schedule(ctx, nil, fwk, framework.NewCycleState(), podInfo.Pod) | ||
if fitError == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert the err
is fitError
instead of assuming it's fitError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's redundant IMO.
test/integration/scheduler/util.go
Outdated
@@ -548,3 +548,18 @@ func nextPodOrDie(t *testing.T, testCtx *testutils.TestContext) *framework.Queue | |||
} | |||
return podInfo | |||
} | |||
|
|||
// nextPod returns the next Pod in the scheduler queue, with a 15 seconds timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments here, might be just say the err
is an acceptable result as the Queue
might be empty to differentiate with
the func of nextPodOrDie
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't return an error here actually, just a popped pod or nil. Mentioning err would be confusing.
/cc @ahg-g |
// It's intended to not start the scheduler's queue, and hence to | ||
// not start any flushing logic. We will pop and schedule the Pods manually later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment really meant for the CleanupTest call below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope :) it's just usually we start the scheduler right after SyncInformerFactory().
I will move the comments above.
test/integration/scheduler/util.go
Outdated
var podInfo *framework.QueuedPodInfo | ||
// NextPod() is a blocking operation. Wrap it in timeout() to avoid relying on | ||
// default go testing timeout (10m) to abort. | ||
if err := timeout(testCtx.Ctx, time.Second*15, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so basically the test will run for at least 15seconds, why not just check the active queue length instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only a PendingPods()
function so far which cannot tell whether it's from activeQ or unschedulableQ/backoffQ. Also, I want to wait for some time instead of immediate check to detect undesirable pods' moving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a function to return the length of the active queue, but if you want to wait, 15seconds feels a lot in this case, I would reduce it to 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to reduce it to 5 secs.
/lgtm |
/unhold |
What type of PR is this?
/kind cleanup
/sig scheduling
What this PR does / why we need it:
Add an integration test to cover event registration for core resources.
Which issue(s) this PR fixes:
Fixes #105303.
Special notes for your reviewer:
Does this PR introduce a user-facing change?