New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update WaitForStableCluster to wait for only system pods to exist #84806
Conversation
/cc @ahg-g |
test/e2e/framework/util.go
Outdated
for len(currentlyNotScheduledPods) != 0 { | ||
|
||
// Wait for all pending pods to be scheduled, and for only system pods to be scheduled | ||
for len(currentlyNotScheduledPods) != 0 && len(scheduledPods) - len (systemPods.Items) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a variable, something like numNonSystemPods := len(scheduledPods) - len (systemPods.Items)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that some system Pods are not scheduled? For example:
- currentlyNotScheduledPods: 2 regular Pods, 1 system Pods
- scheduledPods: 1 regular Pod, 2 system Pod
- systemPods: 3 Pods (1 not scheduled and 2 scheduled)
In this case, len(scheduledPods) - len (systemPods.Items) == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Huang-Wei true, this probably isn't a great check. Maybe something more explicit like a separate loop that just waits for len(allNamespacesPods) == len(systemPods)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do this:
scheduledSystemPods, currentlyNotScheduledSystemPods = e2epod.GetPodsScheduled(masterNodes, systemPods)
and the check would becurrentlyNotScheduledSystemPods != 0 || len(systemPods.Items) != len(allPods.Items)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g updated with this check
test/e2e/framework/util.go
Outdated
scheduledPods, currentlyNotScheduledPods := e2epod.GetPodsScheduled(masterNodes, allPods) | ||
for len(currentlyNotScheduledPods) != 0 { | ||
|
||
// Wait for all pending pods to be scheduled, and for only system pods to be scheduled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment: wait for system pods to be scheduled, and for pods in all other namespaces to be deleted
test/e2e/framework/util.go
Outdated
@@ -2405,10 +2405,17 @@ func WaitForStableCluster(c clientset.Interface, masterNodes sets.String) int { | |||
|
|||
} | |||
allPods.Items = currentPods | |||
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{}) | |||
ExpectNoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please annotate this error with a description.
[1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/e2e/framework/util.go
Outdated
time.Sleep(2 * time.Second) | ||
|
||
systemPods, err = c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{}) | ||
ExpectNoError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1]
EDIT: although it looks like there are no annotations bellow (and in the rest of the function) too. might be a good time to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test/e2e/framework/util.go
Outdated
@@ -2405,10 +2405,17 @@ func WaitForStableCluster(c clientset.Interface, masterNodes sets.String) int { | |||
|
|||
} | |||
allPods.Items = currentPods | |||
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allPods, err := c.CoreV1().Pods(metav1.NamespaceAll).List(metav1.ListOptions{})
this above is already problematic in case of thousands of pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 could you clarify how I could fix this? It seems this function is already dependent on polling all of the pods in the cluster. I could pull the system pods from that query above, but the system pod query is likely to usually be pretty small, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@neolit123 doesn't the list of pods already exist in the informer cache? also the scheduler tests which uses this function does not really create thousands of pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the system pods should be fine, but "all the pods" can block for a while in super-clusters and i wouldn't be surprised that nobody in the wild would run our test suite on such clusters because of that (possibly other reasons too). it feels to me this logic should reworked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g the cache can help.
ok so this function is only used under test/e2e/scheduling/predicates.go
this means that this function has to be moved under test/e2e/scheduling/
ideally.
i was under the impression that given this is part of the framework we are using it elsewhere which doesn't make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @oomichi for an opinion about the move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this seems to be used only in the scheduler e2e tests, I agree it should be moved under test/e2e/scheduling
/assign @timothysc @johnSchnake |
b6afa54
to
ed7a68f
Compare
test/e2e/framework/util.go
Outdated
allPods, err := c.CoreV1().Pods(metav1.NamespaceAll).List(metav1.ListOptions{}) | ||
ExpectNoError(err) | ||
ExpectNoError(err, "listing all pods in all namespaces while waiting for stable cluster") | ||
scheduledPods, currentlyNotScheduledPods = e2epod.GetPodsScheduled(masterNodes, allPods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to change to
scheduledSystemPods, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include the logic from above to filter out succeeded and failed pods inside this loop then too? (https://github.com/kubernetes/kubernetes/blob/ed7a68f3bdb1cdbdc51d0ba11b8c53a8dd29cd62/test/e2e/framework/util.go#L2399-L2404) It seems odd that it isn't included already, but if we're making len(allPods.Items)
part of our condition checking we should probably sort out if that is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving everything to inside the loop sounds reasonable to me.
test/e2e/framework/util.go
Outdated
_, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods) | ||
|
||
// Wait for system pods to be scheduled, and for pods in all other namespaces to be deleted | ||
for len(currentlyNotScheduledPods) != 0 && (len(currentlyNotScheduledSystemPods) != 0 || len(systemPods.Items) != len(allPods.Items)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need len(currentlyNotScheduledPods) != 0 &&
? I think the remaining part should be sufficient: it tests that all system pods are scheduled, and that all scheduled pods are system pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, updated
test/e2e/framework/util.go
Outdated
for len(currentlyNotScheduledPods) != 0 { | ||
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{}) | ||
ExpectNoError(err, "listing all pods in kube-system namespace while waiting for stable cluster") | ||
_, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduledSystemPods (and return that at the end)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
338bb11
to
589e3da
Compare
test/e2e/framework/util.go
Outdated
} | ||
|
||
// ListNamespaceEvents lists the events in the given namespace. | ||
func ListNamespaceEvents(c clientset.Interface, ns string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is already DumpEventsInNamespace.
you can either:
- use DumpEventsInNamespace with it's sorting option
- add optional sorting to DumpEventsInNamespace as a boolean flag or a sort.Interface:
https://golang.org/pkg/sort/#Interface - don't use DumpEventsInNamespace and inline the event list usage under scheduling. i don't see ListNamespaceEvents used yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff is an error in rebasing; I did not add this function... fixing...
/approve please squash your commits. |
} | ||
|
||
} | ||
allPods.Items = currentPods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you factor this out to a function, perhaps allPods := getAllPods()
and use it in both places.
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{}) | ||
framework.ExpectNoError(err, "listing all pods in kube-system namespace while waiting for stable cluster") | ||
scheduledSystemPods, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, factor out to a function that returns two int
variables and use it in both places, inside and outside the loop:
scheduledSystemPods, notScheduledSystemPods := getSystemPods()
allSystemPods := scheduledSystemPods + notScheduledSystemPods
and replace len(systemPods.Items)
with allSystemPods
f1a5cee
to
aa3d84f
Compare
@ahg-g updated with those functions factored out, please take a look again. I also squashed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but you need to fix the lint issue
Move WaitForStable cluster to test/e2e/scheduling and update it to wait for only system pods to be ready in a stable cluster.
aa3d84f
to
a19d834
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, damemi, neolit123 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
What type of PR is this?
/kind flake
What this PR does / why we need it:
When waiting for a stable cluster, some pods still needed to be deleted which affects e2e test assumptions about the number of pods on a cluster. This assumes that a stable cluster should have only system pods
Which issue(s) this PR fixes:
Fixes #84787
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: