Update WaitForStableCluster to wait for only system pods to exist #84806

damemi · 2019-11-05T17:06:22Z

What type of PR is this?
/kind flake

What this PR does / why we need it:
When waiting for a stable cluster, some pods still needed to be deleted which affects e2e test assumptions about the number of pods on a cluster. This assumes that a stable cluster should have only system pods

Which issue(s) this PR fixes:

Fixes #84787

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

damemi · 2019-11-05T17:06:32Z

/cc @ahg-g

ahg-g · 2019-11-05T17:38:32Z

test/e2e/framework/util.go

-	for len(currentlyNotScheduledPods) != 0 {
+
+	// Wait for all pending pods to be scheduled, and for only system pods to be scheduled
+	for len(currentlyNotScheduledPods) != 0 && len(scheduledPods) - len (systemPods.Items) != 0 {


use a variable, something like numNonSystemPods := len(scheduledPods) - len (systemPods.Items)

Is it possible that some system Pods are not scheduled? For example:

currentlyNotScheduledPods: 2 regular Pods, 1 system Pods

scheduledPods: 1 regular Pod, 2 system Pod

systemPods: 3 Pods (1 not scheduled and 2 scheduled)

In this case, len(scheduledPods) - len (systemPods.Items) == 0.

@Huang-Wei true, this probably isn't a great check. Maybe something more explicit like a separate loop that just waits for len(allNamespacesPods) == len(systemPods)?

you can do this:

scheduledSystemPods, currentlyNotScheduledSystemPods = e2epod.GetPodsScheduled(masterNodes, systemPods)

and the check would becurrentlyNotScheduledSystemPods != 0 || len(systemPods.Items) != len(allPods.Items)

@ahg-g updated with this check

ahg-g · 2019-11-05T17:41:02Z

test/e2e/framework/util.go

 	scheduledPods, currentlyNotScheduledPods := e2epod.GetPodsScheduled(masterNodes, allPods)
-	for len(currentlyNotScheduledPods) != 0 {
+
+	// Wait for all pending pods to be scheduled, and for only system pods to be scheduled


update the comment: wait for system pods to be scheduled, and for pods in all other namespaces to be deleted

neolit123 · 2019-11-06T03:11:53Z

test/e2e/framework/util.go

@@ -2405,10 +2405,17 @@ func WaitForStableCluster(c clientset.Interface, masterNodes sets.String) int {

 	}
 	allPods.Items = currentPods
+	systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
+	ExpectNoError(err)


please annotate this error with a description.
[1]

neolit123 · 2019-11-06T03:12:00Z

test/e2e/framework/util.go

 		time.Sleep(2 * time.Second)

+		systemPods, err = c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
+		ExpectNoError(err)


[1]
EDIT: although it looks like there are no annotations bellow (and in the rest of the function) too. might be a good time to add them.

test/e2e/framework/util.go

neolit123 · 2019-11-06T03:18:53Z

test/e2e/framework/util.go

@@ -2405,10 +2405,17 @@ func WaitForStableCluster(c clientset.Interface, masterNodes sets.String) int {

 	}
 	allPods.Items = currentPods
+	systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})


allPods, err := c.CoreV1().Pods(metav1.NamespaceAll).List(metav1.ListOptions{})

this above is already problematic in case of thousands of pods.

@neolit123 could you clarify how I could fix this? It seems this function is already dependent on polling all of the pods in the cluster. I could pull the system pods from that query above, but the system pod query is likely to usually be pretty small, right?

@neolit123 doesn't the list of pods already exist in the informer cache? also the scheduler tests which uses this function does not really create thousands of pods.

the system pods should be fine, but "all the pods" can block for a while in super-clusters and i wouldn't be surprised that nobody in the wild would run our test suite on such clusters because of that (possibly other reasons too). it feels to me this logic should reworked.

@ahg-g the cache can help.
ok so this function is only used under test/e2e/scheduling/predicates.go
this means that this function has to be moved under test/e2e/scheduling/ ideally.

i was under the impression that given this is part of the framework we are using it elsewhere which doesn't make sense.

cc @oomichi for an opinion about the move.

yes, this seems to be used only in the scheduler e2e tests, I agree it should be moved under test/e2e/scheduling

neolit123 · 2019-11-06T03:20:34Z

/assign @timothysc @johnSchnake
for more eyes on this change.

ahg-g · 2019-11-06T20:48:45Z

test/e2e/framework/util.go

 		allPods, err := c.CoreV1().Pods(metav1.NamespaceAll).List(metav1.ListOptions{})
-		ExpectNoError(err)
+		ExpectNoError(err, "listing all pods in all namespaces while waiting for stable cluster")
 		scheduledPods, currentlyNotScheduledPods = e2epod.GetPodsScheduled(masterNodes, allPods)


this needs to change to

scheduledSystemPods, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)

Should we include the logic from above to filter out succeeded and failed pods inside this loop then too? (https://github.com/kubernetes/kubernetes/blob/ed7a68f3bdb1cdbdc51d0ba11b8c53a8dd29cd62/test/e2e/framework/util.go#L2399-L2404) It seems odd that it isn't included already, but if we're making len(allPods.Items) part of our condition checking we should probably sort out if that is needed

moving everything to inside the loop sounds reasonable to me.

ahg-g · 2019-11-06T20:50:12Z

test/e2e/framework/util.go

+	_, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)
+
+	// Wait for system pods to be scheduled, and for pods in all other namespaces to be deleted
+	for len(currentlyNotScheduledPods) != 0 && (len(currentlyNotScheduledSystemPods) != 0 || len(systemPods.Items) != len(allPods.Items)) {


why do we need len(currentlyNotScheduledPods) != 0 &&? I think the remaining part should be sufficient: it tests that all system pods are scheduled, and that all scheduled pods are system pods.

True, updated

ahg-g · 2019-11-06T20:51:51Z

test/e2e/framework/util.go

-	for len(currentlyNotScheduledPods) != 0 {
+	systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
+	ExpectNoError(err, "listing all pods in kube-system namespace while waiting for stable cluster")
+	_, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)


scheduledSystemPods (and return that at the end)

neolit123 · 2019-11-06T22:37:39Z

test/e2e/framework/util.go

-		}
-
+// ListNamespaceEvents lists the events in the given namespace.
+func ListNamespaceEvents(c clientset.Interface, ns string) error {


there is already DumpEventsInNamespace.
you can either:

use DumpEventsInNamespace with it's sorting option

add optional sorting to DumpEventsInNamespace as a boolean flag or a sort.Interface:
https://golang.org/pkg/sort/#Interface

don't use DumpEventsInNamespace and inline the event list usage under scheduling. i don't see ListNamespaceEvents used yet.

This diff is an error in rebasing; I did not add this function... fixing...

neolit123 · 2019-11-07T14:13:02Z

/approve
for framework.

please squash your commits.

ahg-g · 2019-11-07T14:17:24Z

test/e2e/scheduling/framework.go

+		}
+
+	}
+	allPods.Items = currentPods


can you factor this out to a function, perhaps allPods := getAllPods() and use it in both places.

ahg-g · 2019-11-07T14:21:57Z

test/e2e/scheduling/framework.go

+	systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
+	framework.ExpectNoError(err, "listing all pods in kube-system namespace while waiting for stable cluster")
+	scheduledSystemPods, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)


ditto, factor out to a function that returns two int variables and use it in both places, inside and outside the loop:

scheduledSystemPods, notScheduledSystemPods := getSystemPods() allSystemPods := scheduledSystemPods + notScheduledSystemPods

and replace len(systemPods.Items) with allSystemPods

damemi · 2019-11-07T15:52:01Z

@ahg-g updated with those functions factored out, please take a look again. I also squashed

ahg-g

lgtm, but you need to fix the lint issue

test/e2e/scheduling/framework.go

Move WaitForStable cluster to test/e2e/scheduling and update it to wait for only system pods to be ready in a stable cluster.

ahg-g · 2019-11-07T16:33:44Z

/lgtm
/approve

k8s-ci-robot · 2019-11-07T16:33:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, damemi, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/framework/OWNERS~~ [neolit123]
~~test/e2e/scheduling/OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ahg-g · 2019-11-07T17:23:36Z

/retest

k8s-ci-robot requested a review from ahg-g November 5, 2019 17:06

k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 5, 2019

k8s-ci-robot requested review from andrewsykim and krzyzacy November 5, 2019 17:08

ahg-g reviewed Nov 5, 2019

View reviewed changes

neolit123 reviewed Nov 6, 2019

View reviewed changes

test/e2e/framework/util.go Outdated Show resolved Hide resolved

neolit123 reviewed Nov 6, 2019

View reviewed changes

k8s-ci-robot assigned johnSchnake and timothysc Nov 6, 2019

damemi force-pushed the waitforstablecluster branch from b6afa54 to ed7a68f Compare November 6, 2019 20:41

ahg-g reviewed Nov 6, 2019

View reviewed changes

damemi force-pushed the waitforstablecluster branch from 338bb11 to 589e3da Compare November 6, 2019 22:18

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2019

neolit123 reviewed Nov 6, 2019

View reviewed changes

ahg-g reviewed Nov 7, 2019

View reviewed changes

damemi force-pushed the waitforstablecluster branch from f1a5cee to aa3d84f Compare November 7, 2019 15:50

ahg-g reviewed Nov 7, 2019

View reviewed changes

test/e2e/scheduling/framework.go Outdated Show resolved Hide resolved

Update e2e framework WaitForStableCluster function

a19d834

Move WaitForStable cluster to test/e2e/scheduling and update it to wait for only system pods to be ready in a stable cluster.

damemi force-pushed the waitforstablecluster branch from aa3d84f to a19d834 Compare November 7, 2019 16:32

k8s-ci-robot assigned ahg-g Nov 7, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2019

k8s-ci-robot merged commit 0bf790b into kubernetes:master Nov 7, 2019

k8s-ci-robot added this to the v1.17 milestone Nov 7, 2019

This was referenced Nov 8, 2019

[Failing test] Conformance-GCE-master (ci-kubernetes-gce-conformance-latest) #84980

Closed

Revert changes to WaitForStableCluster in scheduler e2e test #84988

Merged

Update WaitForStableCluster to wait for only system pods to exist #84806

Update WaitForStableCluster to wait for only system pods to exist #84806

Conversation

damemi commented Nov 5, 2019

damemi commented Nov 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neolit123 Nov 6, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neolit123 commented Nov 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neolit123 Nov 6, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neolit123 commented Nov 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damemi commented Nov 7, 2019

ahg-g left a comment

Choose a reason for hiding this comment

ahg-g commented Nov 7, 2019

k8s-ci-robot commented Nov 7, 2019

ahg-g commented Nov 7, 2019

neolit123 Nov 6, 2019 •

edited

neolit123 Nov 6, 2019 •

edited