Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update WaitForStableCluster to wait for only system pods to exist #84806

Merged
merged 1 commit into from Nov 7, 2019

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Nov 5, 2019

What type of PR is this?
/kind flake

What this PR does / why we need it:
When waiting for a stable cluster, some pods still needed to be deleted which affects e2e test assumptions about the number of pods on a cluster. This assumes that a stable cluster should have only system pods

Which issue(s) this PR fixes:

Fixes #84787

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 5, 2019
@damemi
Copy link
Contributor Author

damemi commented Nov 5, 2019

/cc @ahg-g

@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 5, 2019
for len(currentlyNotScheduledPods) != 0 {

// Wait for all pending pods to be scheduled, and for only system pods to be scheduled
for len(currentlyNotScheduledPods) != 0 && len(scheduledPods) - len (systemPods.Items) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a variable, something like numNonSystemPods := len(scheduledPods) - len (systemPods.Items)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that some system Pods are not scheduled? For example:

  • currentlyNotScheduledPods: 2 regular Pods, 1 system Pods
  • scheduledPods: 1 regular Pod, 2 system Pod
  • systemPods: 3 Pods (1 not scheduled and 2 scheduled)

In this case, len(scheduledPods) - len (systemPods.Items) == 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei true, this probably isn't a great check. Maybe something more explicit like a separate loop that just waits for len(allNamespacesPods) == len(systemPods)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do this:

scheduledSystemPods, currentlyNotScheduledSystemPods = e2epod.GetPodsScheduled(masterNodes, systemPods)

and the check would becurrentlyNotScheduledSystemPods != 0 || len(systemPods.Items) != len(allPods.Items)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g updated with this check

scheduledPods, currentlyNotScheduledPods := e2epod.GetPodsScheduled(masterNodes, allPods)
for len(currentlyNotScheduledPods) != 0 {

// Wait for all pending pods to be scheduled, and for only system pods to be scheduled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment: wait for system pods to be scheduled, and for pods in all other namespaces to be deleted

@@ -2405,10 +2405,17 @@ func WaitForStableCluster(c clientset.Interface, masterNodes sets.String) int {

}
allPods.Items = currentPods
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
ExpectNoError(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please annotate this error with a description.
[1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

time.Sleep(2 * time.Second)

systemPods, err = c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
ExpectNoError(err)
Copy link
Member

@neolit123 neolit123 Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1]
EDIT: although it looks like there are no annotations bellow (and in the rest of the function) too. might be a good time to add them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2405,10 +2405,17 @@ func WaitForStableCluster(c clientset.Interface, masterNodes sets.String) int {

}
allPods.Items = currentPods
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allPods, err := c.CoreV1().Pods(metav1.NamespaceAll).List(metav1.ListOptions{})

this above is already problematic in case of thousands of pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neolit123 could you clarify how I could fix this? It seems this function is already dependent on polling all of the pods in the cluster. I could pull the system pods from that query above, but the system pod query is likely to usually be pretty small, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neolit123 doesn't the list of pods already exist in the informer cache? also the scheduler tests which uses this function does not really create thousands of pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the system pods should be fine, but "all the pods" can block for a while in super-clusters and i wouldn't be surprised that nobody in the wild would run our test suite on such clusters because of that (possibly other reasons too). it feels to me this logic should reworked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g the cache can help.
ok so this function is only used under test/e2e/scheduling/predicates.go
this means that this function has to be moved under test/e2e/scheduling/ ideally.

i was under the impression that given this is part of the framework we are using it elsewhere which doesn't make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @oomichi for an opinion about the move.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this seems to be used only in the scheduler e2e tests, I agree it should be moved under test/e2e/scheduling

@neolit123
Copy link
Member

/assign @timothysc @johnSchnake
for more eyes on this change.

allPods, err := c.CoreV1().Pods(metav1.NamespaceAll).List(metav1.ListOptions{})
ExpectNoError(err)
ExpectNoError(err, "listing all pods in all namespaces while waiting for stable cluster")
scheduledPods, currentlyNotScheduledPods = e2epod.GetPodsScheduled(masterNodes, allPods)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to change to

scheduledSystemPods, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include the logic from above to filter out succeeded and failed pods inside this loop then too? (https://github.com/kubernetes/kubernetes/blob/ed7a68f3bdb1cdbdc51d0ba11b8c53a8dd29cd62/test/e2e/framework/util.go#L2399-L2404) It seems odd that it isn't included already, but if we're making len(allPods.Items) part of our condition checking we should probably sort out if that is needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving everything to inside the loop sounds reasonable to me.

_, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)

// Wait for system pods to be scheduled, and for pods in all other namespaces to be deleted
for len(currentlyNotScheduledPods) != 0 && (len(currentlyNotScheduledSystemPods) != 0 || len(systemPods.Items) != len(allPods.Items)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need len(currentlyNotScheduledPods) != 0 &&? I think the remaining part should be sufficient: it tests that all system pods are scheduled, and that all scheduled pods are system pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, updated

for len(currentlyNotScheduledPods) != 0 {
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
ExpectNoError(err, "listing all pods in kube-system namespace while waiting for stable cluster")
_, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduledSystemPods (and return that at the end)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 6, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2019
}

// ListNamespaceEvents lists the events in the given namespace.
func ListNamespaceEvents(c clientset.Interface, ns string) error {
Copy link
Member

@neolit123 neolit123 Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already DumpEventsInNamespace.
you can either:

  • use DumpEventsInNamespace with it's sorting option
  • add optional sorting to DumpEventsInNamespace as a boolean flag or a sort.Interface:
    https://golang.org/pkg/sort/#Interface
  • don't use DumpEventsInNamespace and inline the event list usage under scheduling. i don't see ListNamespaceEvents used yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff is an error in rebasing; I did not add this function... fixing...

@neolit123
Copy link
Member

/approve
for framework.

please squash your commits.

}

}
allPods.Items = currentPods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you factor this out to a function, perhaps allPods := getAllPods() and use it in both places.

Comment on lines +53 to +82
systemPods, err := c.CoreV1().Pods(metav1.NamespaceSystem).List(metav1.ListOptions{})
framework.ExpectNoError(err, "listing all pods in kube-system namespace while waiting for stable cluster")
scheduledSystemPods, currentlyNotScheduledSystemPods := e2epod.GetPodsScheduled(masterNodes, systemPods)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, factor out to a function that returns two int variables and use it in both places, inside and outside the loop:

scheduledSystemPods, notScheduledSystemPods := getSystemPods()
allSystemPods := scheduledSystemPods + notScheduledSystemPods

and replace len(systemPods.Items) with allSystemPods

@damemi
Copy link
Contributor Author

damemi commented Nov 7, 2019

@ahg-g updated with those functions factored out, please take a look again. I also squashed

Copy link
Member

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but you need to fix the lint issue

test/e2e/scheduling/framework.go Outdated Show resolved Hide resolved
Move WaitForStable cluster to test/e2e/scheduling and update it to wait for only system pods to be ready in a stable cluster.
@ahg-g
Copy link
Member

ahg-g commented Nov 7, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, damemi, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2019
@ahg-g
Copy link
Member

ahg-g commented Nov 7, 2019

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WaitForStableCluster in e2e tests waits for pods to get deleted
7 participants