Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to #24709

smarterclayton · 2020-03-17T17:25:11Z

The test assumes that all nodes are schedulable when calculating nodes,
but masters are not and nodes in many e2e runs are only in two zones.
The e2e suite needs to be fixed to take selectors for nodes that workloads
can schedule to by default and then the test should use only those nodes
to get zones, but that is a much more invasive change.

The minimum workaround is to only verify spreading across the nodes we
are actually scheduled to and error if there is only one node scheduled
to (a multi-az test should fail if we are in a single AZ).

This flakes 1/4 times on AWS because we run 2 zones.

…cheduled to The test assumes that all nodes are schedulable when calculating nodes, but masters are not and nodes in many e2e runs are only in two zones. The e2e suite needs to be fixed to take selectors for nodes that workloads can schedule to by default and then the test should use only those nodes to get zones, but that is a much more invasive change. The minimum workaround is to only verify spreading across the nodes we are actually scheduled to and error if there is only one node scheduled to (a multi-az test should fail if we are in a single AZ).

openshift-ci-robot · 2020-03-17T17:25:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-03-17T17:28:53Z

@smarterclayton: This pull request references Bugzilla bug 1806594, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2020-03-17T17:28:55Z

/cherrypick release-4.4

openshift-cherrypick-robot · 2020-03-17T17:28:56Z

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2020-03-17T17:28:59Z

/cherrypick release-4.3

openshift-cherrypick-robot · 2020-03-17T17:29:00Z

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.3 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton · 2020-03-17T17:32:48Z

4.4 bug is 1814360 and 4.3 bug is 1814363

smarterclayton · 2020-03-17T19:03:33Z

/hold

I may have misdiagnosed the particular failure here in GCP. The general bug is still a bug (masters and workers don't have to match) but this may not fix the GCP flake correctly.

smarterclayton · 2020-03-18T04:49:27Z

/test e2e-gcp

smarterclayton · 2020-03-18T14:33:55Z

/test e2e-aws

smarterclayton · 2020-03-18T14:36:57Z

/test e2e-gcp

soltysh · 2020-03-18T16:48:16Z

Can you make sure we don't call that upstream commit with number pointing to an issue, it will be confusing during the next rebase.
/cc @marun
since you're doing the current rebase

smarterclayton · 2020-03-19T04:32:04Z

/test e2e-gcp

smarterclayton · 2020-03-22T06:02:56Z

There is no upstream fix yet, so I'd rather reference the issue where you can find the bug than nothing.

smarterclayton · 2020-03-22T06:03:25Z

/test e2e-aws-serial

smarterclayton · 2020-03-22T16:35:21Z

/test e2e-aws-serial

openshift-ci-robot · 2020-03-22T17:01:49Z

@smarterclayton: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-fips	`58b0b9a`	link	`/test e2e-aws-fips`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

smarterclayton · 2020-03-22T17:43:24Z

/test e2e-aws-serial

smarterclayton · 2020-03-23T01:04:51Z

/test e2e-aws-serial

marun · 2020-04-01T06:43:19Z

@smarterclayton I've picked the main commit for the rebase PR to try to get the multi-AZ test passing. Will revisit before merge.

soltysh · 2020-04-23T11:36:52Z

@smarterclayton or @damemi can we get some resolution on this one?

openshift-bot · 2020-07-22T13:26:49Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-08-21T15:38:02Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-09-20T17:04:27Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-09-20T17:04:42Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-09-20T17:04:44Z

@smarterclayton: This pull request references Bugzilla bug 1806594. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2020

openshift-ci-robot requested review from sjenning and tbielawa March 17, 2020 17:25

openshift-ci-robot added the vendor-update Touching vendor dir or related files label Mar 17, 2020

smarterclayton changed the title ~~UPSTREAM: 89178: Multi-AZ test should only check nodes the pods are scheduled to~~ Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to Mar 17, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 17, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 17, 2020

DO NOT MERGE: Add debugging output and wait for nodes ready

58b0b9a

openshift-ci-robot requested a review from marun March 18, 2020 16:48

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2020

openshift-ci-robot closed this Sep 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to #24709

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to #24709

smarterclayton commented Mar 17, 2020

openshift-ci-robot commented Mar 17, 2020

openshift-ci-robot commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

openshift-cherrypick-robot commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

openshift-cherrypick-robot commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

smarterclayton commented Mar 18, 2020

smarterclayton commented Mar 18, 2020

smarterclayton commented Mar 18, 2020

soltysh commented Mar 18, 2020

smarterclayton commented Mar 19, 2020

smarterclayton commented Mar 22, 2020

smarterclayton commented Mar 22, 2020

smarterclayton commented Mar 22, 2020

openshift-ci-robot commented Mar 22, 2020 •

edited

smarterclayton commented Mar 22, 2020

smarterclayton commented Mar 23, 2020

marun commented Apr 1, 2020

soltysh commented Apr 23, 2020

openshift-bot commented Jul 22, 2020

openshift-bot commented Aug 21, 2020

openshift-bot commented Sep 20, 2020

openshift-ci-robot commented Sep 20, 2020

openshift-ci-robot commented Sep 20, 2020

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to #24709

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to #24709

Conversation

smarterclayton commented Mar 17, 2020

openshift-ci-robot commented Mar 17, 2020

openshift-ci-robot commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

openshift-cherrypick-robot commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

openshift-cherrypick-robot commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

smarterclayton commented Mar 17, 2020

smarterclayton commented Mar 18, 2020

smarterclayton commented Mar 18, 2020

smarterclayton commented Mar 18, 2020

soltysh commented Mar 18, 2020

smarterclayton commented Mar 19, 2020

smarterclayton commented Mar 22, 2020

smarterclayton commented Mar 22, 2020

smarterclayton commented Mar 22, 2020

openshift-ci-robot commented Mar 22, 2020 • edited

smarterclayton commented Mar 22, 2020

smarterclayton commented Mar 23, 2020

marun commented Apr 1, 2020

soltysh commented Apr 23, 2020

openshift-bot commented Jul 22, 2020

openshift-bot commented Aug 21, 2020

openshift-bot commented Sep 20, 2020

openshift-ci-robot commented Sep 20, 2020

openshift-ci-robot commented Sep 20, 2020

openshift-ci-robot commented Mar 22, 2020 •

edited