Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to #24709

Closed

Conversation

smarterclayton
Copy link
Contributor

The test assumes that all nodes are schedulable when calculating nodes,
but masters are not and nodes in many e2e runs are only in two zones.
The e2e suite needs to be fixed to take selectors for nodes that workloads
can schedule to by default and then the test should use only those nodes
to get zones, but that is a much more invasive change.

The minimum workaround is to only verify spreading across the nodes we
are actually scheduled to and error if there is only one node scheduled
to (a multi-az test should fail if we are in a single AZ).

This flakes 1/4 times on AWS because we run 2 zones.

…cheduled to

The test assumes that all nodes are schedulable when calculating nodes,
but masters are not and nodes in many e2e runs are only in two zones.
The e2e suite needs to be fixed to take selectors for nodes that workloads
can schedule to by default and then the test should use only those nodes
to get zones, but that is a much more invasive change.

The minimum workaround is to only verify spreading across the nodes we
are actually scheduled to and error if there is only one node scheduled
to (a multi-az test should fail if we are in a single AZ).
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2020
@openshift-ci-robot openshift-ci-robot added the vendor-update Touching vendor dir or related files label Mar 17, 2020
@smarterclayton smarterclayton changed the title UPSTREAM: 89178: Multi-AZ test should only check nodes the pods are scheduled to Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to Mar 17, 2020
@openshift-ci-robot
Copy link

@smarterclayton: This pull request references Bugzilla bug 1806594, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Mar 17, 2020
@smarterclayton
Copy link
Contributor Author

/cherrypick release-4.4

@openshift-cherrypick-robot

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.4 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor Author

/cherrypick release-4.3

@openshift-cherrypick-robot

@smarterclayton: once the present PR merges, I will cherry-pick it on top of release-4.3 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@smarterclayton
Copy link
Contributor Author

4.4 bug is 1814360 and 4.3 bug is 1814363

@smarterclayton
Copy link
Contributor Author

/hold

I may have misdiagnosed the particular failure here in GCP. The general bug is still a bug (masters and workers don't have to match) but this may not fix the GCP flake correctly.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 17, 2020
@smarterclayton
Copy link
Contributor Author

/test e2e-gcp

@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@smarterclayton
Copy link
Contributor Author

/test e2e-gcp

@soltysh
Copy link
Member

soltysh commented Mar 18, 2020

Can you make sure we don't call that upstream commit with number pointing to an issue, it will be confusing during the next rebase.
/cc @marun
since you're doing the current rebase

@smarterclayton
Copy link
Contributor Author

/test e2e-gcp

@smarterclayton
Copy link
Contributor Author

There is no upstream fix yet, so I'd rather reference the issue where you can find the bug than nothing.

@smarterclayton
Copy link
Contributor Author

/test e2e-aws-serial

1 similar comment
@smarterclayton
Copy link
Contributor Author

/test e2e-aws-serial

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 22, 2020

@smarterclayton: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-fips 58b0b9a link /test e2e-aws-fips

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@smarterclayton
Copy link
Contributor Author

/test e2e-aws-serial

1 similar comment
@smarterclayton
Copy link
Contributor Author

/test e2e-aws-serial

@marun
Copy link
Contributor

marun commented Apr 1, 2020

@smarterclayton I've picked the main commit for the rebase PR to try to get the multi-AZ test passing. Will revisit before merge.

@soltysh
Copy link
Member

soltysh commented Apr 23, 2020

@smarterclayton or @damemi can we get some resolution on this one?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

@smarterclayton: This pull request references Bugzilla bug 1806594. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1806594: Multi-AZ test should only check nodes the pods are scheduled to

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants