Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test #1198

Merged
merged 4 commits into from Aug 4, 2021

Conversation

p0lyn0mial
Copy link
Contributor

@p0lyn0mial p0lyn0mial commented Aug 2, 2021

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 2, 2021
@openshift-ci openshift-ci bot requested review from mfojtik and soltysh August 2, 2021 14:49
@p0lyn0mial p0lyn0mial force-pushed the e2e-sno-disruptive branch 2 times, most recently from b19d13c to 025d0f7 Compare August 3, 2021 10:07
@p0lyn0mial
Copy link
Contributor Author

/test e2e-gcp-operator-encryption-single-node

@p0lyn0mial
Copy link
Contributor Author

/test e2e-gcp-operator-encryption-rotation-single-node
/test e2e-gcp-operator-encryption-perf-single-node

}
}

func ensureClusterInGoodState(t testing.TB, cs clientSet, waitPollInterval, waitPollTimeout, mustBeReadyFor time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need waitPollInterval ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to use wait.UntilWithContext ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, would just make this constant. 10s or 30s.

t.Logf("Waiting %s for the cluster to be in a good condition, interval = %v, timeout %v", mustBeReadyFor.String(), waitPollInterval, waitPollTimeout)

return wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) {
monitorPod, err := cs.Kube.CoreV1().Pods("openshift-kube-apiserver").Get(context.TODO(), "kube-apiserver-startup-monitor", metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are static pod with another name (node appended)

return false, nil /*retry*/
}
}
// TODO: on an HA cluster we could also check if pods are on the same revision
Copy link
Contributor

@sttts sttts Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider a cluster no in-progress as good (i.e. nodeStatus has no transitioning node). Why do we need all of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just playing safe, checking if nodeStatus contain only currentRevision and nodeName would be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is enough to start with. We also don't want to start from some "dirty" state.


t.Logf("Setting UnsupportedConfigOverrides to %v", cfg)
err := retry.OnError(retry.DefaultRetry, func(error) bool { return true }, func() error {
raw, err := json.Marshal(cfg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: once outside of the loop is enough

t.Helper()

t.Log("Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 6min")
err := wait.Poll(20*time.Second, 6*time.Minute, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 6min?

return false, nil /*retry*/
}

if v1helpers.IsOperatorConditionFalse(ckaso.Status.Conditions, "StaticPodFallbackRevisionDegraded") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not check for True and return true? Seems to be more direct in a loop like this.

require.NoError(t, err)
}

func assertNodeStatus(t testing.TB, cs clientSet) (string, int32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert what?

return nodeName, failedRevision
}

func assertKasPodAnnotatedOnNode(t testing.TB, cs clientSet, expectedFailedRevision int32, expectedFallbackReason, expectedFallbackMessage, nodeName string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to check the message. It makes maintenance hard. We have unit tests for that.

If you really want to specific, just check for a keyword in the string (strings.Contains).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's sure as well, there might be cases that would break etcd, I can remove.

return filteredKasPods
}

func getDefaultUnsupportedConfig(t testing.TB, cs clientSet) map[string][]string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this func. Name is unspecific, and it does abstract something that's better explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getPlatformSpecificConfig would be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the infra check is nothing one should hide in a func

@p0lyn0mial
Copy link
Contributor Author

/test e2e-aws-operator-disruptive-single-node

@p0lyn0mial p0lyn0mial force-pushed the e2e-sno-disruptive branch 2 times, most recently from d9c6e3b to cb79c1d Compare August 4, 2021 08:24
@p0lyn0mial p0lyn0mial changed the title WIP: implements a basic disruptive test on an SNO cluster Bug 1989633: implements a basic disruptive test on an SNO cluster Aug 4, 2021
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Aug 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@p0lyn0mial: This pull request references Bugzilla bug 1989633, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1989633: implements a basic disruptive test on an SNO cluster

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

setUnsupportedConfig(t, cs, cfg)

// validate if the fallback condition is reported and the cluster is stable
waitForFallbackDegradedCondition(t, cs, waitForFallbackDegradedConditionTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we derive waitForFallbackDegradedConditionTimeout from some per-platform number? E.g. oneNodeRolloutTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TODO

@sttts
Copy link
Contributor

sttts commented Aug 4, 2021

/retest

@sttts sttts changed the title Bug 1985997: implements a basic disruptive test on an SNO cluster Bug 1985997: Enable static pod fallback, with disruptive e2e test Aug 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1985997: Enable static pod fallback, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts
Copy link
Contributor

sttts commented Aug 4, 2021

/lgtm
/approve

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1985997: Enable static pod fallback, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1985997: Enable static pod fallback, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts sttts changed the title Bug 1985997: Enable static pod fallback, with disruptive e2e test Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test Aug 4, 2021
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2021
@p0lyn0mial
Copy link
Contributor Author

pulled openshift/library-go#1177
please retag

@sttts
Copy link
Contributor

sttts commented Aug 4, 2021

/retest
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@p0lyn0mial: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-gcp-operator-encryption-perf-single-node 025d0f7 link /test e2e-gcp-operator-encryption-perf-single-node
ci/prow/e2e-gcp-operator-encryption-rotation-single-node 025d0f7 link /test e2e-gcp-operator-encryption-rotation-single-node
ci/prow/e2e-gcp-operator-encryption-single-node 025d0f7 link /test e2e-gcp-operator-encryption-single-node
ci/prow/e2e-aws-single-node 6f77ec8 link /test e2e-aws-single-node
ci/prow/e2e-gcp-operator-single-node 6f77ec8 link /test e2e-gcp-operator-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@sttts
Copy link
Contributor

sttts commented Aug 4, 2021

flakes:

: [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less | 0s
-- | --
fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Aug  4 16:40:38.918: Some cluster operators are not ready: authentication (Degraded=True OAuthServerDeployment_UnavailablePod: OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ())

@openshift-ci openshift-ci bot merged commit 61d60ee into openshift:master Aug 4, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 4, 2021

@p0lyn0mial: All pull requests linked via external trackers have merged:

Bugzilla bug 1985997 has been moved to the MODIFIED state.

In response to this:

Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants