Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test #1198

p0lyn0mial · 2021-08-02T14:49:26Z

Replacing #1195. Enabling https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/startup-monitor.md for single node.

cc @romfreiman @eranco74

p0lyn0mial · 2021-08-03T10:09:14Z

/test e2e-gcp-operator-encryption-single-node

p0lyn0mial · 2021-08-03T10:10:09Z

/test e2e-gcp-operator-encryption-rotation-single-node
/test e2e-gcp-operator-encryption-perf-single-node

sttts · 2021-08-03T10:11:06Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+	}
+}
+
+func ensureClusterInGoodState(t testing.TB, cs clientSet, waitPollInterval, waitPollTimeout, mustBeReadyFor time.Duration) {


does it need waitPollInterval ?

do you want to use wait.UntilWithContext ?

no, would just make this constant. 10s or 30s.

sttts · 2021-08-03T10:23:01Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+	t.Logf("Waiting %s for the cluster to be in a good condition, interval = %v, timeout %v", mustBeReadyFor.String(), waitPollInterval, waitPollTimeout)
+
+	return wait.Poll(waitPollInterval, waitPollTimeout, func() (bool, error) {
+		monitorPod, err := cs.Kube.CoreV1().Pods("openshift-kube-apiserver").Get(context.TODO(), "kube-apiserver-startup-monitor", metav1.GetOptions{})


these are static pod with another name (node appended)

sttts · 2021-08-03T10:24:07Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+				return false, nil /*retry*/
+			}
+		}
+		// TODO: on an HA cluster we could also check if pods are on the same revision


I would consider a cluster no in-progress as good (i.e. nodeStatus has no transitioning node). Why do we need all of this?

just playing safe, checking if nodeStatus contain only currentRevision and nodeName would be enough?

I think it is enough to start with. We also don't want to start from some "dirty" state.

sttts · 2021-08-03T10:25:12Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+
+	t.Logf("Setting UnsupportedConfigOverrides to %v", cfg)
+	err := retry.OnError(retry.DefaultRetry, func(error) bool { return true }, func() error {
+		raw, err := json.Marshal(cfg)


nit: once outside of the loop is enough

sttts · 2021-08-03T10:25:28Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+	t.Helper()
+
+	t.Log("Waiting for StaticPodFallbackRevisionDegraded condition, interval = 20s, timeout = 6min")
+	err := wait.Poll(20*time.Second, 6*time.Minute, func() (bool, error) {


sttts · 2021-08-03T10:26:09Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+			return false, nil /*retry*/
+		}
+
+		if v1helpers.IsOperatorConditionFalse(ckaso.Status.Conditions, "StaticPodFallbackRevisionDegraded") {


why not check for True and return true? Seems to be more direct in a loop like this.

sttts · 2021-08-03T10:26:20Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+	require.NoError(t, err)
+}
+
+func assertNodeStatus(t testing.TB, cs clientSet) (string, int32) {


assert what?

sttts · 2021-08-03T10:27:31Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+	return nodeName, failedRevision
+}
+
+func assertKasPodAnnotatedOnNode(t testing.TB, cs clientSet, expectedFailedRevision int32, expectedFallbackReason, expectedFallbackMessage, nodeName string) {


I don't think we have to check the message. It makes maintenance hard. We have unit tests for that.

If you really want to specific, just check for a keyword in the string (strings.Contains).

what's sure as well, there might be cases that would break etcd, I can remove.

sttts · 2021-08-03T10:37:13Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+	return filteredKasPods
+}
+
+func getDefaultUnsupportedConfig(t testing.TB, cs clientSet) map[string][]string {


not sure about this func. Name is unspecific, and it does abstract something that's better explicit.

getPlatformSpecificConfig would be better?

I think the infra check is nothing one should hide in a func

p0lyn0mial · 2021-08-04T07:20:09Z

/test e2e-aws-operator-disruptive-single-node

openshift-ci · 2021-08-04T08:26:46Z

@p0lyn0mial: This pull request references Bugzilla bug 1989633, which is invalid:

expected the bug to be in one of the following states: NEW, ASSIGNED, ON_DEV, POST, POST, but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1989633: implements a basic disruptive test on an SNO cluster

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

test/e2e-sno-disruptive/sno_disruptive_test.go

sttts · 2021-08-04T09:24:01Z

test/e2e-sno-disruptive/sno_disruptive_test.go

+	setUnsupportedConfig(t, cs, cfg)
+
+	// validate if the fallback condition is reported and the cluster is stable
+	waitForFallbackDegradedCondition(t, cs, waitForFallbackDegradedConditionTimeout)


can't we derive waitForFallbackDegradedConditionTimeout from some per-platform number? E.g. oneNodeRolloutTimeout

added TODO

sttts · 2021-08-04T14:55:20Z

/retest

openshift-ci · 2021-08-04T15:16:04Z

@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1985997: Enable static pod fallback, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sttts · 2021-08-04T15:16:24Z

/lgtm
/approve

openshift-ci · 2021-08-04T15:17:26Z

@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1985997: Enable static pod fallback, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-08-04T15:17:41Z

@p0lyn0mial: This pull request references Bugzilla bug 1985997, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1985997: Enable static pod fallback, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

p0lyn0mial · 2021-08-04T15:54:20Z

pulled openshift/library-go#1177
please retag

sttts · 2021-08-04T17:30:25Z

/retest
/lgtm

openshift-ci · 2021-08-04T17:30:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: p0lyn0mial, sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sttts]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-08-04T18:09:20Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-08-04T18:17:07Z

@p0lyn0mial: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-gcp-operator-encryption-perf-single-node	`025d0f7`	link	`/test e2e-gcp-operator-encryption-perf-single-node`
ci/prow/e2e-gcp-operator-encryption-rotation-single-node	`025d0f7`	link	`/test e2e-gcp-operator-encryption-rotation-single-node`
ci/prow/e2e-gcp-operator-encryption-single-node	`025d0f7`	link	`/test e2e-gcp-operator-encryption-single-node`
ci/prow/e2e-aws-single-node	`6f77ec8`	link	`/test e2e-aws-single-node`
ci/prow/e2e-gcp-operator-single-node	`6f77ec8`	link	`/test e2e-gcp-operator-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-08-04T18:58:26Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-08-04T19:10:21Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-08-04T20:24:24Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

sttts · 2021-08-04T20:30:15Z

flakes:

: [sig-arch][Early] Managed cluster should start all core operators [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less | 0s
-- | --
fail [github.com/onsi/ginkgo@v4.7.0-origin.0+incompatible/internal/leafnodes/runner.go:113]: Aug  4 16:40:38.918: Some cluster operators are not ready: authentication (Degraded=True OAuthServerDeployment_UnavailablePod: OAuthServerDeploymentDegraded: 1 of 3 requested instances are unavailable for oauth-openshift.openshift-authentication ())

openshift-ci · 2021-08-04T21:15:49Z

@p0lyn0mial: All pull requests linked via external trackers have merged:

Bugzilla bug 1985997 has been moved to the MODIFIED state.

In response to this:

Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 2, 2021

openshift-ci bot requested review from mfojtik and soltysh August 2, 2021 14:49

p0lyn0mial force-pushed the e2e-sno-disruptive branch 2 times, most recently from b19d13c to 025d0f7 Compare August 3, 2021 10:07

sttts reviewed Aug 3, 2021

View reviewed changes

p0lyn0mial force-pushed the e2e-sno-disruptive branch from 025d0f7 to 646d10a Compare August 3, 2021 15:13

p0lyn0mial force-pushed the e2e-sno-disruptive branch 2 times, most recently from d9c6e3b to cb79c1d Compare August 4, 2021 08:24

p0lyn0mial changed the title ~~WIP: implements a basic disruptive test on an SNO cluster~~ Bug 1989633: implements a basic disruptive test on an SNO cluster Aug 4, 2021

p0lyn0mial force-pushed the e2e-sno-disruptive branch from cb79c1d to c54d014 Compare August 4, 2021 08:28

sttts reviewed Aug 4, 2021

View reviewed changes

test/e2e-sno-disruptive/sno_disruptive_test.go Outdated Show resolved Hide resolved

sttts reviewed Aug 4, 2021

View reviewed changes

p0lyn0mial force-pushed the e2e-sno-disruptive branch from c54d014 to a2514eb Compare August 4, 2021 09:42

p0lyn0mial mentioned this pull request Aug 4, 2021

Bug 1985997: Enable startup-monitor for SNO #1195

Closed

p0lyn0mial force-pushed the e2e-sno-disruptive branch from 6a6110b to c11afe0 Compare August 4, 2021 12:38

p0lyn0mial and others added 2 commits August 4, 2021 15:22

adds a basic e2e test for testing the fallback scenario

441c39d

Enable startup-monitor for SNO

7a6b7c1

p0lyn0mial force-pushed the e2e-sno-disruptive branch from c11afe0 to 6170104 Compare August 4, 2021 13:23

sttts changed the title ~~Bug 1985997: implements a basic disruptive test on an SNO cluster~~ Bug 1985997: Enable static pod fallback, with disruptive e2e test Aug 4, 2021

openshift-ci bot assigned sttts Aug 4, 2021

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 4, 2021

sttts changed the title ~~Bug 1985997: Enable static pod fallback, with disruptive e2e test~~ Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test Aug 4, 2021

p0lyn0mial added 2 commits August 4, 2021 17:52

pin library-go

14f03c2

bump (library-go)

6f77ec8

p0lyn0mial force-pushed the e2e-sno-disruptive branch from 6170104 to 6f77ec8 Compare August 4, 2021 15:53

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2021

openshift-ci bot merged commit 61d60ee into openshift:master Aug 4, 2021

Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test #1198

Bug 1985997: Enable static pod fallback logic for SNO, with disruptive e2e test #1198

Conversation

p0lyn0mial commented Aug 2, 2021 • edited by sttts

p0lyn0mial commented Aug 3, 2021

p0lyn0mial commented Aug 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts Aug 3, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0lyn0mial commented Aug 4, 2021

openshift-ci bot commented Aug 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts commented Aug 4, 2021

openshift-ci bot commented Aug 4, 2021

sttts commented Aug 4, 2021

openshift-ci bot commented Aug 4, 2021

openshift-ci bot commented Aug 4, 2021

p0lyn0mial commented Aug 4, 2021

sttts commented Aug 4, 2021

openshift-ci bot commented Aug 4, 2021

openshift-bot commented Aug 4, 2021

openshift-ci bot commented Aug 4, 2021 • edited

openshift-bot commented Aug 4, 2021

openshift-bot commented Aug 4, 2021

openshift-bot commented Aug 4, 2021

sttts commented Aug 4, 2021

openshift-ci bot commented Aug 4, 2021

p0lyn0mial commented Aug 2, 2021 •

edited by sttts

sttts Aug 3, 2021 •

edited

openshift-ci bot commented Aug 4, 2021 •

edited