Various e2e test fixes and cleanups #147

mtrmac · 2021-03-09T15:24:55Z

This is a set of fixes to e2e tests

fixing some bugs (or short timeouts) that show up as flakes (e.g. in https://prow.ci.openshift.org/pr-history/?org=openshift&repo=service-ca-operator&pr=143 )
improving the code to make test development easier
a few minor cleanups.

See individual commit messages for details.

Note primarily the commit “Don't abort metrics/service-ca-metrics before the sync finishes”, where I could see an argument that this should be changed in the code instead of the tests.

mtrmac · 2021-03-09T15:29:13Z

As an aside: To determine the need to increase timeouts I have added simple logging of time spent polling in the various larger-timeout cases (around the wait.PollImmediate calls, basically), and that code did appear to work fine (e.g. on failures the reported times match the timeouts, and the further indirectly-caused failures also show plausible time changes).

The curious thing is that with the test runs that do fully succeed, the reported polling times match exactly, down to the nanosecond, across separate test runs on the same cluster. On a rebuilt cluster the times are different from the first cluster, but again all of them match exactly.

Some part of that can probably be explained by low resolution of the macOS clock, still, the consistent timing of cluster’s operation is so spooky that it’s very tempting to think I’m missing something.

mtrmac · 2021-03-09T23:18:13Z

/hold

service-ca-operator/test/e2e/e2e_test.go

Line 179 in 8e1cd9d

time.Sleep(10 * time.Second)

and the other one should be replaced by polling for the expected changes it seems.

mtrmac · 2021-03-12T17:36:03Z

/hold cancel

mtrmac · 2021-03-13T17:35:37Z

/retest

mtrmac · 2021-03-20T16:06:48Z

/retest

marun · 2021-03-23T01:38:40Z

/retest

marun

@stlaz lgtm, feel free to approve.

marun · 2021-03-23T01:48:29Z

test/e2e/e2e_test.go

@@ -461,7 +459,7 @@ func triggerForcedRotation(t *testing.T, client *kubernetes.Clientset, config *r

 	// Trigger a forced rotation by updating the operator config
 	// with a reason.
-	setUnsupportedServiceCAConfig(t, config, "42", customDuration)
+	forceUnsupportedServiceCAConfigRotation(t, config, secret, customDuration)


(No action required) Not sure what this change is buying given this function is only called once in a test run. YAGNI and all.

This is the “Fix failures on repeated e2e tests in the same cluster” commit.

The current behavior is perfectly fine for a CI run; when developing the tests, repeatedly running them on a single cluster, this failure is distracting from any real failures or test bugs.

marun · 2021-03-23T01:49:42Z

test/e2e/e2e_test.go

+			return false, nil
+		}
+		if err != nil {
+			return false, err


(No action required) Classic polling mistake - returning an error exits rather than triggering retry as intended. No blame intended - lots of examples of this in this file. Best practice for handling potentially recoverable failure in a polling function is to log the error and not return it.

Thanks, I was treating the server as “stable/available by assumption” and I didn’t realize that’s a bad assumption for HA code.

I’m not going to change this one example as an exception, but if you’d prefer me to change all of them, I can do that.

stlaz

This PR sums up many sins of the e2e tests in this repo and I'll gladly see it merged.

I only have a single nit about syntax sugar we can use that I'd prefer to be included and I'll bless the PR.

stlaz · 2021-03-29T08:16:17Z

test/e2e/e2e_test.go

+	var forceRotationReason string
+	i := 0
+	for {
+		forceRotationReason = fmt.Sprintf("service-ca-e2e-force-rotation-reason-%d", i)
+		if currentSigningKeySecret.Annotations[api.ForcedRotationReasonAnnotationName] != forceRotationReason {
+			break
+		}
+		i++
+	}


just a little bit of syntax sugar sprinkles, also limiting the scope of i

var forceRotationReason string for i := 0; ; i++ { forceRotationReason = fmt.Sprintf("service-ca-e2e-force-rotation-reason-%d", i) if currentSigningKeySecret.Annotations[api.ForcedRotationReasonAnnotationName] == forceRotationReason { break } }

Thanks, updated.

stlaz · 2021-03-29T08:30:53Z

test/e2e/e2e_test.go

+		if err != nil && errors.IsNotFound(err) {
+			return false, nil
+		}
+		if err != nil {
+			return false, err


(no action required) I usually prefer

if err != nil { if errors.IsNotFound(err) { return false, nil } return false, err }

but I see that that's not the pattern used in this file either 😐

Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Fail immediately if v1.TLSCertKey is missing instead of later reporting a parse error; remove a pointless initialization. Also only look up the data for v1.TLSCertKey once. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

The spec.unsupportedConfigOverrides forceRotation field only reacts to changes of value, so instead of always writing 42 there, write a value different from the last one. The "infinite" loop will have at most 2 iterations because we are comparing the generated name with a single value of an annotation from a loaded secret. (That's racy against concurrent runs of the same e2e test suite, but the test suite is already assuming that it's the only code changing the openshift-service-ca{,-operator}'s state.) Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Don't immediately abort when we observe a 0 value of service_ca_expiry_time_seconds, which is only updated inside the operator's sync loop. Keep trying for the full timeout value. It's tempting to argue that the metric should not be reported as 0 until the value is known, OTOH https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics says "export a default value such as 0 for any time series you know may exist in advance." Signed-off-by: Miloslav Trmač <mitr@redhat.com>

If these tests occasionally fail, it's impossible to diagnose that failure without capturing the full data (e.g. to tell whether the data is just empty or corrupt, or whether there really is an unexpected intermediate state where the verification fails). So include the full failing data in the error message. Originally motivated by a > rotate.go:151: Failed to receive output: Get "https://test-service-vyh1o.test-urr50.svc:62133": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-service-serving-signer@1615214293") which I have only seen once; hopefully this will help if it ever reoccurs. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

The observed time for the service-ca replacing the canary secret after a forced rotation is commonly 0.5-1.5 minutes, but a time of a bit over 2 minutes was observed, as well as a timeout at 3 minutes. Based on the >2-minute observation, allow for 5 minutes. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

... to account for all the churn when replacing the CA, instead of just the 10-second timeout for updating a secret by a non-rotating CA. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Let the caller specify secret/ConfigMap keys that should be updated, so that we can use this to observe changes to other keys as well. Should not change (test) behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

... instead of a string specifying an edit indirectly. This will allow editing other secret keys for StatefulSet secrets. Also, rename it to editServingSecretData. Should not change (test) behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

... and hoping that the change will happen in 10 seconds. Also, incindentally, increase the timeout, we've seen the changes not happening in these 10 seconds. The poolling could often be faster if the change happens in less than 10 seconds; we will also get a relevant failure about nothing changing instead of later complaints about unexpected format of data. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac · 2021-03-29T21:06:34Z

/retest

stlaz · 2021-03-30T09:57:41Z

/lgtm
thanks for the fixes \o/

openshift-ci-robot · 2021-03-30T09:57:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac, stlaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [stlaz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot requested review from marun and stlaz March 9, 2021 15:25

mtrmac force-pushed the fix-e2e-tests branch from 6ee5d05 to dcc27bf Compare March 9, 2021 17:30

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2021

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2021

mtrmac mentioned this pull request Mar 13, 2021

WIP: Generate per-pod certificates for StatefulSets #144

Closed

marun reviewed Mar 23, 2021

View reviewed changes

stlaz reviewed Mar 29, 2021

View reviewed changes

mtrmac added 10 commits March 29, 2021 18:53

Remove an incorrect comment

81a6b97

Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Fix checkServiceServingCertSecretData

c2ba8c0

Fail immediately if v1.TLSCertKey is missing instead of later reporting a parse error; remove a pointless initialization. Also only look up the data for v1.TLSCertKey once. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

TestE2E/refresh-ca: use the same CA rotation timeout as elsewhere

6fd3b49

... to account for all the churn when replacing the CA, instead of just the 10-second timeout for updating a secret by a non-rotating CA. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

Generalize pollForSecretChange and pollForConfigMapChange

13ae26b

Let the caller specify secret/ConfigMap keys that should be updated, so that we can use this to observe changes to other keys as well. Should not change (test) behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac force-pushed the fix-e2e-tests branch from 768c860 to 8fcb847 Compare March 29, 2021 17:01

openshift-ci-robot assigned stlaz Mar 30, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 30, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2021

openshift-merge-robot merged commit b6ec77b into openshift:master Mar 30, 2021

mtrmac deleted the fix-e2e-tests branch March 30, 2021 10:42

mtrmac mentioned this pull request Sep 10, 2021

[release-4.7] Bug 1966148: Report the operator status as always upgradeable in serviceCA/cluster #160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various e2e test fixes and cleanups #147

Various e2e test fixes and cleanups #147

mtrmac commented Mar 9, 2021

mtrmac commented Mar 9, 2021

mtrmac commented Mar 9, 2021

mtrmac commented Mar 12, 2021

mtrmac commented Mar 13, 2021

mtrmac commented Mar 20, 2021

marun commented Mar 23, 2021

marun left a comment

marun Mar 23, 2021

mtrmac Mar 23, 2021

marun Mar 23, 2021

mtrmac Mar 23, 2021

stlaz left a comment

stlaz Mar 29, 2021

mtrmac Mar 29, 2021

stlaz Mar 29, 2021

mtrmac commented Mar 29, 2021

stlaz commented Mar 30, 2021

openshift-ci-robot commented Mar 30, 2021

Various e2e test fixes and cleanups #147

Various e2e test fixes and cleanups #147

Conversation

mtrmac commented Mar 9, 2021

mtrmac commented Mar 9, 2021

mtrmac commented Mar 9, 2021

mtrmac commented Mar 12, 2021

mtrmac commented Mar 13, 2021

mtrmac commented Mar 20, 2021

marun commented Mar 23, 2021

marun left a comment

Choose a reason for hiding this comment

marun Mar 23, 2021

Choose a reason for hiding this comment

mtrmac Mar 23, 2021

Choose a reason for hiding this comment

marun Mar 23, 2021

Choose a reason for hiding this comment

mtrmac Mar 23, 2021

Choose a reason for hiding this comment

stlaz left a comment

Choose a reason for hiding this comment

stlaz Mar 29, 2021

Choose a reason for hiding this comment

mtrmac Mar 29, 2021

Choose a reason for hiding this comment

stlaz Mar 29, 2021

Choose a reason for hiding this comment

mtrmac commented Mar 29, 2021

stlaz commented Mar 30, 2021

openshift-ci-robot commented Mar 30, 2021