Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various e2e test fixes and cleanups #147

Merged
merged 10 commits into from
Mar 30, 2021

Conversation

mtrmac
Copy link
Contributor

@mtrmac mtrmac commented Mar 9, 2021

This is a set of fixes to e2e tests

See individual commit messages for details.

Note primarily the commit “Don't abort metrics/service-ca-metrics before the sync finishes”, where I could see an argument that this should be changed in the code instead of the tests.

@mtrmac
Copy link
Contributor Author

mtrmac commented Mar 9, 2021

As an aside: To determine the need to increase timeouts I have added simple logging of time spent polling in the various larger-timeout cases (around the wait.PollImmediate calls, basically), and that code did appear to work fine (e.g. on failures the reported times match the timeouts, and the further indirectly-caused failures also show plausible time changes).

The curious thing is that with the test runs that do fully succeed, the reported polling times match exactly, down to the nanosecond, across separate test runs on the same cluster. On a rebuilt cluster the times are different from the first cluster, but again all of them match exactly.

Some part of that can probably be explained by low resolution of the macOS clock, still, the consistent timing of cluster’s operation is so spooky that it’s very tempting to think I’m missing something.

@mtrmac
Copy link
Contributor Author

mtrmac commented Mar 9, 2021

/hold

time.Sleep(10 * time.Second)
and the other one should be replaced by polling for the expected changes it seems.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 9, 2021
@mtrmac
Copy link
Contributor Author

mtrmac commented Mar 12, 2021

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 12, 2021
@mtrmac
Copy link
Contributor Author

mtrmac commented Mar 13, 2021

/retest

@mtrmac
Copy link
Contributor Author

mtrmac commented Mar 20, 2021

/retest

1 similar comment
@marun
Copy link
Contributor

marun commented Mar 23, 2021

/retest

Copy link
Contributor

@marun marun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stlaz lgtm, feel free to approve.

@@ -461,7 +459,7 @@ func triggerForcedRotation(t *testing.T, client *kubernetes.Clientset, config *r

// Trigger a forced rotation by updating the operator config
// with a reason.
setUnsupportedServiceCAConfig(t, config, "42", customDuration)
forceUnsupportedServiceCAConfigRotation(t, config, secret, customDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) Not sure what this change is buying given this function is only called once in a test run. YAGNI and all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the “Fix failures on repeated e2e tests in the same cluster” commit.

The current behavior is perfectly fine for a CI run; when developing the tests, repeatedly running them on a single cluster, this failure is distracting from any real failures or test bugs.

return false, nil
}
if err != nil {
return false, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action required) Classic polling mistake - returning an error exits rather than triggering retry as intended. No blame intended - lots of examples of this in this file. Best practice for handling potentially recoverable failure in a polling function is to log the error and not return it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was treating the server as “stable/available by assumption” and I didn’t realize that’s a bad assumption for HA code.

I’m not going to change this one example as an exception, but if you’d prefer me to change all of them, I can do that.

Copy link
Contributor

@stlaz stlaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR sums up many sins of the e2e tests in this repo and I'll gladly see it merged.

I only have a single nit about syntax sugar we can use that I'd prefer to be included and I'll bless the PR.

Comment on lines 489 to 492
var forceRotationReason string
i := 0
for {
forceRotationReason = fmt.Sprintf("service-ca-e2e-force-rotation-reason-%d", i)
if currentSigningKeySecret.Annotations[api.ForcedRotationReasonAnnotationName] != forceRotationReason {
break
}
i++
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a little bit of syntax sugar sprinkles, also limiting the scope of i

var forceRotationReason string
for i := 0; ; i++ {
    forceRotationReason = fmt.Sprintf("service-ca-e2e-force-rotation-reason-%d", i)
    if currentSigningKeySecret.Annotations[api.ForcedRotationReasonAnnotationName] == forceRotationReason {
        break
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

Comment on lines +529 to +528
if err != nil && errors.IsNotFound(err) {
return false, nil
}
if err != nil {
return false, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no action required) I usually prefer

if err != nil {
    if errors.IsNotFound(err) {
        return false, nil
    }
    return false, err
}

but I see that that's not the pattern used in this file either 😐

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Fail immediately if v1.TLSCertKey is missing instead of
later reporting a parse error; remove a pointless initialization.

Also only look up the data for v1.TLSCertKey once.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
The spec.unsupportedConfigOverrides forceRotation field
only reacts to changes of value, so instead of always writing 42
there, write a value different from the last one.

The "infinite" loop will have at most 2 iterations because
we are comparing the generated name with a single value of
an annotation from a loaded secret. (That's racy against concurrent
runs of the same e2e test suite, but the test suite is already assuming
that it's the only code changing the openshift-service-ca{,-operator}'s
state.)

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Don't immediately abort when we observe a 0 value of
service_ca_expiry_time_seconds, which is only updated inside
the operator's sync loop.

Keep trying for the full timeout value.

It's tempting to argue that the metric should not be reported as 0
until the value is known, OTOH
https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
says "export a default value such as 0 for any time series you know may
exist in advance."

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
If these tests occasionally fail, it's impossible to diagnose
that failure without capturing the full data (e.g. to tell
whether the data is just empty or corrupt, or whether there
really is an unexpected intermediate state where the
verification fails). So include the full failing data in the
error message.

Originally motivated by a
>     rotate.go:151: Failed to receive output: Get "https://test-service-vyh1o.test-urr50.svc:62133": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-service-serving-signer@1615214293")

which I have only seen once; hopefully this will help if it ever reoccurs.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
The observed time for the service-ca replacing the canary secret
after a forced rotation is commonly 0.5-1.5 minutes, but a time of
a bit over 2 minutes was observed, as well as a timeout at 3 minutes.

Based on the >2-minute observation, allow for 5 minutes.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
... to account for all the churn when replacing the CA,
instead of just the 10-second timeout for updating a secret
by a non-rotating CA.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Let the caller specify secret/ConfigMap keys that should be updated,
so that we can use this to observe changes to other keys as well.

Should not change (test) behavior.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
... instead of a string specifying an edit indirectly.  This will
allow editing other secret keys for StatefulSet secrets.

Also, rename it to editServingSecretData.

Should not change (test) behavior.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
... and hoping that the change will happen in 10 seconds.
Also, incindentally, increase the timeout, we've seen
the changes not happening in these 10 seconds.

The poolling could often be faster if the change happens in less
than 10 seconds; we will also get a relevant failure about
nothing changing instead of later complaints about unexpected
format of data.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@mtrmac
Copy link
Contributor Author

mtrmac commented Mar 29, 2021

/retest

@stlaz
Copy link
Contributor

stlaz commented Mar 30, 2021

/lgtm
thanks for the fixes \o/

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 30, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac, stlaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2021
@openshift-merge-robot openshift-merge-robot merged commit b6ec77b into openshift:master Mar 30, 2021
@mtrmac mtrmac deleted the fix-e2e-tests branch March 30, 2021 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants