-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various e2e test fixes and cleanups #147
Various e2e test fixes and cleanups #147
Conversation
As an aside: To determine the need to increase timeouts I have added simple logging of time spent polling in the various larger-timeout cases (around the The curious thing is that with the test runs that do fully succeed, the reported polling times match exactly, down to the nanosecond, across separate test runs on the same cluster. On a rebuilt cluster the times are different from the first cluster, but again all of them match exactly. Some part of that can probably be explained by low resolution of the macOS clock, still, the consistent timing of cluster’s operation is so spooky that it’s very tempting to think I’m missing something. |
/hold service-ca-operator/test/e2e/e2e_test.go Line 179 in 8e1cd9d
|
/hold cancel |
/retest |
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stlaz lgtm, feel free to approve.
@@ -461,7 +459,7 @@ func triggerForcedRotation(t *testing.T, client *kubernetes.Clientset, config *r | |||
|
|||
// Trigger a forced rotation by updating the operator config | |||
// with a reason. | |||
setUnsupportedServiceCAConfig(t, config, "42", customDuration) | |||
forceUnsupportedServiceCAConfigRotation(t, config, secret, customDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No action required) Not sure what this change is buying given this function is only called once in a test run. YAGNI and all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the “Fix failures on repeated e2e tests in the same cluster” commit.
The current behavior is perfectly fine for a CI run; when developing the tests, repeatedly running them on a single cluster, this failure is distracting from any real failures or test bugs.
return false, nil | ||
} | ||
if err != nil { | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No action required) Classic polling mistake - returning an error exits rather than triggering retry as intended. No blame intended - lots of examples of this in this file. Best practice for handling potentially recoverable failure in a polling function is to log the error and not return it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I was treating the server as “stable/available by assumption” and I didn’t realize that’s a bad assumption for HA code.
I’m not going to change this one example as an exception, but if you’d prefer me to change all of them, I can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR sums up many sins of the e2e tests in this repo and I'll gladly see it merged.
I only have a single nit about syntax sugar we can use that I'd prefer to be included and I'll bless the PR.
test/e2e/e2e_test.go
Outdated
var forceRotationReason string | ||
i := 0 | ||
for { | ||
forceRotationReason = fmt.Sprintf("service-ca-e2e-force-rotation-reason-%d", i) | ||
if currentSigningKeySecret.Annotations[api.ForcedRotationReasonAnnotationName] != forceRotationReason { | ||
break | ||
} | ||
i++ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a little bit of syntax sugar sprinkles, also limiting the scope of i
var forceRotationReason string
for i := 0; ; i++ {
forceRotationReason = fmt.Sprintf("service-ca-e2e-force-rotation-reason-%d", i)
if currentSigningKeySecret.Annotations[api.ForcedRotationReasonAnnotationName] == forceRotationReason {
break
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
if err != nil && errors.IsNotFound(err) { | ||
return false, nil | ||
} | ||
if err != nil { | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(no action required) I usually prefer
if err != nil {
if errors.IsNotFound(err) {
return false, nil
}
return false, err
}
but I see that that's not the pattern used in this file either 😐
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Fail immediately if v1.TLSCertKey is missing instead of later reporting a parse error; remove a pointless initialization. Also only look up the data for v1.TLSCertKey once. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
The spec.unsupportedConfigOverrides forceRotation field only reacts to changes of value, so instead of always writing 42 there, write a value different from the last one. The "infinite" loop will have at most 2 iterations because we are comparing the generated name with a single value of an annotation from a loaded secret. (That's racy against concurrent runs of the same e2e test suite, but the test suite is already assuming that it's the only code changing the openshift-service-ca{,-operator}'s state.) Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Don't immediately abort when we observe a 0 value of service_ca_expiry_time_seconds, which is only updated inside the operator's sync loop. Keep trying for the full timeout value. It's tempting to argue that the metric should not be reported as 0 until the value is known, OTOH https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics says "export a default value such as 0 for any time series you know may exist in advance." Signed-off-by: Miloslav Trmač <mitr@redhat.com>
If these tests occasionally fail, it's impossible to diagnose that failure without capturing the full data (e.g. to tell whether the data is just empty or corrupt, or whether there really is an unexpected intermediate state where the verification fails). So include the full failing data in the error message. Originally motivated by a > rotate.go:151: Failed to receive output: Get "https://test-service-vyh1o.test-urr50.svc:62133": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-service-serving-signer@1615214293") which I have only seen once; hopefully this will help if it ever reoccurs. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
The observed time for the service-ca replacing the canary secret after a forced rotation is commonly 0.5-1.5 minutes, but a time of a bit over 2 minutes was observed, as well as a timeout at 3 minutes. Based on the >2-minute observation, allow for 5 minutes. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
... to account for all the churn when replacing the CA, instead of just the 10-second timeout for updating a secret by a non-rotating CA. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Let the caller specify secret/ConfigMap keys that should be updated, so that we can use this to observe changes to other keys as well. Should not change (test) behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
... instead of a string specifying an edit indirectly. This will allow editing other secret keys for StatefulSet secrets. Also, rename it to editServingSecretData. Should not change (test) behavior. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
... and hoping that the change will happen in 10 seconds. Also, incindentally, increase the timeout, we've seen the changes not happening in these 10 seconds. The poolling could often be faster if the change happens in less than 10 seconds; we will also get a relevant failure about nothing changing instead of later complaints about unexpected format of data. Signed-off-by: Miloslav Trmač <mitr@redhat.com>
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mtrmac, stlaz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a set of fixes to e2e tests
See individual commit messages for details.
Note primarily the commit “Don't abort metrics/service-ca-metrics before the sync finishes”, where I could see an argument that this should be changed in the code instead of the tests.