New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2027929: test/e2e/upgrade/adminack: Poll gates for duration of update #26656
Bug 2027929: test/e2e/upgrade/adminack: Poll gates for duration of update #26656
Conversation
@wking: This pull request references Bugzilla bug 2027929, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold
|
Poking in [1] for [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1463537431096594432/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-11-24T16:06:34Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-11-24T16:30:08Z Available=True -: Done applying 4.8.0-0.nightly-2021-11-24-020113 2021-11-24T19:32:46Z Failing=False -: - 2021-11-24T19:03:45Z Progressing=True ClusterOperatorUpdating: Working towards 4.9.0-0.ci-2021-11-24-092816: 205 of 738 done (27% complete), waiting on openshift-apiserver 2021-11-24T19:04:01Z Upgradeable=False AdminAckRequired: Kubernetes 1.22 and therefore OpenShift 4.9 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6329921 for details and instructions. With: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1463537431096594432/artifacts/e2e-aws-upgrade/e2e.log | grep -i 'admin.ack\|Starting upgrade to version' STEP: Building a namespace api object, basename check-for-admin-acks STEP: Setting up admin ack test Nov 24 16:30:38.400: INFO: Admin ack test setup complete Nov 24 16:30:38.477: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates. Nov 24 16:30:59.853: INFO: Starting upgrade to version= image=quay.io/openshift-release-dev/ocp-release:4.7.38-x86_64 attempt=bf6292d3-ef4a-4197-b709-fbae573add46 Nov 24 17:36:50.997: INFO: Starting upgrade to version= image=registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-11-24-020113 attempt=07151322-099a-479f-864f-ef7945aebcd0 Nov 24 19:03:32.205: INFO: Starting upgrade to version= image=registry.ci.openshift.org/ocp/release:4.9.0-0.ci-2021-11-24-092816 attempt=f4750aac-11f3-4d12-a5c6-6860a376bf3e So the admin-ack test-case ran as a one-shot at the beginning of the test, but there was no gate to ack for that 4.6 release. Then the cluster updated through 4.7 to 4.8, where it picked up the ack-4.8-kube-1.22-api-removals-in-4.9 gate. The old, one-shot admin-ack test case was long gone by this time, so the gate remained un-acked, and the requested update to 4.9 hung on the failing Upgradeable=False precondition. With this commit, the non-update admin-ack test-case remains one-shot, but the update admin-ack test-case moves to polling every 10 minutes, and wiggling any new gates it sees that are applicable to the current release (still based on the latest completed history entry). So now the 4.6 -> ... -> 4.9 job will, at some poll after completing 4.8, notice and set the new ack-4.8-kube-1.22-api-removals-in-4.9 gate, and unblock the update to 4.9. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1463537431096594432 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=2026806
… calls There's no need for the formatting Failf call when the only argument is a string that needs no formatting substitution. Replace: Failf(string) with: Fail(string) to be slightly more efficient and also avoid issues is the string being fed in contains %s or other format-sensitive characters. There's also no need to use Sprintf to build a Failf argument, since Failf can perform its own formating. Replace: Failf(fmt.Sprintf(template, a, b, ...)) with: Failf(template, a, b, ...). There's no framework.Log, so I can't apply the same simplification to framework.Logf. But because our existing Logf(string) calls take a local string literal, there's less risk of a %s or other format-sensitive characters sneaking into the argument.
Instead of returning error strings. This allows us to be slightly more structured, for example, we can now use %w to wrap errors, to set up for Unwrap [1]. But mostly it just feels more idiomatic to see 'err != nil' instead of 'len(errMsg) != 0' when reading the code. [1]: https://pkg.go.dev/fmt#Errorf
…adeable Since we began polling in 4d1a25a (test/e2e/upgrade/adminack: Poll gates for duration of update, 2021-11-29, openshift#26649), there's an increased chance that this test is poking at acks while something else is going on in Upgradeable in parallel. For example [1]: Dec 1 21:01:44.722: FAIL: Error while waiting for Upgradeable to go true, err=timed out waiting for the condition Upgradeable: Status=False, Reason=KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade, Message="Cluster operator kube-apiserver should not be upgraded between minor versions: KubeletMinorVersionUpgradeable: Kubelet minor versions on 6 nodes will not be supported in the next OpenShift minor version upgrade.". With the incoming logic, this test-case would say "great, nothing in Upgradeable complaining about any acks", without getting hung up on the fact that Upgradeable is complaining about something else (in this case, Kubelet skew, which will shortly resolve once the machine-config gets around to updating nodes and catches them up with the API-server). [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci/1466135844220833792#1:build-log.txt%3A866
I've pushed 83d4858 -> ac8b790, rebasing on 4.9's current tip and picking up #26661 and #26668. The current mapping from 4.10 -> 4.9 commits is:
No significant manual work, although there were some minor conflicts to resolve, because 4.9 lacks #26650. |
Update job failed, but the polling logic looks appropriate for a 4.9->4.9 update: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/26656/pull-ci-openshift-origin-release-4.9-e2e-gcp-upgrade/1467909644998414336/build-log.txt | grep -i admin.ack
configmap/admin-acks patched
STEP: Building a namespace api object, basename check-for-admin-acks
STEP: Setting up admin ack test
Dec 6 18:12:36.619: INFO: Admin ack test setup complete
Dec 6 18:12:36.688: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec 6 18:22:36.728: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec 6 18:32:36.727: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec 6 18:42:36.732: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec 6 18:52:36.726: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec 6 19:02:36.729: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec 06 18:12:34.000 I ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-4h59pd99-0f4e4-rs4n2-master-0 reason/CreatedSCCRanges created SCC ranges for e2e-check-for-admin-acks-5233 namespace |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
there's still one nit with the 4.10 test-case, but I think we can punt that one to follow-up work and take this PR back as it stands. /hold cancel |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 2027929 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherrypick release-4.8 |
@wking: new pull request created: #26704 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Picking #26649 back to 4.9, but manually because I don't have #26650 in 4.9.