Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2027929: test/e2e/upgrade/adminack: Poll gates for duration of update #26656

Merged
merged 5 commits into from Dec 8, 2021

Conversation

wking
Copy link
Member

@wking wking commented Dec 1, 2021

Picking #26649 back to 4.9, but manually because I don't have #26650 in 4.9.

@openshift-ci openshift-ci bot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Dec 1, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 1, 2021

@wking: This pull request references Bugzilla bug 2027929, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.z) matches configured target release for branch (4.9.z)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 2026806 is in the state CLOSED (CURRENTRELEASE), which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 2026806 targets the "4.10.0" release, which is one of the valid target releases: 4.10.0
  • bug has dependents

In response to this:

Bug 2027929: test/e2e/upgrade/adminack: Poll gates for duration of update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Dec 1, 2021
@openshift-ci openshift-ci bot requested review from deads2k and dmage December 1, 2021 04:50
@wking
Copy link
Member Author

wking commented Dec 2, 2021

/hold

master/4.10 will also need to drop the current skip.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 2, 2021
wking and others added 5 commits December 6, 2021 09:14
Poking in [1] for [2]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1463537431096594432/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
  2021-11-24T16:06:34Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
  2021-11-24T16:30:08Z Available=True -: Done applying 4.8.0-0.nightly-2021-11-24-020113
  2021-11-24T19:32:46Z Failing=False -: -
  2021-11-24T19:03:45Z Progressing=True ClusterOperatorUpdating: Working towards 4.9.0-0.ci-2021-11-24-092816: 205 of 738 done (27% complete), waiting on openshift-apiserver
  2021-11-24T19:04:01Z Upgradeable=False AdminAckRequired: Kubernetes 1.22 and therefore OpenShift 4.9 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6329921 for details and instructions.

With:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1463537431096594432/artifacts/e2e-aws-upgrade/e2e.log | grep -i 'admin.ack\|Starting upgrade to version'
  STEP: Building a namespace api object, basename check-for-admin-acks
  STEP: Setting up admin ack test
  Nov 24 16:30:38.400: INFO: Admin ack test setup complete
  Nov 24 16:30:38.477: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
  Nov 24 16:30:59.853: INFO: Starting upgrade to version= image=quay.io/openshift-release-dev/ocp-release:4.7.38-x86_64 attempt=bf6292d3-ef4a-4197-b709-fbae573add46
  Nov 24 17:36:50.997: INFO: Starting upgrade to version= image=registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-11-24-020113 attempt=07151322-099a-479f-864f-ef7945aebcd0
  Nov 24 19:03:32.205: INFO: Starting upgrade to version= image=registry.ci.openshift.org/ocp/release:4.9.0-0.ci-2021-11-24-092816 attempt=f4750aac-11f3-4d12-a5c6-6860a376bf3e

So the admin-ack test-case ran as a one-shot at the beginning of the
test, but there was no gate to ack for that 4.6 release.  Then the
cluster updated through 4.7 to 4.8, where it picked up the
ack-4.8-kube-1.22-api-removals-in-4.9 gate.  The old, one-shot
admin-ack test case was long gone by this time, so the gate remained
un-acked, and the requested update to 4.9 hung on the failing
Upgradeable=False precondition.

With this commit, the non-update admin-ack test-case remains one-shot,
but the update admin-ack test-case moves to polling every 10 minutes,
and wiggling any new gates it sees that are applicable to the current
release (still based on the latest completed history entry).  So now
the 4.6 -> ... -> 4.9 job will, at some poll after completing 4.8,
notice and set the new ack-4.8-kube-1.22-api-removals-in-4.9 gate, and
unblock the update to 4.9.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1463537431096594432
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=2026806
… calls

There's no need for the formatting Failf call when the only argument
is a string that needs no formatting substitution.  Replace:

  Failf(string)

with:

  Fail(string)

to be slightly more efficient and also avoid issues is the string
being fed in contains %s or other format-sensitive characters.

There's also no need to use Sprintf to build a Failf argument, since
Failf can perform its own formating.  Replace:

  Failf(fmt.Sprintf(template, a, b, ...))

with:

  Failf(template, a, b, ...).

There's no framework.Log, so I can't apply the same simplification to
framework.Logf.  But because our existing Logf(string) calls take a
local string literal, there's less risk of a %s or other
format-sensitive characters sneaking into the argument.
Instead of returning error strings.  This allows us to be slightly
more structured, for example, we can now use %w to wrap errors, to set
up for Unwrap [1].  But mostly it just feels more idiomatic to see
'err != nil' instead of 'len(errMsg) != 0' when reading the code.

[1]: https://pkg.go.dev/fmt#Errorf
…adeable

Since we began polling in 4d1a25a (test/e2e/upgrade/adminack: Poll
gates for duration of update, 2021-11-29, openshift#26649), there's an
increased chance that this test is poking at acks while something else
is going on in Upgradeable in parallel.  For example [1]:

  Dec  1 21:01:44.722: FAIL: Error while waiting for Upgradeable to go true, err=timed out waiting for the condition Upgradeable: Status=False, Reason=KubeletMinorVersion_KubeletMinorVersionUnsupportedNextUpgrade, Message="Cluster operator kube-apiserver should not be upgraded between minor versions: KubeletMinorVersionUpgradeable: Kubelet minor versions on 6 nodes will not be supported in the next OpenShift minor version upgrade.".

With the incoming logic, this test-case would say "great, nothing in
Upgradeable complaining about any acks", without getting hung up on
the fact that Upgradeable is complaining about something else (in this
case, Kubelet skew, which will shortly resolve once the machine-config
gets around to updating nodes and catches them up with the
API-server).

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.7-to-4.8-to-4.9-to-4.10-ci/1466135844220833792#1:build-log.txt%3A866
@wking
Copy link
Member Author

wking commented Dec 6, 2021

I've pushed 83d4858 -> ac8b790, rebasing on 4.9's current tip and picking up #26661 and #26668. The current mapping from 4.10 -> 4.9 commits is:

  • 05374fd -> ac8b790 Bug 2026806: clusterversionoperator/adminack.go: Check for nil cm map
  • 25d7ccd -> 415403d test/extended/util/openshift/clusterversionoperator: Drop waitForUpgradeable
  • e5d0eae -> b186659 test/extended/util/openshift/clusterversionoperator: Return errors
  • 32c8f30 -> 8246a6a test/extended/util/openshift/clusterversionoperator: Simplify Fail(f) calls
  • 4d1a25a -> 5095fb3 test/e2e/upgrade/adminack: Poll gates for duration of update

No significant manual work, although there were some minor conflicts to resolve, because 4.9 lacks #26650.

@wking
Copy link
Member Author

wking commented Dec 6, 2021

Update job failed, but the polling logic looks appropriate for a 4.9->4.9 update:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/26656/pull-ci-openshift-origin-release-4.9-e2e-gcp-upgrade/1467909644998414336/build-log.txt | grep -i admin.ack
configmap/admin-acks patched
STEP: Building a namespace api object, basename check-for-admin-acks
STEP: Setting up admin ack test
Dec  6 18:12:36.619: INFO: Admin ack test setup complete
Dec  6 18:12:36.688: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec  6 18:22:36.728: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec  6 18:32:36.727: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec  6 18:42:36.732: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec  6 18:52:36.726: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec  6 19:02:36.729: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
Dec 06 18:12:34.000 I ns/openshift-kube-controller-manager pod/kube-controller-manager-ci-op-4h59pd99-0f4e4-rs4n2-master-0 reason/CreatedSCCRanges created SCC ranges for e2e-check-for-admin-acks-5233 namespace

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2021

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-csi ac8b790 link false /test e2e-gcp-csi
ci/prow/e2e-agnostic-cmd ac8b790 link false /test e2e-agnostic-cmd

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jottofar
Copy link
Contributor

jottofar commented Dec 7, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2021
@wking
Copy link
Member Author

wking commented Dec 7, 2021

there's still one nit with the 4.10 test-case, but I think we can punt that one to follow-up work and take this PR back as it stands.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2021
@bparees
Copy link
Contributor

bparees commented Dec 7, 2021

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 7, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees, jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2021
@bparees bparees added backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. labels Dec 7, 2021
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit f5f05db into openshift:release-4.9 Dec 8, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 8, 2021

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2027929 has been moved to the MODIFIED state.

In response to this:

Bug 2027929: test/e2e/upgrade/adminack: Poll gates for duration of update

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the poll-admin-ack branch December 8, 2021 08:16
@wking
Copy link
Member Author

wking commented Dec 16, 2021

/cherrypick release-4.8

@openshift-cherrypick-robot

@wking: new pull request created: #26704

In response to this:

/cherrypick release-4.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants