upgrade/adminack: guarantee one admin ack check post-upgrade #27645

petr-muller · 2023-01-10T14:29:15Z

While looking into OCPBUGS-5505 I discovered that some 4.10->4.11
upgrade job runs perform an Admin Ack check, while some do not. 4.11 has
a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs
sometimes test that Upgradeable goes false after the ugprade, and
sometimes they do not. This is only determined by the polling race
condition: the check is executed once per 10 minutes, and we cancel the
polling after upgrade is completed. This means that in some cases we are
lucky and manage to run one check before the cancel, and sometimes we
are not and only check while still on the base version.

Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Jan  6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...

Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'

Add a guaranteed single check execution after the upgrade, so that admin
ack is always checked at least once with the upgrade target version.
Doing checks after done is signalled has prior art in the alert test.

petr-muller · 2023-01-10T14:30:49Z

/uncc @jottofar @vrutkovsk
/cc @LalatenduMohanty @wking

petr-muller · 2023-01-10T14:31:06Z

/uncc @vrutkovs

wking · 2023-01-10T18:16:31Z

test/e2e/upgrade/adminack/adminack.go

+		// Perform one guaranteed check after the upgrade is complete. We cancel the polled check
+		// above, so we never know whether the poll was lucky to run at least once since the version
+		// was bumped.
+		(&clusterversionoperator.AdminAckTest{Oc: t.oc, Config: t.config}).Test(context.Background())


do we want to cap the amount of time we're willing to spend on this? context.WithTimeout(context.Background(), 5*time.Minute) or similar?

I considered that but I checked the other place that does this one-off check and it also just does context.Background():

origin/test/extended/adminack/adminack.go

Lines 23 to 26 in 74d8b6d

ctx := context.Background()

adminAckTest := &clusterversionoperator.AdminAckTest{Oc: oc, Config: config}

adminAckTest.Test(ctx)

But yeah, it's probably better to cap that. I'll try to come up with some reasonable upper bound - there seems to be multiple timeouted polls inside the check, potentially even iterated, the upper bound needs to match the expected sum of these.

There are two 3 minute waits for each applicable gate. I don't think we ever had more than one gate but with a little future-proofing I set the timeout to 15m to handle a potential case where we have two gates and somehowtthings go slowly enough to hit the three minute timeout on each wait (this is IMO totally pathological case).

Huh I guess I cannot do the check here because Test will terminate while we're still testing in the goroutine.

While looking into OCPBUGS-5505 I discovered that some 4.10->4.11 upgrade job runs perform an Admin Ack check, while some do not. 4.11 has a `ack-4.11-kube-1.25-api-removals-in-4.12` gate, so these upgrade jobs sometimes test that `Upgradeable` goes `false` after the ugprade, and sometimes they do not. This is only determined by the polling race condition: the check is executed once per 10 minutes, and we cancel the polling after upgrade is completed. This means that in some cases we are lucky and manage to run one check before the cancel, and sometimes we are not and only check while still on the base version. Add a guaranteed single check execution after the upgrade, so that admin ack is always checked at least once with the upgrade target version. Doing checks after `done` is signalled has prior art in the alert test.

petr-muller · 2023-01-12T17:24:25Z

The results on upgrade jobs here look sane (nothing looks broken) but there's no gate on 4.13 so the test is not doing much

test/extended/util/openshift/clusterversionoperator/adminack.go

The `done` signal is either a timeout or "upgrade finished, stop testing". We do not need to perform the last check in the former case. Track versions that we check and when we get the signal, check whether the current version was checked at least once, and if not, check it before terminating.

wking

/lgtm

openshift-ci · 2023-01-13T17:07:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/util/openshift/clusterversionoperator/OWNERS~~ [petr-muller,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-01-13T19:32:59Z

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-serial	`b000fd6`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-aws-ovn-etcd-scaling	`b000fd6`	link	false	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-gcp-ovn-rt-upgrade	`b000fd6`	link	false	`/test e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`b000fd6`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-azure-ovn-etcd-scaling	`b000fd6`	link	false	`/test e2e-azure-ovn-etcd-scaling`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

petr-muller · 2023-01-15T22:28:36Z

/cherry-pick release 4.12

openshift-cherrypick-robot · 2023-01-15T22:31:07Z

@petr-muller: cannot checkout release 4.12: error checking out release 4.12: exit status 1. output: error: pathspec 'release 4.12' did not match any file(s) known to git

In response to this:

/cherry-pick release 4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-01-15T22:31:31Z

/cherry-pick release-4.12

openshift-cherrypick-robot · 2023-01-15T22:32:15Z

@petr-muller: new pull request created: #27659

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller · 2023-01-15T22:32:44Z

/cherry-pick release-4.11

openshift-cherrypick-robot · 2023-01-15T22:33:27Z

@petr-muller: new pull request created: #27660

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ade check openshift#27645 intended to add a guaranteed post-upgrade check but I have overlooked how exactly the polling is implemented and terminated, leading to the post-upgrade check never actually execute. Previously the test used `PollImmediateWithContext` for the each-10-minutes check. The `ConditionFunc` never actually returned `true` or non-nil `err`, so the `PollImmediateWithContext` never terminated by the means of `ConditionFunc`: it was always terminated by the `ctx.Done()` that the framework does on finished upgrade (or a test timeout). This means that `PollImmediateWithContext` always terminated with `err=wait.ErrWaitTimeout` and the `Test` method immediately returned, so the "guaranteed" check code is never reached. Given our `ConditionFunc` never terminates the polling, we can simplify and use the `wait.UntilWithContext` instead, which is a simpler version that precisely implements the desired loop (poll until context is done).

openshift-ci bot requested review from jottofar and vrutkovs January 10, 2023 14:30

openshift-ci bot requested review from LalatenduMohanty and wking and removed request for jottofar January 10, 2023 14:30

openshift-ci bot removed the request for review from vrutkovs January 10, 2023 14:31

wking reviewed Jan 10, 2023

View reviewed changes

petr-muller force-pushed the ocpbugs-5505-guaranteed-admin-ack-post-upgrade branch from bbd64c9 to f9a6a2b Compare January 11, 2023 22:30

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2023

petr-muller force-pushed the ocpbugs-5505-guaranteed-admin-ack-post-upgrade branch from f9a6a2b to b7b9151 Compare January 11, 2023 22:43

wking reviewed Jan 13, 2023

View reviewed changes

test/extended/util/openshift/clusterversionoperator/adminack.go Outdated Show resolved Hide resolved

petr-muller force-pushed the ocpbugs-5505-guaranteed-admin-ack-post-upgrade branch from a04c3fd to b000fd6 Compare January 13, 2023 16:55

wking approved these changes Jan 13, 2023

View reviewed changes

openshift-ci bot assigned wking Jan 13, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2023

openshift-merge-robot merged commit 832987a into openshift:master Jan 13, 2023

openshift-cherrypick-robot mentioned this pull request Jan 15, 2023

[release-4.12] upgrade/adminack: guarantee one admin ack check post-upgrade #27659

Closed

openshift-cherrypick-robot mentioned this pull request Jan 15, 2023

[release-4.11] upgrade/adminack: guarantee one admin ack check post-upgrade #27660

Closed

petr-muller mentioned this pull request Jan 23, 2023

OCPBUGS-6503: upgrade/adminack: simplify polling and unblock "guaranteed" post-upgrade check #27678

Merged

petr-muller mentioned this pull request Jan 26, 2023

OCPBUGS-6850: [release-4.12] upgrade/adminack: guarantee one admin ack check post-upgrade #27684

Closed

petr-muller mentioned this pull request Jan 26, 2023

OCPBUGS-6851: [release-4.11] upgrade/adminack: guarantee one admin ack check post-upgrade #27685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade/adminack: guarantee one admin ack check post-upgrade #27645

upgrade/adminack: guarantee one admin ack check post-upgrade #27645

petr-muller commented Jan 10, 2023

petr-muller commented Jan 10, 2023

petr-muller commented Jan 10, 2023

wking Jan 10, 2023

petr-muller Jan 11, 2023

petr-muller Jan 11, 2023

petr-muller Jan 11, 2023

petr-muller commented Jan 12, 2023

wking left a comment

openshift-ci bot commented Jan 13, 2023

openshift-ci bot commented Jan 13, 2023

petr-muller commented Jan 15, 2023

openshift-cherrypick-robot commented Jan 15, 2023

petr-muller commented Jan 15, 2023

openshift-cherrypick-robot commented Jan 15, 2023

petr-muller commented Jan 15, 2023

openshift-cherrypick-robot commented Jan 15, 2023

	ctx := context.Background()

	adminAckTest := &clusterversionoperator.AdminAckTest{Oc: oc, Config: config}
	adminAckTest.Test(ctx)

upgrade/adminack: guarantee one admin ack check post-upgrade #27645

upgrade/adminack: guarantee one admin ack check post-upgrade #27645

Conversation

petr-muller commented Jan 10, 2023

petr-muller commented Jan 10, 2023

petr-muller commented Jan 10, 2023

wking Jan 10, 2023

Choose a reason for hiding this comment

petr-muller Jan 11, 2023

Choose a reason for hiding this comment

petr-muller Jan 11, 2023

Choose a reason for hiding this comment

petr-muller Jan 11, 2023

Choose a reason for hiding this comment

petr-muller commented Jan 12, 2023

wking left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 13, 2023

openshift-ci bot commented Jan 13, 2023

petr-muller commented Jan 15, 2023

openshift-cherrypick-robot commented Jan 15, 2023

petr-muller commented Jan 15, 2023

openshift-cherrypick-robot commented Jan 15, 2023

petr-muller commented Jan 15, 2023

openshift-cherrypick-robot commented Jan 15, 2023