Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade/adminack: guarantee one admin ack check post-upgrade #27645

Conversation

petr-muller
Copy link
Member

While looking into OCPBUGS-5505 I discovered that some 4.10->4.11
upgrade job runs perform an Admin Ack check, while some do not. 4.11 has
a ack-4.11-kube-1.25-api-removals-in-4.12 gate, so these upgrade jobs
sometimes test that Upgradeable goes false after the ugprade, and
sometimes they do not. This is only determined by the polling race
condition: the check is executed once per 10 minutes, and we cancel the
polling after upgrade is completed. This means that in some cases we are
lucky and manage to run one check before the cancel, and sometimes we
are not and only check while still on the base version.

Example job that checked admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444032104304640/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'
Jan  6 21:16:40.153: INFO: Waiting for Upgradeable to be AdminAckRequired ...

Example job that did not check admin acks post-upgrade:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480

$ curl --silent https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/openshift-cluster-version-operator-880-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1611444033509396480/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'Waiting for Upgradeable to be AdminAckRequired'

Add a guaranteed single check execution after the upgrade, so that admin
ack is always checked at least once with the upgrade target version.
Doing checks after done is signalled has prior art in the alert test.

@petr-muller
Copy link
Member Author

/uncc @jottofar @vrutkovsk
/cc @LalatenduMohanty @wking

@openshift-ci openshift-ci bot requested review from LalatenduMohanty and wking and removed request for jottofar January 10, 2023 14:30
@petr-muller
Copy link
Member Author

/uncc @vrutkovs

@openshift-ci openshift-ci bot removed the request for review from vrutkovs January 10, 2023 14:31
// Perform one guaranteed check after the upgrade is complete. We cancel the polled check
// above, so we never know whether the poll was lucky to run at least once since the version
// was bumped.
(&clusterversionoperator.AdminAckTest{Oc: t.oc, Config: t.config}).Test(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to cap the amount of time we're willing to spend on this? context.WithTimeout(context.Background(), 5*time.Minute) or similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that but I checked the other place that does this one-off check and it also just does context.Background():

ctx := context.Background()
adminAckTest := &clusterversionoperator.AdminAckTest{Oc: oc, Config: config}
adminAckTest.Test(ctx)

But yeah, it's probably better to cap that. I'll try to come up with some reasonable upper bound - there seems to be multiple timeouted polls inside the check, potentially even iterated, the upper bound needs to match the expected sum of these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two 3 minute waits for each applicable gate. I don't think we ever had more than one gate but with a little future-proofing I set the timeout to 15m to handle a potential case where we have two gates and somehowtthings go slowly enough to hit the three minute timeout on each wait (this is IMO totally pathological case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh I guess I cannot do the check here because Test will terminate while we're still testing in the goroutine.

@petr-muller petr-muller force-pushed the ocpbugs-5505-guaranteed-admin-ack-post-upgrade branch from bbd64c9 to f9a6a2b Compare January 11, 2023 22:30
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2023
While looking into OCPBUGS-5505 I discovered that some 4.10->4.11
upgrade job runs perform an Admin Ack check, while some do not. 4.11 has
a `ack-4.11-kube-1.25-api-removals-in-4.12` gate, so these upgrade jobs
sometimes test that `Upgradeable` goes `false` after the ugprade, and
sometimes they do not. This is only determined by the polling race
condition: the check is executed once per 10 minutes, and we cancel the
polling after upgrade is completed. This means that in some cases we are
lucky and manage to run one check before the cancel, and sometimes we
are not and only check while still on the base version.

Add a guaranteed single check execution after the upgrade, so that admin
ack is always checked at least once with the upgrade target version.
Doing checks after `done` is signalled has prior art in the alert test.
@petr-muller petr-muller force-pushed the ocpbugs-5505-guaranteed-admin-ack-post-upgrade branch from f9a6a2b to b7b9151 Compare January 11, 2023 22:43
@petr-muller
Copy link
Member Author

The results on upgrade jobs here look sane (nothing looks broken) but there's no gate on 4.13 so the test is not doing much

The `done` signal is either a timeout or "upgrade finished, stop testing". We do not need to perform the last check in the former case. Track versions that we check and when we get the signal, check whether the current version was checked at least once, and if not, check it before terminating.
@petr-muller petr-muller force-pushed the ocpbugs-5505-guaranteed-admin-ack-post-upgrade branch from a04c3fd to b000fd6 Compare January 13, 2023 16:55
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 13, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 832987a into openshift:master Jan 13, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 13, 2023

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-serial b000fd6 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-etcd-scaling b000fd6 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-rt-upgrade b000fd6 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-vsphere-ovn-etcd-scaling b000fd6 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-azure-ovn-etcd-scaling b000fd6 link false /test e2e-azure-ovn-etcd-scaling

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member Author

/cherry-pick release 4.12

@openshift-cherrypick-robot

@petr-muller: cannot checkout release 4.12: error checking out release 4.12: exit status 1. output: error: pathspec 'release 4.12' did not match any file(s) known to git

In response to this:

/cherry-pick release 4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/cherry-pick release-4.12

@openshift-cherrypick-robot

@petr-muller: new pull request created: #27659

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member Author

/cherry-pick release-4.11

@openshift-cherrypick-robot

@petr-muller: new pull request created: #27660

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

petr-muller added a commit to petr-muller/origin that referenced this pull request Jan 23, 2023
…ade check

openshift#27645 intended to add a guaranteed post-upgrade check but I have overlooked how exactly the polling is implemented and terminated, leading to the post-upgrade check never actually execute.

Previously the test used `PollImmediateWithContext` for the each-10-minutes check. The `ConditionFunc` never actually returned `true` or non-nil `err`, so the `PollImmediateWithContext` never terminated by the means of `ConditionFunc`: it was always terminated by the `ctx.Done()` that the framework does on finished upgrade (or a test timeout). This means that `PollImmediateWithContext` always terminated with `err=wait.ErrWaitTimeout` and the `Test` method immediately returned, so the "guaranteed" check code is never reached.

Given our `ConditionFunc` never terminates the polling, we can simplify and use the `wait.UntilWithContext` instead, which is a simpler version that precisely implements the desired loop (poll until context is done).
petr-muller added a commit to petr-muller/origin that referenced this pull request Jan 26, 2023
…ade check

openshift#27645 intended to add a guaranteed post-upgrade check but I have overlooked how exactly the polling is implemented and terminated, leading to the post-upgrade check never actually execute.

Previously the test used `PollImmediateWithContext` for the each-10-minutes check. The `ConditionFunc` never actually returned `true` or non-nil `err`, so the `PollImmediateWithContext` never terminated by the means of `ConditionFunc`: it was always terminated by the `ctx.Done()` that the framework does on finished upgrade (or a test timeout). This means that `PollImmediateWithContext` always terminated with `err=wait.ErrWaitTimeout` and the `Test` method immediately returned, so the "guaranteed" check code is never reached.

Given our `ConditionFunc` never terminates the polling, we can simplify and use the `wait.UntilWithContext` instead, which is a simpler version that precisely implements the desired loop (poll until context is done).
petr-muller added a commit to petr-muller/origin that referenced this pull request Jan 26, 2023
…ade check

openshift#27645 intended to add a guaranteed post-upgrade check but I have overlooked how exactly the polling is implemented and terminated, leading to the post-upgrade check never actually execute.

Previously the test used `PollImmediateWithContext` for the each-10-minutes check. The `ConditionFunc` never actually returned `true` or non-nil `err`, so the `PollImmediateWithContext` never terminated by the means of `ConditionFunc`: it was always terminated by the `ctx.Done()` that the framework does on finished upgrade (or a test timeout). This means that `PollImmediateWithContext` always terminated with `err=wait.ErrWaitTimeout` and the `Test` method immediately returned, so the "guaranteed" check code is never reached.

Given our `ConditionFunc` never terminates the polling, we can simplify and use the `wait.UntilWithContext` instead, which is a simpler version that precisely implements the desired loop (poll until context is done).
tjungblu pushed a commit to tjungblu/origin that referenced this pull request Apr 11, 2023
…ade check

openshift#27645 intended to add a guaranteed post-upgrade check but I have overlooked how exactly the polling is implemented and terminated, leading to the post-upgrade check never actually execute.

Previously the test used `PollImmediateWithContext` for the each-10-minutes check. The `ConditionFunc` never actually returned `true` or non-nil `err`, so the `PollImmediateWithContext` never terminated by the means of `ConditionFunc`: it was always terminated by the `ctx.Done()` that the framework does on finished upgrade (or a test timeout). This means that `PollImmediateWithContext` always terminated with `err=wait.ErrWaitTimeout` and the `Test` method immediately returned, so the "guaranteed" check code is never reached.

Given our `ConditionFunc` never terminates the polling, we can simplify and use the `wait.UntilWithContext` instead, which is a simpler version that precisely implements the desired loop (poll until context is done).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants