OCPBUGS-17041: Release Leader Election on Manager Exit #556

awgreene · 2023-08-29T20:46:08Z

This commit introduces a couple of changes to the package server manager
to improve hand off between running pods during an upgrade or
redeployment.

The package server manager will now voluntarily release its lease on
manager exit which will speed up voluntary leader transition as the new
leader shouldn't have to wait the LeaseDuration time before taking
over.

I should note that enabling this setting expects that the binary will
immediately exit upon release.

The package server manager deployment has had its .strategy.type
field updated from Rolling to Recreate, which will prevent the new pod
from attempting to acquire the lease before the current leader has
shutdown and released its lease.

Signed-off-by: Alexander Greene greene.al1991@gmail.com

openshift-ci-robot · 2023-08-29T20:46:12Z

@awgreene: This pull request references Jira Issue OCPBUGS-17041, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This commit updates the package server manager to voluntarily release it's lease on manager exit which will speed up voluntary leader transition as the new leader doesn't have to wait the LeaseDuration time before taking over.

I should note that enabling this setting expects that the binary will immediately exit upon release.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tmshort · 2023-08-29T20:58:39Z

/lgtm
pending successful CI completion

awgreene · 2023-08-29T20:59:40Z

cmd/package-server-manager/main.go

+		LeaderElection:                !disableLeaderElection,
+		LeaderElectionNamespace:       namespace,
+		LeaderElectionID:              leaderElectionConfigmapName,
+		LeaderElectionReleaseOnCancel: true,


This is the bulk of the change in the PR. The Go docs describing the field can be found here, and the PR introducing the functionality can be found here.

This will actually make #555 work faster; right now, one has to wait 2-3 minutes for the change to propagate due to the lease used in the leader election.

tmshort · 2023-08-30T13:55:53Z

/retest

awgreene · 2023-08-30T14:21:59Z

scripts/generate_crds_manifests.sh

-    type: RollingUpdate
+    type: Recreate


@tmshort this will slow down the new pod from coming up, but will prevent the new pod from attempting to acquire a lease before the other pod has shut down, meaning that it won't wait the retry duration, which should yield overall speed improvements.

openshift-ci-robot · 2023-08-30T20:30:59Z

@awgreene: This pull request references Jira Issue OCPBUGS-17041, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

This commit introduces a couple of changes to the package server manager
to improve hand off between running pods during an upgrade or
redeployment.

The package server manager will now voluntarily release its lease on
manager exit which will speed up voluntary leader transition as the new
leader shouldn't have to wait the LeaseDuration time before taking
over.

I should note that enabling this setting expects that the binary will
immediately exit upon release.

The package server manager deployment has had its .strategy.type
field updated from Rolling to Recreate, which will prevent the new pod
from attempting to acquire the lease before the current leader has
shutdown and released its lease.

Signed-off-by: Alexander Greene greene.al1991@gmail.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

awgreene · 2023-08-31T14:41:42Z

/retest

tmshort · 2023-09-05T18:28:54Z

/retest

awgreene · 2023-09-12T15:21:51Z

/retest
/lgtm

openshift-ci · 2023-09-12T15:21:52Z

@awgreene: you cannot LGTM your own PR.

In response to this:

/retest
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

grokspawn · 2023-09-12T15:26:57Z

/jira refresh

perdasilva

lgtm

openshift-ci · 2023-09-12T15:34:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [awgreene,perdasilva]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

awgreene · 2023-09-12T18:07:16Z

Turns out this is a must have in 4.14 for telco, I've been asked to merge it into 4.15 and then 4.14.
/label acknowledge-critical-fixes-only

awgreene · 2023-09-12T18:08:48Z

/jira refresh

awgreene · 2023-10-23T14:37:47Z

/retest

tmshort · 2023-10-23T18:06:28Z

/retest

awgreene · 2023-10-24T14:40:49Z

/retest

ncdc · 2023-10-24T19:49:48Z

Looking into the latest e2e flake/failure.

e2e started polling for a succeeded CSV at 15:48:34.396258Z
scheduler assigned webhook pod to a node at 15:49:31.218077. Note this is about 3 seconds before the 1-minute polling expires.
pod starts logging at 15:49:39.065Z, after polling expired
e2e test teardown deletes the namespace
pod tries to grab leader lease 15:49:39.259582 error initially creating leader election record: configmaps "78ab7849.operators.coreos.io" is forbidden: unable to create new content in namespace webhook-e2e-gnlxr because it is being terminated

awgreene · 2023-10-25T13:37:30Z

The failed test doesn't seem to be related to the changes introduced in this PR based on the periodic test grid results, I'll take a swing at improving that test before attempting to retest this again.

ncdc · 2023-10-25T13:42:47Z

👍 I agree it isn't related. Thanks for looking into test improvements.

tmshort · 2023-10-25T14:10:28Z

I swear I've seen a similar failure in other PRs (not that I can find them).

ncdc · 2023-10-25T19:17:22Z

Updated timings with more detail. I think we may need to extend the polling to more than 1 minute, or we need to start polling later, after the catalog is happy. There's just not enough time to do what we need within 1 minute when we start the polling this early:

15:48:34.396258Z Start polling
time="2023-10-24T15:48:35Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.207.167:50051: i/o timeout\"" source="{catalog-gxmmh webhook-e2e-gnlxr}"
time="2023-10-24T15:49:26Z" level=info msg="added to bundle, Kind=ClusterServiceVersion" configmap=webhook-e2e-gnlxr/c2b1b338f5c023731fba9484eb4b81d35f81414ced29d9fcd90813ba9a8e8b8 key=webhook-operator.clusterserviceversion.yaml
2023-10-24T15:49:31.156721Z deployment created

tmshort · 2023-10-25T22:54:30Z

Some of those errors look to be networking-type errors (i.e. i/o timeout), rather than kubernetes errors.

ncdc · 2023-10-26T00:12:30Z

You are correct; however, they are errors that prevent the parts of the test from progressing and demonstrate the need for either more than a 1m poll timeout, or a way to start polling later.

tmshort · 2023-10-26T13:54:24Z

You are correct; however, they are errors that prevent the parts of the test from progressing and demonstrate the need for either more than a 1m poll timeout, or a way to start polling later.

Or retries, if they are not present.

awgreene · 2023-10-27T14:13:37Z

/retest

tmshort · 2023-10-27T16:10:44Z

Updated timings with more detail. I think we may need to extend the polling to more than 1 minute, or we need to start polling later, after the catalog is happy. There's just not enough time to do what we need within 1 minute when we start the polling this early

So, the catalog should be happy; in this location (webhook_e2e_test.go:685), there is fetchCatalogSourceOnStatus(..., catalogSourceRegistryPodSynced()). fetchCatalogSourceOnStatus() polls for "5m" until the catalogSourceRegistryPodSynced() function returns true; since the error is in not occurring in these functions, we have to assume the catalog is ready.

The polling timeout (error) is occurring later in awaitCSV() which is logging the progression of the CSV status in the poll loop, but we aren't reaching the desired state. Also, the nature of the Eventually() loop here means we'll never see the "never got correct status" message when the function attempts to end, as Eventually() will terminate the function early. The fetchCSV() and awaitCSV() functions are functionally equivalent here, but have different logging output.

One thing we aren't doing is waiting for the Subscription to be created via createSubscriptionForCatalog(), instead, we are going immediately to awaitCSV; so I think we need to wait for the subscription to get to a good state before waiting for the CSV.

So, I have already increased the timeout period of the awaitCSV() function, but I think we need to call fetchSubscription() to make sure the subscription is created and in the correct state. This will both delay the wait for the CSV, as well as increase the timeout.

awgreene · 2023-10-27T17:29:07Z

Separate failure, we'll need to look into this as well:

End-to-end: [It] Install Plan update catalog for subscription AmplifyPermissions

{Timed out after 60.001s.
Expected
    <bool>: false
to be true failed [FAILED] Timed out after 60.001s.
Expected
    <bool>: false
to be true
In [It] at: /go/src/github.com/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/util.go:582 @ 10/27/23 15:57:21.264
}

tmshort · 2023-10-27T18:31:46Z

It's a failure during cleanup after the test completes (it's literally a defer).
That being said, this test code is a mish-mash of Gomega/Ginkgo and assert/require, and this particular test doesn't have BeforeEach/AfterEach for individual test plans, so the behavior may be a bit wonky cleanup-wise, it might be trying to clean up a lot, and we aren't sure of the order. If there was AfterEach order would be deterministic.
EDIT: It's also hard to determine comprehend where Gomega/Ginkgo functions actually are, given everything is a closure.

awgreene · 2023-11-10T03:57:23Z

#600 merged, which addresses e2e failures.

/retest

awgreene · 2023-11-10T10:28:37Z

/unhold

OpenShift CI had applied a /hold label as we had reran tests 3 times without changing the commit. We believe that these were innate issues with our CI and have addressed them in #600.

m1kola

From what I read about LeaderElectionReleaseOnCancel - this PR looks reasonable. But I'm a bit hesitant to lgtm it because I'm not well versed in this area.

grokspawn · 2023-11-10T13:03:06Z

/lgtm

openshift-ci · 2023-11-10T15:12:54Z

@awgreene: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-11-10T15:17:21Z

@awgreene: Jira Issue OCPBUGS-17041: All pull requests linked via external trackers have merged:

openshift/operator-framework-olm#556

Jira Issue OCPBUGS-17041 has been moved to the MODIFIED state.

In response to this:

This commit introduces a couple of changes to the package server manager
to improve hand off between running pods during an upgrade or
redeployment.

The package server manager will now voluntarily release its lease on
manager exit which will speed up voluntary leader transition as the new
leader shouldn't have to wait the LeaseDuration time before taking
over.

I should note that enabling this setting expects that the binary will
immediately exit upon release.

The package server manager deployment has had its .strategy.type
field updated from Rolling to Recreate, which will prevent the new pod
from attempting to acquire the lease before the current leader has
shutdown and released its lease.

Signed-off-by: Alexander Greene greene.al1991@gmail.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2023-11-10T15:18:12Z

@awgreene: #556 failed to apply on top of branch "release-4.14":

Applying: Improve Leader Election Hand Off
Using index info to reconstruct a base tree...
M	cmd/package-server-manager/main.go
M	manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
M	manifests/0000_50_olm_06-psm-operator.deployment.yaml
M	scripts/generate_crds_manifests.sh
Falling back to patching base and 3-way merge...
Auto-merging scripts/generate_crds_manifests.sh
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.yaml
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
Auto-merging cmd/package-server-manager/main.go
CONFLICT (content): Merge conflict in cmd/package-server-manager/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Improve Leader Election Hand Off
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

awgreene · 2023-11-10T15:34:34Z

/cherry-pick release-4.14

openshift-cherrypick-robot · 2023-11-10T15:35:14Z

@awgreene: #556 failed to apply on top of branch "release-4.14":

Applying: Improve Leader Election Hand Off
Using index info to reconstruct a base tree...
M	cmd/package-server-manager/main.go
M	manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
M	manifests/0000_50_olm_06-psm-operator.deployment.yaml
M	scripts/generate_crds_manifests.sh
Falling back to patching base and 3-way merge...
Auto-merging scripts/generate_crds_manifests.sh
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.yaml
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
Auto-merging cmd/package-server-manager/main.go
CONFLICT (content): Merge conflict in cmd/package-server-manager/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Improve Leader Election Hand Off
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot · 2023-11-13T22:02:53Z

Fix included in accepted release 4.15.0-0.nightly-2023-11-13-174800

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 29, 2023

openshift-ci bot requested review from dinhxuanvu and ncdc August 29, 2023 20:46

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2023

openshift-ci bot assigned tmshort Aug 29, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2023

awgreene commented Aug 29, 2023

View reviewed changes

awgreene force-pushed the lease-release branch from 66e390e to c3541b2 Compare August 30, 2023 14:12

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2023

awgreene force-pushed the lease-release branch from c3541b2 to 17b441e Compare August 30, 2023 14:18

awgreene commented Aug 30, 2023

View reviewed changes

awgreene force-pushed the lease-release branch from 17b441e to 3722ae2 Compare August 30, 2023 20:30

perdasilva approved these changes Sep 12, 2023

View reviewed changes

openshift-ci bot assigned perdasilva Sep 12, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2023

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Sep 12, 2023

openshift-ci-robot added the jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. label Sep 12, 2023

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2023

m1kola reviewed Nov 10, 2023

View reviewed changes

openshift-ci bot assigned grokspawn Nov 10, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2023

openshift-merge-bot bot merged commit dc49d8f into openshift:master Nov 10, 2023
12 checks passed

OCPBUGS-17041: Release Leader Election on Manager Exit #556

OCPBUGS-17041: Release Leader Election on Manager Exit #556

Conversation

awgreene commented Aug 29, 2023 • edited

openshift-ci-robot commented Aug 29, 2023

tmshort commented Aug 29, 2023

awgreene Aug 29, 2023

Choose a reason for hiding this comment

tmshort Aug 29, 2023

Choose a reason for hiding this comment

tmshort commented Aug 30, 2023

awgreene Aug 30, 2023

Choose a reason for hiding this comment

openshift-ci-robot commented Aug 30, 2023

awgreene commented Aug 31, 2023

tmshort commented Sep 5, 2023

awgreene commented Sep 12, 2023

openshift-ci bot commented Sep 12, 2023

grokspawn commented Sep 12, 2023

perdasilva left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 12, 2023

awgreene commented Sep 12, 2023

awgreene commented Sep 12, 2023

awgreene commented Oct 23, 2023

tmshort commented Oct 23, 2023

awgreene commented Oct 24, 2023

ncdc commented Oct 24, 2023

awgreene commented Oct 25, 2023

ncdc commented Oct 25, 2023

tmshort commented Oct 25, 2023 • edited

ncdc commented Oct 25, 2023

tmshort commented Oct 25, 2023

ncdc commented Oct 26, 2023

tmshort commented Oct 26, 2023

awgreene commented Oct 27, 2023

tmshort commented Oct 27, 2023

awgreene commented Oct 27, 2023

tmshort commented Oct 27, 2023 • edited

awgreene commented Nov 10, 2023

awgreene commented Nov 10, 2023 • edited

m1kola left a comment

Choose a reason for hiding this comment

grokspawn commented Nov 10, 2023

openshift-ci bot commented Nov 10, 2023

openshift-ci-robot commented Nov 10, 2023

openshift-cherrypick-robot commented Nov 10, 2023

awgreene commented Nov 10, 2023

openshift-cherrypick-robot commented Nov 10, 2023

openshift-merge-robot commented Nov 13, 2023

awgreene commented Aug 29, 2023 •

edited

tmshort commented Oct 25, 2023 •

edited

tmshort commented Oct 27, 2023 •

edited

awgreene commented Nov 10, 2023 •

edited