Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-17041: Release Leader Election on Manager Exit #556

Merged
merged 1 commit into from Nov 10, 2023

Conversation

awgreene
Copy link
Contributor

@awgreene awgreene commented Aug 29, 2023

This commit introduces a couple of changes to the package server manager
to improve hand off between running pods during an upgrade or
redeployment.

  1. The package server manager will now voluntarily release its lease on
    manager exit which will speed up voluntary leader transition as the new
    leader shouldn't have to wait the LeaseDuration time before taking
    over.

I should note that enabling this setting expects that the binary will
immediately exit upon release.

  1. The package server manager deployment has had its .strategy.type
    field updated from Rolling to Recreate, which will prevent the new pod
    from attempting to acquire the lease before the current leader has
    shutdown and released its lease.

Signed-off-by: Alexander Greene greene.al1991@gmail.com

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 29, 2023
@openshift-ci-robot
Copy link

@awgreene: This pull request references Jira Issue OCPBUGS-17041, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This commit updates the package server manager to voluntarily release it's lease on manager exit which will speed up voluntary leader transition as the new leader doesn't have to wait the LeaseDuration time before taking over.

I should note that enabling this setting expects that the binary will immediately exit upon release.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from dinhxuanvu and ncdc August 29, 2023 20:46
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2023
@tmshort
Copy link
Contributor

tmshort commented Aug 29, 2023

/lgtm
pending successful CI completion

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2023
LeaderElection: !disableLeaderElection,
LeaderElectionNamespace: namespace,
LeaderElectionID: leaderElectionConfigmapName,
LeaderElectionReleaseOnCancel: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bulk of the change in the PR. The Go docs describing the field can be found here, and the PR introducing the functionality can be found here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will actually make #555 work faster; right now, one has to wait 2-3 minutes for the change to propagate due to the lease used in the leader election.

@tmshort
Copy link
Contributor

tmshort commented Aug 30, 2023

/retest

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2023
Comment on lines -122 to +126
type: RollingUpdate
type: Recreate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmshort this will slow down the new pod from coming up, but will prevent the new pod from attempting to acquire a lease before the other pod has shut down, meaning that it won't wait the retry duration, which should yield overall speed improvements.

@openshift-ci-robot
Copy link

@awgreene: This pull request references Jira Issue OCPBUGS-17041, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

This commit introduces a couple of changes to the package server manager
to improve hand off between running pods during an upgrade or
redeployment.

  1. The package server manager will now voluntarily release its lease on
    manager exit which will speed up voluntary leader transition as the new
    leader shouldn't have to wait the LeaseDuration time before taking
    over.

I should note that enabling this setting expects that the binary will
immediately exit upon release.

  1. The package server manager deployment has had its .strategy.type
    field updated from Rolling to Recreate, which will prevent the new pod
    from attempting to acquire the lease before the current leader has
    shutdown and released its lease.

Signed-off-by: Alexander Greene greene.al1991@gmail.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@awgreene
Copy link
Contributor Author

/retest

1 similar comment
@tmshort
Copy link
Contributor

tmshort commented Sep 5, 2023

/retest

@awgreene
Copy link
Contributor Author

/retest
/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 12, 2023

@awgreene: you cannot LGTM your own PR.

In response to this:

/retest
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@grokspawn
Copy link
Contributor

/jira refresh

Copy link
Contributor

@perdasilva perdasilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: awgreene, perdasilva

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@awgreene
Copy link
Contributor Author

Turns out this is a must have in 4.14 for telco, I've been asked to merge it into 4.15 and then 4.14.
/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Sep 12, 2023
@awgreene
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. label Sep 12, 2023
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 18, 2023
@awgreene
Copy link
Contributor Author

/retest

2 similar comments
@tmshort
Copy link
Contributor

tmshort commented Oct 23, 2023

/retest

@awgreene
Copy link
Contributor Author

/retest

@ncdc
Copy link
Contributor

ncdc commented Oct 24, 2023

Looking into the latest e2e flake/failure.

  1. e2e started polling for a succeeded CSV at 15:48:34.396258Z
  2. scheduler assigned webhook pod to a node at 15:49:31.218077. Note this is about 3 seconds before the 1-minute polling expires.
  3. pod starts logging at 15:49:39.065Z, after polling expired
  4. e2e test teardown deletes the namespace
  5. pod tries to grab leader lease 15:49:39.259582 error initially creating leader election record: configmaps "78ab7849.operators.coreos.io" is forbidden: unable to create new content in namespace webhook-e2e-gnlxr because it is being terminated

@awgreene
Copy link
Contributor Author

The failed test doesn't seem to be related to the changes introduced in this PR based on the periodic test grid results, I'll take a swing at improving that test before attempting to retest this again.

@ncdc
Copy link
Contributor

ncdc commented Oct 25, 2023

👍 I agree it isn't related. Thanks for looking into test improvements.

@tmshort
Copy link
Contributor

tmshort commented Oct 25, 2023

I swear I've seen a similar failure in other PRs (not that I can find them).

@ncdc
Copy link
Contributor

ncdc commented Oct 25, 2023

Updated timings with more detail. I think we may need to extend the polling to more than 1 minute, or we need to start polling later, after the catalog is happy. There's just not enough time to do what we need within 1 minute when we start the polling this early:

  1. 15:48:34.396258Z Start polling
  2. time="2023-10-24T15:48:35Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.207.167:50051: i/o timeout\"" source="{catalog-gxmmh webhook-e2e-gnlxr}"
  3. time="2023-10-24T15:49:26Z" level=info msg="added to bundle, Kind=ClusterServiceVersion" configmap=webhook-e2e-gnlxr/c2b1b338f5c023731fba9484eb4b81d35f81414ced29d9fcd90813ba9a8e8b8 key=webhook-operator.clusterserviceversion.yaml
  4. 2023-10-24T15:49:31.156721Z deployment created

@tmshort
Copy link
Contributor

tmshort commented Oct 25, 2023

Some of those errors look to be networking-type errors (i.e. i/o timeout), rather than kubernetes errors.

@ncdc
Copy link
Contributor

ncdc commented Oct 26, 2023

You are correct; however, they are errors that prevent the parts of the test from progressing and demonstrate the need for either more than a 1m poll timeout, or a way to start polling later.

@tmshort
Copy link
Contributor

tmshort commented Oct 26, 2023

You are correct; however, they are errors that prevent the parts of the test from progressing and demonstrate the need for either more than a 1m poll timeout, or a way to start polling later.

Or retries, if they are not present.

@awgreene
Copy link
Contributor Author

/retest

@tmshort
Copy link
Contributor

tmshort commented Oct 27, 2023

Updated timings with more detail. I think we may need to extend the polling to more than 1 minute, or we need to start polling later, after the catalog is happy. There's just not enough time to do what we need within 1 minute when we start the polling this early

So, the catalog should be happy; in this location (webhook_e2e_test.go:685), there is fetchCatalogSourceOnStatus(..., catalogSourceRegistryPodSynced()). fetchCatalogSourceOnStatus() polls for "5m" until the catalogSourceRegistryPodSynced() function returns true; since the error is in not occurring in these functions, we have to assume the catalog is ready.

The polling timeout (error) is occurring later in awaitCSV() which is logging the progression of the CSV status in the poll loop, but we aren't reaching the desired state. Also, the nature of the Eventually() loop here means we'll never see the "never got correct status" message when the function attempts to end, as Eventually() will terminate the function early. The fetchCSV() and awaitCSV() functions are functionally equivalent here, but have different logging output.

One thing we aren't doing is waiting for the Subscription to be created via createSubscriptionForCatalog(), instead, we are going immediately to awaitCSV; so I think we need to wait for the subscription to get to a good state before waiting for the CSV.

So, I have already increased the timeout period of the awaitCSV() function, but I think we need to call fetchSubscription() to make sure the subscription is created and in the correct state. This will both delay the wait for the CSV, as well as increase the timeout.

@awgreene
Copy link
Contributor Author

Separate failure, we'll need to look into this as well:

End-to-end: [It] Install Plan update catalog for subscription AmplifyPermissions

{Timed out after 60.001s.
Expected
    <bool>: false
to be true failed [FAILED] Timed out after 60.001s.
Expected
    <bool>: false
to be true
In [It] at: /go/src/github.com/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/util.go:582 @ 10/27/23 15:57:21.264
}

@tmshort
Copy link
Contributor

tmshort commented Oct 27, 2023

It's a failure during cleanup after the test completes (it's literally a defer).
That being said, this test code is a mish-mash of Gomega/Ginkgo and assert/require, and this particular test doesn't have BeforeEach/AfterEach for individual test plans, so the behavior may be a bit wonky cleanup-wise, it might be trying to clean up a lot, and we aren't sure of the order. If there was AfterEach order would be deterministic.
EDIT: It's also hard to determine comprehend where Gomega/Ginkgo functions actually are, given everything is a closure.

@awgreene
Copy link
Contributor Author

#600 merged, which addresses e2e failures.

/retest

@awgreene
Copy link
Contributor Author

awgreene commented Nov 10, 2023

/unhold

OpenShift CI had applied a /hold label as we had reran tests 3 times without changing the commit. We believe that these were innate issues with our CI and have addressed them in #600.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 10, 2023
Copy link
Member

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I read about LeaderElectionReleaseOnCancel - this PR looks reasonable. But I'm a bit hesitant to lgtm it because I'm not well versed in this area.

@grokspawn
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2023
Copy link
Contributor

openshift-ci bot commented Nov 10, 2023

@awgreene: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit dc49d8f into openshift:master Nov 10, 2023
12 checks passed
@openshift-ci-robot
Copy link

@awgreene: Jira Issue OCPBUGS-17041: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-17041 has been moved to the MODIFIED state.

In response to this:

This commit introduces a couple of changes to the package server manager
to improve hand off between running pods during an upgrade or
redeployment.

  1. The package server manager will now voluntarily release its lease on
    manager exit which will speed up voluntary leader transition as the new
    leader shouldn't have to wait the LeaseDuration time before taking
    over.

I should note that enabling this setting expects that the binary will
immediately exit upon release.

  1. The package server manager deployment has had its .strategy.type
    field updated from Rolling to Recreate, which will prevent the new pod
    from attempting to acquire the lease before the current leader has
    shutdown and released its lease.

Signed-off-by: Alexander Greene greene.al1991@gmail.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@awgreene: #556 failed to apply on top of branch "release-4.14":

Applying: Improve Leader Election Hand Off
Using index info to reconstruct a base tree...
M	cmd/package-server-manager/main.go
M	manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
M	manifests/0000_50_olm_06-psm-operator.deployment.yaml
M	scripts/generate_crds_manifests.sh
Falling back to patching base and 3-way merge...
Auto-merging scripts/generate_crds_manifests.sh
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.yaml
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
Auto-merging cmd/package-server-manager/main.go
CONFLICT (content): Merge conflict in cmd/package-server-manager/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Improve Leader Election Hand Off
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@awgreene
Copy link
Contributor Author

/cherry-pick release-4.14

@openshift-cherrypick-robot

@awgreene: #556 failed to apply on top of branch "release-4.14":

Applying: Improve Leader Election Hand Off
Using index info to reconstruct a base tree...
M	cmd/package-server-manager/main.go
M	manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
M	manifests/0000_50_olm_06-psm-operator.deployment.yaml
M	scripts/generate_crds_manifests.sh
Falling back to patching base and 3-way merge...
Auto-merging scripts/generate_crds_manifests.sh
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.yaml
Auto-merging manifests/0000_50_olm_06-psm-operator.deployment.ibm-cloud-managed.yaml
Auto-merging cmd/package-server-manager/main.go
CONFLICT (content): Merge conflict in cmd/package-server-manager/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Improve Leader Election Hand Off
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-11-13-174800

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants