New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-17041: Release Leader Election on Manager Exit #556
Conversation
@awgreene: This pull request references Jira Issue OCPBUGS-17041, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
cmd/package-server-manager/main.go
Outdated
LeaderElection: !disableLeaderElection, | ||
LeaderElectionNamespace: namespace, | ||
LeaderElectionID: leaderElectionConfigmapName, | ||
LeaderElectionReleaseOnCancel: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will actually make #555 work faster; right now, one has to wait 2-3 minutes for the change to propagate due to the lease used in the leader election.
/retest |
66e390e
to
c3541b2
Compare
c3541b2
to
17b441e
Compare
type: RollingUpdate | ||
type: Recreate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmshort this will slow down the new pod from coming up, but will prevent the new pod from attempting to acquire a lease before the other pod has shut down, meaning that it won't wait the retry duration, which should yield overall speed improvements.
17b441e
to
3722ae2
Compare
@awgreene: This pull request references Jira Issue OCPBUGS-17041, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
1 similar comment
/retest |
/retest |
@awgreene: you cannot LGTM your own PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: awgreene, perdasilva The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Turns out this is a must have in 4.14 for telco, I've been asked to merge it into 4.15 and then 4.14. |
/jira refresh |
/retest |
2 similar comments
/retest |
/retest |
Looking into the latest e2e flake/failure.
|
The failed test doesn't seem to be related to the changes introduced in this PR based on the periodic test grid results, I'll take a swing at improving that test before attempting to retest this again. |
👍 I agree it isn't related. Thanks for looking into test improvements. |
I swear I've seen a similar failure in other PRs (not that I can find them). |
Updated timings with more detail. I think we may need to extend the polling to more than 1 minute, or we need to start polling later, after the catalog is happy. There's just not enough time to do what we need within 1 minute when we start the polling this early:
|
Some of those errors look to be networking-type errors (i.e. |
You are correct; however, they are errors that prevent the parts of the test from progressing and demonstrate the need for either more than a 1m poll timeout, or a way to start polling later. |
Or retries, if they are not present. |
/retest |
So, the catalog should be happy; in this location (webhook_e2e_test.go:685), there is The polling timeout (error) is occurring later in One thing we aren't doing is waiting for the Subscription to be created via So, I have already increased the timeout period of the |
Separate failure, we'll need to look into this as well:
|
It's a failure during cleanup after the test completes (it's literally a |
#600 merged, which addresses e2e failures. /retest |
/unhold OpenShift CI had applied a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I read about LeaderElectionReleaseOnCancel
- this PR looks reasonable. But I'm a bit hesitant to lgtm it because I'm not well versed in this area.
/lgtm |
@awgreene: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@awgreene: Jira Issue OCPBUGS-17041: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-17041 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@awgreene: #556 failed to apply on top of branch "release-4.14":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.14 |
@awgreene: #556 failed to apply on top of branch "release-4.14":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix included in accepted release 4.15.0-0.nightly-2023-11-13-174800 |
This commit introduces a couple of changes to the package server manager
to improve hand off between running pods during an upgrade or
redeployment.
manager exit which will speed up voluntary leader transition as the new
leader shouldn't have to wait the LeaseDuration time before taking
over.
I should note that enabling this setting expects that the binary will
immediately exit upon release.
field updated from Rolling to Recreate, which will prevent the new pod
from attempting to acquire the lease before the current leader has
shutdown and released its lease.
Signed-off-by: Alexander Greene greene.al1991@gmail.com