New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable leader election for control plane operator to prevent race conditions on upgrades #935
enable leader election for control plane operator to prevent race conditions on upgrades #935
Conversation
3729b2b
to
799ce4c
Compare
Thanks! I'm fine with this change though I'd like some discussion aside before proceeding: @relyt0925 How is it that you have 2 instances of CPO running simultaneously? is it because an upgrade? Is it ha topology?
Shouldn't we be enforcing leader election for the cpo from the hc controller? @ironcladlou @csrwng @alvaroaleman @sjenning @relyt0925 Are there evidences of this happening atm you can share? If not, I'd prefer to use leader election consistently (so we ensure no simultaneous controllers are ever active regardless of the deployment replica number) and not to try to solve a problem we don't have yet. |
Yeah, we definitely should be using leader election, I thought we already did. |
I agree with @enxebre's assessment:
It seems like elections would solve it generically. |
I posted evidence in this thread: but yes it does occur |
Another thought I had was that given these components aren't servicing client connections, the scope of disruption using recreate seems to be temporary pause in reconciliation. So I don't know there's any specific advantage to using rolling updates in this case, and they definitely introduce their own complexities. |
799ce4c
to
2ff65b1
Compare
✔️ Deploy Preview for hypershift-docs ready! 🔨 Explore the source changes: 2ff65b1 🔍 Inspect the deploy log: https://app.netlify.com/sites/hypershift-docs/deploys/61f802b091bfc3000728c9c6 😎 Browse the preview: https://deploy-preview-935--hypershift-docs.netlify.app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/rettitle Enable leader election for control plane operator
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, relyt0925 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Assuming we are using the version of controller runtime which uses leases by default, you will need to update
|
I think controller runtime uses |
good catch updating |
…ditions on upgrades There is a race condition without leader election for the control plane operator when two instances are active at a given moment of time. It can lead to them simultaneous modifying deployments and resulting them in unexpected states (sometimes a merge of the data between the two versions of the operator). This pr enables leader election for the component by default to eliminate this reace condition
2ff65b1
to
a014f73
Compare
/unhold |
/lgtm |
@relyt0925: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There is a race condition without leader election for the control plane operator when two instances are active at a given moment of time. It can lead to them simultaneous modifying deployments and resulting them in unexpected states (sometimes a merge of the data between the two versions of the operator). This pr enables leader election for the component by default to eliminate this reace condition
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, use
fixes #<issue_number>(, fixes #<issue_number>, ...)
format, where issue_number might be a GitHub issue, or a Jira story:Fixes # Race condition that can result in unexpected cluster states
Checklist