enable leader election for control plane operator to prevent race conditions on upgrades #935

relyt0925 · 2022-01-29T01:38:17Z

There is a race condition without leader election for the control plane operator when two instances are active at a given moment of time. It can lead to them simultaneous modifying deployments and resulting them in unexpected states (sometimes a merge of the data between the two versions of the operator). This pr enables leader election for the component by default to eliminate this reace condition

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes # Race condition that can result in unexpected cluster states

Checklist

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

enxebre · 2022-01-31T11:01:34Z

Thanks! I'm fine with this change though I'd like some discussion aside before proceeding:

@relyt0925 How is it that you have 2 instances of CPO running simultaneously? is it because an upgrade? Is it ha topology?

For now, we do not want to turn on leader election because it produces aggregate load on the kube-apiserver that can lead to cluster performance degregation at scale.

Shouldn't we be enforcing leader election for the cpo from the hc controller? @ironcladlou @csrwng @alvaroaleman @sjenning

@relyt0925 Are there evidences of this happening atm you can share? If not, I'd prefer to use leader election consistently (so we ensure no simultaneous controllers are ever active regardless of the deployment replica number) and not to try to solve a problem we don't have yet.

alvaroaleman · 2022-01-31T14:04:45Z

Yeah, we definitely should be using leader election, I thought we already did.

ironcladlou · 2022-01-31T14:32:12Z

I agree with @enxebre's assessment:

@relyt0925 Are there evidences of this happening atm you can share? If not, I'd prefer to use leader election consistently (so we ensure no simultaneous controllers are ever active regardless of the deployment replica number) and not to try to solve a problem we don't have yet.

It seems like elections would solve it generically.

relyt0925 · 2022-01-31T15:03:17Z

I posted evidence in this thread:
https://coreos.slack.com/archives/C01C8502FMM/p1643409582906399?thread_ts=1643404538.076709&channel=C01C8502FMM&message_ts=1643409582.906399

but yes it does occur

ironcladlou · 2022-01-31T15:21:07Z

Another thought I had was that given these components aren't servicing client connections, the scope of disruption using recreate seems to be temporary pause in reconciliation. So I don't know there's any specific advantage to using rolling updates in this case, and they definitely introduce their own complexities.

netlify · 2022-01-31T15:40:19Z

✔️ Deploy Preview for hypershift-docs ready!

🔨 Explore the source changes: 2ff65b1

🔍 Inspect the deploy log: https://app.netlify.com/sites/hypershift-docs/deploys/61f802b091bfc3000728c9c6

😎 Browse the preview: https://deploy-preview-935--hypershift-docs.netlify.app

alvaroaleman

/rettitle Enable leader election for control plane operator

openshift-ci · 2022-01-31T15:46:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, relyt0925

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alvaroaleman]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2022-01-31T15:50:45Z

Assuming we are using the version of controller runtime which uses leases by default, you will need to update reconcileControlPlaneOperatorRole to manage them, e.g

		{
			APIGroups: []string{"coordination.k8s.io"},
			Resources: []string{
				"leases",
			},
			Verbs: []string{"*"},
		},

enxebre · 2022-01-31T15:54:07Z

I think controller runtime uses configmapsleases by default. Can we please make sure we set LeaderElectionResourceLock: "leases", in the CPO?
/hold

relyt0925 · 2022-01-31T16:13:04Z

good catch updating

…ditions on upgrades There is a race condition without leader election for the control plane operator when two instances are active at a given moment of time. It can lead to them simultaneous modifying deployments and resulting them in unexpected states (sometimes a merge of the data between the two versions of the operator). This pr enables leader election for the component by default to eliminate this reace condition

relyt0925 · 2022-01-31T16:31:21Z

/unhold

sjenning · 2022-01-31T16:53:01Z

/lgtm

openshift-ci · 2022-01-31T17:06:32Z

@relyt0925: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from alvaroaleman and enxebre January 29, 2022 01:38

relyt0925 force-pushed the shutdown-in-parallel branch 3 times, most recently from 3729b2b to 799ce4c Compare January 29, 2022 16:53

relyt0925 force-pushed the shutdown-in-parallel branch from 799ce4c to 2ff65b1 Compare January 31, 2022 15:39

alvaroaleman approved these changes Jan 31, 2022

View reviewed changes

relyt0925 changed the title ~~use recreate strategy for control plane operator to prevent race cond…~~ enable leader election for control plane operator to prevent race conditions on upgrades Jan 31, 2022

openshift-ci bot assigned alvaroaleman Jan 31, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2022

relyt0925 force-pushed the shutdown-in-parallel branch from 2ff65b1 to a014f73 Compare January 31, 2022 16:14

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2022

openshift-ci bot assigned sjenning Jan 31, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2022

openshift-merge-robot merged commit 9dae915 into openshift:main Jan 31, 2022

sjenning mentioned this pull request Feb 9, 2022

use leases for hypershift-operator leader election #992

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable leader election for control plane operator to prevent race conditions on upgrades #935

enable leader election for control plane operator to prevent race conditions on upgrades #935

relyt0925 commented Jan 29, 2022 •

edited

enxebre commented Jan 31, 2022 •

edited

alvaroaleman commented Jan 31, 2022

ironcladlou commented Jan 31, 2022

relyt0925 commented Jan 31, 2022 •

edited

ironcladlou commented Jan 31, 2022

netlify bot commented Jan 31, 2022

alvaroaleman left a comment

openshift-ci bot commented Jan 31, 2022

enxebre commented Jan 31, 2022

enxebre commented Jan 31, 2022

relyt0925 commented Jan 31, 2022

relyt0925 commented Jan 31, 2022

sjenning commented Jan 31, 2022

openshift-ci bot commented Jan 31, 2022

enable leader election for control plane operator to prevent race conditions on upgrades #935

enable leader election for control plane operator to prevent race conditions on upgrades #935

Conversation

relyt0925 commented Jan 29, 2022 • edited

enxebre commented Jan 31, 2022 • edited

alvaroaleman commented Jan 31, 2022

ironcladlou commented Jan 31, 2022

relyt0925 commented Jan 31, 2022 • edited

ironcladlou commented Jan 31, 2022

netlify bot commented Jan 31, 2022

alvaroaleman left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 31, 2022

enxebre commented Jan 31, 2022

enxebre commented Jan 31, 2022

relyt0925 commented Jan 31, 2022

relyt0925 commented Jan 31, 2022

sjenning commented Jan 31, 2022

openshift-ci bot commented Jan 31, 2022

relyt0925 commented Jan 29, 2022 •

edited

enxebre commented Jan 31, 2022 •

edited

relyt0925 commented Jan 31, 2022 •

edited