Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable leader election for control plane operator to prevent race conditions on upgrades #935

Merged

Conversation

relyt0925
Copy link
Contributor

@relyt0925 relyt0925 commented Jan 29, 2022

There is a race condition without leader election for the control plane operator when two instances are active at a given moment of time. It can lead to them simultaneous modifying deployments and resulting them in unexpected states (sometimes a merge of the data between the two versions of the operator). This pr enables leader election for the component by default to eliminate this reace condition

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes # Race condition that can result in unexpected cluster states

Checklist

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@relyt0925 relyt0925 force-pushed the shutdown-in-parallel branch 3 times, most recently from 3729b2b to 799ce4c Compare January 29, 2022 16:53
@enxebre
Copy link
Member

enxebre commented Jan 31, 2022

Thanks! I'm fine with this change though I'd like some discussion aside before proceeding:

@relyt0925 How is it that you have 2 instances of CPO running simultaneously? is it because an upgrade? Is it ha topology?

For now, we do not want to turn on leader election because it produces aggregate load on the kube-apiserver that can lead to cluster performance degregation at scale.

Shouldn't we be enforcing leader election for the cpo from the hc controller? @ironcladlou @csrwng @alvaroaleman @sjenning

@relyt0925 Are there evidences of this happening atm you can share? If not, I'd prefer to use leader election consistently (so we ensure no simultaneous controllers are ever active regardless of the deployment replica number) and not to try to solve a problem we don't have yet.

@alvaroaleman
Copy link
Contributor

Yeah, we definitely should be using leader election, I thought we already did.

@ironcladlou
Copy link
Contributor

I agree with @enxebre's assessment:

@relyt0925 Are there evidences of this happening atm you can share? If not, I'd prefer to use leader election consistently (so we ensure no simultaneous controllers are ever active regardless of the deployment replica number) and not to try to solve a problem we don't have yet.

It seems like elections would solve it generically.

@relyt0925
Copy link
Contributor Author

relyt0925 commented Jan 31, 2022

@ironcladlou
Copy link
Contributor

Another thought I had was that given these components aren't servicing client connections, the scope of disruption using recreate seems to be temporary pause in reconciliation. So I don't know there's any specific advantage to using rolling updates in this case, and they definitely introduce their own complexities.

@netlify
Copy link

netlify bot commented Jan 31, 2022

✔️ Deploy Preview for hypershift-docs ready!

🔨 Explore the source changes: 2ff65b1

🔍 Inspect the deploy log: https://app.netlify.com/sites/hypershift-docs/deploys/61f802b091bfc3000728c9c6

😎 Browse the preview: https://deploy-preview-935--hypershift-docs.netlify.app

Copy link
Contributor

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/rettitle Enable leader election for control plane operator

@relyt0925 relyt0925 changed the title use recreate strategy for control plane operator to prevent race cond… enable leader election for control plane operator to prevent race conditions on upgrades Jan 31, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 31, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, relyt0925

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2022
@enxebre
Copy link
Member

enxebre commented Jan 31, 2022

Assuming we are using the version of controller runtime which uses leases by default, you will need to update reconcileControlPlaneOperatorRole to manage them, e.g

		{
			APIGroups: []string{"coordination.k8s.io"},
			Resources: []string{
				"leases",
			},
			Verbs: []string{"*"},
		},

@enxebre
Copy link
Member

enxebre commented Jan 31, 2022

I think controller runtime uses configmapsleases by default. Can we please make sure we set LeaderElectionResourceLock: "leases", in the CPO?
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2022
@relyt0925
Copy link
Contributor Author

good catch updating

…ditions on upgrades

There is a race condition without leader election for the control plane operator when two instances are active at a given moment of time. It can lead to them simultaneous modifying deployments and resulting them in unexpected states (sometimes a merge of the data between the two versions of the operator). This pr enables leader election for the component by default to eliminate this reace condition
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2022
@relyt0925
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2022
@sjenning
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 31, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 31, 2022

@relyt0925: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 9dae915 into openshift:main Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants