Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Add option to use Lease lock if coordination group is available #600

Closed

Conversation

mszostok
Copy link
Contributor

@mszostok mszostok commented Sep 11, 2019

Description

  • This PR adds new strategy for NewResourceLock method: Choose the Lease lock if lease.coordination.k8s.io is available otherwise, use ConfigMaps.
  • Add tests coverage for implemented logic using ginko & gomega
  • Remove outdated information about cert generation (cert generation was removed in this PR: ⚠️ split webhook server and manifest generation #300)

Notes to reviewer:

Fixes: #460


needs to be sync with #444

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mszostok
To complete the pull request process, please assign pwittrock
You can assign the PR to them by writing /assign @pwittrock in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 11, 2019
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 11, 2019
@mszostok mszostok force-pushed the myk-leader-election branch 2 times, most recently from 5e26c0e to fea2ea9 Compare September 11, 2019 18:44
@mszostok
Copy link
Contributor Author

I've checked the #444 and there is no merge conflict because the logic that I'm adding in this PR is directly in NewResourceLock func and for #444 it should be like a "blackbox"

one thing that needs to be changed are comments:

	// LeaderElectionID determines the name of the configmap that leader election
	// will use for holding the leader lock.

we should remove the information about the configmaps cause it's implementation details
so on my PR, I've changed that into

	// LeaderElectionID determines the name of the resource lock that leader election
	// will use for holding the leader lock.

If we agree on that then on #444 it should be adjusted too

/cc @DirectXMan12

pkg/leaderelection/doc.go Outdated Show resolved Hide resolved
pkg/leaderelection/doc.go Outdated Show resolved Hide resolved
pkg/leaderelection/leader_election.go Outdated Show resolved Hide resolved
pkg/leaderelection/leader_election.go Outdated Show resolved Hide resolved
pkg/leaderelection/leader_election.go Outdated Show resolved Hide resolved
@mszostok mszostok force-pushed the myk-leader-election branch 2 times, most recently from 8efda4e to 13fe693 Compare September 12, 2019 12:55
- Choose the Lease lock if lease.coordination.k8s.io is available otherwise, use ConfigMaps.
- Add tests coverage for implemented logic using ginko & gomega
- Remove outdated information about cert generation
@mszostok
Copy link
Contributor Author

mszostok commented Sep 12, 2019

I had to rebase after merging #591

Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable?

Should we try pass in the RESTMapper instead of adding yet another call to discovery?

@mszostok
Copy link
Contributor Author

mszostok commented Sep 21, 2019

@DirectXMan12 thanks for comments, please check my answers

Should we try pass in the RESTMapper instead of adding yet another call to discovery?

tbh I didn’t take this into account because this is only one call to api-server and only at the beginning. Using client allows us to encapsulate new logic inside the NewResourceLock. Personally, for me, it’s not a case for optimization but here is draft PR with RESTMapper:
https://github.com/mszostok/controller-runtime/pull/1/files

if you think that it's a better approach then I will rewrite the current proposition and change test coverage.


Should this be configurable?

Good question, in the #460 task there was only info that new API should be used when available and fall back to configmaps if not, so I didn't customize that

What kind of customization level you were thinking about?

  1. User should be able to provide own NewResourceLock func? Like in this PR ✨ expose NewResourceLock in manager.Options  #535?

  2. Add field LeaderElectionLockType which would be an enum:

  3. both 1&2


one thing that comes to my mind which I want to clarify. This impl will be a little "breaking change", right?

Scenario 1:
User has a controller with old impl version, then updates the controller-runtime library. In the previous version the CM was used, in the new one, the Lease will be used. Problem is that the old ConfigMap will not be garbage collected, so we will have orphaned object

Scenario 2:
User updates the controller-runtime library and performs the rolling update of controller Deployment. Problem:

  • The previous version has an old strategy (ConfigMap) - still deployed
  • The new one has a new strategy (Lease) - newly deployed

Both will work at the same time (reconcile process). Because the old one will check the ConfigMap and the new one will check the Lease. So user needs to know controller Deployment should have the Recreate strategy to prevent such a situation.

So what about the migration path? Should it be automated? Documented?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2019
@k8s-ci-robot
Copy link
Contributor

@mszostok: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2020
@k8s-ci-robot
Copy link
Contributor

@mszostok: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-controller-runtime-test-master 179f36f link /test pull-controller-runtime-test-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

DirectXMan12 pushed a commit that referenced this pull request Jan 31, 2020
Fix some phrases in what_is_a_controller.md
@DirectXMan12
Copy link
Contributor

cc @JoelSpeed (not sure how active you want to be anymore -- lmk if you don't want to be tagged), @GrigoriyMikhalkin, @alvaroaleman (all of whom have had thoughts on LE in the past)

@JoelSpeed
Copy link
Contributor

This seems sensible to me, glad someone's gotten around to doing this!

What happens during rolling updates when changing from the ConfigMap to ResourceLock implementations? I guess it's pretty unavoidable that the leader election will fail temporarily if two different versions of the controller are running?

@vincepri vincepri added this to the Next milestone Feb 21, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@leongor
Copy link

leongor commented May 14, 2020

Is this PR still relevant? Can it be merged?

@alvaroaleman
Copy link
Member

As Joel mentioned, we need to think about how we migrate existing controllers that currently use a configmap

@leongor
Copy link

leongor commented May 14, 2020

I understand, but meantime this option can be used by those who do not need to migrate.
In our environment, it's needed desperately as we cannot use configmap.

@alvaroaleman
Copy link
Member

I understand, but meantime this option can be used by those who do not need to migrate.
In our environment, it's needed desperately as we cannot use configmap.

Unfortunately not, because with the changes in this PR controllers will automatically use the lease lock if its available. This means that if a controller gets updated to this version of c-r and then deployed, there are two replicas that use a different mechanism for leader election which means they can hold a leader lock simultaneously. This means we very silently break valid expectations ("If I enable leader election, there will always at most be one active replica") and that is not an option.

Someone has to come up with a concept for what we can do to prevent that. Possible approaches include:

  • Default to configmap and only use leader lock if explicitly requested
  • Check if there is an existing configmap leader lock in the cluster and if yes, default to using configmap

@leongor
Copy link

leongor commented May 14, 2020

I think it's a reasonable approach to stick with configmap as default and provide an option to change to lease.
The only thing that prevent to use lease today is that newResourceLock field in manager options is private and cannot be overridden.

@alvaroaleman
Copy link
Member

Yeah, generally that sounds reasonable. It is subpar insofar that we basically say the recommended leader lock is not the default one due to these compatibility concerns but I guess its the best we can do and we can add a TODO to the manager to change the default when we break its interface.

@leongor
Copy link

leongor commented May 14, 2020

Sounds good for me.

@leongor
Copy link

leongor commented May 17, 2020

So, can we reopen this PR or do we need a new one?

@alvaroaleman
Copy link
Member

alvaroaleman commented May 17, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use leader election API if available
9 participants