Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate scheduler, controller-manager and cloud-controller-manager to use LeaseLock #94603

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Sep 8, 2020

Ref #80289

Migrate scheduler, controller-manager and cloud-controller-manager to use LeaseLock

/kind cleanup
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 8, 2020
@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 8, 2020
@@ -44,7 +44,7 @@ func RecommendedDefaultLeaderElectionConfiguration(obj *LeaderElectionConfigurat
obj.RetryPeriod = metav1.Duration{Duration: 2 * time.Second}
}
if obj.ResourceLock == "" {
// TODO: Migrate to LeaseLock.
// TODO(#80289): Migrate to LeaseLock when graduating to v1beta1.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt - who owns that part now?
I would be happy to promote it to v1beta1 in 1.20 to clean this up, but wanted to check with the owner if there aren't any known blockers for doing that (or other things we want to fix during this migration).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... cc @mtaufen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a particularly active file, but it's a core dependency so it would be good to stabilize the version. I'm not sure what the blockers would be off the top of my head. I'm happy to help if you want to work on moving it forward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will try to get back to in in the next couple weeks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no rush as I'll be OOO for the next couple weeks anyway.

@wojtek-t
Copy link
Member Author

wojtek-t commented Sep 8, 2020

/retest

@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@@ -44,7 +44,7 @@ func RecommendedDefaultLeaderElectionConfiguration(obj *LeaderElectionConfigurat
obj.RetryPeriod = metav1.Duration{Duration: 2 * time.Second}
}
if obj.ResourceLock == "" {
// TODO: Migrate to LeaseLock.
// TODO(#80289): Migrate to LeaseLock when graduating to v1beta1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... cc @mtaufen?

@@ -44,7 +44,7 @@ func RecommendedDefaultLeaderElectionConfiguration(obj *LeaderElectionConfigurat
obj.RetryPeriod = metav1.Duration{Duration: 2 * time.Second}
}
if obj.ResourceLock == "" {
// TODO: Migrate to LeaseLock.
// TODO(#80289): Migrate to LeaseLock when graduating to v1beta1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going straight from Endpoints -> Lease isn't safe, right? We have to go Endpoints -> EndpointsLeases -> Lease over three releases, right?

Copy link
Member

@liggitt liggitt Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, I'm not sure we can know when it's safe to migrate this function to LeaseLock. With components we control, we know their rollout cadence and skew support, so we can go endpoints -> endpointsleases -> leases over three versions. With this helper method, we have no idea what the consuming component's release schedule and skew support is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - going Endpoints->Leases isn't safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wojtek-t wojtek-t force-pushed the migrate_leader_election_to_leases_todos branch from 5af5011 to f34cb47 Compare September 9, 2020 14:09
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 9, 2020
@wojtek-t wojtek-t changed the title TODOs about migrating to LeaseLock in leaderelection Upgrade scheduler, controller-manager and cloud-controller-manager to use LeaseLock Sep 9, 2020
@wojtek-t wojtek-t changed the title Upgrade scheduler, controller-manager and cloud-controller-manager to use LeaseLock Migrate scheduler, controller-manager and cloud-controller-manager to use LeaseLock Sep 9, 2020
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Sep 9, 2020
@wojtek-t wojtek-t force-pushed the migrate_leader_election_to_leases_todos branch from f34cb47 to 3ddbb04 Compare September 9, 2020 14:22
@@ -115,6 +116,10 @@ func NewCloudControllerManagerOptions() (*CloudControllerManagerOptions, error)
// NewDefaultComponentConfig returns cloud-controller manager configuration object.
func NewDefaultComponentConfig(insecurePort int32) (*ccmconfig.CloudControllerManagerConfiguration, error) {
versioned := &ccmconfigv1alpha1.CloudControllerManagerConfiguration{}
// Use lease-based leader election to reduce cost.
// The default endpoints-leases one has already been used for couple releases.
versioned.LeaderElection.ResourceLock = resourcelock.LeasesResourceLock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be specific about which release we switched the default to endpointslease. also, why not change the default in SetDefaults_CloudControllerManagerConfiguration instead of here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about introducing in 1.17 (same below).

Modifying defaulting seems like backward-incompatible (though technically it's probably very unlikely someone will reuse it for something else...).

Should I really change to modifying the defaulting?

@@ -210,6 +211,10 @@ func NewKubeControllerManagerOptions() (*KubeControllerManagerOptions, error) {
// NewDefaultComponentConfig returns kube-controller manager configuration object.
func NewDefaultComponentConfig(insecurePort int32) (kubectrlmgrconfig.KubeControllerManagerConfiguration, error) {
versioned := kubectrlmgrconfigv1alpha1.KubeControllerManagerConfiguration{}
// Use lease-based leader election to reduce cost.
// The default endpoints-leases one has already been used for couple releases.
versioned.LeaderElection.ResourceLock = resourcelock.LeasesResourceLock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be specific about which release we switched the default to endpointslease. also, why not change the default in SetDefaults_KubeControllerManagerConfiguration instead of here?

@@ -138,6 +138,9 @@ func splitHostIntPort(s string) (string, int, error) {
func newDefaultComponentConfig() (*kubeschedulerconfig.KubeSchedulerConfiguration, error) {
versionedCfg := kubeschedulerconfigv1beta1.KubeSchedulerConfiguration{}
versionedCfg.DebuggingConfiguration = *configv1alpha1.NewRecommendedDebuggingConfiguration()
// Use lease-based leader election to reduce cost.
// The default endpoints-leases one has already been used for couple releases.
versionedCfg.LeaderElection.ResourceLock = resourcelock.LeasesResourceLock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be specific about which release we switched the default to endpointslease. also, why not change the default in SetDefaults_KubeSchedulerConfiguration instead of here?

@wojtek-t wojtek-t force-pushed the migrate_leader_election_to_leases_todos branch 2 times, most recently from 8940a77 to 0d3216e Compare September 9, 2020 19:10
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 10, 2020
@wojtek-t
Copy link
Member Author

@liggitt - PTAL

@liggitt
Copy link
Member

liggitt commented Sep 10, 2020

/lgtm

would like an ack from someone on those component's teams:
/cc @ahg-g @cheftako

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 10, 2020
@liggitt
Copy link
Member

liggitt commented Sep 10, 2020

/approve
/hold for component team acks

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 10, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 10, 2020
@ahg-g
Copy link
Member

ahg-g commented Sep 10, 2020

are we making an exception here in the sense that a CC default value is being changed without increasing the version (to v1beta2 in the scheduler case)? Note that we are planning to make some changes to scheduler's CC and introduce v1beta2, so we could make this change in v1beta2 and keep the old behavior for v1beta1.

@liggitt
Copy link
Member

liggitt commented Sep 10, 2020

are we making an exception here in the sense that a CC default value is being changed without increasing the version (to v1beta2 in the scheduler case)? Note that we are planning to make some changes to scheduler's CC and introduce v1beta2, so we could make this change in v1beta2 and keep the old behavior for v1beta1.

that would be fine as well. the controller manager configs aren't exposed as config files yet, so the default change is only affecting the CLI flag defaults

@ahg-g
Copy link
Member

ahg-g commented Sep 10, 2020

/lgtm

@@ -130,7 +130,10 @@ func RecommendedDefaultGenericControllerManagerConfiguration(obj *kubectrlmgrcon
}

if len(obj.LeaderElection.ResourceLock) == 0 {
obj.LeaderElection.ResourceLock = "endpointsleases"
// Use lease-based leader election to reduce cost.
// We migrated for EndpointsLease lock in 1.17 and starting in 1.20 we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We migrated to EndpointLease lock in 1.17

@cheftako
Copy link
Member

/lgtm

@wojtek-t
Copy link
Member Author

Cancelling hold based on Abdullah and Walter lgtms above.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2020
@wojtek-t wojtek-t added this to the v1.20 milestone Sep 11, 2020
@k8s-ci-robot k8s-ci-robot merged commit d39214a into kubernetes:master Sep 11, 2020
JacobHenner added a commit to JacobHenner/linkerd2 that referenced this pull request Jun 21, 2022
Watch events for objects in the kube-system namespace were previously
ignored. In certain situations, this would cause the destination service
to return invalid (outdated) endpoints for services in kube-system -
including unmeshed services.

It was suggested [1] that kube-system events were ignored to avoid
handling frequent Endpoint updates - specifically from controllers using
Endpoints for leader elections [2]. As of Kubernetes 1.20, these
controllers default to using Leases instead of Endpoints for their
leader elections [3], obviating the need to exclude (or filter) updates
from kube-system. The exclusions have been removed accordingly.

[1]: linkerd#4133 (comment)
[2]: kubernetes/kubernetes#86286
[3]: kubernetes/kubernetes#94603

Signed-off-by: Jacob Henner <code@ventricle.us>
kleimkuhler pushed a commit to linkerd/linkerd2 that referenced this pull request Jul 11, 2022
Watch events for objects in the kube-system namespace were previously ignored.
In certain situations, this would cause the destination service to return
invalid (outdated) endpoints for services in kube-system - including unmeshed
services.

It [was suggested][1] that kube-system events were ignored to avoid handling
frequent Endpoint updates - specifically from [controllers using Endpoints for
leader elections][2]. As of Kubernetes 1.20, these controllers [default to using
Leases instead of Endpoints for their leader elections][3], obviating the need
to exclude (or filter) updates from kube-system. The exclusions have been
removed accordingly.

[1]: #4133 (comment)
[2]: kubernetes/kubernetes#86286
[3]: kubernetes/kubernetes#94603

Signed-off-by: Jacob Henner <code@ventricle.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants