Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancellable leader election #57932

Merged

Conversation

@ash2k
Copy link
Member

commented Jan 6, 2018

What this PR does / why we need it:
Adds ability to cancel leader election. Useful in integration tests where the whole app is started and stopped in each test.

Special notes for your reviewer:
I used the context package - it is impossible/hard to achieve the same behaviour with just channels without spawning additional goroutines but it is trivial with context. See acquire() and renew() methods.

Release note:

NONE

/kind enhancement
/sig api-machinery

@caesarxuchao

This comment has been minimized.

Copy link
Member

commented Jan 8, 2018

/assign @cheftako

@ash2k ash2k force-pushed the atlassian:cancellable-leader-election branch from a67cfa9 to b013fef Jan 17, 2018

@ash2k

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2018

/retest

@ash2k ash2k force-pushed the atlassian:cancellable-leader-election branch from b013fef to ddf6fdb Jan 19, 2018

@ash2k

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2018

/retest

@ash2k ash2k force-pushed the atlassian:cancellable-leader-election branch from ddf6fdb to fdd468d Feb 1, 2018

@ash2k

This comment has been minimized.

Copy link
Member Author

commented Feb 2, 2018

@timothysc timothysc self-assigned this Feb 5, 2018

@timothysc

This comment has been minimized.

Copy link
Member

commented Feb 5, 2018

@ash2k where is your integration test that makes your case? It's pretty trivial to plumb signal-handlers through the stop channel and others have written code that depends on this external to k8s.

You're basically forcing everyone who uses this to change, so let's be certain to dot all the i's and cross all the t's.

@ash2k

This comment has been minimized.

Copy link
Member Author

commented Feb 6, 2018

@timothysc There seem to be some misunderstanding here. I'm saying once you start the leader election, there is no way to ask it to stop. I.e. there is no way to make the method LeaderElector.Run() return. It only ever returns if it fails to renew the lease. I want to start and stop integration tests that start the whole application (in a Go test) with leader election enabled (because it is code that needs to be tested). But once the application does what it needs to do in the test, the test should be able to tell the application to cleanly shutdown. No way to do it right now.
You seem to be talking about plumbing a channel though to stop the application itself - that is not the problem I'm trying to solve here.

Examples of my integration tests can be found here https://github.com/atlassian/smith/tree/master/it
You probably want to take a look at the SetupApp() function that starts the application for each test and then does clean tear down.

@timothysc

This comment has been minimized.

Copy link
Member

commented Feb 7, 2018

I'm ok with it, but you're changing common code that will affect multiple parties that use the client. I'm not certain how apimachinery folks communicate these changes, and what policies are in place nowadays.

/cc @sttts

@k8s-ci-robot k8s-ci-robot requested a review from sttts Feb 7, 2018

@sttts

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2018

Would be much easier to review if the conversion from stopCh to context would be separate.

@ash2k

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2018

It would be harder to implemented the same logic with a stop channel - channels panic if closed more than once unlike the cancel() function that is used to cancel a context. So I would need to have some sort of locking to ensure the channel has been closed only once.
For example the renew() function would need to start an additional goroutine to receive from the stopCh and then close a temporary channel which would signal wait.Until() to stop. And that temporary channel would be used to cancel the wait from inside of the wait.Until(). So need a lock around that - could use sync.Once to do this. Temporary channel is needed because it cannot and must not close the stop channel that was passed to it.
All of that is not needed with a context and it would probably be harder to review.

@sttts

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2018

All of that is not needed with a context and it would probably be harder to review.

I don't argue against a context :) just that we have two changes mixed into one commit here.

@sttts

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2018

I don't argue against a context :)

Would like us to move to contexts much more throughout the code base.

@ash2k

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2018

I understand and do not argue that it's two changes in 1 commit. My point is that it would not be easier to review. But if you insist I'll do it in another branch as an experiment. I'll post a link.
Yes, moving from stop channels to context would be nice.

@sttts

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2018

But if you insist I'll do it in another branch as an experiment

Two commits in this PR would be fine.

In general, change sgtm.

ash2k added some commits Feb 14, 2018

@ash2k ash2k force-pushed the atlassian:cancellable-leader-election branch from c31d11f to 102090d Jun 7, 2018

@ash2k

This comment has been minimized.

Copy link
Member Author

commented Jun 8, 2018

/retest

@@ -144,7 +145,7 @@ func Run(c *config.CompletedConfig) error {
}
}

run := func(stop <-chan struct{}) {
run := func(runCtx context.Context) {

This comment has been minimized.

Copy link
@mikedanese

mikedanese Jun 8, 2018

Member

same nit as above. s/runCtx/ctx/

This comment has been minimized.

Copy link
@ash2k

ash2k Jun 8, 2018

Author Member

It clashes with ctx, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, runCtx.Done()) below.

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jun 9, 2018

Contributor

I think that in general, variables of type context.Context prefer to be named ctx. Naming a "ControllerContext" struct variable ctx is probably worth changing so that readers don't get confused.

In fact, looking at the definition of ControllerContext, I don't think it is named correctly. It looks more like a config struct. (-:

This comment has been minimized.

Copy link
@ash2k

ash2k Jun 9, 2018

Author Member

Renamed.

@mikedanese

This comment has been minimized.

Copy link
Member

commented Jun 8, 2018

One more comment.

/lgtm
/hold

@misterikkit
Copy link
Contributor

left a comment

It would have been nicer to add new functionality to leaderelect without requiring clients to change in the same PR. Each controller could be updated in a follow-up PR so that you don't get caught up in all the approvals needed.

Since the work is already done here, I won't push for this to be split up. See my comment about how I think it could have been done, though.

@@ -144,7 +145,7 @@ func Run(c *config.CompletedConfig) error {
}
}

run := func(stop <-chan struct{}) {
run := func(runCtx context.Context) {

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jun 9, 2018

Contributor

I think that in general, variables of type context.Context prefer to be named ctx. Naming a "ControllerContext" struct variable ctx is probably worth changing so that readers don't get confused.

In fact, looking at the definition of ControllerContext, I don't think it is named correctly. It looks more like a config struct. (-:

le.maybeReportTransition()
if !succeeded {
glog.V(4).Infof("failed to acquire lease %v", desc)
return
}
le.config.Lock.RecordEvent("became leader")
glog.Infof("successfully acquired lease %v", desc)
close(stop)
}, le.config.RetryPeriod, JitterFactor, true, stop)
once.Do(func() { close(tmpStop) })

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jun 9, 2018

Contributor

What is the meaning of tmpStop in this function? cancellation propagation?

JitterUntil may call this anonymous func multiple times... aha

Okay, tmpStop signals to JitterUntil that it has succeeded. I suggest renaming it to done. I also think that the entire wait library tends to result in unreadable code like this, when we could have just written a for loop.

glog.Fatalf("error running controllers: %v", err)
}
}

runCtx, cancel := context.WithCancel(context.Background())
defer cancel()

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jun 9, 2018

Contributor

defer cancel() is a pretty common pattern with context.WithCancel. I think the expectation is that after creating the context, you are going to call some blocking Run func. If that func starts goroutines, it should pass the context (or child contexts) to them. So if Run ever exits, we will signal to any stray goroutines to clean up and exit.

@@ -146,28 +146,28 @@ type LeaderElector struct {
}

// Run starts the leader election loop
func (le *LeaderElector) Run(stop <-chan struct{}) {
func (le *LeaderElector) Run(ctx context.Context) {

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jun 9, 2018

Contributor

You could also create an overload like this:
func (le *LeaderElector) Run() { le.Run(context.TODO()) }

Then updating clients to use the new thing would be much more graceful.

This comment has been minimized.

Copy link
@mikedanese

mikedanese Jun 11, 2018

Member

Go doesn't support overloading. We discussed adding WithContext style functions to maintain compatibility but the package is documented as having an alpha API. #57932 (review)

If people feel strongly we can do this, but I wouldn't fight for it.

This comment has been minimized.

Copy link
@misterikkit

misterikkit Jun 11, 2018

Contributor

Oops - I am still learning quite a bit about golang.

I am not concerned with backward compatibility, but I thought the author's life would have been easier by temporarily preserving backward compatibility.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jun 9, 2018

@mikedanese

This comment has been minimized.

Copy link
Member

commented Jun 11, 2018

/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm and removed do-not-merge/hold labels Jun 11, 2018

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ash2k, mikedanese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton smarterclayton added this to the v1.12 milestone Jun 13, 2018

@smarterclayton

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2018

This is too late for 1.11 given the potential for regression. I'm marking at as 1.12 to be explicit with that.

@fejta-bot

This comment has been minimized.

Copy link

commented Jun 13, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot

This comment has been minimized.

Copy link

commented Jun 17, 2018

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2018

[MILESTONENOTIFIER] Milestone Pull Request Labels Incomplete

@ash2k @cheftako @deads2k @fgrzadkowski @mikedanese @pmorie

Action required: This pull request requires label changes. If the required changes are not made within 3 days, the pull request will be moved out of the v1.12 milestone.

priority: Must specify exactly one of priority/critical-urgent, priority/important-longterm or priority/important-soon.

Help
@k8s-github-robot

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2018

Automatic merge from submit-queue (batch tested with PRs 65256, 64236, 64919, 64879, 57932). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 571b9be into kubernetes:master Jun 21, 2018

18 checks passed

Submit Queue Queued to run github e2e tests a second time.
Details
cla/linuxfoundation ash2k authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Skipped
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details

@ash2k ash2k deleted the atlassian:cancellable-leader-election branch Jun 21, 2018

@ash2k ash2k referenced this pull request Oct 11, 2018

Merged

Kubernetes 1.12 libraries #35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.