Perform leaderelection in GenericPilot #186

munnerz · 2018-01-09T17:42:21Z

What this PR does / why we need it:

Adds a leaderelection package and implements leader election in the GenericPilot.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Closes #97

Release note:

Add support for leader election in pilots

munnerz · 2018-01-09T18:35:38Z

/test e2e

munnerz · 2018-01-09T18:38:57Z

@wallrj this is ready for review now.

I think the best way for us to write tests for this is probably with integration tests for GenericPilot itself. The leaderelection package doesn't really need much testing as it is largely just a convenient proxy to the upstream kubernetes package.

We can then provide our own SyncFunc and LeaderElectedSyncFunc, run two instances of GenericPilot and ensure that only one of them receives calls to LeaderElectedSyncFunc.

wallrj

Thanks @munnerz

Looks good to me.

I'm unsure what operations would be performed by the leading pilot, but I guess that will become clear in a followup branch.
Would like to see some tests for this, here or in a followup branch.
Can't see what creates the leader election config map?

wallrj · 2018-01-10T09:58:10Z

pkg/pilot/genericpilot/genericpilot.go

+		if err != nil {
+			glog.V(4).Infof("Leader elector failed with error: %s", err)
+		} else {
+			glog.V(4).Infof("Leader elector unexpectedly exited")


Log these as errors?

So as soon as the the elector go routing exits (e.g. with an error) the Run method will exit and close the stop chanel.
But there doesn't seem to be anything here that waits fot the controller and the process waiter go routines to finish.
Shouldn't there be a waitgroup or something here?

So we don't explicitly wait for each goroutine to exit right now - instead we rely on the call to stop to perform any required shutdown operations (which may or may not include stopping the controller). The controller itself does get stopped, but not until after the call to stop has returned (defer close(ctrlStopCh) a few lines above causes this).

We can change this to explicitly wait for the controller to finish 'draining' it's workers, and perhaps we should. As we don't currently rely on that though I've left it out for now in an attempt to not further block up process exit. I can foresee us needing to change this like you say in future though, as the controller is responsible for reporting Pilot status etc.

wallrj · 2018-01-10T10:07:51Z

pkg/pilot/genericpilot/genericpilot.go

+	return out
+}
+
+func (g *GenericPilot) runElector(stopCh <-chan struct{}) <-chan error {


The stopCh isn't used.
Pass it to elector.Run ?

elector.Run does not accept a stopCh (and is an upstream package) - I included the chan as an arg here for consistency with the other run* methods, and in case we do/can use it in future. Can be removed though as like you say, it's not used.

wallrj · 2018-01-10T10:19:31Z

pkg/pilot/genericpilot/leaderelection/leaderelection.go

+	id, err := os.Hostname()
+	if err != nil {
+		return fmt.Errorf("error getting hostname: %s", err)
+	}


Hostname is unique to the Pod running the pilot process, right?
What will it be? The Pod name?
In which case, it might be better to use the supplied --pilot-name, especially when we get round to writing tests for this.

I'm not 100% sure on this one. Pilot name does make more sense, however if there is a Deployment managing our Pods, and the update strategy is set to rollingUpdate, there are instances where two pods can be running at once (and both would have the same pilot name). Typically hostname is used when using the leader election packages as it should be unique (hence why we use Hostname when performing this same leader election in navigator-controller)

wallrj · 2018-01-10T10:23:19Z

pkg/pilot/elasticsearch/v5/controller.go

+func (p *Pilot) leaderElectedSyncFunc(pilot *v1alpha1.Pilot) error {
+	glog.V(4).Infof("ElasticsearchController: leader elected sync of pilot %q", pilot.Name)
+	return nil
+}


What operations might be performed here?
#97 talks about excludeShards, but that seems like something that would be performed by the Pilot that had been selected to leave the cluster by the ES controller.

So in the ES pilot case, excludeShards will be handled here. I've got some more work locally to actually do this. This needs to be leader elected as it requires updating the ES API with the new list of nodes that should be excluded. We could do it in each individual Pilot and just hope there's no races, but this way is far more concrete.

munnerz · 2018-01-10T10:49:03Z

pkg/controllers/elasticsearch/nodepool/resources.go

+						{
+							Name: "LEADER_ELECTION_CONFIG_MAP",
+							// TODO: trim the length of this string
+							Value: fmt.Sprintf("%s-leaderelection", c.Name),


Note to self: this config map name is not unique enough

munnerz · 2018-01-11T13:38:01Z

/test e2e

wallrj · 2018-01-11T13:59:51Z

/lgtm

jetstack-ci-bot · 2018-01-11T13:59:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wallrj

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~OWNERS~~ [wallrj]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

jetstack-ci-bot · 2018-01-11T14:00:28Z

/test all [submit-queue is verifying that this PR is safe to merge]

jetstack-ci-bot · 2018-01-11T14:25:39Z

Automatic merge from submit-queue.

jetstack-bot added release-note size/L labels Jan 9, 2018

jetstack-ci-bot assigned wallrj Jan 9, 2018

munnerz force-pushed the leaderelection branch 3 times, most recently from fe73587 to 6632c41 Compare January 9, 2018 18:22

Perform leaderelection in GenericPilot

20df3d7

munnerz force-pushed the leaderelection branch from 6632c41 to 20df3d7 Compare January 9, 2018 18:31

wallrj approved these changes Jan 10, 2018

View reviewed changes

munnerz commented Jan 10, 2018

View reviewed changes

munnerz added 2 commits January 11, 2018 13:36

Treat goroutine exit messages as errors

89229f7

Make leaderelection config map name more unique

aa868d4

jetstack-bot added the lgtm label Jan 11, 2018

jetstack-ci-bot added the approved label Jan 11, 2018

jetstack-ci-bot merged commit 2668110 into jetstack:master Jan 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform leaderelection in GenericPilot #186

Perform leaderelection in GenericPilot #186

munnerz commented Jan 9, 2018

munnerz commented Jan 9, 2018

munnerz commented Jan 9, 2018

wallrj left a comment

wallrj Jan 10, 2018

wallrj Jan 10, 2018

munnerz Jan 10, 2018

wallrj Jan 10, 2018

munnerz Jan 10, 2018

wallrj Jan 10, 2018

munnerz Jan 10, 2018

wallrj Jan 10, 2018

munnerz Jan 10, 2018

munnerz Jan 10, 2018

munnerz commented Jan 11, 2018

wallrj commented Jan 11, 2018

jetstack-ci-bot commented Jan 11, 2018

jetstack-ci-bot commented Jan 11, 2018

jetstack-ci-bot commented Jan 11, 2018

Perform leaderelection in GenericPilot #186

Perform leaderelection in GenericPilot #186

Conversation

munnerz commented Jan 9, 2018

munnerz commented Jan 9, 2018

munnerz commented Jan 9, 2018

wallrj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

munnerz commented Jan 11, 2018

wallrj commented Jan 11, 2018

jetstack-ci-bot commented Jan 11, 2018

jetstack-ci-bot commented Jan 11, 2018

jetstack-ci-bot commented Jan 11, 2018