Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Perform leaderelection in GenericPilot #186

Merged
merged 3 commits into from
Jan 11, 2018

Conversation

munnerz
Copy link
Contributor

@munnerz munnerz commented Jan 9, 2018

What this PR does / why we need it:

Adds a leaderelection package and implements leader election in the GenericPilot.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Closes #97

Release note:

Add support for leader election in pilots

@munnerz
Copy link
Contributor Author

munnerz commented Jan 9, 2018

/test e2e

@munnerz
Copy link
Contributor Author

munnerz commented Jan 9, 2018

@wallrj this is ready for review now.

I think the best way for us to write tests for this is probably with integration tests for GenericPilot itself. The leaderelection package doesn't really need much testing as it is largely just a convenient proxy to the upstream kubernetes package.

We can then provide our own SyncFunc and LeaderElectedSyncFunc, run two instances of GenericPilot and ensure that only one of them receives calls to LeaderElectedSyncFunc.

Copy link
Member

@wallrj wallrj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @munnerz

Looks good to me.

  • I'm unsure what operations would be performed by the leading pilot, but I guess that will become clear in a followup branch.
  • Would like to see some tests for this, here or in a followup branch.
  • Can't see what creates the leader election config map?

if err != nil {
glog.V(4).Infof("Leader elector failed with error: %s", err)
} else {
glog.V(4).Infof("Leader elector unexpectedly exited")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log these as errors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as soon as the the elector go routing exits (e.g. with an error) the Run method will exit and close the stop chanel.
But there doesn't seem to be anything here that waits fot the controller and the process waiter go routines to finish.
Shouldn't there be a waitgroup or something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't explicitly wait for each goroutine to exit right now - instead we rely on the call to stop to perform any required shutdown operations (which may or may not include stopping the controller). The controller itself does get stopped, but not until after the call to stop has returned (defer close(ctrlStopCh) a few lines above causes this).

We can change this to explicitly wait for the controller to finish 'draining' it's workers, and perhaps we should. As we don't currently rely on that though I've left it out for now in an attempt to not further block up process exit. I can foresee us needing to change this like you say in future though, as the controller is responsible for reporting Pilot status etc.

return out
}

func (g *GenericPilot) runElector(stopCh <-chan struct{}) <-chan error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stopCh isn't used.
Pass it to elector.Run ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elector.Run does not accept a stopCh (and is an upstream package) - I included the chan as an arg here for consistency with the other run* methods, and in case we do/can use it in future. Can be removed though as like you say, it's not used.

id, err := os.Hostname()
if err != nil {
return fmt.Errorf("error getting hostname: %s", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hostname is unique to the Pod running the pilot process, right?
What will it be? The Pod name?
In which case, it might be better to use the supplied --pilot-name, especially when we get round to writing tests for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure on this one. Pilot name does make more sense, however if there is a Deployment managing our Pods, and the update strategy is set to rollingUpdate, there are instances where two pods can be running at once (and both would have the same pilot name). Typically hostname is used when using the leader election packages as it should be unique (hence why we use Hostname when performing this same leader election in navigator-controller)

func (p *Pilot) leaderElectedSyncFunc(pilot *v1alpha1.Pilot) error {
glog.V(4).Infof("ElasticsearchController: leader elected sync of pilot %q", pilot.Name)
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What operations might be performed here?
#97 talks about excludeShards, but that seems like something that would be performed by the Pilot that had been selected to leave the cluster by the ES controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the ES pilot case, excludeShards will be handled here. I've got some more work locally to actually do this. This needs to be leader elected as it requires updating the ES API with the new list of nodes that should be excluded. We could do it in each individual Pilot and just hope there's no races, but this way is far more concrete.

{
Name: "LEADER_ELECTION_CONFIG_MAP",
// TODO: trim the length of this string
Value: fmt.Sprintf("%s-leaderelection", c.Name),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this config map name is not unique enough

@munnerz
Copy link
Contributor Author

munnerz commented Jan 11, 2018

/test e2e

@wallrj
Copy link
Member

wallrj commented Jan 11, 2018

/lgtm

@jetstack-ci-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wallrj

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@jetstack-ci-bot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@jetstack-ci-bot
Copy link
Contributor

Automatic merge from submit-queue.

@jetstack-ci-bot jetstack-ci-bot merged commit 2668110 into jetstack:master Jan 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Providing leader election mechanisms to Pilots
4 participants