Improve HA recovery #6379

Merged
merged 5 commits into from Oct 5, 2016

Conversation

Projects
None yet
3 participants
Contributor

mjs commented Oct 5, 2016

This contains 2 fixes which improve Juju's handling of a HA primary change.

  1. A timeout is now used when dialling a mongodb server. This greatly reduces the time for mgo to respond to a mongodb server going away and therefore also improves Juju's recovery to the mongodb master changing.
  2. HackLeadership (now called KillWorkers) is now called on all States in the apiserver's StatePool. This prevents a leadership API request for a hosted model from preventing apiserver shutdown (observed during HA failovers).

Fixes https://bugs.launchpad.net/juju/+bug/1588224

QA

Bootstrap and use enable-ha to create a 3 node controller. Once stable, stop the primary node. Watch the logs to see when controllers recover. Before these changes it would take minutes for the controllers to recover and the apiserver to be back up again. With these changes the time between the primary being stopped and the apiserver being back up again is < 60s.

There are many more improvements that can be made but they are more risky/invasive and will be tackled in future PRs.

mjs added some commits Oct 4, 2016

mongo: Use a timeout when dialing
When dialling a mongodb server, use the timeout from the DialOpts. This
helps mgo to respond faster when a mongodb node goes down, and assists
with HA failover.

When not using a custom dial function, mgo will use the timeout with
net.DialTimeout in the same way.

Also switched to passing the dial function using the DialServer field of
mgo.DialInfo instead of the deprecated Dial field.

Some logging has been tidied up too: refer to mongodb instead of mongo
and adjusted log levels.

Part of the fix for https://bugs.launchpad.net/juju/+bug/1588224
Rename HackLeadership to KillWorkers
The new name better reflects what the method now does these days.
state: Add StatePool.KillWorkers
Allow the internal workers for each State held by a StatePool to be
killed. This will be used to help ensure the apiserver doesn't get stuck
when shutting down.
apiserver: Call KillWorkers on state pool too
KillWorkers was only being called on the controller State meaning that
leadership calls for hosted models could still block apiserver
shutdown. This was sometimes preventing controller recovery following a
HA leadship change but could also happen at other times.

Part of the fix for https://bugs.launchpad.net/juju/+bug/1588224

Looks great

mongo/open.go
if err != nil {
- logger.Debugf("connection failed, will retry: %v", err)
+ logger.Errorf("mongodb connection failed, will retry: %v", err)
@wallyworld

wallyworld Oct 5, 2016

Owner

s/error/warning

errors are supposed to be something a user acts on IIRC. the fact that it says "will retry" implies warning is sufficient

@mjs

mjs Oct 5, 2016

Contributor

Yep, you're right

mongo: Tweak log levels
Use warning instead of error for mongo connection issues which the user
can't really do much about.
Contributor

mjs commented Oct 5, 2016

$$merge$$

Contributor

jujubot commented Oct 5, 2016

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit a73d91e into juju:master Oct 5, 2016

@mjs mjs deleted the mjs:1588224-minimal-fix branch Oct 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment