Use lease management for singular controller workers #7984

Merged
merged 7 commits into from Oct 30, 2017

Conversation

Projects
None yet
4 participants
Member

axw commented Oct 29, 2017

Description of change

The aim of this PR is to get rid of Mongo mastership as a means of running singular controller workers, and while doing so, move several more state workers into the machine manifold (txnpruner, dblogpruner).

The "singular" lease manager is updated to manage leases for either the controller or model. The machine manifolds contain a singular worker that runs for all controller agents, and attempts to claim a lease on the controller. The txnpruner, dblogpruner, and externalcontrollerupdater manifolds/workers depend on that to start up.

We introduce two new flags to the machine manifolds: is-controller-flag, and is-responsible-controller-flag. The former is used to indicate that the agent is a controller, and the latter indicates that the agent has claimed responsibility for the controller.

QA steps

  1. juju bootstrap localhost
  2. juju enable-ha
    (wait for all)
  3. juju deploy ubuntu
  4. check that only machine 0 in the controller model is running "external-controller-updater", "log-pruner", and "transaction-pruner" (use the juju-engine-report command)
  5. juju ssh -m controller 0 sudo service jujud-machine-0 stop
  6. wait for another agent to claim the controller lease, again checking only one agent runs the workers specified above

Documentation changes

None.

Bug reference

https://bugs.launchpad.net/juju/+bug/1726680

@axw axw changed the title from Jujud singular controller to Use lease management for singular controller workers Oct 29, 2017

api/singular/api.go
}
-// Claim attempts to claim responsibility for model administration for the
-// supplied duration. If the claim is denied, it will return
+// ClaimModel attempts to claim responsibility for administration of the entity

Looks good, love the deleted mongo code.
Might pay to get a second opinion just in case.

+ modelTag names.ModelTag
+}
+
+func (b *stateBackend) ModelTag() names.ModelTag {
@wallyworld

wallyworld Oct 30, 2017

Owner

// ModelTag tells the Facade what models it should consider requests for.

@axw

axw Oct 30, 2017

Member

// ModelTag is part of the Backend interface

apiserver/params/internal.go
- ModelTag string `json:"model-tag"`
- ControllerTag string `json:"controller-tag"`
- Duration time.Duration `json:"duration"`
+ Tag string `json:"tag"`
@wallyworld

wallyworld Oct 30, 2017

Owner

EntityTag perhaps?
Then we have qualified names for both EntityTag and ClaimantTag

cmd/jujud/agent/machine/manifolds.go
+
+ // SingularFlagDuration defines for how long this agent will ask
+ // for controller administration rights.
+ SingularFlagDuration time.Duration
@wallyworld

wallyworld Oct 30, 2017

Owner

The var name here is not very descriptive - when I saw it in the code above, I had to scroll to its definition to have any clue what it was for. Perhaps ControllerAdminAttemptDuration? Or something.

@axw

axw Oct 30, 2017

Member

renamed to ControllerLeaseDuration, hopefully that's more meaningful? ControllerAdminAttemptDuration says nothing to me

cmd/jujud/agent/machine/stateflag.go
+// State, and returns a worker implementing engine.Flag, whose
+// Check method always returns true. This is used for flagging
+// that the machine is a controller/model manager.
+func stateFlagManifold() dependency.Manifold {
@wallyworld

wallyworld Oct 30, 2017

Owner

Could this be called
isControllerManifold()

@axw

axw Oct 30, 2017

Member

ambivalent, but changed it

+ }
+
+ go func() {
+ worker.Wait()
@wallyworld

wallyworld Oct 30, 2017

Owner

Can we log any error here?

@axw

axw Oct 30, 2017

Member

We could but it would be inappropriate. The purpose of this goroutine is just to decref the State when the worker is done.

some small issues, I'm not sure that many of them need changes, but some comments nonetheless.

+ entity names.Tag,
+) (*API, error) {
+ if !names.IsValidMachine(claimant.Id()) {
+ return nil, errors.NotValidf("claimant tag")
@jameinel

jameinel Oct 30, 2017

Owner

I'm a bit surprised we need this, given claimant is a names.MachineTag how do you create a MachineTag that isn't a valid machine?

@axw

axw Oct 30, 2017

Member

The zero value is the only way you could have an invalid MachineTag.

@@ -224,7 +224,7 @@ func AllFacades() *facade.Registry {
reg("Resumer", 2, resumer.NewResumerAPI)
reg("RetryStrategy", 1, retrystrategy.NewRetryStrategyAPI)
- reg("Singular", 1, singular.NewExternalFacade)
+ reg("Singular", 2, singular.NewExternalFacade)
@jameinel

jameinel Oct 30, 2017

Owner

are we not registering version 1 somewhere else? Are we just not worried about older versions because this runs in-process?

@axw

axw Oct 30, 2017

Member

Correct. I rationalised in a commit message, but forgot to carry over to the PR description:

The facade version has been bumped, but the old
version has been dropped entirely. Controllers
are always in sync with both their API client
and API server code, and this is a controller-only
API; there's no point in carrying the old version.

- claimer lease.Claimer
+ auth facade.Authorizer
+ controller names.ControllerTag
+ model names.ModelTag
@jameinel

jameinel Oct 30, 2017

Owner

would it be clearer if the member was 'modeltag' instead of 'model' which sounds like it is the actual Model object not a ModelTag object?

@axw

axw Oct 30, 2017

Member

yep, done

}
+ err = facade.claimer.WaitUntilExpired(leaseId)
@jameinel

jameinel Oct 30, 2017

Owner

This actually seems to have odd semantics if we ever actually passed in more than 1 entity. Namely, we don't wait for all of them in parallel, we wait for them sequentially, which actually means that we might expire the first request by the time we manage to grab the last one, given nothing in here seems to be refreshing any claimed leases during the wait time.
Not your code, so I'm happy enough with a caveat/comment/bug, but this does look like if you had explicitly staggered leases, lease-1 expires in 10s, lease-2 in 25s, and lease-3 actually refreshes before lease-2 is grabbed, so it actually isn't until 55s, then lease-1 has actually expired before we finish grabbing #3.

@axw

axw Oct 30, 2017

Member

Yikes, indeed - hadn't occurred to me. I'll add a comment.

@@ -1921,45 +1908,6 @@ func (a *MachineAgent) uninstallAgent() error {
return errors.Errorf("uninstall failed: %v", errs)
}
-type MongoSessioner interface {
- MongoSession() *mgo.Session
@jameinel

jameinel Oct 30, 2017

Owner

very nice to see these go

+ Flags: []string{
+ isControllerFlagName,
+ },
+}.Decorate
@jameinel

jameinel Oct 30, 2017

Owner

While these seem cute and functional, and make for terse descriptions of the components to run, it feels a little like the contribute to the 'logging' spew when trying to figure out what is/should be/isn't running right now. Not running because "flag not set" is often not very understandable. Is it supposed to be set? is something currently running and going to be setting it soon?

@axw

axw Oct 30, 2017

Member

Yes, they are a bit noisy, and not completely clear when debugging. I don't have immediate ideas for improvement, will have to ponder.

cmd/jujud/agent/machine/manifolds.go
+ isResponsibleControllerFlagName = "is-responsible-controller-flag"
+ isControllerFlagName = "is-controller-flag"
+ logPrunerName = "log-pruner"
+ txnPrunerName = "transaction-pruner"
@jameinel

jameinel Oct 30, 2017

Owner

this makes me feel like its reasonable to break this into sections so we don't have to churn the entire table when we add a flag that is slightly longer.

@@ -53,6 +54,9 @@ func (*ManifoldsSuite) TestManifoldNames(c *gc.C) {
"fan-configurer",
"global-clock-updater",
"host-key-reporter",
+ "is-controller-flag",
+ "is-responsible-controller-flag",
@jameinel

jameinel Oct 30, 2017

Owner

'is-responsible-controller-flag' still feels really hard for me to get my head around. responsible for what?

is-primary-controller
is master controller
is singleton controller

I don't know that I have anything better.

@axw

axw Oct 30, 2017

Member

Yeah, I just copied from the model worker, figured it was better to be consistent. I like "is-primary-controller-flag", will go with that.

worker/dblogpruner/worker.go
+ if pruneTimer == nil {
+ pruneTimer = w.config.Clock.NewTimer(w.config.PruneInterval)
+ pruneCh = pruneTimer.Chan()
+ defer pruneTimer.Stop()
@jameinel

jameinel Oct 30, 2017

Owner

is there a reason to defer creating the pruneTimer until now?
Could we just set up the pruneTimer outside of the loop?
Also, shouldn't we be calling pruneTimer.Reset(w.config.PruneInterval) when the config changes?

@axw

axw Oct 30, 2017

Member

is there a reason to defer creating the pruneTimer until now?

Yes, the reason is that we don't have any configuration until we get the first watcher notification. I'll comment.

Also, shouldn't we be calling pruneTimer.Reset(w.config.PruneInterval) when the config changes?

Nope, pruning should be every , from when we get the first config notification. The interval is statically configured via the manifold config.

axw added some commits Oct 29, 2017

state: update singular to permit controller leases
Update the singular lease manager to permit leases
to be taken out on the controller UUID, as well as
model UUIDs. This will be used to replace the singular
runner currently using Mongo mastership.
{api,apiserver}/singular: accept controller tags
Update the Singular API to accept controller tags
as well as model tags. Also rename the existing
"ControllerTag" to "ClaimantTag". The former predates
what we now know as controller tags, and really means
"machine tag of a controller agent". Renaming to
ClaimantTag gives us some room to allow for unit
agents if we want to extend the use of singular
any more.

The facade version has been bumped, but the old
version has been dropped entirely. Controllers
are always in sync with both their API client
and API server code, and this is a controller-only
API; there's no point in carrying the old version.
worker/dblogpruner: add manifold
Add a manifold for worker/dblogpruner, so that
it can be managed by the machine agent's dependency
engine.
worker/txnpruner: add manifold
Add a manifold for worker/txnpruner, so that
it can be managed by the machine agent's dependency
engine.
worker/externalcontrollerupdater: drop StateName
Drop StateName input from the worker. It was being
used so that we would run the worker only on controller
machines. What we really should be doing though is
running it on only one controller machine at a time.
We'll replace this with an engine housing wrapper.
worker/singular: support claiming controller
Update worker/singular to support claiming for
administration of the controller, and not just
a specific model. This will be used for running
singular workers at the controller level.

Also, drop the old Mongo mastership related code.
cmd/jujud/agent: controller-singular workers
Update the machine agent to support running
singular workers in the machine dependency
engine. We introduce two new flags:

 - is-controller-flag
 - is-responsible-controller-flag

The former is set for all controller machines;
the latter is set for at most one controller
machine at a time, by running a singular worker
claiming a lease for the controller.

The txnpruner, dblogpruner, and
externalcontrollerupdater workers are dependent
on the is-responsible-controller-flag input.
Member

axw commented Oct 30, 2017

$$merge$$

Contributor

jujubot commented Oct 30, 2017

Status: merge request accepted. Url: http://ci.jujucharms.com/job/github-merge-juju

@jujubot jujubot merged commit fd5fb93 into juju:develop Oct 30, 2017

1 check was pending

continuous-integration/jenkins/pr-merge This commit is being built
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment