Restart agent when txn log errors occur; provider configurable txn log size #7382

Merged
merged 2 commits into from May 26, 2017

Conversation

Projects
None yet
4 participants
Owner

wallyworld commented May 24, 2017

Description of change

During scale testing, destroying many models all at once would cause the capped txn log to roll over from underneath the watcher iterator. The watcher would error and things would go pear shaped from there.

We now catch that error in the watcher and bubble it up as a fatal agent error. This causes the agent to restart. The restart has the effect of starting from a known point again.

A small refactoring was done as part of this work to only initialise the database collections during state initialisation at bootstrap time, rather than every time a state instance is opened. This was needed because it was only at init time that the controller config with the max txn log size is available, plus it is dumb to initialise the collections after initialisation.

QA steps

$ juju bootstrap aws --config max-txn-log-size=1M
$ juju controller-config
verify that the max-txn-log-size is shown as 1M
run juju debug-log
create many models and destroy all at once
verify that debug log shows the capped position lost error and the agent should be restarted error and then exits as the agent is killed
run juju debug-log again and see that the logs show the agent has restarted

$ juju bootstrap aws
$ juju controller-config
verify that the max-txn-log-size is shown as 10M

bootstrap with an older client and upgrade
$ juju controller-config
verify that the max-txn-log-size is shown as 10M

Documentation changes

There's a new controller config attribute - max-txn-log-size, defaults to 10M
// MaxTxnLogSize is the maximum size the of capped txn log collection, eg "10M"
MaxTxnLogSize = "max-txn-log-size"

Bug reference

https://bugs.launchpad.net/juju/+bug/1692792

Owner

jameinel commented May 24, 2017

What about after upgrades, etc. There are more times than just bootstrap where we will need to initialize all of the tables.

@@ -83,7 +84,10 @@ func (w *commonWatcher) commonLoop() {
defer wg.Done()
<-w.tomb.Dying()
if err := w.call("Stop", nil); err != nil {
- logger.Errorf("error trying to stop watcher: %v", err)
+ // Don't log an error if a watcher is stopped due to an agent restart.
+ if err.Error() != worker.ErrRestartAgent.Error() {
@howbazaar

howbazaar May 24, 2017

Owner

Why not use errors.Cause(err) != worker.ErrRestartAgent?

@wallyworld

wallyworld May 25, 2017

Owner

Because the error here is a comes from across the wire and just contains a string message

modelTag := names.NewModelTag(args.ControllerModelArgs.Config.UUID())
- st, err := open(modelTag, args.MongoInfo, args.MongoDialOpts, args.NewPolicy, args.Clock, nil)
+ if !names.IsValidModel(modelTag.Id()) {
@howbazaar

howbazaar May 24, 2017

Owner

In what situations would this ever occur?

@wallyworld

wallyworld May 24, 2017

Owner

Not sure since you would expect the uuid to be validated elsewhere. But the code had this check already so I kept it.

state/open.go
@@ -526,18 +558,13 @@ func newState(
}
}()
- // Set up database.
+ schema := allCollections()
@howbazaar

howbazaar May 24, 2017

Owner

Any reason not to put these two lines into the &database initializer?

+ current: make(map[watchKey]int64),
+ request: make(chan interface{}),
+ }
+ if w.iteratorFunc == nil {
@howbazaar

howbazaar May 24, 2017

Owner

This is quite ick.

Couldn't you instead keep New function taking one arg, but having a test method that provider an iterator override?

Having the testing arg part of the public interface is not nice.

@wallyworld

wallyworld May 24, 2017

Owner

We use this pattern elsewhere. But we also use a test New function also in other places. Matter of taste as to how one does DI. Doing it this way makes the dependency explicit. I can go either way.

@@ -441,7 +464,15 @@ func (w *Watcher) sync() error {
}
}
if err := iter.Close(); err != nil {
@howbazaar

howbazaar May 24, 2017

Owner

Isn't there an iter.Err() that we should be looking for, rather than just using Close?

@wallyworld

wallyworld May 24, 2017

Owner

As far as I can tell, you only get the error when closing the iter

state/watcher/watcher_test.go
+type badIter struct {
+ *mgo.Iter
+
+ errorAfter *int
@howbazaar

howbazaar May 24, 2017

Owner

Why a pointer? Since the interface is supported by a *badIter, you can just use a normal int here.

@wallyworld

wallyworld May 24, 2017

Owner

Because of how the badIter was constructed, but I can refactor

Owner

howbazaar commented May 24, 2017

As @jameinel says, can we reconfigure the max txn log size after the controller has been set up?

Owner

wallyworld commented May 25, 2017

Upgrades is a good point. I still think it's bad to, regardless of context, do the collection setup each time a new state is created. eg state.ForModel() should not have to do it. I think it would be better to have the upgrade process explicitly redo the collections as part of the upgrade.

Also, controller settings are immutable at the moment. If/when that changes, again, I'd rather be explicit about messing with the collections, rather than just doing it every time we open a state.

Owner

jameinel commented May 25, 2017

I'm perfectly happy to have an explicit point where we do DB reconfiguration (be it Upgrades or Install, or even user initiated reset and restart). But I wanted to make sure we didn't forget things like upgrades. Especially wrt, how will things function if you upgrade to a version that now has this config, but never had it set.

As far as initialization of collections, I'm actually more concerned that we'll add a collection that, say, has a really important second index, but because we aren't applying the DB schema, we forget to create that index when upgrading. Mongo does generally follow the rule of "ensure this index exists" rather than "create an explicit index" for this precise reason. You can always just ask to make sure the index you want exists.

Owner

wallyworld commented May 25, 2017

The latest revision ensures that each time the agent is restarted, we apply indices, create explicit collections etc. This covers the upgrade case, as well as not unnecessarily doing it each time we simply get a new state from the state pool etc.

Owner

wallyworld commented May 25, 2017

$$merge$$

Contributor

jujubot commented May 25, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented May 25, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10986

wallyworld added some commits May 24, 2017

Address code review comments:
- ensure state collections are initialised each time the agent re-starts so upgrades are handled
- watcher test improvements
Owner

wallyworld commented May 26, 2017

$$merge$$

Contributor

jujubot commented May 26, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 67513bf into juju:develop May 26, 2017

1 check failed

github-check-merge-juju Built PR, ran unit tests, and tested LXD deploy. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment