Restart agent when txn log errors occur; provider configurable txn log size #7382

wallyworld · 2017-05-24T09:29:05Z

Description of change

During scale testing, destroying many models all at once would cause the capped txn log to roll over from underneath the watcher iterator. The watcher would error and things would go pear shaped from there.

We now catch that error in the watcher and bubble it up as a fatal agent error. This causes the agent to restart. The restart has the effect of starting from a known point again.

A small refactoring was done as part of this work to only initialise the database collections during state initialisation at bootstrap time, rather than every time a state instance is opened. This was needed because it was only at init time that the controller config with the max txn log size is available, plus it is dumb to initialise the collections after initialisation.

QA steps

$ juju bootstrap aws --config max-txn-log-size=1M
$ juju controller-config
verify that the max-txn-log-size is shown as 1M
run juju debug-log
create many models and destroy all at once
verify that debug log shows the capped position lost error and the agent should be restarted error and then exits as the agent is killed
run juju debug-log again and see that the logs show the agent has restarted

$ juju bootstrap aws
$ juju controller-config
verify that the max-txn-log-size is shown as 10M

bootstrap with an older client and upgrade
$ juju controller-config
verify that the max-txn-log-size is shown as 10M

Documentation changes

There's a new controller config attribute - max-txn-log-size, defaults to 10M
// MaxTxnLogSize is the maximum size the of capped txn log collection, eg "10M"
MaxTxnLogSize = "max-txn-log-size"

Bug reference

https://bugs.launchpad.net/juju/+bug/1692792

jameinel · 2017-05-24T12:35:29Z

What about after upgrades, etc. There are more times than just bootstrap where we will need to initialize all of the tables.

howbazaar · 2017-05-24T22:29:38Z

api/watcher/watcher.go

@@ -83,7 +84,10 @@ func (w *commonWatcher) commonLoop() {
 		defer wg.Done()
 		<-w.tomb.Dying()
 		if err := w.call("Stop", nil); err != nil {
-			logger.Errorf("error trying to stop watcher: %v", err)
+			// Don't log an error if a watcher is stopped due to an agent restart.
+			if err.Error() != worker.ErrRestartAgent.Error() {


Why not use errors.Cause(err) != worker.ErrRestartAgent?

Because the error here is a comes from across the wire and just contains a string message

howbazaar · 2017-05-24T22:39:43Z

state/open.go

 	modelTag := names.NewModelTag(args.ControllerModelArgs.Config.UUID())
-	st, err := open(modelTag, args.MongoInfo, args.MongoDialOpts, args.NewPolicy, args.Clock, nil)
+	if !names.IsValidModel(modelTag.Id()) {


In what situations would this ever occur?

Not sure since you would expect the uuid to be validated elsewhere. But the code had this check already so I kept it.

howbazaar · 2017-05-24T22:41:58Z

state/open.go

@@ -526,18 +558,13 @@ func newState(
 		}
 	}()

-	// Set up database.
+	schema := allCollections()


Any reason not to put these two lines into the &database initializer?

howbazaar · 2017-05-24T22:44:49Z

state/watcher/watcher.go

+		current:      make(map[watchKey]int64),
+		request:      make(chan interface{}),
+	}
+	if w.iteratorFunc == nil {


This is quite ick.

Couldn't you instead keep New function taking one arg, but having a test method that provider an iterator override?

Having the testing arg part of the public interface is not nice.

We use this pattern elsewhere. But we also use a test New function also in other places. Matter of taste as to how one does DI. Doing it this way makes the dependency explicit. I can go either way.

howbazaar · 2017-05-24T22:48:51Z

state/watcher/watcher.go

@@ -441,7 +464,15 @@ func (w *Watcher) sync() error {
 		}
 	}
 	if err := iter.Close(); err != nil {


Isn't there an iter.Err() that we should be looking for, rather than just using Close?

As far as I can tell, you only get the error when closing the iter

howbazaar · 2017-05-24T22:50:38Z

state/watcher/watcher_test.go

+type badIter struct {
+	*mgo.Iter
+
+	errorAfter *int


Why a pointer? Since the interface is supported by a *badIter, you can just use a normal int here.

Because of how the badIter was constructed, but I can refactor

howbazaar · 2017-05-24T22:53:55Z

As @jameinel says, can we reconfigure the max txn log size after the controller has been set up?

wallyworld · 2017-05-25T00:13:16Z

Upgrades is a good point. I still think it's bad to, regardless of context, do the collection setup each time a new state is created. eg state.ForModel() should not have to do it. I think it would be better to have the upgrade process explicitly redo the collections as part of the upgrade.

Also, controller settings are immutable at the moment. If/when that changes, again, I'd rather be explicit about messing with the collections, rather than just doing it every time we open a state.

jameinel · 2017-05-25T04:00:31Z

I'm perfectly happy to have an explicit point where we do DB reconfiguration (be it Upgrades or Install, or even user initiated reset and restart). But I wanted to make sure we didn't forget things like upgrades. Especially wrt, how will things function if you upgrade to a version that now has this config, but never had it set.

As far as initialization of collections, I'm actually more concerned that we'll add a collection that, say, has a really important second index, but because we aren't applying the DB schema, we forget to create that index when upgrading. Mongo does generally follow the rule of "ensure this index exists" rather than "create an explicit index" for this precise reason. You can always just ask to make sure the index you want exists.

wallyworld · 2017-05-25T12:11:22Z

The latest revision ensures that each time the agent is restarted, we apply indices, create explicit collections etc. This covers the upgrade case, as well as not unnecessarily doing it each time we simply get a new state from the state pool etc.

wallyworld · 2017-05-25T12:11:31Z

$$merge$$

jujubot · 2017-05-25T12:12:09Z

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

jujubot · 2017-05-25T13:01:56Z

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10986

…g size

- ensure state collections are initialised each time the agent re-starts so upgrades are handled - watcher test improvements

wallyworld · 2017-05-26T00:48:06Z

$$merge$$

jujubot · 2017-05-26T00:50:07Z

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

howbazaar reviewed May 24, 2017

View reviewed changes

howbazaar approved these changes May 25, 2017

View reviewed changes

wallyworld force-pushed the capped-txn-log-errors branch from 80678a0 to c5094a8 Compare May 26, 2017 00:42

wallyworld added 2 commits May 26, 2017 10:47

Restart agent when txn log errors occur; provider configurable txn lo…

b513e6c

…g size

Address code review comments:

217dfd2

- ensure state collections are initialised each time the agent re-starts so upgrades are handled - watcher test improvements

wallyworld force-pushed the capped-txn-log-errors branch from c5094a8 to 217dfd2 Compare May 26, 2017 00:47

jujubot merged commit 67513bf into juju:develop May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart agent when txn log errors occur; provider configurable txn log size #7382

Restart agent when txn log errors occur; provider configurable txn log size #7382

wallyworld commented May 24, 2017

jameinel commented May 24, 2017

howbazaar May 24, 2017

wallyworld May 25, 2017

howbazaar May 24, 2017

wallyworld May 24, 2017

howbazaar May 24, 2017

howbazaar May 24, 2017

wallyworld May 24, 2017

howbazaar May 24, 2017

wallyworld May 24, 2017

howbazaar May 24, 2017

wallyworld May 24, 2017

howbazaar commented May 24, 2017

wallyworld commented May 25, 2017

jameinel commented May 25, 2017

wallyworld commented May 25, 2017

wallyworld commented May 25, 2017

jujubot commented May 25, 2017

jujubot commented May 25, 2017

wallyworld commented May 26, 2017

jujubot commented May 26, 2017

Restart agent when txn log errors occur; provider configurable txn log size #7382

Restart agent when txn log errors occur; provider configurable txn log size #7382

Conversation

wallyworld commented May 24, 2017

Description of change

QA steps

Documentation changes

Bug reference

jameinel commented May 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howbazaar commented May 24, 2017

wallyworld commented May 25, 2017

jameinel commented May 25, 2017

wallyworld commented May 25, 2017

wallyworld commented May 25, 2017

jujubot commented May 25, 2017

jujubot commented May 25, 2017

wallyworld commented May 26, 2017

jujubot commented May 26, 2017