Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart agent when txn log errors occur; provider configurable txn log size #7382

Merged
merged 2 commits into from May 26, 2017

Conversation

wallyworld
Copy link
Member

Description of change

During scale testing, destroying many models all at once would cause the capped txn log to roll over from underneath the watcher iterator. The watcher would error and things would go pear shaped from there.

We now catch that error in the watcher and bubble it up as a fatal agent error. This causes the agent to restart. The restart has the effect of starting from a known point again.

A small refactoring was done as part of this work to only initialise the database collections during state initialisation at bootstrap time, rather than every time a state instance is opened. This was needed because it was only at init time that the controller config with the max txn log size is available, plus it is dumb to initialise the collections after initialisation.

QA steps

$ juju bootstrap aws --config max-txn-log-size=1M
$ juju controller-config
verify that the max-txn-log-size is shown as 1M
run juju debug-log
create many models and destroy all at once
verify that debug log shows the capped position lost error and the agent should be restarted error and then exits as the agent is killed
run juju debug-log again and see that the logs show the agent has restarted

$ juju bootstrap aws
$ juju controller-config
verify that the max-txn-log-size is shown as 10M

bootstrap with an older client and upgrade
$ juju controller-config
verify that the max-txn-log-size is shown as 10M

Documentation changes

There's a new controller config attribute - max-txn-log-size, defaults to 10M
// MaxTxnLogSize is the maximum size the of capped txn log collection, eg "10M"
MaxTxnLogSize = "max-txn-log-size"

Bug reference

https://bugs.launchpad.net/juju/+bug/1692792

@jameinel
Copy link
Member

What about after upgrades, etc. There are more times than just bootstrap where we will need to initialize all of the tables.

@@ -83,7 +84,10 @@ func (w *commonWatcher) commonLoop() {
defer wg.Done()
<-w.tomb.Dying()
if err := w.call("Stop", nil); err != nil {
logger.Errorf("error trying to stop watcher: %v", err)
// Don't log an error if a watcher is stopped due to an agent restart.
if err.Error() != worker.ErrRestartAgent.Error() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use errors.Cause(err) != worker.ErrRestartAgent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the error here is a comes from across the wire and just contains a string message

modelTag := names.NewModelTag(args.ControllerModelArgs.Config.UUID())
st, err := open(modelTag, args.MongoInfo, args.MongoDialOpts, args.NewPolicy, args.Clock, nil)
if !names.IsValidModel(modelTag.Id()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations would this ever occur?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure since you would expect the uuid to be validated elsewhere. But the code had this check already so I kept it.

state/open.go Outdated
@@ -526,18 +558,13 @@ func newState(
}
}()

// Set up database.
schema := allCollections()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to put these two lines into the &database initializer?

current: make(map[watchKey]int64),
request: make(chan interface{}),
}
if w.iteratorFunc == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite ick.

Couldn't you instead keep New function taking one arg, but having a test method that provider an iterator override?

Having the testing arg part of the public interface is not nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this pattern elsewhere. But we also use a test New function also in other places. Matter of taste as to how one does DI. Doing it this way makes the dependency explicit. I can go either way.

@@ -441,7 +464,15 @@ func (w *Watcher) sync() error {
}
}
if err := iter.Close(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an iter.Err() that we should be looking for, rather than just using Close?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, you only get the error when closing the iter

type badIter struct {
*mgo.Iter

errorAfter *int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a pointer? Since the interface is supported by a *badIter, you can just use a normal int here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of how the badIter was constructed, but I can refactor

@howbazaar
Copy link
Contributor

As @jameinel says, can we reconfigure the max txn log size after the controller has been set up?

@wallyworld
Copy link
Member Author

Upgrades is a good point. I still think it's bad to, regardless of context, do the collection setup each time a new state is created. eg state.ForModel() should not have to do it. I think it would be better to have the upgrade process explicitly redo the collections as part of the upgrade.

Also, controller settings are immutable at the moment. If/when that changes, again, I'd rather be explicit about messing with the collections, rather than just doing it every time we open a state.

@jameinel
Copy link
Member

I'm perfectly happy to have an explicit point where we do DB reconfiguration (be it Upgrades or Install, or even user initiated reset and restart). But I wanted to make sure we didn't forget things like upgrades. Especially wrt, how will things function if you upgrade to a version that now has this config, but never had it set.

As far as initialization of collections, I'm actually more concerned that we'll add a collection that, say, has a really important second index, but because we aren't applying the DB schema, we forget to create that index when upgrading. Mongo does generally follow the rule of "ensure this index exists" rather than "create an explicit index" for this precise reason. You can always just ask to make sure the index you want exists.

@wallyworld
Copy link
Member Author

The latest revision ensures that each time the agent is restarted, we apply indices, create explicit collections etc. This covers the upgrade case, as well as not unnecessarily doing it each time we simply get a new state from the state pool etc.

@wallyworld
Copy link
Member Author

$$merge$$

@jujubot
Copy link
Collaborator

jujubot commented May 25, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot
Copy link
Collaborator

jujubot commented May 25, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10986

- ensure state collections are initialised each time the agent re-starts so upgrades are handled
- watcher test improvements
@wallyworld
Copy link
Member Author

$$merge$$

@jujubot
Copy link
Collaborator

jujubot commented May 26, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 67513bf into juju:develop May 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants