Introduce a single txn log watcher for state pool. #8223

Merged
merged 4 commits into from Dec 14, 2017

Conversation

Projects
None yet
4 participants
Owner

howbazaar commented Dec 14, 2017

Backport of 2.3 work to split up txn.logs watcher.

Back in the dawn of time, there was only a single model, and a single State instance. As the controller concept became a thing, we had additional State instances for each model. A core aspect of how the event model works in Juju is watching of the txns.log collection. As documents are added, updated, and removed by the transaction system, documents are added to the txns.log capped collection. Each document refers to the documents that were touched a single transaction. Each State instance has its own txn watcher. Each one of these tails the txns.log collection to watch for changes. As the number of models grows in a single controller, so does the number of things watching the txns.log collection. If the controller is HA, then there is n x m watchers where n is the number of models and m is the number of API servers. This creates quite a bit of idle load on mongo as the number of models grows above a small number. As numbers approach 200, there is a LOT of load. Especially when actions occur that trigger a bunch of transactions that touch a bunch of documents, such as removing an application. Particularly an application that has a bunch of subordinates.

So... this branch changes how the polling of the database is done. The state pool now has a watcher, that publishes all document changes to a hub that is also owned by the state pool. Whenever the state pool creates a new State instance, the base watcher for that State instance listens to the hub for changes rather than polling the database.

This drastically reduces the mongo load.

QA steps

I verified this by starting with a 2.2.6 controller, deploying 55 units to each of four models, and adding 700 empty models. This showed significant i/o load on the juju.txns.log collection.

After upgrading the controller, the load dropped significantly. From ~1000ms read / 10s to 15ms read / 10s.

Additional QA was done against EC2 with 3500 unit agents.

Bug reference

https://bugs.launchpad.net/juju/+bug/1727973
https://bugs.launchpad.net/juju/+bug/1733708
https://bugs.launchpad.net/juju/+bug/1737255

I'm assuming the hub watcher and txn watcher stuff has been reviewed elsewhere in the original PR?
If so, LGTM, if not, I'll need to look at those bits properly.

systemState: systemState,
pool: make(map[string]*PoolItem),
+ hub: pubsub.NewSimpleHub(nil),
+ }
+ // If systemState is nil, this is clearly a test, and a poorly
@wallyworld

wallyworld Dec 14, 2017

Owner

Can we log a warning so that if we accidentally introduce nil system state in production code, we at least get to see an artefact

state/pool.go
+ // If systemState is nil, this is clearly a test, and a poorly
+ // isolated one. However now is not the time to fix all those broken
+ // tests.
+ if systemState != nil {
@wallyworld

wallyworld Dec 14, 2017

Owner

can we invert this to avoid indentation and jeep the code closer to what it should be without the work around
if systemState == nil {
return pool
}

Mostly approving this based on reviewing it against 2.3 and feeling like at a minimum, it is better than what we had before. There might still be possibilities of pushing filtering higher in the stack (so callbacks are invoked per thread, but filtering is done in, say, a single thread, rather than filter into callback).
Regardless, we need something like this, and this is a good first step.

state/watcher/hubwatcher.go
+func (w *HubWatcher) flush() {
+ // syncEvents are stored first in first out.
+ for i := 0; i < len(w.syncEvents); i++ {
+ e := &w.syncEvents[i]
@jameinel

jameinel Dec 14, 2017

Owner

do we allow for syncEvents to also grow during the loop? Should we put a comment to that effect?
certainly this paradigm supports it changing.

@howbazaar

howbazaar Dec 14, 2017

Owner

Yes, adding comment.

Owner

howbazaar commented Dec 14, 2017

$$merge$$

Contributor

jujubot commented Dec 14, 2017

Status: merge request accepted. Url: http://ci.jujucharms.com/job/github-merge-juju

@jujubot jujubot merged commit edc8d6d into juju:2.2 Dec 14, 2017

1 check passed

continuous-integration/jenkins/pr-merge This commit looks good
Details

@howbazaar howbazaar deleted the howbazaar:2.2-txn-log-watcher-pubsub branch Dec 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment