Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a single txn log watcher for state pool. #8223

Merged
merged 4 commits into from Dec 14, 2017

Conversation

howbazaar
Copy link
Contributor

Backport of 2.3 work to split up txn.logs watcher.

Back in the dawn of time, there was only a single model, and a single State instance. As the controller concept became a thing, we had additional State instances for each model. A core aspect of how the event model works in Juju is watching of the txns.log collection. As documents are added, updated, and removed by the transaction system, documents are added to the txns.log capped collection. Each document refers to the documents that were touched a single transaction. Each State instance has its own txn watcher. Each one of these tails the txns.log collection to watch for changes. As the number of models grows in a single controller, so does the number of things watching the txns.log collection. If the controller is HA, then there is n x m watchers where n is the number of models and m is the number of API servers. This creates quite a bit of idle load on mongo as the number of models grows above a small number. As numbers approach 200, there is a LOT of load. Especially when actions occur that trigger a bunch of transactions that touch a bunch of documents, such as removing an application. Particularly an application that has a bunch of subordinates.

So... this branch changes how the polling of the database is done. The state pool now has a watcher, that publishes all document changes to a hub that is also owned by the state pool. Whenever the state pool creates a new State instance, the base watcher for that State instance listens to the hub for changes rather than polling the database.

This drastically reduces the mongo load.

QA steps

I verified this by starting with a 2.2.6 controller, deploying 55 units to each of four models, and adding 700 empty models. This showed significant i/o load on the juju.txns.log collection.

After upgrading the controller, the load dropped significantly. From ~1000ms read / 10s to 15ms read / 10s.

Additional QA was done against EC2 with 3500 unit agents.

Bug reference

https://bugs.launchpad.net/juju/+bug/1727973
https://bugs.launchpad.net/juju/+bug/1733708
https://bugs.launchpad.net/juju/+bug/1737255

Copy link
Member

@wallyworld wallyworld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the hub watcher and txn watcher stuff has been reviewed elsewhere in the original PR?
If so, LGTM, if not, I'll need to look at those bits properly.

state/pool.go Outdated
// If systemState is nil, this is clearly a test, and a poorly
// isolated one. However now is not the time to fix all those broken
// tests.
if systemState != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we invert this to avoid indentation and jeep the code closer to what it should be without the work around
if systemState == nil {
return pool
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

systemState: systemState,
pool: make(map[string]*PoolItem),
hub: pubsub.NewSimpleHub(nil),
}
// If systemState is nil, this is clearly a test, and a poorly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log a warning so that if we accidentally introduce nil system state in production code, we at least get to see an artefact

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

@jameinel jameinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly approving this based on reviewing it against 2.3 and feeling like at a minimum, it is better than what we had before. There might still be possibilities of pushing filtering higher in the stack (so callbacks are invoked per thread, but filtering is done in, say, a single thread, rather than filter into callback).
Regardless, we need something like this, and this is a good first step.

func (w *HubWatcher) flush() {
// syncEvents are stored first in first out.
for i := 0; i < len(w.syncEvents); i++ {
e := &w.syncEvents[i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we allow for syncEvents to also grow during the loop? Should we put a comment to that effect?
certainly this paradigm supports it changing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding comment.

@howbazaar
Copy link
Contributor Author

$$merge$$

@jujubot
Copy link
Collaborator

jujubot commented Dec 14, 2017

Status: merge request accepted. Url: http://ci.jujucharms.com/job/github-merge-juju

@jujubot jujubot merged commit edc8d6d into juju:2.2 Dec 14, 2017
@howbazaar howbazaar deleted the 2.2-txn-log-watcher-pubsub branch December 14, 2017 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants