Make status faster (in theory). #7766

Merged
merged 3 commits into from Aug 22, 2017

Conversation

Projects
None yet
4 participants
Owner

howbazaar commented Aug 21, 2017

juju status can be slow on big models in big controllers.

This branch attacks the problem from two different directions. Firstly the missing indices on the applications and machine collections. Both of these are fully loaded when status is executed. In controllers with many models the query would involve table scans.

Secondly when the apiserver code was iterating over the machines an units, each one had multiple calls to some value stored in the statuses collection. Instead of executing these queries one at a time, all the statuses for the model are loaded at once at the start of the status call.

This then reduces the number of calls for the status values by 2 per machine, and 3 per unit (or 4 per unit if SetStatus has never been called on the application. Also reduces the queries on statuses history by 1 per unit by making the full workload version status document available through the cache. As well as the status caching, a single query is made to the db for all the units in the model rather than one call per application.

So, for a model with 50 applications, 2500 units and 300 machines, this branch reduces the query count by 10648 (or 13148 if no app called SetStatus).

QA steps

Run juju status

Owner

jameinel commented Aug 21, 2017

I tried testing this on a controller with a couple thousand unit agents. I got up to around 2800 unit agents, and 'time juju status' when everything was happy was around 6.7s. Directly after 'juju upgrade-juju', all the agents were connecting to the new controller, CPU was generally maxed, and 'time juju status' took 2m50s.

Once things quieted down a bit and the only thing in the log file was now "waking to check leases", 'time juju status' was 4.8, 5.3, 5.6, 5.8, 5.0s.
So a good 30% faster.

Its quite hard for me to test how well 'juju status' is under actual load (how do you know if the time is different because the load is different, or if it is because the code is different).
I did try (just upgraded back to the 2.2.3 that I had bootstrapped with, though the indexes are likely till there?)
$ time juju status > /dev/null
real 2m2.225s

So I can't say that 2m50s is significantly different from 2m2s (that your code makes things slower under load), and variation means I can't even really say that it made a concrete difference under load.
Post upgrade back (so I assume still with the indexes, but not the new code path), I saw
17s, 13s, 4.8s, 6.4s, 5.1s, 5.1s, 5.1s, 4.7s.

So I can't say for sure whether this made a concrete difference or not.

I think the only way to find the difference would be to start over again, and run enough 'juju status' to see the difference in the noise.

One loop looks wrong, a couple other small comments.

+ return noStatus, errors.Annotate(err, "could not fetch model")
+ }
+ if context.status, err = context.model.LoadModelStatus(); err != nil {
+ return noStatus, errors.Annotate(err, "could not load model status values")
@jameinel

jameinel Aug 21, 2017

Owner

Would we want to do this contextually on "len(args.Patterns) <= 0" ? I'm just wondering if "juju status foo" is going to have to load the status of everything first.

@howbazaar

howbazaar Aug 21, 2017

Owner

I think if it proves to be too slow, we could only load matching docs.

apiserver/client/status.go
appUnitMap := make(map[string]*state.Unit)
for _, u := range units {
- appUnitMap[u.Name()] = u
+ if u.ApplicationName() == app.Name() {
@jameinel

jameinel Aug 21, 2017

Owner

3000 units in a model with 190 applications becomes 1903000 to evaluate this loop.
If you want to do this, we need to first put the units into a map on app name => units. (which happens to be what I was specifically testing, ~190 apps, each with 15 units.)
That may not be the specific layout, but given you can convert this from O(N
M) to O(N*1) by putting it into a hash table, I think its better to do so.

@howbazaar

howbazaar Aug 21, 2017

Owner

Iterating over lists is fast, but I have refactored the code.

machinesMap[id] = hostStatus
cache[id] = hostStatus
for _, machine := range machines[1:] {
parent, ok := cache[state.ParentId(machine.Id())]
if !ok {
- panic("We've broken an assumpution.")
+ logger.Errorf("programmer error, please file a bug, reference this whole log line: %q, %q", id, machine.Id())
@jameinel

jameinel Aug 21, 2017

Owner

yay for not being a panic

@howbazaar

howbazaar Aug 21, 2017

Owner

I know, right.

+ ipAddresses := c.ipAddresses[machineID]
+ spaces := c.spaces[machineID]
+ linkLayerDevices := c.linkLayerDevices[machineID]
+
@jameinel

jameinel Aug 21, 2017

Owner

A lot of this data never gets rendered in tabular format.
I wonder if we should move away from a FullStatus that generates everything and a renderer that filters out some of it.

@@ -1062,10 +1082,24 @@ func populateStatusFromStatusInfoAndErr(agent *params.DetailedStatus, statusInfo
agent.Since = statusInfo.Since
}
+// contextMachine overloads the Status call to use the cached status values,
+// and delegates everything else to the Machine.
+// TODO: cache presence as well.
@jameinel

jameinel Aug 21, 2017

Owner

Presence is already in-memory structures, I don't think we need yet-another cache for it.
I do have code that would allow you to ask about the presence information of multiple agents in one channel send and receive, instead of each agent being a separate request.
I just didn't bother updating it, because all the accesses were thisunit.Status() so we didn't have a way to call the bulk presence function.

@howbazaar

howbazaar Aug 21, 2017

Owner

How efficient would it be to ask for all the unit agent presence in one go?

@@ -246,8 +246,12 @@ func allCollections() collectionSchema {
// -----
// These collections hold information associated with applications.
- charmsC: {},
- applicationsC: {},
+ charmsC: {},
@jameinel

jameinel Aug 21, 2017

Owner

I stopped here for tonight. I may try to give it another look tomorrow.
Would it be reasonable to document why we want a given index somewhere?

Looks like a good change - it seems like reducing the number of queries would be worthwhile. Have you done any profiling of this code?

if context.applications, context.units, context.latestCharms, err =
- fetchAllApplicationsAndUnits(c.api.stateAccessor, len(args.Patterns) <= 0); err != nil {
+ fetchAllApplicationsAndUnits(c.api.stateAccessor, context.model, len(args.Patterns) <= 0); err != nil {
@babbageclunk

babbageclunk Aug 22, 2017

Member

I realise this is pre-existing, but the <= here seems needlessly obtuse. There's no way len can return a negative result, right?

apiserver/client/status.go
appUnitMap := make(map[string]*state.Unit)
for _, u := range units {
- appUnitMap[u.Name()] = u
+ if u.ApplicationName() == app.Name() {
@jameinel

jameinel Aug 21, 2017

Owner

3000 units in a model with 190 applications becomes 1903000 to evaluate this loop.
If you want to do this, we need to first put the units into a map on app name => units. (which happens to be what I was specifically testing, ~190 apps, each with 15 units.)
That may not be the specific layout, but given you can convert this from O(N
M) to O(N*1) by putting it into a hash table, I think its better to do so.

@howbazaar

howbazaar Aug 21, 2017

Owner

Iterating over lists is fast, but I have refactored the code.

+
+// LoadModelStatus retrieves all the status documents for the model
+// at once. Used to primarily speed up status.
+func (m *Model) LoadModelStatus() (*ModelStatus, error) {
@babbageclunk

babbageclunk Aug 22, 2017

Member

I think @jameinel's right - it would be a pity to dramatically slow down status with a filter pattern on a large model. Would it be hard to apply the filters here?

@howbazaar

howbazaar Aug 22, 2017

Owner

I'm not sure to be honest.

state/status.go
+ return doc, nil
+}
+
+func (m *ModelStatus) transform(doc statusDocWithID) status.StatusInfo {
@babbageclunk

babbageclunk Aug 22, 2017

Member

This would make more sense as a method on statusDocWithID.

Owner

howbazaar commented Aug 22, 2017

$$merge$$

Contributor

jujubot commented Aug 22, 2017

Status: merge request accepted. Url: http://ci.jujucharms.com/job/github-merge-juju

@jujubot jujubot merged commit 4ba2790 into juju:2.2 Aug 22, 2017

@howbazaar howbazaar deleted the howbazaar:2.2-faster-status branch Aug 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment