state: Prune status history in batches #7501

Merged
merged 5 commits into from Jul 10, 2017

Conversation

Projects
None yet
5 participants
Member

babbageclunk commented Jun 14, 2017

Description of change

In a heavily loaded system we saw the status history pruner failing because it was trying to delete too many rows at once - the RemoveAll call would time out.

Change pruning to delete rows in batches of 1000 by id using the mgo bulk API. Use ids rather than time ranges to ensure we can always make progress - in a pathological situation there might be enough status updates in a given time range that the delete still can't complete.

QA steps

  • Bootstrap a controller, setting max-status-history-age to 5m and max-status-history-size to 1M.
  • Deploy some units
  • Generate a lot of status updates by running status-set in a loop under juju run - I used the following script (scp'd to the unit machines) and ran it with juju run --unit x nohup /home/ubuntu/spam-status.bash :
while true; do
      status-set maintenance xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      status-set active "all good"
done
  • Wait for 5 minutes for the pruner to run, and check that it reports deleting the records by age from each model.
  • Turn off the age constraint by setting max-status-history-age to 0 in the default model.
  • Check that the pruner logs that it's deleting rows by size (but only once since this is across all models).
  • Check the collection size in the mongo shell using db.statuseshistory.dataSize() - this should be within 1Mb of the specified limit (it might be 1.9 Mb since we ask for the size in Mb and Mongo gives us an int - 1.9 truncates to 1).

Bug reference

Fixes https://bugs.launchpad.net/juju/+bug/1696509

- }
- _, err = history.RemoveAll(bson.D{
- {"updated", bson.M{"$lt": result.Updated}},
+ toDelete := int(float64(collMB-p.maxSize) / sizePerStatus)
@babbageclunk

babbageclunk Jun 14, 2017

Member

I'm not sure whether we should bother working out how many we're expecting to delete. It's potentially useful for logging how much left there is to go (if there's a really large amount to prune), but it's misleading (hence the awkward "(estimated)" in the log message). We check the size after each batch, and stop once we're under the threshold (this seems important in case something else is also deleting history - like unit.Destroy), and generally this will be fewer than the calculation.

caveat some potential tweaks WRT RemoveAll vs Bulk.Remove()
and a manual verification of the impact of this code under load vs the original code under load, all of this looks like a real improvement to me.

Owner

jameinel commented Jun 22, 2017

We're you able to test this?

Member

babbageclunk commented Jun 22, 2017

No, I haven't gotten to this yet - I was away for 4 days and other fires have prevented me from working on it, sorry.

babbageclunk added some commits Jun 12, 2017

Batch pruning from the status history collection
Ensure that we only try to delete 1000 rows at a time. In a large
deployment with lots of status history records the pruner was dying
because it was issuing a RemoveAll call trying to delete so many rows
that it would fail.
Extract batch-deletion to reuse for age pruning
Created a statusHistoryPruner struct so that pruneByAge and pruneBySize could
both use deleteInBatches.
Member

axw commented Jul 7, 2017

I spent a while trying to get agents to hammer the database with enough records that it would trigger the original error, but to no avail. I think it's only likely to happen due to the bug that was preventing records from being pruned in the first place, or if the database is otherwise highly loaded.

So I turned to writing a standalone program that inserted a bunch of records, and then observed the pruner do its thing. Even with 1 million records, the original code worked fine: it took around 20 seconds. With 5 million records, it resulted in an i/o timeout; syslog said the query took ~90 seconds to complete. mongod was averaging about 105% CPU over that time.

With this branch, at 95c0fdb, pruning 5 million records took about 3 minutes, and mongod CPU was about 95%.

Back at 710a8c4 (i.e. using RemoveAll, rather than batching), pruning 5 million records took a little over 2 minutes, and mongod CPU was was around 100-105%.

Owner

wallyworld commented Jul 10, 2017

$$merge$$

Contributor

jujubot commented Jul 10, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit cd5d2ef into juju:2.2 Jul 10, 2017

1 check failed

github-check-merge-juju Ran tests against PR. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details

@babbageclunk babbageclunk deleted the babbageclunk:status-hist-pruner-delete branch Jul 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment