Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
state: Prune status history in batches #7501
Conversation
| - } | ||
| - _, err = history.RemoveAll(bson.D{ | ||
| - {"updated", bson.M{"$lt": result.Updated}}, | ||
| + toDelete := int(float64(collMB-p.maxSize) / sizePerStatus) |
babbageclunk
Jun 14, 2017
Member
I'm not sure whether we should bother working out how many we're expecting to delete. It's potentially useful for logging how much left there is to go (if there's a really large amount to prune), but it's misleading (hence the awkward "(estimated)" in the log message). We check the size after each batch, and stop once we're under the threshold (this seems important in case something else is also deleting history - like unit.Destroy), and generally this will be fewer than the calculation.
jameinel
approved these changes
Jun 14, 2017
caveat some potential tweaks WRT RemoveAll vs Bulk.Remove()
and a manual verification of the impact of this code under load vs the original code under load, all of this looks like a real improvement to me.
|
We're you able to test this? |
|
No, I haven't gotten to this yet - I was away for 4 days and other fires have prevented me from working on it, sorry. |
babbageclunk
added some commits
Jun 12, 2017
|
I spent a while trying to get agents to hammer the database with enough records that it would trigger the original error, but to no avail. I think it's only likely to happen due to the bug that was preventing records from being pruned in the first place, or if the database is otherwise highly loaded. So I turned to writing a standalone program that inserted a bunch of records, and then observed the pruner do its thing. Even with 1 million records, the original code worked fine: it took around 20 seconds. With 5 million records, it resulted in an i/o timeout; syslog said the query took ~90 seconds to complete. mongod was averaging about 105% CPU over that time. With this branch, at 95c0fdb, pruning 5 million records took about 3 minutes, and mongod CPU was about 95%. Back at 710a8c4 (i.e. using RemoveAll, rather than batching), pruning 5 million records took a little over 2 minutes, and mongod CPU was was around 100-105%. |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
babbageclunk commentedJun 14, 2017
•
Edited 1 time
-
babbageclunk
Jun 14, 2017
Description of change
In a heavily loaded system we saw the status history pruner failing because it was trying to delete too many rows at once - the RemoveAll call would time out.
Change pruning to delete rows in batches of 1000 by id using the mgo bulk API. Use ids rather than time ranges to ensure we can always make progress - in a pathological situation there might be enough status updates in a given time range that the delete still can't complete.
QA steps
status-setin a loop underjuju run- I used the following script (scp'd to the unit machines) and ran it withjuju run --unit x nohup /home/ubuntu/spam-status.bash:max-status-history-ageto 0 in the default model.db.statuseshistory.dataSize()- this should be within 1Mb of the specified limit (it might be 1.9 Mb since we ask for the size in Mb and Mongo gives us an int - 1.9 truncates to 1).Bug reference
Fixes https://bugs.launchpad.net/juju/+bug/1696509