Add another patch for mgo. #8068

Merged
merged 1 commit into from Nov 15, 2017

Conversation

Projects
None yet
3 participants
Owner

jameinel commented Nov 14, 2017

Description of change

This one changes it so that if we find a document whose txn-queue is
growing too large, abort the transaction, rather than letting it grow
unbounded. This allows us to recover from a bad transaction in a much
more reasonable manner.

QA steps

Bootstrap a controller from the source created by "releasetests/make-release-tarball.sh"
You can use enable-ha if you so chose.
Go to machine-0 and inject and invalid transaction in a document (I like to use lease documents, because we know we try to touch them every 30s.)
Watch the transaction queue grow, but end up capped around 1000.
Remove the bad transaction from the queue.
See that everything actually recovers gracefully.

$ rm -rf tmp.*
$ ./releasetests/make-release-tarball.bash df86098b22d78 .
$ cd tmp.*/RELEASE
$ export GOPATH=$PWD
# I had to hack github.com/juju/juju/version/version.go so that bootstrap would pick this jujud
# and not download it from streams
$ time go install -v github.com/juju/juju/...
$ ./bin/juju bootstrap lxd --debug
$ ./bin/juju enable-ha
$ juju deploy -B cs:~jameinel/ubuntu-lite --to 0 -m controller
$ juju switch controller
# wait for "juju status" to stabilize
$ juju ssh -m controller 0
$$ dialmgo() {
    agent=$(cd /var/lib/juju/agents; echo machine-*)
    pw=$(sudo grep statepassword /var/lib/juju/agents/${agent}/agent.conf | cut '-d ' -sf2)
    /usr/lib/juju/mongo3.2/bin/mongo --ssl -u ${agent} -p $pw --authenticationDatabase admin --sslAllowInvalidHostnames --sslAllowInvalidCertificates localhost:37017/juju
}
$$ dialmgo
> db.leases.find().pretty()
...
{
        "_id" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1:application-leadership#ubuntu-lite#",
        "namespace" : "application-leadership",
        "name" : "ubuntu-lite",
        "holder" : "ubuntu-lite/0",
        "start" : NumberLong("254150808548"),
        "duration" : NumberLong("60000000000"),
        "writer" : "machine-0",
        "model-uuid" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1",
        "txn-revno" : NumberLong(8),
        "txn-queue" : [
                "5a0aa29741639d0e1a4e0f44_61a9f62b"
        ]
}
> db.leases.update({name: "ubuntu-lite"}, {$push: {"txn-queue": "5a0aa29741639d0edeadbeef_decafbad"}})
# Watch juju debug-log in another window, see the leadership manager start dying
# and resumer fail to resume transactions
> db.leases.aggregate([{$match: {name: "ubuntu-lite"}}, {$project: {_id: 1, num: {$size: "$txn-queue"}}}])
{ "_id" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1:application-leadership#ubuntu-lite#", "num" : 490 }
# Watch how many transactions are created
> db.txns.aggregate([{$group: {_id: "$s", c: {$sum: 1}}}, {$sort: {c: -1}}])
{ "_id" : 5, "c" : 2774 }
{ "_id" : 6, "c" : 1816 }
{ "_id" : 2, "c" : 698 }

# Wait for it to hit 1000, and notice that it creates more aborted transactions, and the queue doesn't grow very much:
> db.txns.aggregate([{$group: {_id: "$s", c: {$sum: 1}}}, {$sort: {c: -1}}])
{ "_id" : 5, "c" : 3922 }
{ "_id" : 6, "c" : 1840 }
{ "_id" : 2, "c" : 990 }
> db.leases.aggregate([{$match: {name: "ubuntu-lite"}}, {$project: {_id: 1, num: {$size: "$txn-queue"}}}])
{ "_id" : "9b89c6e0-a321-43c2-894f-7c8eb87a92b1:application-leadership#ubuntu-lite#", "num" : 1000 }

# Now remove the transaction, and see the count start to fall
> db.leases.update({name: "ubuntu-lite"}, {$pull: {"txn-queue": "5a0aa29741639d0edeadbeef_decafbad"}})

Documentation changes

None.

Bug reference

go-mgo/mgo#463
This is more about when we have problems, don't let them get as out of hand, than fixing the underlying problem.

Add another patch for mgo.
This one changes it so that if we find a document whose txn-queue is
growing too large, abort the transaction, rather than letting it grow
unbounded. This allows us to recover from a bad transaction in a much
more reasonable manner.

axw approved these changes Nov 15, 2017

Owner

jameinel commented Nov 15, 2017

$$merge$$

Contributor

jujubot commented Nov 15, 2017

Status: merge request accepted. Url: http://ci.jujucharms.com/job/github-merge-juju

@jujubot jujubot merged commit 5324747 into juju:develop Nov 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment