Retry upserts that fail with "duplicate key error" #291

Merged
merged 16 commits into from Jul 14, 2016

Conversation

Projects
None yet
4 participants
Contributor

babbageclunk commented Jun 27, 2016

According to the Mongo docs (link below), getting this error from an upsert is expected
behaviour and the client should retry if it happens.
https://docs.mongodb.com/v3.2/reference/method/db.collection.update/#use-unique-indexes

It doesn't seem to be specific to Mongo 3.2, but I think that the performance characteristics have changed in a way that makes it much more common than it is with 2.x.

Looking through the code revealed that removes had the same problem - I verified this by modifying the repro code to do removes instead.

This fixes #277

babbageclunk added some commits Jun 27, 2016

Retry upserts that fail with "duplicate key error"
According to the Mongo docs (link below), getting this error is expected
behaviour and the client should retry if it happens.
https://docs.mongodb.com/v3.2/reference/method/db.collection.update/#use-unique-indexes

This fixes #277
Contributor

niemeyer commented Jun 28, 2016

Any chance we can get a test?

Also, wonder if we should be retrying on mgo itself. Any reason not to?

Contributor

babbageclunk commented Jun 28, 2016

Good point, the retry should definitely be in mgo, rather than mgo.txn. I'll move it. Are you alright with the infinite loop to do the retry? Or should I be doing something a bit cleverer?

Sorry, I should have added a test - it's fiddly to do, but it looks like there are already some other tests trying to reproduce racy things.

babbageclunk added some commits Jun 28, 2016

Retry on duplicate key errors in Collection.Upsert
Also, use IsDup to detect the duplicate key errors - it looks like
that's been accreted through painful experience.
Contributor

babbageclunk commented Jul 5, 2016

Hi, sorry for the delay on this - I've moved the retrying into mgo, in Query.Apply and Collection.Upsert. Unfortunately I've had real trouble reproducing the problem in a test at that level - no matter how I try I can't get the upserts to fail with an 11000 error.

However when I run the txn tests TestTxnQueueStashStressTest was reliably failing and now passes consistently with the change. (I don't really understand why the txn tests aren't run by go test ./... from the mgo directory.)

Contributor

babbageclunk commented Jul 5, 2016

It seems like the -check.v in .travis.yml means the mgo/txn tests aren't being run - is that deliberate? I've added a line to run the txn tests as well.

Contributor

babbageclunk commented Jul 6, 2016

Hmm - that's unfortunate. It looks like TestTxnQueueStressTest fails intermittently on the parent branch as well. I'm going to pull the change to turn on the mgo/txn tests out of this PR into a new one and add a skip for that test until someone can work out what's going wrong.

Contributor

babbageclunk commented Jul 6, 2016

Sigh - no, of course, that doesn't work because this PR fixes TestTxnQueueStashStressTest. So I'm including the skip here.

babbageclunk added some commits Jul 6, 2016

Skip TestTxnQueueStressTest unless -flaky is passed
At the moment it fails about 25% of the time - the reason needs to be
worked out, but it's probably better to get the txn tests running first.
Retry DropAll in txn test setup
It fails sometimes in a way that seems like a timing issue - put a sleep
in before retrying (up to 3 attempts).
Bumping up retries and sleep time
If 10 1-second waits aren't enough then it's likely that the "no
reachable servers" error isn't transient.
Try to avoid port collisions when starting txn tests
Make StartServer dial the server first to ensure it's up before it returns.

Error logging showed that something else was already using port 50017.

Picking a random port still gets collisions :(, so try picking an unused
one instead.
Contributor

babbageclunk commented Jul 6, 2016

To show that this change does fix the duplicate key error, here's a build of a branch that has the txn tests turned on without the upsert retrying - TestTxnQueueStashStressTest fails in the mongo 3.2 test run.

session.go
@@ -2484,7 +2484,15 @@ func (c *Collection) Upsert(selector interface{}, update interface{}) (info *Cha
Flags: 1,
Upsert: true,
}
- lerr, err := c.writeOp(&op, true)
+ var lerr *LastError
+ for {
@niemeyer

niemeyer Jul 8, 2016

Contributor

Let's please start by retrying at most 5 times here and in the other loop. More than that there's a good chance that there's something else wrong.

session.go
return nil, ErrNotFound
+ } else {
+ return nil, err
@niemeyer

niemeyer Jul 8, 2016

Contributor

Please take this one out of the else and after the if blocks. Makes it more clear that this isn't looping around no matter what.

session.go
+ // Retry duplicate key errors on upserts.
+ // https://docs.mongodb.com/v3.2/reference/method/db.collection.update/#use-unique-indexes
+ continue
+ } else if qerr, ok := err.(*QueryError); ok && qerr.Message == "No matching object found" {
@niemeyer

niemeyer Jul 8, 2016

Contributor

Not your problem, but the error message comparison here saddens me.

@rogpeppe

rogpeppe Jul 12, 2016

Contributor

ditto about the else.

txn/flusher.go
@@ -264,8 +264,7 @@ NextDoc:
// Document missing. Use stash collection.
change.Upsert = true
chaos("")
- _, err := f.sc.FindId(dkey).Apply(change, &info)
- if err != nil {
+ if _, err := f.sc.FindId(dkey).Apply(change, &info); err != nil {
@niemeyer

niemeyer Jul 8, 2016

Contributor

Please restore the original form here. No need to touch this file.

@babbageclunk

babbageclunk Jul 10, 2016

Contributor

Oops, that was a holdover from the previous version of the PR. Removed.

txn/mgo_test.go
@@ -17,16 +21,39 @@ type MgoSuite struct {
session *mgo.Session
}
-var mgoaddr = "127.0.0.1:50017"
+const mgoip = "127.0.0.1"
@niemeyer

niemeyer Jul 8, 2016

Contributor

The changes in this file seem unrelated to the fix. Note that most of the mgo tests use fixed ports for the tests, so we have larger problems if that was an issue.

Also, we have a test server package nowadays which could handle this logic all by itself, probably. So if we have to fix it, that'd be a better way.

@babbageclunk

babbageclunk Jul 10, 2016

Contributor

You're right - these changes are only tangentially related to the upsert fix. Would you rather I split them (and the travis config change) out into a separate PR? It would need to get merged after this one, otherwise the txn tests will be failing.

D'oh, I hadn't spotted it, dbtest handles this much more nicely. I'll rewrite txn_test.go to use it and remove mgo_tes.go.

Contributor

babbageclunk commented Jul 10, 2016

Thanks for going through this!

babbageclunk added some commits Jul 10, 2016

Change txn tests to use dbtest instead of mgo_test
This has better handling of used ports and error reporting. Remove
mgo_test.go, it's not needed now. sim_test.go:simulate was relying on
the mgoaddr global set in mgo_test.go, changed it to get the DBServer
passed in.
Install gopkg.in/tomb.v2 in Travis CI
This is needed so the txn tests can use dbtest.
Contributor

babbageclunk commented Jul 10, 2016

This is a spurious failure - the script to start the various mongo processes and set up users failed with couldn't add user: not master
The same commit passed on Travis here: https://travis-ci.org/babbageclunk/mgo/builds/143769207

session.go
+
+ if err == nil {
+ break
+ } else if change.Upsert && IsDup(err) {
@rogpeppe

rogpeppe Jul 12, 2016

Contributor

given that the else is redundant, perhaps leave it out and put the if on a new line?

@babbageclunk

babbageclunk Jul 12, 2016

Contributor

Removing those two elses and making the ifs independent reads much better, thanks

txn/txn_test.go
func (s *S) TestTxnQueueStressTest(c *C) {
+ if !*flaky {
@rogpeppe

rogpeppe Jul 12, 2016

Contributor

Perhaps include a comment as to why this test is flaky and how it fails when it flakes out?

Contributor

rogpeppe commented Jul 12, 2016

LGTM FWIW. What an unfortunate semantic.

Contributor

rogpeppe commented Jul 12, 2016

BTW it would be great if this could land soon, as it fixes a current critical bug in juju.

@@ -2484,7 +2487,15 @@ func (c *Collection) Upsert(selector interface{}, update interface{}) (info *Cha
Flags: 1,
Upsert: true,
}
- lerr, err := c.writeOp(&op, true)
+ var lerr *LastError
+ for i := 0; i < maxUpsertRetries; i++ {
@jameinel

jameinel Jul 14, 2016

Contributor

I believe mgo has an ability to hook in a log function, can we get retries like this into that mechanism so we have an idea how often it might be triggering.
Maybe something like logging only on success if i > 0 ?

babbageclunk added some commits Jul 14, 2016

Review change from rogpeppe
Clarify the retry logic in Query.Apply.
Review change from rogpeppe
Clarify what happens when test fails and what it might mean.
+ // https://docs.mongodb.com/v3.2/reference/method/db.collection.update/#use-unique-indexes
+ if !IsDup(err) {
+ if i > 0 {
+ debugf("upsert retry succeeded after %d failure(s)", i)
@niemeyer

niemeyer Jul 14, 2016

Contributor

The error message is bogus. Not having a dup error does not mean the upsert succeeded.

Please drop the debug message altogether. If we have debug on, we'll see the attempts going through.

+
+ if err == nil {
+ if i > 0 {
+ debugf("upsert retry succeeded after %d failure(s)", i)
@niemeyer

niemeyer Jul 14, 2016

Contributor

Same, let's please drop this.

Contributor

niemeyer commented Jul 14, 2016

Merging, will fix that afterwards.

@niemeyer niemeyer merged commit aee6a64 into go-mgo:v2-unstable Jul 14, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

babbageclunk commented Jul 14, 2016

Thanks!

Contributor

babbageclunk commented Jul 14, 2016

I've got a branch with these same changes against the v2 branch. Do you want me to make another PR for that, or will you just merge this across to there?

babbageclunk added a commit to babbageclunk/juju that referenced this pull request Aug 18, 2016

Update mgo dependency
This version includes the fixes for the duplicate key error on upsert,
which means we don't need the patch any more.

go-mgo/mgo#291
go-mgo/mgo#316

@babbageclunk babbageclunk referenced this pull request in juju/juju Aug 18, 2016

Merged

Update mgo dependency #6028

jujubot added a commit to juju/juju that referenced this pull request Aug 22, 2016

Merge pull request #6028 from babbageclunk/update-mgo
Update mgo dependency

This version includes the fixes for the duplicate key error on upsert, which means we don't need the patch any more.

[go-mgo/mgo#291](go-mgo/mgo#291)
[go-mgo/mgo#316](go-mgo/mgo#316)

Add a readme to keep the patches directory alive and explain its use. mgz has made a change to apply_patches.py to ignore any files with a different extension.

(Review request: http://reviews.vapour.ws/r/5477/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment