Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Retry upserts that fail with "duplicate key error" #291
Conversation
babbageclunk
added some commits
Jun 27, 2016
|
Any chance we can get a test? Also, wonder if we should be retrying on mgo itself. Any reason not to? |
|
Good point, the retry should definitely be in mgo, rather than mgo.txn. I'll move it. Are you alright with the infinite loop to do the retry? Or should I be doing something a bit cleverer? Sorry, I should have added a test - it's fiddly to do, but it looks like there are already some other tests trying to reproduce racy things. |
babbageclunk
added some commits
Jun 28, 2016
|
Hi, sorry for the delay on this - I've moved the retrying into mgo, in Query.Apply and Collection.Upsert. Unfortunately I've had real trouble reproducing the problem in a test at that level - no matter how I try I can't get the upserts to fail with an 11000 error. However when I run the txn tests TestTxnQueueStashStressTest was reliably failing and now passes consistently with the change. (I don't really understand why the txn tests aren't run by |
|
It seems like the -check.v in .travis.yml means the mgo/txn tests aren't being run - is that deliberate? I've added a line to run the txn tests as well. |
|
Hmm - that's unfortunate. It looks like TestTxnQueueStressTest fails intermittently on the parent branch as well. I'm going to pull the change to turn on the mgo/txn tests out of this PR into a new one and add a skip for that test until someone can work out what's going wrong. |
|
Sigh - no, of course, that doesn't work because this PR fixes TestTxnQueueStashStressTest. So I'm including the skip here. |
babbageclunk
added some commits
Jul 6, 2016
|
To show that this change does fix the duplicate key error, here's a build of a branch that has the txn tests turned on without the upsert retrying - TestTxnQueueStashStressTest fails in the mongo 3.2 test run. |
niemeyer
reviewed
Jul 8, 2016
| @@ -2484,7 +2484,15 @@ func (c *Collection) Upsert(selector interface{}, update interface{}) (info *Cha | ||
| Flags: 1, | ||
| Upsert: true, | ||
| } | ||
| - lerr, err := c.writeOp(&op, true) | ||
| + var lerr *LastError | ||
| + for { |
niemeyer
Jul 8, 2016
Contributor
Let's please start by retrying at most 5 times here and in the other loop. More than that there's a good chance that there's something else wrong.
niemeyer
reviewed
Jul 8, 2016
| return nil, ErrNotFound | ||
| + } else { | ||
| + return nil, err |
niemeyer
Jul 8, 2016
Contributor
Please take this one out of the else and after the if blocks. Makes it more clear that this isn't looping around no matter what.
niemeyer
reviewed
Jul 8, 2016
| + // Retry duplicate key errors on upserts. | ||
| + // https://docs.mongodb.com/v3.2/reference/method/db.collection.update/#use-unique-indexes | ||
| + continue | ||
| + } else if qerr, ok := err.(*QueryError); ok && qerr.Message == "No matching object found" { |
niemeyer
Jul 8, 2016
Contributor
Not your problem, but the error message comparison here saddens me.
niemeyer
reviewed
Jul 8, 2016
| @@ -264,8 +264,7 @@ NextDoc: | ||
| // Document missing. Use stash collection. | ||
| change.Upsert = true | ||
| chaos("") | ||
| - _, err := f.sc.FindId(dkey).Apply(change, &info) | ||
| - if err != nil { | ||
| + if _, err := f.sc.FindId(dkey).Apply(change, &info); err != nil { |
babbageclunk
Jul 10, 2016
Contributor
Oops, that was a holdover from the previous version of the PR. Removed.
niemeyer
reviewed
Jul 8, 2016
| @@ -17,16 +21,39 @@ type MgoSuite struct { | ||
| session *mgo.Session | ||
| } | ||
| -var mgoaddr = "127.0.0.1:50017" | ||
| +const mgoip = "127.0.0.1" |
niemeyer
Jul 8, 2016
Contributor
The changes in this file seem unrelated to the fix. Note that most of the mgo tests use fixed ports for the tests, so we have larger problems if that was an issue.
Also, we have a test server package nowadays which could handle this logic all by itself, probably. So if we have to fix it, that'd be a better way.
babbageclunk
Jul 10, 2016
•
Contributor
You're right - these changes are only tangentially related to the upsert fix. Would you rather I split them (and the travis config change) out into a separate PR? It would need to get merged after this one, otherwise the txn tests will be failing.
D'oh, I hadn't spotted it, dbtest handles this much more nicely. I'll rewrite txn_test.go to use it and remove mgo_tes.go.
|
Thanks for going through this! |
babbageclunk
added some commits
Jul 10, 2016
|
This is a spurious failure - the script to start the various mongo processes and set up users failed with |
rogpeppe
reviewed
Jul 12, 2016
| + | ||
| + if err == nil { | ||
| + break | ||
| + } else if change.Upsert && IsDup(err) { |
rogpeppe
Jul 12, 2016
Contributor
given that the else is redundant, perhaps leave it out and put the if on a new line?
babbageclunk
Jul 12, 2016
Contributor
Removing those two elses and making the ifs independent reads much better, thanks
rogpeppe
reviewed
Jul 12, 2016
| func (s *S) TestTxnQueueStressTest(c *C) { | ||
| + if !*flaky { |
rogpeppe
Jul 12, 2016
Contributor
Perhaps include a comment as to why this test is flaky and how it fails when it flakes out?
|
LGTM FWIW. What an unfortunate semantic. |
|
BTW it would be great if this could land soon, as it fixes a current critical bug in juju. |
jameinel
reviewed
Jul 14, 2016
| @@ -2484,7 +2487,15 @@ func (c *Collection) Upsert(selector interface{}, update interface{}) (info *Cha | ||
| Flags: 1, | ||
| Upsert: true, | ||
| } | ||
| - lerr, err := c.writeOp(&op, true) | ||
| + var lerr *LastError | ||
| + for i := 0; i < maxUpsertRetries; i++ { |
jameinel
Jul 14, 2016
Contributor
I believe mgo has an ability to hook in a log function, can we get retries like this into that mechanism so we have an idea how often it might be triggering.
Maybe something like logging only on success if i > 0 ?
babbageclunk
added some commits
Jul 14, 2016
niemeyer
reviewed
Jul 14, 2016
| + // https://docs.mongodb.com/v3.2/reference/method/db.collection.update/#use-unique-indexes | ||
| + if !IsDup(err) { | ||
| + if i > 0 { | ||
| + debugf("upsert retry succeeded after %d failure(s)", i) |
niemeyer
Jul 14, 2016
Contributor
The error message is bogus. Not having a dup error does not mean the upsert succeeded.
Please drop the debug message altogether. If we have debug on, we'll see the attempts going through.
niemeyer
reviewed
Jul 14, 2016
| + | ||
| + if err == nil { | ||
| + if i > 0 { | ||
| + debugf("upsert retry succeeded after %d failure(s)", i) |
|
Merging, will fix that afterwards. |
niemeyer
merged commit aee6a64
into
go-mgo:v2-unstable
Jul 14, 2016
1 check passed
|
Thanks! |
|
I've got a branch with these same changes against the v2 branch. Do you want me to make another PR for that, or will you just merge this across to there? |
babbageclunk commentedJun 27, 2016
According to the Mongo docs (link below), getting this error from an upsert is expected
behaviour and the client should retry if it happens.
https://docs.mongodb.com/v3.2/reference/method/db.collection.update/#use-unique-indexes
It doesn't seem to be specific to Mongo 3.2, but I think that the performance characteristics have changed in a way that makes it much more common than it is with 2.x.
Looking through the code revealed that removes had the same problem - I verified this by modifying the repro code to do removes instead.
This fixes #277