Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make _rev generation deterministic (similar to CouchDB) #4642

Closed
willholley opened this issue Dec 8, 2015 · 12 comments

Comments

Projects
None yet
7 participants
@willholley
Copy link
Member

commented Dec 8, 2015

PouchDB currently generates _rev values using a uuid. This presents problems when, for instance, an application has conflict resolution logic which might run on different devices concurrently (because they would both resolve the conflict but, in doing so, generate a new one).

CouchDB has an optimisation here where the _rev is deterministic, based on the contents of the document, attachments, deleted flag and rev history (see http://stackoverflow.com/questions/5954864/how-does-couchdb-calculate-the-revision-number).

It would be useful for PouchDB to have a similarly deterministic rev generation, though it's not really necessary to match the CouchDB one.

@willholley

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2015

the rev generation code is synchronous so we can't use the (async) md5 module / browser shim as is. Possibly the easiest thing is to add a new module specifically for rev generation which uses the synchronous md5 call under the hood.

@kxepal

This comment has been minimized.

Copy link

commented Dec 8, 2015

Few pitfalls here:

  • JSON in CouchDB represented as list of key-value pairs with the respect of duplicated keys and their initial order:
$ curl -XPUT http://localhost:5984/db1/doc -d '{"foo": "bar"}'
{"ok":true,"id":"doc","rev":"1-4c6114c65e295552ab1019e2b046b10e"}
$ curl -XPUT http://localhost:5984/db2/doc -d '{"foo": "bar", "foo": "baz"}'
{"ok":true,"id":"doc","rev":"1-906f55b140f53d2fdbeae42704285674"}
$ curl -XPUT http://localhost:5984/db3/doc -d '{"foo": "baz", "foo": "bar"}'
{"ok":true,"id":"doc","rev":"1-ef985aeeff74ff50c32f828181c98634"}
  • On attachment update, document represented not in JSON (list of kv pairs), but in binary form.

willholley added a commit that referenced this issue Dec 9, 2015

(#4642) - deterministic revision numbers
Naive implementation of a deterministic revision number. As a first
pass, this uses the md5 hash of the document to generate a new revision,
only in the case that an existing _rev is present (i.e. for doc updates).

The current structure of the document update logic makes it tricky to
incorporate the full rev history, hence we fall back to uuid-based revs
for new documents or revisions following a deletion. A possible
workaround might be to explicitly attach any existing _rev before the
document update logic is called.

This patch still makes client side conflict resolution easier - if
conflicts are resolved concurrently then they should generate the same
rev tree.
@nolanlawson

This comment has been minimized.

Copy link
Member

commented Dec 13, 2015

@willholley We have typically shied away from this for a few reasons:

  1. The code necessary to deterministically generate revisions would add some amount of bloat
  2. It is impossible to exactly replicate CouchDB's deterministic algorithm due to Erlang quirks
  3. Checksumming JSON docs adds a performance cost
  4. The benefit is minimal. In fact the benefit only occurs in cases where two users make exactly the same modification to the same document, in which case it adds a slight performance boost by not needing to replicate that change between the two users.

So basically the cost is fairly high, and the payoff is fairly low.

@daleharvey

This comment has been minimized.

Copy link
Member

commented Dec 13, 2015

Its not just a slight performance boost, if we have 2 (or more) clients who are both trying to do automated conflict resolution then our current strategy can have them race possibly infinitely (they both see the change, both create a new doc to fix it, both those new docs conflict).

If we dont match CouchDB's id generation then even if we have deterministic generaton they can still race if one of the clients is running against couch. Similiarly if a client is running this with 2 versions of Couch they will hit the same problem.

Not making a case for either way yet, the previous patch that didnt get merged did match (one version) of CouchDB's implementation and that was definitely a large amount of code.

@nolanlawson

This comment has been minimized.

Copy link
Member

commented Dec 13, 2015

if we have 2 (or more) clients who are both trying to do automated conflict resolution then our current strategy can have them race possibly infinitely (they both see the change, both create a new doc to fix it, both those new docs conflict).

Then maybe we should push for a strategy that is at least deterministic for all PouchDB implementations? Presumably this probably will only usually occur for clients, not for servers.

Still, I think it has enough of a performance hit that it should be opt-in rather than opt-out.

@willholley

This comment has been minimized.

Copy link
Member Author

commented Dec 15, 2015

I agree that we should not attempt to copy CouchDB rev generation but aim to make rev generation within PouchDB deterministic, at least for the automatic conflict resolution use case.

In terms of the performance, are there specific scenarios to worry about / have tests for? The most common source of performance complaints that I hear is around replication, where rev generation doesn't occur, but any new edits generated by PouchDB are likely going to take a hit. I started a branch at https://github.com/pouchdb/pouchdb/commits/4642-md5-rev-generation with a minimal implementation that could be used to do a quick performance test.

@kxepal

This comment has been minimized.

Copy link

commented Dec 15, 2015

If performance matters, you can try to search for something faster than MD5, collision free and available for you. Technically, you don't have any need in strong and secure hash function for general usage.

One candidate I have on mind: https://github.com/Cyan4973/xxHash
Not sure how is fast JavaScript implementation: https://github.com/pierrec/js-xxhash
But you got the idea.

@nolanlawson

This comment has been minimized.

Copy link
Member

commented Dec 21, 2015

if we have 2 (or more) clients who are both trying to do automated conflict resolution then our current strategy can have them race possibly infinitely (they both see the change, both create a new doc to fix it, both those new docs conflict).

I'll also point out that this is an entirely theoretical outcome, and I've never heard a user report that this kind of situation has actually occurred.

@willholley

This comment has been minimized.

Copy link
Member Author

commented Dec 21, 2015

I opened the ticket following a discussion in Slack where a user had this
problem. The workaround is to try and elect a device to perform the
conflict resolution but that in itself is a tricky problem. I expect this
is going to pop up more as users discover conflicts (experience with
Cloudant suggests that lots of users don't handle them well).

On Mon, 21 Dec 2015 18:50 Nolan Lawson notifications@github.com wrote:

if we have 2 (or more) clients who are both trying to do automated
conflict resolution then our current strategy can have them race possibly
infinitely (they both see the change, both create a new doc to fix it, both
those new docs conflict).

I'll also point out that this is an entirely theoretical outcome, and I've
never heard a user report that this kind of situation has actually occurred.


Reply to this email directly or view it on GitHub
#4642 (comment).

@gnowoel

This comment has been minimized.

Copy link

commented Aug 6, 2017

Running into an infinite loop today, and I'm trying to find a solution.

We know this happened because multiple devices were trying to resolve the same conflict, and in doing so, they all generated a new revision, hence a new conflict.

Since the new revisions are all generated by the same conflict resolution logic, the contents of those revisions should be all the same, with the only exception of the revision IDs.

Keeping this in mind, I figured out a possible solution.

When resolving conflicts of two revisions, if the only difference of those is the revision ID, then we can pick one (rather than generating a new revision) as the winner.

The idea is to avoid generating a new revision ID whenever possible. No more revision IDs, no more conflicts. And the loop will stop.

If you're using the pouch-resolve-conflicts plugin, the resolveFun() function could be written as follows:

import clone from 'lodash/clone';
import isEqual from 'lodash/isEqual';

function resolveFun(a, b) {
  const cloneA = clone(a);
  const cloneB = clone(b);

  // Delete the different revision IDs
  delete cloneA._rev;
  delete cloneB._rev;

  // Also delete the conflict lists just in case
  delete cloneA._conflicts;
  delete cloneB._conflicts;

  // Same content. Pick an existing revision
  if (isEqual(cloneA, cloneB)) {
    return b;
  }

  // Different contents. Resolving conflicts...
};

Of course, you don't have to use pouch-resolve-conflicts or lodash, or even deep comparison. Just try not to produce a new revision ID unless absolutely necessary.

At least in my case, I found this approach is even better than having deterministic revisions. Instead of calculating the MD5 hashes for every doc and every change, now you only do heavy computation on limited docs and limited revisions.

Without deterministic revisions, however, you have to remember to manually resolve conflicts in your project. But that's something you should do anyway.

@stale

This comment has been minimized.

Copy link

commented Nov 29, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 29, 2017

@nename0 nename0 added pinned and removed wontfix labels Dec 1, 2017

@garrensmith

This comment has been minimized.

Copy link
Member

commented Mar 5, 2018

I want to reopen this topic for discussion. I have a situation where the same doc could be written to two different pouchdb-server's at the same time and I'm going to get a lot of unnecessary conflicts.

I would be keen for us to move to using a deterministic revision. If we against that, I would like the revision calculation to be changeable, then a user could add in a custom revision generation algorithm that is deterministic if required.

Ideally could we change the rev calcucation to be async as well? I'm not sure how big an impact that would have on all the adapters. I'm guessing we would have to change how all the adapters deal with the parseDocs function.

daleharvey added a commit that referenced this issue Mar 14, 2018

(#4642) - Deterministic revision numbers with md5
* Deterministic revision numbers with md5

This uses md5 to create the the revision number for a document. This
means that each document will have a deterministic revision nubmer.
This makes client side conflict resolution easier and will reduce the
number of conflicts

By default this is turned on, but a user can fallback to generating
random revision numbers by passing `useMd5Rev: false' as an option to a
new PouchDB instance

* add some extra tests

* fixes from review

* fix eslint issue

* only check revExists on newEdits = false

* fix idb-next and rename to deterministic_revs

* fix for same rev for new and old doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.