Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression in live+retry replication - processPendingBatch() #5196

Closed
nolanlawson opened this issue May 22, 2016 · 5 comments

Comments

@nolanlawson
Copy link
Member

commented May 22, 2016

Folks are saying that processPendingBatch() has a regression in 5.3.2 where it goes into an infinite loop (I think that's the bug?). The proposed fix is 7092284.

I'm opening a new issue because the other ones (#5157 and #5164) became very long, and I want to focus on finding a test case to reproduce the issue.

Obviously we could just merge 7092284 and be done. But it's disturbing that that change doesn't affect any of the tests. We need a test to identify when and where exactly the existing code fails.

I gave this a try in f289d50 but failed. Any help here would be appreciated.

cc @Ampakinetic @man4j @dharders

@dharders

This comment has been minimized.

Copy link
Contributor

commented May 22, 2016

@nolanlawson it's not an infinite loop in my testing, but close enough from the users perspective. i.e. it seems to settle after ~10000 xhr requests and several MB of data transfer (even though my db is < 0.7 MB)

I'm not too sure about writing the correct unit test, but to reproduce the behaviour locally:

  • I setup up a CouchDB server locally (v1.6.1 or v2.0) with a database named testdb
  • Created 1295 docs (like your unit test does), but with
for (var i = 0; i < 1295; i++) {
   localDB.put({_id: 'testdoc_'+ i}).catch(function(e) { console.log(e)})
}
  • Blow away the browser local db though Google Chrome Cookie + Site Data delete option
  • Run the below code and inspect the network tab in Chrome (sync should start automatically)
var localDB = new PouchDB('testdb');
localDB.sync('http://192.168.1.180:5984/testdb', {
    live: true,
    retry: true,
    batch_size: 10000,          
    batches_limit: 10
    }).on(...)
  • Swap out the CDN for different PouchDB version
<script src="https://cdn.jsdelivr.net/pouchdb/5.3.2/pouchdb.min.js"></script>
<!--<script src="https://cdn.jsdelivr.net/pouchdb/5.2.0/pouchdb.min.js"></script>-->
  • Blow away the browser local db again
  • Reload and inspect the different sync network behaviour (i.e 5.2.0 = Good, 5.3.2 = Bad)
@dharders

This comment has been minimized.

Copy link
Contributor

commented May 22, 2016

Also,

  1. Your unit test test.retry.js L621 info.doc_count.should.equal(numDocs); will not exactly reveal a failure as the local & remote databases do (eventually) contain the same number of docs (after the long loop magically finishes)
  2. You need to replicate the local db to server, then destroy() the local db, create new blank local db then replicate again to repro the behaviour. i.e. the second replicate is imitating a client connecting for the first time (destroy() was not in your unit test). Insert at line 618
     }).then(function () {
        return db.destroy()          // destroy local db
     }).then(function(){                 
        db = new PouchDB(dbs.name);  // blank db.... imitate first time connect !

       // ------ Replicate again, with data in server db but NOTHING in local db

        function replicatePromise(fromDB, toDB) {
          return new Promise(function (resolve, reject) {
            var replication = fromDB.replicate.to(toDB, {
              live: true,
              retry: true,
              batches_limit: 10,
              batch_size: 200
            }).on('paused', function (err) {
              if (!err) {
                // replication.cancel();
                resolve();                    //  <<--------  Required to see behaviour (after 'paused')
              }
            }).on('complete', resolve)
              .on('error', reject);
          });
        }
        return Promise.all([
          replicatePromise(db, remote),
          replicatePromise(remote, db)
        ]);

Note that I do not cancel the replication like your test. Because the issue arises AFTER the 'paused' events (i.e. several paused -> active -> paused) and cancelling after the first pause hides the behaviour (although I do think the dbs are in sync at the resolve() point).

The performance hit is really with the many (unnecessary?) background remote checkpoint xhr GET requests (to _local/{checkpointidXXX}) happening AFTER the 'paused' event. Should this even happen ? It doesn't with{ live: false}, but does with {live: true}

From my observation, the bad behaving response to the above xhr GET has a last_seq value only incrementing by 1. Where the desired behaviour has a last_seq value incrementing by > 1 (i.e. with 7092284)

How to automate test for this ?

@nolanlawson

This comment has been minimized.

Copy link
Member Author

commented May 22, 2016

OK, so the issue is a performance problem and not a correctness problem?

In that case, we should probably write a performance test rather than a regular integration test. We have such a test here although I admit we don't pay as much attention to our performance tests as we probably ought to.

In my unit test, destroy() is implicitly called between each test (see beforeEach()).

If this is a tricky bug to catch in a test, then I'll be happy to just take everyone's word for it that the fix works, but keep in mind that we'll undoubtedly regress in the future if we can't figure out a way to test it. Maybe the right way to test it is to replicate some documents, then read the checkpoint afterwards?

@nolanlawson

This comment has been minimized.

Copy link
Member Author

commented May 22, 2016

See: #5199

nolanlawson added a commit that referenced this issue May 22, 2016
@nolanlawson

This comment has been minimized.

Copy link
Member Author

commented May 22, 2016

fixed in #5199

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.