New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP - change batching for replication #1963
Conversation
utils.call(opts.onChange, null, result); | ||
currentBatch.docs.forEach(function () { | ||
result.docs_written++; | ||
utils.call(opts.onChange, null, result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onChange should be called with just one result and never with an error (this was wrong before)
Pull requests with more tests are always appreciated, the current tests
|
Latest commit (ig3/pouchdb@fe9a127) is pretty much as I had envisioned it. It includes several recent fixes (thanks @calvinmetcalf) not yet committed to master, necessary for tests to pass on my system but not otherwise related to the change to replicate. Would need a rebase after they are committed. All tests are passing on my system, in node and firefox. Changes is called in a loop with limit opts.batch_size. The batches array is limited to length opts.batches_limit. When the limit is exceeded, calls to changes are stopped. When the length falls below the limit, calls to changes resume. This limits the size of the queue and, consequently, maximum memory consumption, aside from the size of the documents in the queue. Next step is to add a limit on total size of documents in the queue, which would more strictly limit memory consumption. This should be easy. Initially, calls to changes are not live. After changes completes without changes, the subsequent call to changes is live. When a change is received, calls to changes revert to not live, until the next empty change. This can probably be optimized, but I need to think about it... Events are emitted for: change, uptodate and outofdate. Next steps are to write some tests for the new features and updates for the documentation. Any comments or suggestions are most welcome. |
utils.call(opts.onChange, null, result); | ||
currentBatch.docs.forEach(function () { | ||
result.docs_written++; | ||
returnValue.emit('change'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably want to return some details on the change, like the last_seq at least
we also need to do a complete event and error event |
also annoyingly this does not solve the back pressure issue, push replication from leveldb to http stalls out around 200k. |
return replicationComplete(); | ||
} | ||
if (changes.last_seq > changesOpts.since) { | ||
if (changesOpts.live) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe set a function wide flag for this?
How can I replicate the leveldb to http fault? |
um make a db and put 300,000 random documents in it, and then try to replicate it, I can write a script that will do that for you if you want |
try this script https://gist.github.com/calvinmetcalf/11062744 |
in your branch run |
I don't know where the problem is yet, but somewhere between 10,000 and 100,000 documents in the source database and performance deteriorates severely. I suspect the problem is reading from the local leveldb database as the performance is degraded before even a few hundred documents are written to the target while replication a database of 10,000 documents is quick, yet replication continues, slowly... |
One problem is that the leveldb implementation of changes appears to ignore options.limit. I realize now that this is an undocumented option. It is supported by the http adapter, but perhaps not others. While I could work around this, it would be wasteful. How easy would it be to make leveldb's changes respect options.limit? |
It should respect it
|
nope isn't getting passed in |
That helps, but there is still a problem. With a source database having 100,000 documents, per db.info, replication begins and proceeds as expected through the first 300 documents. Then changes is called with since = 300, onChange is called three times then changes complete function is called. This is with limit = 10. Changes is called again with since = 303 and it calls complete without having called onChange and with last_seq = 0; So, it seems something has gone wrong in the leveldb database. |
limit doesn't work, I spoke too soon earlier |
Here is a workaround to the limitations of changes in leveldb. I expect it works (not finished testing yet) but it is very slow. Not only does leveldb changes ignore options.limit, but it also doesn't stop when the changes promise is cancelled: it sends all changes to the latest every time it is called. None the less, memory consumption now appears to be bounded. |
The last update is now past 300,000 documents replicated. Slow but steady. |
With fetchGenerationOneRevs disabled, 500,000 documents replicate from leveldb to http (PouchDB on localhost) in under 5 minutes. |
Heh, this is why we have performance tests before we introduce optimisations fetchGenerationOneRevs is purely supposed to be an optimisation for the case where a document has a single revision we pick up, theres probably something wrong if its slowing things down as far as I can remember it should be a universal speedup (in the case documents have a single revision) |
@ig3 figured out the issue, the test was written poorly, I'm going to open up a pull on your pull which you can merge into your branch that fixes this |
opened it ig3#2 |
@daleharvey: There seems to be a problem with travis tests on firefox. All tests are passing in node and firefox at my end. Tests of the new API features (promise interface and events) would be very helpful. Tests with batch_size > 1 and a reconsideration of the default batch_size: with a bit more testing, it may be time for it to be increased from 1. With help/guidance from @calvinmetcalf, we have done some testing with larger datasets and exercised the throttling of the changes feed, which seems to be working, but I'm not sure how to test this methodically, particularly without making the tests very slow. It would be nice to be able to add some tests with larger datasets that are not run by default, maybe including Calvin's random data generation. |
Maybe I should merge instead of rebase to capture master updates? |
Thats pretty exciting to see I will add a mechanism to have some tests disabled only when they go through saucelabs, for now we will just have to do the larger replication tests in firefox, so dont worry about writing slow tests, we need it for the existing tests anyway |
@ig3 you should be able to do |
WOOT! |
I put default batch_size back to 100, which I had been testing with previously. |
And rebased to current master |
ah so as it turns out a database GUID wouldn't actually help as well as you'd think as if you were replicating and got up to seq 100, and then the source database exploded or something and was reset from a backup at 80 it would still have the same guid but you'd loose those next 20 changes. A compromise between not wanting to loose the current functionality with read write databases but wanting to be able to replicate from read only databases might be to flag any put errors from the source db and stop trying to update it after you get one. |
utils.call(opts.onChange, null, result); | ||
currentBatch.docs.forEach(function () { | ||
result.docs_written++; | ||
returnValue.emit('change', utils.clone(result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only need this once per batch, as it is it gives you a flood of events all at once when you just care about the last one, so maybe just
result.docs_written += currentBatch.docs.length;
returnValue.emit('change', utils.clone(result));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Various tests require an event per doc updated and I have taken their
requirements as given and definitive of the API requirements thus far.
But I have no problems with the event being once per batch and changing
the tests if that is preferred.
Does this need further consideration? Or should I just do it? It's an
easy change.
On 24/04/2014 8:00 a.m., Calvin Metcalf wrote:
In lib/replicate.js:
} result.last_seq = last_seq = currentBatch.seq;
utils.call(opts.onChange, null, result);
currentBatch.docs.forEach(function () {
result.docs_written++;
returnValue.emit('change', utils.clone(result));
we only need this once per batch, as it is it gives you a flood of
events all at once when you just care about the last one, so maybe justresult.docs_written += currentBatch.docs.length;
returnValue.emit('change', utils.clone(result));—
Reply to this email directly or view it on GitHub
https://github.com/pouchdb/pouchdb/pull/1963/files#r11920575.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about making that change either, changes may group multiple changes into a single document, but you still always get at least one change per document, its a huge bonus for the changes and the replication api to be symetrical, I dont see much wrong with 'flooding' the client side, if it needs to debounce changes to update the UI it can easily do that whereas if we group the changes we lose functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So replicating with batch size 500 means that you have no events and then
suddenly 500 events synchronously all identical except for the docs written
field.
One key issue is that only the 500th event is actually actually accurate as
it is describing the state of things right now, the previous 499 events
are giving out of date information and would not do what you expected if
you tried to cancel the replication after hearing one.
A way to think about doing event once per batch is that it is listing
changes to the target database and since the updates happen in batches each
batch of updates is a change
On Apr 23, 2014 5:21 PM, "Dale Harvey" notifications@github.com wrote:
In lib/replicate.js:
} result.last_seq = last_seq = currentBatch.seq;
utils.call(opts.onChange, null, result);
currentBatch.docs.forEach(function () {
result.docs_written++;
returnValue.emit('change', utils.clone(result));
I am not sure about making that change either, changes may group multiple
changes into a single document, but you still always get at least one
change per document, its a huge bonus for the changes and the replication
api to be symetrical, I dont see much wrong with 'flooding' the client
side, if it needs to debounce changes to update the UI it can easily do
that whereas if we group the changes we lose functionality—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1963/files#r11924543
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for some reason I was thinking the replication changes were symetrical to .changes, ie they included doc information and if you wanted to say do something when a docs updated you could do via the replication change event
But thats not true, so actually yeh agree with the initial comment to batch them, sorry about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However I am completely fine with doing the batch work in a follow up, not adding extra work to this patch is a good thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if we want to handle the readonly puts and batch changes in other plus
the +1 the squash and merge.
On Apr 23, 2014 5:42 PM, "Dale Harvey" notifications@github.com wrote:
In lib/replicate.js:
} result.last_seq = last_seq = currentBatch.seq;
utils.call(opts.onChange, null, result);
currentBatch.docs.forEach(function () {
result.docs_written++;
returnValue.emit('change', utils.clone(result));
However I am completely fine with doing the batch work in a follow up, not
adding extra work to this patch is a good thing—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1963/files#r11925446
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's trivial to emit one change event per batch: that's the way it was
until I increased batch_size, at which point I added the noted loop to
satisfy the tests.
I'll remove the loop: one change event per batch, and change the
affected tests.
On 24/04/2014 9:42 a.m., Dale Harvey wrote:
In lib/replicate.js:
} result.last_seq = last_seq = currentBatch.seq;
utils.call(opts.onChange, null, result);
currentBatch.docs.forEach(function () {
result.docs_written++;
returnValue.emit('change', utils.clone(result));
However I am completely fine with doing the batch work in a follow up,
not adding extra work to this patch is a good thing—
Reply to this email directly or view it on GitHub
https://github.com/pouchdb/pouchdb/pull/1963/files#r11925446.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but you don't need to worry about the read only puts in this one. We
can do that in a separate pull
On Apr 23, 2014 6:13 PM, "ig3" notifications@github.com wrote:
In lib/replicate.js:
} result.last_seq = last_seq = currentBatch.seq;
utils.call(opts.onChange, null, result);
currentBatch.docs.forEach(function () {
result.docs_written++;
returnValue.emit('change', utils.clone(result));
It's trivial to emit one change event per batch: that's the way it was
until I increased batch_size, at which point I added the noted loop to
satisfy the tests. I'll remove the loop: one change event per batch, and
change the affected tests.
… <#14590a6da16a15a9_>
On 24/04/2014 9:42 a.m., Dale Harvey wrote: In lib/replicate.js: > } >
result.last_seq = last_seq = currentBatch.seq; > -
utils.call(opts.onChange, null, result); > +
currentBatch.docs.forEach(function () { > + result.docs_written++; > +
returnValue.emit('change', utils.clone(result)); However I am completely
fine with doing the batch work in a follow up, not adding extra work to
this patch is a good thing — Reply to this email directly or view it on
GitHub https://github.com/pouchdb/pouchdb/pull/1963/files#r11925446.—
Reply to this email directly or view it on GitHubhttps://github.com//pull/1963/files#r11926626
.
That seems reasonable to me and I think would allow replication from a
|
It'll always be a permission error (and will always be the httpadapter) so
|
Theres an ongoing conversation about how to properly checkpoint read only databases, I definitely wouldnt worry about resuming the checkpoints,but if we could have basic support in place it would be nice, just warning about errors checkpointing instead of bailing seems ok to me |
Separate issues for change events and read-only source makes sense to me. |
Strange, all tests are passing through many iterations on my system. Thoughts on why sync might be failing on Travis? |
Try deleting your node module folder and reinstalling
|
hmm, retriggered and got the same failure, looks genuine, will try locally |
reproduces locally for me |
Its because replication is now a promise, we probably need to do a
|
is the line that breaks,
also fails, going back to opts.complete works, if I tried switching to require('lie'); in utils.js I weirdly get an npm error Also found a few other problems, the test seems to be broken, its a little hard to follow but if I take the changes mock out it still passes in the browser, we modify the db's changes feed, so its remote that shouldnt get the extra document, but we test that db hasnt got both objects which it very likely will, even with fixing that I still get it passing Also the test is based on a bug, the sync.complete should only be called once, when both replications are complete, the sync object returns both replication objects so they can individually be listened for and cancelled, so I think a test should just cancel one and ensure the sync.complete is called However any valid test for this is going to be broken until sync is fixed, which is #1 priority for me once this patch is in, so I suggest skipping it Also since we split out the test.sync file it is not run in the browser tests :( |
pr to add sync back - #2034 Ill skip that failing tests and review + merge once thats green, this is a big improvements, isnt regressing us and theres a lot of work thats gonna spin from it, I want to get this all better tested, but that can be followed up so we unblock people |
I agree with that plan.
|
woot: c1845be |
I still can't get a clean test run for this in node (haven't tried the browser yet). It is close to what I had intended, but I'm stymied by a surprising number of leveldb bugs. I'm taking a break now, but thought you might like to consider this.
With all the interference, I have lost track a bit, but I believe this puts bounds on memory for both changes and replicate (i.e. it should work even for very large changes feeds). It throttles changes to limit the queue of replicate batches. It signals uptodate, but a bit prematurely. The event should be deferred until the documents are written. I know what I want to do about this, but must get back to other work now.
There is a new option for replication: opts.batchs_limit, which is an upper bound on the number of batches queued.
The algorithm is somewhat as Dale and Calvin have discussed, but different in some details.
npm test was running fairly reliably with batch_size = batches_limit = 1 (after fixing a few bugs elsewhere), but is not yet clean with batch_size = batches_limit = 100.
The latest problem appears to be a fault in allDocs on leveldb (#1961). It it can't be sorted soon I'll remove the revision-1 stuff pending a fix, to get this going and stable.
It may be just my bad luck, but at the moment I am feeling there may be a bit too much change and too little testing going on. It sure seems easy for me to find bugs... Anyway, time for a nap and I'll look at it with fresh eyes later.
Comments and suggestions welcome.