New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch up to oplog at most once per write fence #4694

Closed
wants to merge 1 commit into
base: devel
from

Conversation

Projects
None yet
7 participants
@OleksandrChekhovskyi
Contributor

OleksandrChekhovskyi commented Jul 7, 2015

Before this change, number of catch-up attempts was N*M, where N is number of writes inside of the fence, and M is number of active observers on affected collections. Every catch up issues yet another query to find the latest oplog entry.

It was extremely inefficient, in terms of both CPU usage and added latency. After executing write-heavy methods, application process was occupied for many seconds doing the same thing over and over again.

This change provides a performance improvement for all kinds of workloads.


CPU profile of write-intensive method before the change:
before

After the change:
after

Before the change there was noticeable several seconds lag after the method invocation (which was caused by waitUntilCaughtUp function), before anything else could be processed. After the change that lag is gone, and application is responsive perceptively immediately after the operation is done.

The most visible effect is achieved when a method does a lot of writes, but this change improves efficiency all over the board.

@glasser

This comment has been minimized.

Member

glasser commented Jul 7, 2015

This is super interesting. Do you have a benchmark or test that we can use to see the performance improvement you're describing?

Meteor.defer(function () {
if (fence._oplogObserveDrivers) {
if (!_.contains(fence._oplogObserveDrivers, self))

This comment has been minimized.

@glasser

glasser Jul 7, 2015

Member

This looks like it will lead to quadratic behavior. I'd rather give OplogObserveDrivers an incrementing ID and use an object to track them as a set.

This comment has been minimized.

@OleksandrChekhovskyi

OleksandrChekhovskyi Jul 8, 2015

Contributor

Indeed. Fixed in new version of the patch.

return;
}
fence._oplogObserveDrivers = [self];

This comment has been minimized.

@glasser

glasser Jul 7, 2015

Member

Not great to just put random fields on the fence. But I can't immediately think of something better.

This comment has been minimized.

@OleksandrChekhovskyi

OleksandrChekhovskyi Jul 8, 2015

Contributor

Not great, but I'm not sure either how to make this better.

self._writesToCommitWhenWeReachSteady.push(write);
}
_.each(fence._oplogObserveDrivers, function (driver) {

This comment has been minimized.

@glasser

glasser Jul 7, 2015

Member

are we 100% confident that fence._oplogObserveDrivers will not be added to once the fence is armed? I think so but want to make sure we've thought this through.

This comment has been minimized.

@OleksandrChekhovskyi

OleksandrChekhovskyi Jul 8, 2015

Contributor

Now it cleans up after itself, so even if it is added again, it will just catch up again. I can't think of a scenario where that could happen though.

try {
var collection = self.rawCollection(collection_name);
collection.insert(replaceTypes(document, replaceMeteorAtomWithMongo),
{safe: true}, callback);
Meteor.refresh({collection: collection_name, id: document._id });

This comment has been minimized.

@glasser

glasser Jul 7, 2015

Member

I don't think this is correct, unless I'm missing something. In the past, we waited until after the database told us that the write had occurred in order to refresh (ie, to wait until oplog is caught up, or do another poll). Now you're sending out that notification immediately upon sending the command to the database. So eg for the polling observe driver, you could do a poll immediately and think you've refreshed before the write actually happens. And for oplog, I guess you're making some assumption about how the arming of the fence relates to the callback being run, which I'm not sure is correct, but since the polling version is definitely incorrect I'm not going to think about this too hard right now.

This comment has been minimized.

@OleksandrChekhovskyi

OleksandrChekhovskyi Jul 8, 2015

Contributor

I was operating on the assumption of 'read your own writes' consistency, but since different connection is used for oplog tailing, this might not be the case indeed. Reverted back to 'refresh on database response' behavior, and moved the fence callback from 'on armed' to 'on ready to fire' to accomodate for the fact that we need to wait for write replies.

@OleksandrChekhovskyi OleksandrChekhovskyi force-pushed the OleksandrChekhovskyi:devel branch from 1dfc1f4 to d3e3877 Jul 8, 2015

@OleksandrChekhovskyi

This comment has been minimized.

Contributor

OleksandrChekhovskyi commented Jul 8, 2015

Patch has been updated to address the feedback.

I have created a simple benchmark that illustrates the problem.
https://github.com/OleksandrChekhovskyi/meteor-overhead-benchmark
Keep creating new pages with a button and observe how slow it can get.

_.each(self.completion_callbacks, function (f) {f(self);});
self.completion_callbacks = [];
self.outstanding_writes++;
_.each(self.before_fire_callbacks, function (f) {f(self);});

This comment has been minimized.

@glasser

glasser Jul 8, 2015

Member

I wonder if we need to think about what happens if one of these throws? We aren't for onAllCommitted callbacks, but (a) that's probably a mistake (b) the onBeforeFire callback in practice is more complex than the only existing onAllCommitted callback (eg, it blocks on network activity).

This comment has been minimized.

@OleksandrChekhovskyi

OleksandrChekhovskyi Jul 9, 2015

Contributor

Added a try/catch to safeguard against unexpected exceptions.

// Register a function to be called before firing the fence.
onBeforeFire: function (func) {
var self = this;
if (self.fired)

This comment has been minimized.

@glasser

glasser Jul 8, 2015

Member

Would it be correct to make this a if (self.armed) check instead? Then you don't have to worry about self.before_fire_callbacks being mutated while being iterated over. (Actually I'm not sure why onAllCommitted doesn't do a if (self.armed) check either.)

This comment has been minimized.

@glasser

glasser Jul 8, 2015

Member

In fact, this makes some of the oplog logic a little easier to understand. Because you clean up the _oplogObserveDrivers field at the end of the onBeforeFire callback, I could worry that somehow another observe driver could come along and register itself again. However, if we ensure that you can't add a second onBeforeFire callback after the fence is armed, and the fence must be armed for the first callback to run, we know we'll at least see an ugly error if this happens.

This comment has been minimized.

@OleksandrChekhovskyi

OleksandrChekhovskyi Jul 9, 2015

Contributor

It will not be correct to check for self.armed instead. Meteor.refresh might get called for a fence after it has been armed in case if method issued async writes (by passing a callback to database methods). That was the reason I have moved it to 'before fire' instead of 'on arm'.

I have added additional code to ensure that everything is re-entrant - if for any reason another write is scheduled on the fence when onBeforeFire callback is being processed, it will attach a new callback and catch up to oplog again. The method has to do really funky stuff to get to that situation though - like doing writes from setTimeout callback and doing 'normal' writes as well.

@@ -59,6 +60,15 @@ _.extend(DDPServer._WriteFence.prototype, {
self._maybeFire();
},
// Register a function to be called before firing the fence.

This comment has been minimized.

@glasser

glasser Jul 8, 2015

Member

Be more explicit: "It's OK for this function to call beginWrite, in which case the fence won't fire until those writes are done as a well. However, the function will only be called once."

(That's OK, right? We're not going to add any more "real" writes while the "catch-up" write is catching up, right?)

This comment has been minimized.

@OleksandrChekhovskyi
@glasser

This comment has been minimized.

Member

glasser commented Jul 8, 2015

Other than these last three comments, this change looks great. Your benchmark is pretty clear that this makes a difference, and it seems correct to me. It makes the WriteFence API a little stranger, but that API is a means to an end. Once you address those I think we should be able to merge this!

Also, if you'd like to add a description to History.md in the top v.NEXT under a new ### Livequery section that would be great.

Oleksandr Chekhovskyi
Catch up to oplog at most once per write fence
Before this change, number of catch-up attempts was N*M, where N is number of
writes inside of the fence, and M is number of active observers on affected collections.
Every catch up issues yet another query to find the latest oplog entry.

It was extremely inefficient, in terms of both CPU usage and added latency.
After executing write-heavy methods, application process was occupied for many seconds
doing the same thing over and over again.

This change provides a performance improvement for all kinds of workloads.

@OleksandrChekhovskyi OleksandrChekhovskyi force-pushed the OleksandrChekhovskyi:devel branch from d3e3877 to 27bdf62 Jul 9, 2015

@OleksandrChekhovskyi

This comment has been minimized.

Contributor

OleksandrChekhovskyi commented Jul 9, 2015

Patch has been updated, and hopefully everything is addressed now.

* Improved server performance by reducing overhead of processing oplog after
database writes. Improvements are most noticeable in case when a method is
doing a lot of writes on collections with plenty of active observers.

This comment has been minimized.

@glasser

glasser Jul 9, 2015

Member

add "#4694" here

@glasser glasser closed this in cb4ba33 Jul 9, 2015

@glasser

This comment has been minimized.

Member

glasser commented Jul 9, 2015

Thanks, merged! This seems like it'll be a great performance increase and I'm excited to see it get out soon.

@rclai

This comment has been minimized.

rclai commented Jul 9, 2015

This is great because I do work with a lot of server-side observers.

@mizzao

This comment has been minimized.

Contributor

mizzao commented Jul 9, 2015

This is great, I first described this issue almost a year ago when we were still on the google groups and I'm very glad that someone fixed it.

https://groups.google.com/forum/#!topic/meteor-talk/Y547Hh2z39Y

Just wanted to point out that the solution proposed there was to "give up" if the oplog observer was too far behind, and just re-query the database to catch up. I'm not sure if that's what @OleksandrChekhovskyi did, but the benchmark seems to have proved something.

I guess this also illustrates why the google group was pretty useless for having these sorts of discussions. (> /dev/null)

@zimme

This comment has been minimized.

Contributor

zimme commented Jul 10, 2015

👍

@sebakerckhof

This comment has been minimized.

Contributor

sebakerckhof commented Jul 11, 2015

@mizzao No, the fallback you describe is already in Meteor since 1.0.4, see:
#2668

Instead, this PR should increase the oplog tailing performance for write-heavy loads, so that meteor doesn't have to use the fallback that fast.

In 1.0.4 there was however another PR merged that should also increase write performance with the oplog: https://github.com/meteor/meteor/pull/3697/commits

@mizzao

This comment has been minimized.

Contributor

mizzao commented Jul 11, 2015

@sebakerckhof excellent, thanks for all those links. I'm very grateful for all of the hardcore performance profilers out there - seems like they are really helping Meteor to grow into a robust and fast platform.

@arunoda

This comment has been minimized.

arunoda commented Jul 11, 2015

This is great. Thanks @OleksandrChekhovskyi

@arunoda

This comment has been minimized.

arunoda commented Jul 12, 2015

@glasser I got some idea while looking at this code base.
This solution pretty great and reduce oplog last entry checks to a few. But, there still an issue if we get a lot of method calls at once. Then updated messages delayed until we ping for oplog.

what if we can process waitUntilCaughtUp in a queue. So, then we can ping oplog once for a lot of waitUntilCaughtUp requests. And it's fairly simple to implement.

Is it sounds good or am I doing talking something wrong?

@glasser

This comment has been minimized.

Member

glasser commented Jul 12, 2015

I'm not 100% sure what you're describing @arunoda but it sounds promising. As long as it doesn't err too far on the side of making the updated message come late because we're not trying for a while. Love to see a PR!

@arunoda

This comment has been minimized.

arunoda commented Jul 12, 2015

It's a random idea. I'll give it a try today. May be not sure that really affect the performance changes. Let's see.

glasser added a commit that referenced this pull request Aug 13, 2015

Don't try to interact with fences that have fired
This bug was introduced by #4694 which switched OplogObserveDriver's
listener from using beginWrite to the new onBeforeFire.  beginWrite
doesn't throw an error when called on a retired fence; onBeforeFire
does.  So don't try to interact with fired fences. (I'm not sure if
there is an importance to the distinction between retired and fired
introduced by dcd2641, but this code should be fine.)

While we're at it, make the error in question (which shouldn't happen)
be delivered to Mongo write callbacks (or thrown), if for no other
reason than that it allows us to test this fix.

Fixes #4839.

glasser added a commit that referenced this pull request Aug 25, 2015

Don't try to interact with fences that have fired
This bug was introduced by #4694 which switched OplogObserveDriver's
listener from using beginWrite to the new onBeforeFire.  beginWrite
doesn't throw an error when called on a retired fence; onBeforeFire
does.  So don't try to interact with fired fences. (I'm not sure if
there is an importance to the distinction between retired and fired
introduced by dcd2641, but this code should be fine.)

While we're at it, make the error in question (which shouldn't happen)
be delivered to Mongo write callbacks (or thrown), if for no other
reason than that it allows us to test this fix.

Fixes #4839.

Conflicts:
	packages/mongo/mongo_livedata_tests.js

@arunoda arunoda referenced this pull request Aug 28, 2015

Closed

ChangeFeed Improvements #5

OleksandrChekhovskyi pushed a commit to Favro/meteor that referenced this pull request Sep 7, 2015

Don't try to interact with fences that have fired
This bug was introduced by meteor#4694 which switched OplogObserveDriver's
listener from using beginWrite to the new onBeforeFire.  beginWrite
doesn't throw an error when called on a retired fence; onBeforeFire
does.  So don't try to interact with fired fences. (I'm not sure if
there is an importance to the distinction between retired and fired
introduced by dcd2641, but this code should be fine.)

While we're at it, make the error in question (which shouldn't happen)
be delivered to Mongo write callbacks (or thrown), if for no other
reason than that it allows us to test this fix.

Fixes meteor#4839.

@mikowals mikowals referenced this pull request Nov 8, 2015

Open

Just a question #1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment