-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch up to oplog at most once per write fence #4694
Conversation
This is super interesting. Do you have a benchmark or test that we can use to see the performance improvement you're describing? |
Meteor.defer(function () { | ||
|
||
if (fence._oplogObserveDrivers) { | ||
if (!_.contains(fence._oplogObserveDrivers, self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it will lead to quadratic behavior. I'd rather give OplogObserveDriver
s an incrementing ID and use an object to track them as a set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Fixed in new version of the patch.
1dfc1f4
to
d3e3877
Compare
Patch has been updated to address the feedback. I have created a simple benchmark that illustrates the problem. |
_.each(self.completion_callbacks, function (f) {f(self);}); | ||
self.completion_callbacks = []; | ||
self.outstanding_writes++; | ||
_.each(self.before_fire_callbacks, function (f) {f(self);}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to think about what happens if one of these throws? We aren't for onAllCommitted callbacks, but (a) that's probably a mistake (b) the onBeforeFire callback in practice is more complex than the only existing onAllCommitted callback (eg, it blocks on network activity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a try/catch to safeguard against unexpected exceptions.
Other than these last three comments, this change looks great. Your benchmark is pretty clear that this makes a difference, and it seems correct to me. It makes the WriteFence API a little stranger, but that API is a means to an end. Once you address those I think we should be able to merge this! Also, if you'd like to add a description to |
Before this change, number of catch-up attempts was N*M, where N is number of writes inside of the fence, and M is number of active observers on affected collections. Every catch up issues yet another query to find the latest oplog entry. It was extremely inefficient, in terms of both CPU usage and added latency. After executing write-heavy methods, application process was occupied for many seconds doing the same thing over and over again. This change provides a performance improvement for all kinds of workloads.
d3e3877
to
27bdf62
Compare
Patch has been updated, and hopefully everything is addressed now. |
|
||
* Improved server performance by reducing overhead of processing oplog after | ||
database writes. Improvements are most noticeable in case when a method is | ||
doing a lot of writes on collections with plenty of active observers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "#4694" here
Thanks, merged! This seems like it'll be a great performance increase and I'm excited to see it get out soon. |
This is great because I do work with a lot of server-side observers. |
This is great, I first described this issue almost a year ago when we were still on the google groups and I'm very glad that someone fixed it.
Just wanted to point out that the solution proposed there was to "give up" if the oplog observer was too far behind, and just re-query the database to catch up. I'm not sure if that's what @OleksandrChekhovskyi did, but the benchmark seems to have proved something. I guess this also illustrates why the google group was pretty useless for having these sorts of discussions. (> /dev/null) |
👍 |
@mizzao No, the fallback you describe is already in Meteor since 1.0.4, see: Instead, this PR should increase the oplog tailing performance for write-heavy loads, so that meteor doesn't have to use the fallback that fast. In 1.0.4 there was however another PR merged that should also increase write performance with the oplog: https://github.com/meteor/meteor/pull/3697/commits |
@sebakerckhof excellent, thanks for all those links. I'm very grateful for all of the hardcore performance profilers out there - seems like they are really helping Meteor to grow into a robust and fast platform. |
This is great. Thanks @OleksandrChekhovskyi |
@glasser I got some idea while looking at this code base. what if we can process waitUntilCaughtUp in a queue. So, then we can ping oplog once for a lot of Is it sounds good or am I doing talking something wrong? |
I'm not 100% sure what you're describing @arunoda but it sounds promising. As long as it doesn't err too far on the side of making the updated message come late because we're not trying for a while. Love to see a PR! |
It's a random idea. I'll give it a try today. May be not sure that really affect the performance changes. Let's see. |
This bug was introduced by #4694 which switched OplogObserveDriver's listener from using beginWrite to the new onBeforeFire. beginWrite doesn't throw an error when called on a retired fence; onBeforeFire does. So don't try to interact with fired fences. (I'm not sure if there is an importance to the distinction between retired and fired introduced by dcd2641, but this code should be fine.) While we're at it, make the error in question (which shouldn't happen) be delivered to Mongo write callbacks (or thrown), if for no other reason than that it allows us to test this fix. Fixes #4839.
This bug was introduced by #4694 which switched OplogObserveDriver's listener from using beginWrite to the new onBeforeFire. beginWrite doesn't throw an error when called on a retired fence; onBeforeFire does. So don't try to interact with fired fences. (I'm not sure if there is an importance to the distinction between retired and fired introduced by dcd2641, but this code should be fine.) While we're at it, make the error in question (which shouldn't happen) be delivered to Mongo write callbacks (or thrown), if for no other reason than that it allows us to test this fix. Fixes #4839. Conflicts: packages/mongo/mongo_livedata_tests.js
Before this change, number of catch-up attempts was N*M, where N is number of writes inside of the fence, and M is number of active observers on affected collections. Every catch up issues yet another query to find the latest oplog entry.
It was extremely inefficient, in terms of both CPU usage and added latency. After executing write-heavy methods, application process was occupied for many seconds doing the same thing over and over again.
This change provides a performance improvement for all kinds of workloads.
CPU profile of write-intensive method before the change:
After the change:
Before the change there was noticeable several seconds lag after the method invocation (which was caused by
waitUntilCaughtUp
function), before anything else could be processed. After the change that lag is gone, and application is responsive perceptively immediately after the operation is done.The most visible effect is achieved when a method does a lot of writes, but this change improves efficiency all over the board.