Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch up to oplog at most once per write fence #4694

Closed

Conversation

OleksandrChekhovskyi
Copy link
Contributor

@OleksandrChekhovskyi OleksandrChekhovskyi commented Jul 7, 2015

Before this change, number of catch-up attempts was N*M, where N is number of writes inside of the fence, and M is number of active observers on affected collections. Every catch up issues yet another query to find the latest oplog entry.

It was extremely inefficient, in terms of both CPU usage and added latency. After executing write-heavy methods, application process was occupied for many seconds doing the same thing over and over again.

This change provides a performance improvement for all kinds of workloads.


CPU profile of write-intensive method before the change:
before

After the change:
after

Before the change there was noticeable several seconds lag after the method invocation (which was caused by waitUntilCaughtUp function), before anything else could be processed. After the change that lag is gone, and application is responsive perceptively immediately after the operation is done.

The most visible effect is achieved when a method does a lot of writes, but this change improves efficiency all over the board.

@glasser
Copy link
Member

@glasser glasser commented Jul 7, 2015

This is super interesting. Do you have a benchmark or test that we can use to see the performance improvement you're describing?

Meteor.defer(function () {

if (fence._oplogObserveDrivers) {
if (!_.contains(fence._oplogObserveDrivers, self))
Copy link
Member

@glasser glasser Jul 7, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it will lead to quadratic behavior. I'd rather give OplogObserveDrivers an incrementing ID and use an object to track them as a set.

Copy link
Contributor Author

@OleksandrChekhovskyi OleksandrChekhovskyi Jul 8, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Fixed in new version of the patch.

@OleksandrChekhovskyi
Copy link
Contributor Author

@OleksandrChekhovskyi OleksandrChekhovskyi commented Jul 8, 2015

Patch has been updated to address the feedback.

I have created a simple benchmark that illustrates the problem.
https://github.com/OleksandrChekhovskyi/meteor-overhead-benchmark
Keep creating new pages with a button and observe how slow it can get.

_.each(self.completion_callbacks, function (f) {f(self);});
self.completion_callbacks = [];
self.outstanding_writes++;
_.each(self.before_fire_callbacks, function (f) {f(self);});
Copy link
Member

@glasser glasser Jul 8, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to think about what happens if one of these throws? We aren't for onAllCommitted callbacks, but (a) that's probably a mistake (b) the onBeforeFire callback in practice is more complex than the only existing onAllCommitted callback (eg, it blocks on network activity).

Copy link
Contributor Author

@OleksandrChekhovskyi OleksandrChekhovskyi Jul 9, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a try/catch to safeguard against unexpected exceptions.

@glasser
Copy link
Member

@glasser glasser commented Jul 8, 2015

Other than these last three comments, this change looks great. Your benchmark is pretty clear that this makes a difference, and it seems correct to me. It makes the WriteFence API a little stranger, but that API is a means to an end. Once you address those I think we should be able to merge this!

Also, if you'd like to add a description to History.md in the top v.NEXT under a new ### Livequery section that would be great.

Before this change, number of catch-up attempts was N*M, where N is number of
writes inside of the fence, and M is number of active observers on affected collections.
Every catch up issues yet another query to find the latest oplog entry.

It was extremely inefficient, in terms of both CPU usage and added latency.
After executing write-heavy methods, application process was occupied for many seconds
doing the same thing over and over again.

This change provides a performance improvement for all kinds of workloads.
@OleksandrChekhovskyi
Copy link
Contributor Author

@OleksandrChekhovskyi OleksandrChekhovskyi commented Jul 9, 2015

Patch has been updated, and hopefully everything is addressed now.


* Improved server performance by reducing overhead of processing oplog after
database writes. Improvements are most noticeable in case when a method is
doing a lot of writes on collections with plenty of active observers.
Copy link
Member

@glasser glasser Jul 9, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "#4694" here

@glasser
Copy link
Member

@glasser glasser commented Jul 9, 2015

Thanks, merged! This seems like it'll be a great performance increase and I'm excited to see it get out soon.

@rclai
Copy link

@rclai rclai commented Jul 9, 2015

This is great because I do work with a lot of server-side observers.

@mizzao
Copy link
Contributor

@mizzao mizzao commented Jul 9, 2015

This is great, I first described this issue almost a year ago when we were still on the google groups and I'm very glad that someone fixed it.

https://groups.google.com/forum/#!topic/meteor-talk/Y547Hh2z39Y

Just wanted to point out that the solution proposed there was to "give up" if the oplog observer was too far behind, and just re-query the database to catch up. I'm not sure if that's what @OleksandrChekhovskyi did, but the benchmark seems to have proved something.

I guess this also illustrates why the google group was pretty useless for having these sorts of discussions. (> /dev/null)

@zimme
Copy link
Contributor

@zimme zimme commented Jul 10, 2015

👍

@sebakerckhof
Copy link
Contributor

@sebakerckhof sebakerckhof commented Jul 11, 2015

@mizzao No, the fallback you describe is already in Meteor since 1.0.4, see:
#2668

Instead, this PR should increase the oplog tailing performance for write-heavy loads, so that meteor doesn't have to use the fallback that fast.

In 1.0.4 there was however another PR merged that should also increase write performance with the oplog: https://github.com/meteor/meteor/pull/3697/commits

@mizzao
Copy link
Contributor

@mizzao mizzao commented Jul 11, 2015

@sebakerckhof excellent, thanks for all those links. I'm very grateful for all of the hardcore performance profilers out there - seems like they are really helping Meteor to grow into a robust and fast platform.

@arunoda
Copy link

@arunoda arunoda commented Jul 11, 2015

This is great. Thanks @OleksandrChekhovskyi

@arunoda
Copy link

@arunoda arunoda commented Jul 12, 2015

@glasser I got some idea while looking at this code base.
This solution pretty great and reduce oplog last entry checks to a few. But, there still an issue if we get a lot of method calls at once. Then updated messages delayed until we ping for oplog.

what if we can process waitUntilCaughtUp in a queue. So, then we can ping oplog once for a lot of waitUntilCaughtUp requests. And it's fairly simple to implement.

Is it sounds good or am I doing talking something wrong?

@glasser
Copy link
Member

@glasser glasser commented Jul 12, 2015

I'm not 100% sure what you're describing @arunoda but it sounds promising. As long as it doesn't err too far on the side of making the updated message come late because we're not trying for a while. Love to see a PR!

@arunoda
Copy link

@arunoda arunoda commented Jul 12, 2015

It's a random idea. I'll give it a try today. May be not sure that really affect the performance changes. Let's see.

glasser added a commit that referenced this issue Aug 13, 2015
This bug was introduced by #4694 which switched OplogObserveDriver's
listener from using beginWrite to the new onBeforeFire.  beginWrite
doesn't throw an error when called on a retired fence; onBeforeFire
does.  So don't try to interact with fired fences. (I'm not sure if
there is an importance to the distinction between retired and fired
introduced by dcd2641, but this code should be fine.)

While we're at it, make the error in question (which shouldn't happen)
be delivered to Mongo write callbacks (or thrown), if for no other
reason than that it allows us to test this fix.

Fixes #4839.
glasser added a commit that referenced this issue Aug 25, 2015
This bug was introduced by #4694 which switched OplogObserveDriver's
listener from using beginWrite to the new onBeforeFire.  beginWrite
doesn't throw an error when called on a retired fence; onBeforeFire
does.  So don't try to interact with fired fences. (I'm not sure if
there is an importance to the distinction between retired and fired
introduced by dcd2641, but this code should be fine.)

While we're at it, make the error in question (which shouldn't happen)
be delivered to Mongo write callbacks (or thrown), if for no other
reason than that it allows us to test this fix.

Fixes #4839.

Conflicts:
	packages/mongo/mongo_livedata_tests.js
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants