Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-op support and add support for oplog refs #1

Open
wants to merge 7 commits into
base: tokumx-compatibility
Choose a base branch
from
Open

Fix multi-op support and add support for oplog refs #1

wants to merge 7 commits into from

Conversation

cheald
Copy link

@cheald cheald commented Apr 18, 2014

I've expanded on my previous fixes and spent a fair amount of time with my head down in tokumx's oplog. Rather than flattening oplog entries, we treat all oplog entries as potential multi-op entries. When a ref op is encountered, the appropriate refs are found and iterated (using the same technique as the tokumx server's replication code). This results in full support for tokumx transactions and multi-ops, while still maintaining compatibility with the vanilla Mongo oplog format.

Additionally, the inner loop issue you previous mentioned has been resolved; checking the river status in the inner loop allows for dashboard interrupts of the river without requiring continual creation of new cursors.

The biggest outstanding issue is the oplog.refs lookup. The semantics are kinda fuzzy, but the general idea is that we will have one or more refs with monotonically-increasing seq numbers. The oplog.refs _id format is {oid: ObjectId(...), seq: num}, but an index only exists on _id. The way the server finds entries is via:

            LOCK_REASON(lockReason, "repl: finding oplog.refs entry to apply");
            Client::ReadContext ctx(rsOplogRefs, lockReason);
            const BSONObj query = BSON("_id" << BSON("$gt" << BSON("oid" << oid << "seq" << seq)));
            if (!Collection::findOne(rsOplogRefs, query, entry, true)) {
                break;
            }

The only thing it does here that I haven't been able to replicate is the fourth parameter in findOne, which is a requireIndex boolean, which appears to cause the query to be executed in index-only mode. I haven't found a way to replicate that with the driver yet, and explain() shows that the query is not executed in indexOnly mode. Improving that should substantially improve performance, since the _id index is clustered.

I've not yet written tests, but since you're working on that, I figured this code might be useful in the course of testing full compatibility.

…e that when we encounter an oplog.refs entry, the river doesn't bomb out on an NPE.

* Fix oplog.refs oid entry handling
* Fix UPDATE_ROW op
* Validate that we got an item before processing it
* Add explicit log message for NPE on namespace fetch
* Fix findAndModify support for TokuMX
@kdkeck
Copy link
Owner

kdkeck commented Apr 18, 2014

If the oplog.refs query isn't using indexOnly mode, I would consider
that a server-side bug and report it to TokuTech.

I'm also surprised that you think continuing to use the same cursor will
work--as soon as cursor.hasNext() returns false, I would have expectd
that it would close the cursor on the server and continue to return
false without querying the server again. And even if that's not the
case, I would be worried that the client library would at some point
start to work that way, if there's not an explicit setting to ask for
different behavior. Have you tested that out?

On 04/18/2014 03:35 PM, Chris Heald wrote:

I've expanded on my previous fixes and spent a fair amount of time
with my head down in tokumx's oplog. Rather than flattening oplog
entries, we treat all oplog entries as potential multi-op entries.
When a ref op is encountered, the appropriate refs are found and
iterated (using the same technique as the tokumx server's replication
code). This results in full support for tokumx transactions and
multi-ops, while still maintaining compatibility with the vanilla
Mongo oplog format.

Additionally, the inner loop issue you previous mentioned has been
resolved; checking the river status in the inner loop allows for
dashboard interrupts of the river without requiring continual creation
of new cursors.

The biggest outstanding issue is the oplog.refs lookup. The semantics
are kinda fuzzy, but the general idea is that we will have one or more
refs with monotonically-increasing seq numbers. The oplog.refs _id
format is |{oid: ObjectId(...), seq: num}|, but an index only exists
on |_id|. The way the server finds entries is via:

| LOCK_REASON(lockReason, "repl: finding oplog.refs entry to apply");
Client::ReadContext ctx(rsOplogRefs, lockReason);
const BSONObj query = BSON("_id" << BSON("$gt" << BSON("oid" << oid << "seq" << seq)));
if (!Collection::findOne(rsOplogRefs, query, entry, true)) {
break;
}
|

The only thing it does here that I haven't been able to replicate is
the fourth parameter in findOne, which is a |requireIndex| boolean,
which appears to cause the query to be executed in index-only mode. I
haven't found a way to replicate that with the driver yet, and
explain() shows that the query is not executed in indexOnly mode.
Improving that should substantially improve performance, since the _id
index is clustered.

I've not yet written tests, but since you're working on that, I
figured this code might be useful in the course of testing full
compatibility.


    You can merge this Pull Request by running

git pull https://github.com/mashable/elasticsearch-river-mongodb tokumx-compatibility

Or view, comment on, or merge it at:

#1

    Commit Summary


Reply to this email directly or view it on GitHub
#1.

@cheald
Copy link
Author

cheald commented Apr 18, 2014

I haven't tested it explicitly, but I was noticing that when watching the TRACE info in the log that the river was continually re-playing ops that had already been handled; my guess was that this was due to reopening new cursors (in fact, after all writes to our tokumx instance had stopped, the river continued to replicate the last chunk of ops over and over). Changing the loop so that it loops around hasNext solves the issue cleanly (and anecdotally, has made the whole shebang feel substantially faster, though that's completely unquantified).

The cursor that gets opened is done so with QUERYOPTION_TAILABLE | QUERYOPTION_AWAITDATA | QUERYOPTION_NOTIMEOUT, so it's not going to behave like a normal cursor. The QUERYOPTION_TAILABLE option in particular prevents the cursor from being closed when it runs out of data, and QUERYOPTION_NOTIMEOUT prevents it from expiring on the server. This is the same kind of cursor used by internal replication to tail the oplog and replay it on secondaries; constantly opening new cursors is obviously non-ideal from a performance perspective.

If you run the Slurper with debug logging ,you'll notice that the 500ms timeout debug message is never logged - this is because hasNext never returns false for a tailable cursor! QUERYOPTION_AWAITDATA causes either hasNext or next to block (I'm not quite clear on which, will check the source shortly) until data is available.

If the cursor does get closed, then attempting to read next() from it will raise an exception, pop us out of that inner loop, and restart with a new cursor. You can easily observe this by performing a replica master failover with the river running.

@kdkeck
Copy link
Owner

kdkeck commented Apr 18, 2014

Ah, cool. In that case I would suggest creating and submitting a pull
request with just that change to the upstream repository, and I think
it would get accepted quickly. (Using those query options was a
relatively recent change, made on the advice of the mongodb team,
without such a complete explanation of what they would all do.)

On Fri 18 Apr 2014 04:10:47 PM PDT, Chris Heald wrote:

I haven't tested it explicitly, but I was noticing that when watching
the TRACE info in the log that the river was continually re-playing
ops that had already been handled; my guess was that this was due to
reopening new cursors. Changing the loop so that it loops around
hasNext solves the issue cleanly (and anecdotally, has made the whole
shebang feel substantially faster, though that's completely unquantified).

The cursor that gets opened is done so with |QUERYOPTION_TAILABLE
||QUERYOPTION_AWAITDATA | |QUERYOPTION_NOTIMEOUT|, so it's not going
to behave like a normal cursor. The QUERYOPTION_TAILABLE option in
particular prevents the cursor from being closed when it runs out of
data, and QUERYOPTION_NOTIMEOUT prevents it from expiring on the
server. This is the same kind of cursor used by internal replication
to tail the oplog and replay it on secondaries; constantly opening new
cursors is obviously non-ideal from a performance perspective.

If you run the Slurper with debug logging ,you'll notice that the
500ms timeout debug message is never logged - this is because hasNext
never returns false for a tailable cursor! QUERYOPTION_AWAITDATA
causes either |hasNext| or |next| to block (I'm not quite clear on
which, will check the source shortly) until data is available.


Reply to this email directly or view it on GitHub
#1 (comment).

@kdkeck
Copy link
Owner

kdkeck commented May 14, 2014

So it looks like my updated unit tests all still pass after a merge of this pull request. I'm still a bit concerned, though, about the 'a' flag on oplog entries, when you're reading from a secondary. I think I need to figure out how to write a Toku-specific unit test for that. Probably using a mocked cursor.

@kdkeck
Copy link
Owner

kdkeck commented May 14, 2014

See this post, if you haven't already, for more info on the issue: http://www.tokutek.com/2014/04/on-tokumx-oplog-tailable-cursors-and-concurrency/

On second thought, that post implies that a cursor with the right flags set may never return an uncommitted entry. Maybe we can just add a verification of that, so we'll see if it's ever violated?

@cheald
Copy link
Author

cheald commented May 14, 2014

I do think it would be a good idea to at least add that assert for truthiness of the "a" flag, even though we expect the cursor to never return an a: false record per that post above.

@cheald
Copy link
Author

cheald commented May 14, 2014

FWIW, we have this reading from our master and replicating collections with tens of millions of documents just fine. We've run into an issue when Infinity and NaN are in number fields (valid Float values in Javascript, not so much in Java), but beyond that, it's working beautifully.

@kdkeck
Copy link
Owner

kdkeck commented May 14, 2014

Your Float issue sounds like a serialization issue that should be raised on the upstream project. Java's float datatype does in fact includes infinity and NaN values, per the IEEE spec.

@cheald
Copy link
Author

cheald commented May 14, 2014

Oh yeah, it definitely is. I just mean that as a "We aren't having any TokuMX-specific issues".

@kdkeck
Copy link
Owner

kdkeck commented May 15, 2014

Sadly, when I inserted a check for a=false at the beginning of processOplogEntry, it caused most of the TomuMX tests to fail. So apparently TokuMX does not block the tailable cursor until a=true, when you're reading from a secondary. And it's possible that's why Richard reported that the tests failed for him:

richardwilly98#248 (comment)

The fix I've come up with has now been made into to a new pull request:

richardwilly98#270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants