Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Mongo query in polling observer blocks other queries #2534

Closed
mitar opened this issue Sep 7, 2014 · 8 comments
Closed

Error in Mongo query in polling observer blocks other queries #2534

mitar opened this issue Sep 7, 2014 · 8 comments

Comments

@mitar
Copy link
Contributor

@mitar mitar commented Sep 7, 2014

OK, because of a bug in MongoDB my query was failing. This would not be a problem if that query would not more or less made other parts of the site inoperable.

See this repository. Run it with meteor and open http://127.0.0.1:3000/. You should start getting every 10 seconds something like the following in the console:

I20140907-01:00:39.069(-7)? Exception in queued task: MongoError: positional operator element mismatch
I20140907-01:00:39.069(-7)?     at Object.Future.wait (~/.meteor/packages/meteor-tool/.1.0.28.y8a796++os.osx.x86_64+web.browser+web.cordova/meteor-tool-os.osx.x86_64/dev_bundle/lib/node_modules/fibers/future.js:326:15)
I20140907-01:00:39.070(-7)?     at _.extend._nextObject (packages/mongo/mongo_driver.js:858)
I20140907-01:00:39.070(-7)?     at _.extend.forEach (packages/mongo/mongo_driver.js:892)
I20140907-01:00:39.070(-7)?     at _.extend.getRawObjects (packages/mongo/mongo_driver.js:941)
I20140907-01:00:39.070(-7)?     at _.extend._pollMongo (packages/mongo/polling_observe_driver.js:144)
I20140907-01:00:39.071(-7)?     at Object.task (packages/mongo/polling_observe_driver.js:85)
I20140907-01:00:39.071(-7)?     at _.extend._run (packages/meteor/fiber_helpers.js:145)
I20140907-01:00:39.071(-7)?     at packages/meteor/fiber_helpers.js:123
I20140907-01:00:39.071(-7)?     - - - - -
I20140907-01:00:39.071(-7)?     at Object.toError (~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/utils.js:110:11)
I20140907-01:00:39.072(-7)?     at ~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/cursor.js:686:54
I20140907-01:00:39.072(-7)?     at Cursor.close (~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/cursor.js:969:5)
I20140907-01:00:39.072(-7)?     at commandHandler (~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/cursor.js:686:21)
I20140907-01:00:39.072(-7)?     at ~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/db.js:1843:9
I20140907-01:00:39.072(-7)?     at Server.Base._callHandler (~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/connection/base.js:445:41)
I20140907-01:00:39.073(-7)?     at ~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/connection/server.js:468:18
I20140907-01:00:39.073(-7)?     at MongoReply.parseBody (~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/responses/mongo_reply.js:68:5)
I20140907-01:00:39.073(-7)?     at null.<anonymous> (~/.meteor/packages/mongo/.1.0.4.cdngnx++os+web.browser+web.cordova/npm/node_modules/mongodb/lib/mongodb/connection/server.js:426:20)
I20140907-01:00:39.073(-7)?     at emit (events.js:95:17)

OK. There is a bug in my query (it would be really very very helpful if Meteor would output which query is failing, because otherwise it is really hard to debug, I had to patch Meteor for that), but this should not block the rest of the site to work. For example, if you run now (while those errors are shown in the console) in the browser:

Meteor.subscribe('all', function () { console.log(Test.find().fetch()) } );

You will notice this subscription never returns any data. Even if by itself does not use any problematic query.

I have disabled oplog because for the queries where I was getting this problem on production Meteor was using polling as well (maybe because I am using old version of Meteor, pre Blaze).

@n1mmy
Copy link
Contributor

@n1mmy n1mmy commented Sep 11, 2014

Thanks for the good reproduction. We're seeing the same thing. We'll look into it.

Loading

@glasser
Copy link
Member

@glasser glasser commented Sep 11, 2014

My guess: the observeChanges call in _publishCursor is not throwing on this error even though it should. This blocks DDP processing.

I'm not really clear on why we don't defer the _publishCursor calls: I don't see why they should block DDP processing.

Loading

@mitar
Copy link
Contributor Author

@mitar mitar commented Sep 11, 2014

I'm not really clear on why we don't defer the _publishCursor calls: I don't see why they should block DDP processing.

I think I asked the same question here. I think it is the same question.

Loading

@glasser
Copy link
Member

@glasser glasser commented Sep 11, 2014

Well, there are two questions: deferring the publisher handler body, and deferring the publishCursor work.

Loading

@glasser
Copy link
Member

@glasser glasser commented Feb 7, 2015

https://github.com/tmeasday/meteor-1633 is another reproduction, brought over from #1633.

Loading

glasser added a commit that referenced this issue Feb 7, 2015
Now when we get an error from Mongo, the cursor description is included.

Other than by the test-only _resumePolling, _pollMongo is only called at
the top level of an async-queued task, so the previously thrown error
never did anything other than print the exception (from within
_SynchronousQueue) anyway.

Fixes #1633.

(See also #2534.)
@glasser
Copy link
Member

@glasser glasser commented Feb 7, 2015

OK, so there are three related issues here:

(1) The root cause of any problem here: we can't tell the difference between a temporary connection error which we should retry, and a bad query error from Mongo which we should give up on. This leads to observes that never get initial query completed, which is what causes the other problems. We should figure out how to analyze Mongo's errors better to tell the different.

(2) Why does this affect the current DDP connection? Well, if the observeChanges call that hangs is inside a publish function, or implicitly in the _publishCursor call invoked on a cursor returned by the publish function, then the DDP server will never finish processing the current message and will never move on to the next message. So the connection is basically dead. I think that we should probably defer the whole execution of the publish function (and _publishCursor). This will also give the performance advantage that multiple subscriptions made at the same time can be processed in parallel. We will have be careful to ensure that it's OK that if you call subscribe and then a method call, the write fence of the method call may not know about all observes related to the subscription yet, since the publish function may not have run (fully)!

(3) Why does this affect future DDP connections? Specifically, it affects the updated message of methods on all DDP connections (ie, the write fence). The idea behind the write fence is to create a connection between observes that are part of a subscription on a DDP connection and methods that run on that connection. But it's not implemented that directly. It actually connects ALL observes to ALL methods. So, eg, if there's a broken observe on a collection (say, Meteor.users), and another method writes to that collection (as login does), then its write fence will include that broken observe, and thus never fire. This is what is happening in @tmeasday 's reproduction.

To fix this, we could do something where observes have an EnvironmentVariable that is the current subscription (or really, current DDP connection) and WriteFences also know their connection and we only connect WriteFences to observes for the same connection? It's a little tricky due to de-duping though.

Loading

@glasser
Copy link
Member

@glasser glasser commented Feb 9, 2015

As far as fixing the root cause, I am told by the Node Mongo maintainer that we can differentiate between command errors and network errors (in Mongo 2.6 and up) by looking on the error object for a numeric code.

Some of the internal APIs around the ObserveMultiplexer and ObserveDrivers need to be reworked to properly get those errors back to the original observeChanges call(s).

Loading

glasser added a commit that referenced this issue Feb 26, 2015
Test fails on server (passes on client).
glasser added a commit that referenced this issue Feb 26, 2015
I don't think that this properly stops the driver, and in fact "can't
make ObserveMultiplexer ready twice" will be printed every 10 seconds
afterwards

also the "ready twice" error message is misleading

also we should test that the de-dup case here works (ie, that two
duplicated observeChanges calls both get the thrown error)

also we should test something about the error that gets thrown
glasser added a commit that referenced this issue Feb 26, 2015
Fixes #2534.

In the past, a bad query would be retried forever, causing the
observeChanges to hang.

(In the common case of a DDP subscription, this would also cause the
current DDP connection to completely stop reading messages, and would
make methods that do writes to the same collection (on ANY DDP
connection) never get their 'updated' messages.  See
#2534 (comment)
These two secondary problems may still need to be fixed but at least
the root cause should be addressed.)

This only fixes PollingObserveDriver, not OplogObserveDriver, but this
issue typically only affects PollingObserveDriver because we don't
choose to use OplogObserveDriver when minimongo can't understand a
query.
glasser added a commit that referenced this issue Feb 26, 2015
In order to test this, we applied the following diff to
minimongo/selector.js:

--- a/packages/minimongo/selector.js
+++ b/packages/minimongo/selector.js
@@ -591,6 +591,8 @@ ELEMENT_OPERATORS = {
   },
   $in: {
     compileElementSelector: function (operand) {
+      if (operand === null)
+        operand = [];
       if (!isArray(operand))
         throw Error("$in needs an array");

This means that minimongo now allows `$in: null`, and thus
OplogObserveDriver can be used for the query in the new 'bad query'
test, but mongod still returns an error that is processed by the passing
test.  (The client side of the test fails with this temporary change,
because minimongo no longer throws!)

It's hard to see how to test this better, because minimongo should throw
on any queries that mongod throws on.
@glasser
Copy link
Member

@glasser glasser commented Feb 26, 2015

With the fix we just merged, Mitar's reproduction sends a 500 Meteor.Error over the wire and prints this, which contains the name of the sub and its bad selector and the error message:

I20150226-11:44:48.403(-8)? Exception from sub broken id uZPpvd3eSpdk7th55 Error: Exception while polling query {"collectionName":"Test","selector":{"importing.foo":"b"},"options":{"transform":null,"fields":{"jobs":{"$slice":-1},"importing.$":1}}}: Runner error: BadValue positional operator element mismatch
I20150226-11:44:48.403(-8)?     at Object.Future.wait (/Users/glasser/Projects/Meteor/meteor/dev_bundle/server-lib/node_modules/fibers/future.js:398:15)
I20150226-11:44:48.403(-8)?     at [object Object]._.extend.addHandleAndSendInitialAdds (packages/mongo/observe_multiplex.js:55:1)
I20150226-11:44:48.403(-8)?     at [object Object].MongoConnection._observeChanges (packages/mongo/mongo_driver.js:1166:1)
I20150226-11:44:48.403(-8)?     at [object Object].Cursor.observeChanges (packages/mongo/mongo_driver.js:841:1)
I20150226-11:44:48.403(-8)?     at Function.Mongo.Collection._publishCursor (packages/mongo/collection.js:302:1)
I20150226-11:44:48.403(-8)?     at [object Object].Cursor._publishCursor (packages/mongo/mongo_driver.js:822:1)
I20150226-11:44:48.404(-8)?     at [object Object]._.extend._runHandler (packages/ddp/livedata_server.js:986:1)
I20150226-11:44:48.404(-8)?     at [object Object]._.extend._startSubscription (packages/ddp/livedata_server.js:769:1)
I20150226-11:44:48.404(-8)?     at [object Object]._.extend.protocol_handlers.sub (packages/ddp/livedata_server.js:582:1)
I20150226-11:44:48.404(-8)?     at packages/ddp/livedata_server.js:546:1
I20150226-11:44:48.404(-8)?     - - - - -
I20150226-11:44:48.404(-8)?     at [object Object]._.extend._pollMongo (packages/mongo/polling_observe_driver.js:156:1)
I20150226-11:44:48.404(-8)?     at Object.task (packages/mongo/polling_observe_driver.js:85:1)
I20150226-11:44:48.404(-8)?     at [object Object]._.extend._run (packages/meteor/fiber_helpers.js:147:1)
I20150226-11:44:48.404(-8)?     at packages/meteor/fiber_helpers.js:125:1

The subsequent subscription works fine.

Similarly, Tom's reproduction prints one error instead of an infinite number, and the subsequent login works.

Thanks for the great reproductions!

Loading

glasser added a commit that referenced this issue Feb 26, 2015
Test fails on server (passes on client).
@glasser glasser closed this in 2f2ba39 Feb 26, 2015
glasser added a commit that referenced this issue Feb 26, 2015
In order to test this, we applied the following diff to
minimongo/selector.js:

--- a/packages/minimongo/selector.js
+++ b/packages/minimongo/selector.js
@@ -591,6 +591,8 @@ ELEMENT_OPERATORS = {
   },
   $in: {
     compileElementSelector: function (operand) {
+      if (operand === null)
+        operand = [];
       if (!isArray(operand))
         throw Error("$in needs an array");

This means that minimongo now allows `$in: null`, and thus
OplogObserveDriver can be used for the query in the new 'bad query'
test, but mongod still returns an error that is processed by the passing
test.  (The client side of the test fails with this temporary change,
because minimongo no longer throws!)

It's hard to see how to test this better, because minimongo should throw
on any queries that mongod throws on.
glasser added a commit that referenced this issue Feb 26, 2015
glasser added a commit that referenced this issue Feb 27, 2015
This contains a PR we just submitted to help with error handling
for #2534.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants