RUBY-1117 Ensure the driver returns the response to the current query #774

elhu · 2016-06-27T13:16:54Z

Hello!

As discussed in https://jira.mongodb.org/browse/RUBY-1117, we discovered an issue in the Mongo ruby driver: if a thread dies before it has time to read the response to a query it triggered, then the next thread to read from that socket will return an incorrect response.

We've addressed the issue by ensuring that we were returning the proper response in Mongo::Server::Connection#deliver. If response_to does not match the request_id of the current query, the next response is read.

In fixing this bug, we also discovered that the class instance variables set in Mongo::Protocol::Message.deserialize_header were not thread-safe. In order to avoid this problem, we decided to replace them by local variables, since they did not seem to be used anywhere.

We've tried to provide meaningful specs as well, please let us know if there are things you wish to discuss before this can be merged!

Many thanks to @jarthod for helping me figure out this one.

estolfo · 2016-06-28T14:02:58Z

lib/mongo/server/connection.rb

+        if messages.last.replyable?
+          # Protection against returning the response to a previous request. See
+          # RUBY-1117
+          reply = read until reply && reply.response_to == messages.last.request_id


I'm wary of putting an until here, as it's not guaranteed that you'll eventually read the reply with a response_to matching the last message's request_id. What I would do instead is pass the expected request_id to #read like this:

messages.last.replyable? ? read(messages.last.request_id) : nil

then I'd change the method signature of Connectable#read to this

def read(request_id = nil) ensure_connected do |socket| Protocol::Reply.deserialize(socket, max_message_size, request_id) end end

and then the method signature of Message.deserialize to this:

def self.deserialize(io, max_message_size = MAX_MESSAGE_SIZE, expected_request_id = nil)

and raise an Exception in the same way we're raising one for the length:

if expected_request_id && expected_request_id != response_to raise Error::UnexpectedResponse.new(expected_request_id, response_to) end

What do you think? We'll address the scenario in which a socket has a killed-thread's unread data by this ticket:
https://jira.mongodb.org/browse/RUBY-1118 so what the until loop accomplishes wouldn't be necessary.

This is actually very close to our first attempt at fixing the issue.
The problem with that approach is that we would need to be absolutely certain the socket will also be closed when we read the wrong response from the network.

If not, then we just push back the problem to the next read, indefinitely, and that would not solve anything.

Also, it would be nice to give the process a chance to recover before causing it to fail. If not then it just pushes back the error handling responsibility on the app instead of the library. Maybe we could try reading just once more and then abort? It would be stupid to drop the socket and it's pending messages if our reply is in there.

IMO the best behavior would be to read until we get our response, and once there is nothing more to read, the request should timeout. That is actually linked to https://jira.mongodb.org/browse/RUBY-1120 IMO, which is the opposite problem, the response from mongodb is missed or lost and the client waits indefinitely.

We had a look at the code with @elhu and saw there is a socket timeout option which defaults at 5s and is set as read and write timeout on the socket using setsockopt, unfortunately this seems to be broken in ruby itself, because when you do io.read(16) in ruby, even if the read syscall times out (we see it returning -1 in strace), ruby continue waiting (probably with a select or ppoll as seen in RUBY-1120) which means this timeout has no impact.

If this timeout were working properly, it wouldn't be a problem to read multiple times.
Of course if you really prefer this way it's better than nothing, but it means everyone doing thread kill or on bad network will get random exceptions in production and broken builds and will have no way to improve this.

ps: about passing the request_id argument, to me it doesn't sounds like it's the deserialize method responsibility to check this, it is supposed to just get from wire protocol to Message object. Same for the read method. So even if this should raise an exception, It looks more appropriate in the deliver method to me (with the response_to added to the message as it is part but was just ignored). But that's just my opinion.

Hi Adrien

'The problem with that approach is that we would need to be absolutely certain the socket will also be closed when we read the wrong response from the network.'
That's why I mentioned this ticket: https://jira.mongodb.org/browse/RUBY-1118
With RUBY-1118, the socket will be closed when it is checked out (at a higher level than this PR) and has unread data or when it has been discovered to be closed by some external network error. That way, you won't have the issue of pushing back the problem to the next read, indefinitely.

The application creates threads and can potentially manipulate them (by killing them mid-execution) so an error should be raised to notify it that a socket received a previous request's unread response. If the application is getting a lot of these errors, a change in the application itself should probably be implemented. It doesn't makes sense to me that the driver should try to read again an arbitrary number of times if the ids don't match.

I agree that checking the reply's id in the #deliver method would work and probably make more sense. Though there is already a check for length and we should also include another check on the op_code...If you'd like to add that as well to this PR, that'd be awesome = )

Yeah that would improve the average case but it doesn't fix anything, if the first query is slow, when the main thread check out the orphan socket there's nothing to read, so it go ahead, send the second query and reads back the first response.

After thinking about this a bit more with, wouldn't the solution be to close the socket when there's an internal exception before checking it in back in the pool ? So when the thread is killed, the socket would simply be dropped because we don't know in which crappy state it is, the next query will have to reopen it.

Instead of:

def with_connection connection = checkout yield(connection) ensure checkin(connection) if connection end

It would look like:

def with_connection connection = checkout yield(connection) rescue Exception connection.disconnect! if connection raise ensure checkin(connection) if connection end

WDYT ?

I also found in the connection_pool gem used by everyone that they're using Thread.handle_interrupt which seems to help catching interruption like kill in https://github.com/mperham/connection_pool/blob/master/lib/connection_pool.rb#L56

I haven't looked much at this and I don't know if it's useful here as the ensure seems to be called properly in our case otherwise the socket would be lost and not reused.

+1 for the op_code check.

I see what you're saying but if the application is killing threads in a number of places after initiating long-running operations, it should probably change the socket_timeout option to better match the permitted execution time of threads or use the max_time_ms option. You could also close the client if it's a real threat that future operations would read responses from previous ones.

Perhaps this isn't clear in the code but a Connection object is a wrapper around a Socket. In the #with_connection method, the connection does not have to be disconnected when there is an error because the socket is disconnected at a lower level. If you look at the #ensure_connected method in Connectable, the socket is disconnected when there is an exception so that it won't be used again:
https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/server/connectable.rb#L90-L95

I don't know much about Thread.handle_interrupt but will look into it. Thanks!

Feel free to add the op_code check (it should be pretty simple to add) or I can add it in separately later. Up to you and Vincent = )

@estolfo I'm not sure what you mean regarding the op_code check.
From what I can tell, the op_code of a response is always 1. What check would you want to perform exactly?

I feel like I'm missing something.

Hey @elhu,
Yeah, it's pretty straight-forward, we'd have to do what the Java driver is doing here: https://github.com/mongodb/mongo-java-driver/blob/24aac956ca2fad7c803688cce47eb6f88a4c3b13/driver-core/src/main/com/mongodb/connection/ReplyHeader.java#L63-L66
Emily

@estolfo Ok I'm not sure what you mean in your first paragraph, why would the socket_timeout and max_time_ms options help here ? To reduce the change the query leaks on the main thread ?

About the disconnect! indeed we missed this one, so actually if we put the raise in deserialize, it should work and recover properly without our change to connection_pool. We just tried and it works as expected, it raises in case of error and the socket recovers at the next query.

And so with this fix, we would have to disconnect every mongo socket each time we kill a thread (or EM.stop) if we want to be safe, right ?

This should still be improved though IMO, I'm trying to find if there is a way, basically if the read is interrupted, we want to close the socket, the problem is that killing a thread calls every ensure block (effectively checking in the connection back in the pool to be reused) BUT it doesn't call any rescue Exception block, which prevents the socket from being closed. If we can find a way to close the socket also when the thread is killed and not only on Exception, that would solve this problem entirely. I had a look at handle_interrupt and it seems to be handy to protect your code against exception (you can defer them to interrupt your thread at a later point, that's why concurrent_ruby uses it for it's thread pool: https://github.com/mperham/connection_pool/blob/master/lib/connection_pool.rb#L59) but it doesn't seem to do anything with kill, as opposed to what they say in the Ruby doc.

Ok I actually found a not sexy but very simple and easy fix, as thread interruption calls ensure block but not rescue (that's why the socket is checked but not closed), we simply need to use and ensure block to close the socket. That would change:

def ensure_connected ensure_same_process! connect! begin yield socket rescue Exception => e disconnect! raise e end end

To:

def ensure_connected ensure_same_process! connect! begin res = yield socket success = true res ensure success or disconnect! end end

We confirmed this with our example and our spec suite, it properly closes the socket in case of both Exception AND Thread.kill. This way the dirty socket is never reused. We should of course keep the UnexpectedResponse exception because it protects against regressions or other bugs, but this covers the problem better than simply raise randomly or forcing you to change how you write ruby code. WDYT?

And for the record, handle_interrupt helps with Exceptions but not kill, so it might be a good idea to have a look to make your code more robust (like connection_pool) but it doesn't help here.

…nd/or wrong response being returned. #RUBY-1117

jarthod · 2016-06-30T15:33:02Z

I've updated the MR with all of our comments (The exception in deserialize) and with my last suggestion for the Thread.kill fix (using ensure in ensure_connected). Let me know what you think @estolfo

estolfo · 2016-06-30T17:11:19Z

thanks @jarthod for the updates and thorough tests, I'll take a look and get back to you!

jarthod · 2016-07-05T08:25:06Z

For the record we've been using this fix in production (1000+ mongo queries/sec) since 4 days now, without any issue. I see on of my tests failed randomly on the CI, I'll see to make it more robust if you agree with the fix.

estolfo · 2016-07-05T08:41:40Z

Hey @jarthod
Thanks for the update. Would you mind looking to why the test failed on travis?

estolfo · 2016-07-05T08:42:46Z

lib/mongo/error/unexpected_response.rb

+      #
+      # @since 2.1.0
+      def initialize(expected_response_to, response_to)
+        super("Unexpected response. Got response for request ID #{response_to} but expected response for request ID #{expected_response_to}")


can we break this up into two lines?

jarthod · 2016-07-05T14:31:39Z

Ok I addressed the various comments, and the specs are now green.
I'm not sure of the robustness of this specs though, as the red specs were totally random and with low probability

estolfo · 2016-07-05T14:31:40Z

hey @jarthod The tests are passing on travis now. You've completed your updates, correct?

jarthod · 2016-07-05T15:19:17Z

🎉

estolfo · 2016-07-05T15:21:42Z

thanks to you and your colleagues (especially @elhu) for your work on this = )

elhu · 2016-07-05T15:26:04Z

Woohoo, great news!
That was a tricky one, hopefully many users can benefit from it.

estolfo changed the title ~~Ensure the driver returns the response to the current query~~ RUBY-1117 Ensure the driver returns the response to the current query Jun 28, 2016

estolfo reviewed Jun 28, 2016
View reviewed changes

Vincent Boisard and others added 3 commits June 30, 2016 14:56

fix: Multiple thread safety issue that could result in hung process a…

c5833c9

…nd/or wrong response being returned. #RUBY-1117

fix: spec title more explicit

16fd217

new fix for the thread kill issue

7e2e94d

estolfo reviewed Jul 5, 2016
View reviewed changes

jarthod added 2 commits July 5, 2016 11:58

various cosmetic fixes + new spec robustness

eb9895d

attempt at more robust spec

fa25549

estolfo merged commit e65a573 into mongodb:master Jul 5, 2016

RUBY-1117 Ensure the driver returns the response to the current query #774

RUBY-1117 Ensure the driver returns the response to the current query #774

Uh oh!

Conversation

elhu commented Jun 27, 2016

Uh oh!

estolfo Jun 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jarthod Jun 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jarthod commented Jun 30, 2016

Uh oh!

estolfo commented Jun 30, 2016

Uh oh!

jarthod commented Jul 5, 2016

Uh oh!

estolfo commented Jul 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jarthod commented Jul 5, 2016

Uh oh!

estolfo commented Jul 5, 2016

Uh oh!

jarthod commented Jul 5, 2016

Uh oh!

estolfo commented Jul 5, 2016

Uh oh!

elhu commented Jul 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

estolfo Jun 28, 2016 •

edited

Loading

jarthod Jun 30, 2016 •

edited

Loading