Skip to content

Conversation

@elhu
Copy link

@elhu elhu commented Jun 27, 2016

Hello!

As discussed in https://jira.mongodb.org/browse/RUBY-1117, we discovered an issue in the Mongo ruby driver: if a thread dies before it has time to read the response to a query it triggered, then the next thread to read from that socket will return an incorrect response.

We've addressed the issue by ensuring that we were returning the proper response in Mongo::Server::Connection#deliver. If response_to does not match the request_id of the current query, the next response is read.

In fixing this bug, we also discovered that the class instance variables set in Mongo::Protocol::Message.deserialize_header were not thread-safe. In order to avoid this problem, we decided to replace them by local variables, since they did not seem to be used anywhere.

We've tried to provide meaningful specs as well, please let us know if there are things you wish to discuss before this can be merged!

Many thanks to @jarthod for helping me figure out this one.

@estolfo estolfo changed the title Ensure the driver returns the response to the current query RUBY-1117 Ensure the driver returns the response to the current query Jun 28, 2016
if messages.last.replyable?
# Protection against returning the response to a previous request. See
# RUBY-1117
reply = read until reply && reply.response_to == messages.last.request_id
Copy link
Contributor

@estolfo estolfo Jun 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wary of putting an until here, as it's not guaranteed that you'll eventually read the reply with a response_to matching the last message's request_id. What I would do instead is pass the expected request_id to #read like this:

messages.last.replyable? ? read(messages.last.request_id) : nil

then I'd change the method signature of Connectable#read to this

      def read(request_id = nil)
        ensure_connected do |socket|
          Protocol::Reply.deserialize(socket, max_message_size, request_id)
        end
      end

and then the method signature of Message.deserialize to this:

def self.deserialize(io, max_message_size = MAX_MESSAGE_SIZE, expected_request_id = nil)

and raise an Exception in the same way we're raising one for the length:

if expected_request_id && expected_request_id != response_to
  raise Error::UnexpectedResponse.new(expected_request_id, response_to)
end

What do you think? We'll address the scenario in which a socket has a killed-thread's unread data by this ticket:
https://jira.mongodb.org/browse/RUBY-1118 so what the until loop accomplishes wouldn't be necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually very close to our first attempt at fixing the issue.
The problem with that approach is that we would need to be absolutely certain the socket will also be closed when we read the wrong response from the network.

If not, then we just push back the problem to the next read, indefinitely, and that would not solve anything.

Also, it would be nice to give the process a chance to recover before causing it to fail. If not then it just pushes back the error handling responsibility on the app instead of the library. Maybe we could try reading just once more and then abort? It would be stupid to drop the socket and it's pending messages if our reply is in there.

IMO the best behavior would be to read until we get our response, and once there is nothing more to read, the request should timeout. That is actually linked to https://jira.mongodb.org/browse/RUBY-1120 IMO, which is the opposite problem, the response from mongodb is missed or lost and the client waits indefinitely.

We had a look at the code with @elhu and saw there is a socket timeout option which defaults at 5s and is set as read and write timeout on the socket using setsockopt, unfortunately this seems to be broken in ruby itself, because when you do io.read(16) in ruby, even if the read syscall times out (we see it returning -1 in strace), ruby continue waiting (probably with a select or ppoll as seen in RUBY-1120) which means this timeout has no impact.

If this timeout were working properly, it wouldn't be a problem to read multiple times.
Of course if you really prefer this way it's better than nothing, but it means everyone doing thread kill or on bad network will get random exceptions in production and broken builds and will have no way to improve this.

ps: about passing the request_id argument, to me it doesn't sounds like it's the deserialize method responsibility to check this, it is supposed to just get from wire protocol to Message object. Same for the read method. So even if this should raise an exception, It looks more appropriate in the deliver method to me (with the response_to added to the message as it is part but was just ignored). But that's just my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Adrien

'The problem with that approach is that we would need to be absolutely certain the socket will also be closed when we read the wrong response from the network.'
That's why I mentioned this ticket: https://jira.mongodb.org/browse/RUBY-1118
With RUBY-1118, the socket will be closed when it is checked out (at a higher level than this PR) and has unread data or when it has been discovered to be closed by some external network error. That way, you won't have the issue of pushing back the problem to the next read, indefinitely.

The application creates threads and can potentially manipulate them (by killing them mid-execution) so an error should be raised to notify it that a socket received a previous request's unread response. If the application is getting a lot of these errors, a change in the application itself should probably be implemented. It doesn't makes sense to me that the driver should try to read again an arbitrary number of times if the ids don't match.

I agree that checking the reply's id in the #deliver method would work and probably make more sense. Though there is already a check for length and we should also include another check on the op_code...If you'd like to add that as well to this PR, that'd be awesome = )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would improve the average case but it doesn't fix anything, if the first query is slow, when the main thread check out the orphan socket there's nothing to read, so it go ahead, send the second query and reads back the first response.

After thinking about this a bit more with, wouldn't the solution be to close the socket when there's an internal exception before checking it in back in the pool ? So when the thread is killed, the socket would simply be dropped because we don't know in which crappy state it is, the next query will have to reopen it.

Instead of:

def with_connection
  connection = checkout
  yield(connection)
ensure
  checkin(connection) if connection
end

It would look like:

def with_connection
  connection = checkout
  yield(connection)
rescue Exception
  connection.disconnect! if connection
  raise
ensure
  checkin(connection) if connection
end

WDYT ?

I also found in the connection_pool gem used by everyone that they're using Thread.handle_interrupt which seems to help catching interruption like kill in https://github.com/mperham/connection_pool/blob/master/lib/connection_pool.rb#L56

I haven't looked much at this and I don't know if it's useful here as the ensure seems to be called properly in our case otherwise the socket would be lost and not reused.

+1 for the op_code check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying but if the application is killing threads in a number of places after initiating long-running operations, it should probably change the socket_timeout option to better match the permitted execution time of threads or use the max_time_ms option. You could also close the client if it's a real threat that future operations would read responses from previous ones.

Perhaps this isn't clear in the code but a Connection object is a wrapper around a Socket. In the #with_connection method, the connection does not have to be disconnected when there is an error because the socket is disconnected at a lower level. If you look at the #ensure_connected method in Connectable, the socket is disconnected when there is an exception so that it won't be used again:
https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/server/connectable.rb#L90-L95

I don't know much about Thread.handle_interrupt but will look into it. Thanks!

Feel free to add the op_code check (it should be pretty simple to add) or I can add it in separately later. Up to you and Vincent = )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@estolfo I'm not sure what you mean regarding the op_code check.
From what I can tell, the op_code of a response is always 1. What check would you want to perform exactly?

I feel like I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@estolfo Ok I'm not sure what you mean in your first paragraph, why would the socket_timeout and max_time_ms options help here ? To reduce the change the query leaks on the main thread ?

About the disconnect! indeed we missed this one, so actually if we put the raise in deserialize, it should work and recover properly without our change to connection_pool. We just tried and it works as expected, it raises in case of error and the socket recovers at the next query.

And so with this fix, we would have to disconnect every mongo socket each time we kill a thread (or EM.stop) if we want to be safe, right ?

This should still be improved though IMO, I'm trying to find if there is a way, basically if the read is interrupted, we want to close the socket, the problem is that killing a thread calls every ensure block (effectively checking in the connection back in the pool to be reused) BUT it doesn't call any rescue Exception block, which prevents the socket from being closed. If we can find a way to close the socket also when the thread is killed and not only on Exception, that would solve this problem entirely. I had a look at handle_interrupt and it seems to be handy to protect your code against exception (you can defer them to interrupt your thread at a later point, that's why concurrent_ruby uses it for it's thread pool: https://github.com/mperham/connection_pool/blob/master/lib/connection_pool.rb#L59) but it doesn't seem to do anything with kill, as opposed to what they say in the Ruby doc.

Copy link
Contributor

@jarthod jarthod Jun 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I actually found a not sexy but very simple and easy fix, as thread interruption calls ensure block but not rescue (that's why the socket is checked but not closed), we simply need to use and ensure block to close the socket. That would change:

def ensure_connected
  ensure_same_process!
  connect!
  begin
    yield socket
  rescue Exception => e
    disconnect!
    raise e
  end
end

To:

def ensure_connected
  ensure_same_process!
  connect!
  begin
    res = yield socket
    success = true
    res
  ensure
    success or disconnect!
  end
end

We confirmed this with our example and our spec suite, it properly closes the socket in case of both Exception AND Thread.kill. This way the dirty socket is never reused. We should of course keep the UnexpectedResponse exception because it protects against regressions or other bugs, but this covers the problem better than simply raise randomly or forcing you to change how you write ruby code. WDYT?

And for the record, handle_interrupt helps with Exceptions but not kill, so it might be a good idea to have a look to make your code more robust (like connection_pool) but it doesn't help here.

@jarthod
Copy link
Contributor

jarthod commented Jun 30, 2016

I've updated the MR with all of our comments (The exception in deserialize) and with my last suggestion for the Thread.kill fix (using ensure in ensure_connected). Let me know what you think @estolfo

@estolfo
Copy link
Contributor

estolfo commented Jun 30, 2016

thanks @jarthod for the updates and thorough tests, I'll take a look and get back to you!

@jarthod
Copy link
Contributor

jarthod commented Jul 5, 2016

For the record we've been using this fix in production (1000+ mongo queries/sec) since 4 days now, without any issue. I see on of my tests failed randomly on the CI, I'll see to make it more robust if you agree with the fix.

@estolfo
Copy link
Contributor

estolfo commented Jul 5, 2016

Hey @jarthod
Thanks for the update. Would you mind looking to why the test failed on travis?

#
# @since 2.1.0
def initialize(expected_response_to, response_to)
super("Unexpected response. Got response for request ID #{response_to} but expected response for request ID #{expected_response_to}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we break this up into two lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@jarthod
Copy link
Contributor

jarthod commented Jul 5, 2016

Ok I addressed the various comments, and the specs are now green.
I'm not sure of the robustness of this specs though, as the red specs were totally random and with low probability

@estolfo
Copy link
Contributor

estolfo commented Jul 5, 2016

hey @jarthod The tests are passing on travis now. You've completed your updates, correct?

@estolfo estolfo merged commit e65a573 into mongodb:master Jul 5, 2016
@jarthod
Copy link
Contributor

jarthod commented Jul 5, 2016

🎉

@estolfo
Copy link
Contributor

estolfo commented Jul 5, 2016

thanks to you and your colleagues (especially @elhu) for your work on this = )

@elhu
Copy link
Author

elhu commented Jul 5, 2016

Woohoo, great news!
That was a tricky one, hopefully many users can benefit from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants