-
Notifications
You must be signed in to change notification settings - Fork 533
RUBY-1117 Ensure the driver returns the response to the current query #774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/mongo/server/connection.rb
Outdated
| if messages.last.replyable? | ||
| # Protection against returning the response to a previous request. See | ||
| # RUBY-1117 | ||
| reply = read until reply && reply.response_to == messages.last.request_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wary of putting an until here, as it's not guaranteed that you'll eventually read the reply with a response_to matching the last message's request_id. What I would do instead is pass the expected request_id to #read like this:
messages.last.replyable? ? read(messages.last.request_id) : nilthen I'd change the method signature of Connectable#read to this
def read(request_id = nil)
ensure_connected do |socket|
Protocol::Reply.deserialize(socket, max_message_size, request_id)
end
endand then the method signature of Message.deserialize to this:
def self.deserialize(io, max_message_size = MAX_MESSAGE_SIZE, expected_request_id = nil)and raise an Exception in the same way we're raising one for the length:
if expected_request_id && expected_request_id != response_to
raise Error::UnexpectedResponse.new(expected_request_id, response_to)
endWhat do you think? We'll address the scenario in which a socket has a killed-thread's unread data by this ticket:
https://jira.mongodb.org/browse/RUBY-1118 so what the until loop accomplishes wouldn't be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually very close to our first attempt at fixing the issue.
The problem with that approach is that we would need to be absolutely certain the socket will also be closed when we read the wrong response from the network.
If not, then we just push back the problem to the next read, indefinitely, and that would not solve anything.
Also, it would be nice to give the process a chance to recover before causing it to fail. If not then it just pushes back the error handling responsibility on the app instead of the library. Maybe we could try reading just once more and then abort? It would be stupid to drop the socket and it's pending messages if our reply is in there.
IMO the best behavior would be to read until we get our response, and once there is nothing more to read, the request should timeout. That is actually linked to https://jira.mongodb.org/browse/RUBY-1120 IMO, which is the opposite problem, the response from mongodb is missed or lost and the client waits indefinitely.
We had a look at the code with @elhu and saw there is a socket timeout option which defaults at 5s and is set as read and write timeout on the socket using setsockopt, unfortunately this seems to be broken in ruby itself, because when you do io.read(16) in ruby, even if the read syscall times out (we see it returning -1 in strace), ruby continue waiting (probably with a select or ppoll as seen in RUBY-1120) which means this timeout has no impact.
If this timeout were working properly, it wouldn't be a problem to read multiple times.
Of course if you really prefer this way it's better than nothing, but it means everyone doing thread kill or on bad network will get random exceptions in production and broken builds and will have no way to improve this.
ps: about passing the request_id argument, to me it doesn't sounds like it's the deserialize method responsibility to check this, it is supposed to just get from wire protocol to Message object. Same for the read method. So even if this should raise an exception, It looks more appropriate in the deliver method to me (with the response_to added to the message as it is part but was just ignored). But that's just my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Adrien
'The problem with that approach is that we would need to be absolutely certain the socket will also be closed when we read the wrong response from the network.'
That's why I mentioned this ticket: https://jira.mongodb.org/browse/RUBY-1118
With RUBY-1118, the socket will be closed when it is checked out (at a higher level than this PR) and has unread data or when it has been discovered to be closed by some external network error. That way, you won't have the issue of pushing back the problem to the next read, indefinitely.
The application creates threads and can potentially manipulate them (by killing them mid-execution) so an error should be raised to notify it that a socket received a previous request's unread response. If the application is getting a lot of these errors, a change in the application itself should probably be implemented. It doesn't makes sense to me that the driver should try to read again an arbitrary number of times if the ids don't match.
I agree that checking the reply's id in the #deliver method would work and probably make more sense. Though there is already a check for length and we should also include another check on the op_code...If you'd like to add that as well to this PR, that'd be awesome = )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that would improve the average case but it doesn't fix anything, if the first query is slow, when the main thread check out the orphan socket there's nothing to read, so it go ahead, send the second query and reads back the first response.
After thinking about this a bit more with, wouldn't the solution be to close the socket when there's an internal exception before checking it in back in the pool ? So when the thread is killed, the socket would simply be dropped because we don't know in which crappy state it is, the next query will have to reopen it.
Instead of:
def with_connection
connection = checkout
yield(connection)
ensure
checkin(connection) if connection
endIt would look like:
def with_connection
connection = checkout
yield(connection)
rescue Exception
connection.disconnect! if connection
raise
ensure
checkin(connection) if connection
endWDYT ?
I also found in the connection_pool gem used by everyone that they're using Thread.handle_interrupt which seems to help catching interruption like kill in https://github.com/mperham/connection_pool/blob/master/lib/connection_pool.rb#L56
I haven't looked much at this and I don't know if it's useful here as the ensure seems to be called properly in our case otherwise the socket would be lost and not reused.
+1 for the op_code check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying but if the application is killing threads in a number of places after initiating long-running operations, it should probably change the socket_timeout option to better match the permitted execution time of threads or use the max_time_ms option. You could also close the client if it's a real threat that future operations would read responses from previous ones.
Perhaps this isn't clear in the code but a Connection object is a wrapper around a Socket. In the #with_connection method, the connection does not have to be disconnected when there is an error because the socket is disconnected at a lower level. If you look at the #ensure_connected method in Connectable, the socket is disconnected when there is an exception so that it won't be used again:
https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/server/connectable.rb#L90-L95
I don't know much about Thread.handle_interrupt but will look into it. Thanks!
Feel free to add the op_code check (it should be pretty simple to add) or I can add it in separately later. Up to you and Vincent = )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@estolfo I'm not sure what you mean regarding the op_code check.
From what I can tell, the op_code of a response is always 1. What check would you want to perform exactly?
I feel like I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @elhu,
Yeah, it's pretty straight-forward, we'd have to do what the Java driver is doing here: https://github.com/mongodb/mongo-java-driver/blob/24aac956ca2fad7c803688cce47eb6f88a4c3b13/driver-core/src/main/com/mongodb/connection/ReplyHeader.java#L63-L66
Emily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@estolfo Ok I'm not sure what you mean in your first paragraph, why would the socket_timeout and max_time_ms options help here ? To reduce the change the query leaks on the main thread ?
About the disconnect! indeed we missed this one, so actually if we put the raise in deserialize, it should work and recover properly without our change to connection_pool. We just tried and it works as expected, it raises in case of error and the socket recovers at the next query.
And so with this fix, we would have to disconnect every mongo socket each time we kill a thread (or EM.stop) if we want to be safe, right ?
This should still be improved though IMO, I'm trying to find if there is a way, basically if the read is interrupted, we want to close the socket, the problem is that killing a thread calls every ensure block (effectively checking in the connection back in the pool to be reused) BUT it doesn't call any rescue Exception block, which prevents the socket from being closed. If we can find a way to close the socket also when the thread is killed and not only on Exception, that would solve this problem entirely. I had a look at handle_interrupt and it seems to be handy to protect your code against exception (you can defer them to interrupt your thread at a later point, that's why concurrent_ruby uses it for it's thread pool: https://github.com/mperham/connection_pool/blob/master/lib/connection_pool.rb#L59) but it doesn't seem to do anything with kill, as opposed to what they say in the Ruby doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I actually found a not sexy but very simple and easy fix, as thread interruption calls ensure block but not rescue (that's why the socket is checked but not closed), we simply need to use and ensure block to close the socket. That would change:
def ensure_connected
ensure_same_process!
connect!
begin
yield socket
rescue Exception => e
disconnect!
raise e
end
endTo:
def ensure_connected
ensure_same_process!
connect!
begin
res = yield socket
success = true
res
ensure
success or disconnect!
end
endWe confirmed this with our example and our spec suite, it properly closes the socket in case of both Exception AND Thread.kill. This way the dirty socket is never reused. We should of course keep the UnexpectedResponse exception because it protects against regressions or other bugs, but this covers the problem better than simply raise randomly or forcing you to change how you write ruby code. WDYT?
And for the record, handle_interrupt helps with Exceptions but not kill, so it might be a good idea to have a look to make your code more robust (like connection_pool) but it doesn't help here.
…nd/or wrong response being returned. #RUBY-1117
|
I've updated the MR with all of our comments (The exception in |
|
thanks @jarthod for the updates and thorough tests, I'll take a look and get back to you! |
|
For the record we've been using this fix in production (1000+ mongo queries/sec) since 4 days now, without any issue. I see on of my tests failed randomly on the CI, I'll see to make it more robust if you agree with the fix. |
|
Hey @jarthod |
| # | ||
| # @since 2.1.0 | ||
| def initialize(expected_response_to, response_to) | ||
| super("Unexpected response. Got response for request ID #{response_to} but expected response for request ID #{expected_response_to}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we break this up into two lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
Ok I addressed the various comments, and the specs are now green. |
|
hey @jarthod The tests are passing on travis now. You've completed your updates, correct? |
|
🎉 |
|
thanks to you and your colleagues (especially @elhu) for your work on this = ) |
|
Woohoo, great news! |
Hello!
As discussed in https://jira.mongodb.org/browse/RUBY-1117, we discovered an issue in the Mongo ruby driver: if a thread dies before it has time to read the response to a query it triggered, then the next thread to read from that socket will return an incorrect response.
We've addressed the issue by ensuring that we were returning the proper response in
Mongo::Server::Connection#deliver. Ifresponse_todoes not match therequest_idof the current query, the next response is read.In fixing this bug, we also discovered that the class instance variables set in
Mongo::Protocol::Message.deserialize_headerwere not thread-safe. In order to avoid this problem, we decided to replace them by local variables, since they did not seem to be used anywhere.We've tried to provide meaningful specs as well, please let us know if there are things you wish to discuss before this can be merged!
Many thanks to @jarthod for helping me figure out this one.