-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-3175 Preemptively cancel in progress operations when SDAM heartbeats timeout #1465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b08bde5
to
fcc878e
Compare
fcc878e
to
a677b19
Compare
timeout = _POLL_TIMEOUT | ||
readable = conn.socket_checker.select(sock, read=True, timeout=timeout) | ||
if conn.cancel_context.cancelled: | ||
raise _OperationCancelled("operation cancelled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is under indented one-level now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block still looks under indented. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code has the following if statements all at the same level:
if context.cancelled:
...
if readable:
...
if timed_out:
...
This change keeps that consistency, but removes the initial if context:
check above, which might be causing the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. Github's diff still shows this as if it changed for me when I set the ignore whitespace option. Must be a bug on their end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run a benchmark with:
- 100 threads all running short running op (eg ping)
- 100 threads all running a long running op (eg find one with a sleep)
As well as the EVG benchmark? I'm wondering if there's a cost to adding this polling loop to every connection.
pymongo/monitoring.py
Outdated
@@ -874,15 +874,23 @@ class PoolClearedEvent(_PoolEvent): | |||
:param address: The address (host, port) pair of the server this Pool is | |||
attempting to connect to. | |||
:param service_id: The service_id this command was sent to, or ``None``. | |||
- `service_id`: The service_id this command was sent to, or ``None``. | |||
- '__interrupt_in_use_connections': True if all active connections were interrupted by the Pool during clearing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the leading __
. Also is this really the name? It seems comically long. Can we think of a better one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could shorten it to __interrupt_connections
, but then we lose the information that only connections in use are interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we follow the spec this closely, but the spec explicitly calls this flag interruptInUseConnections
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename it interrupt_connections
throughout, even on the monitoring events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec tests actually explicitly check for the existence of interruptInUseConnections
in the event, so renaming this breaks the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we can do in that situation is update the test runner to map the field name. It looks like the test runner ignores the interruptInUseConnections field and only checks for hasServiceId:
mongo-python-driver/test/unified_format.py
Lines 836 to 838 in b1939e1
elif name == "poolClearedEvent": | |
self.test.assertIsInstance(actual, PoolClearedEvent) | |
self.assertHasServiceId(spec, actual) |
timeout = _POLL_TIMEOUT | ||
readable = conn.socket_checker.select(sock, read=True, timeout=timeout) | ||
if conn.cancel_context.cancelled: | ||
raise _OperationCancelled("operation cancelled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block still looks under indented. Is that intentional?
sock = conn.conn | ||
timed_out = False | ||
# Check if the socket is already closed | ||
if sock.fileno() == -1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would the socket be closed? The comment could be improved by explaining that case (ie the when/why not the how).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have at least one test that explicitly closes the socket itself rather than the connection. Sounds good, I'll add a comment.
EVG benchmark results: https://performance-analyzer.server-tig.prod.corp.mongodb.com/perf-analyzer-viz?comparison_id=9802c9f9-659e-46b3-97ac-c268f34a581c&selected_tab=data-table&percent_filter=0%7C%7C100&z_filter=0%7C%7C10 A slight decrease for some operations, a slight improvement or no change for others. I ran a very simple local benchmark that created 100 threads and executed either a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go. Thank you. I learned a lot doing this review!
timeout = _POLL_TIMEOUT | ||
readable = conn.socket_checker.select(sock, read=True, timeout=timeout) | ||
if conn.cancel_context.cancelled: | ||
raise _OperationCancelled("operation cancelled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. Github's diff still shows this as if it changed for me when I set the ignore whitespace option. Must be a bug on their end.
pymongo/monitoring.py
Outdated
@@ -874,15 +874,23 @@ class PoolClearedEvent(_PoolEvent): | |||
:param address: The address (host, port) pair of the server this Pool is | |||
attempting to connect to. | |||
:param service_id: The service_id this command was sent to, or ``None``. | |||
- `service_id`: The service_id this command was sent to, or ``None``. | |||
- '__interrupt_in_use_connections': True if all active connections were interrupted by the Pool during clearing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename it interrupt_connections
throughout, even on the monitoring events.
Looking at the perf results it does appear this results in a ~5% decrease. Locally with 100 threads I see a decrease of somewhere between 5-10%:
vs
With a single threaded benchmark the total execution time is about the same but the user and system CPU time goes up by 5%
VS
Here's the single-threaded benchmark: from pymongo import MongoClient
client = MongoClient()
N_ITERATIONS = 50000
def main():
for _ in range(N_ITERATIONS):
client.admin.command('ping')
if __name__ == "__main__":
main() I don't think this should hold up this PR but it would be good to circle back and see how we can improve. Could you open a new ticket? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to update the unified runner to check interruptInUseConnections. It looks like the test runner ignores the interruptInUseConnections field and only checks for hasServiceId:
mongo-python-driver/test/unified_format.py
Lines 836 to 838 in b1939e1
elif name == "poolClearedEvent": | |
self.test.assertIsInstance(actual, PoolClearedEvent) | |
self.assertHasServiceId(spec, actual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
No description provided.