PYTHON-3175 Preemptively cancel in progress operations when SDAM heartbeats timeout #1465

NoahStapp · 2024-01-05T20:15:08Z

No description provided.

…tbeats timeout

pymongo/network.py

pymongo/pool.py

ShaneHarvey · 2024-01-11T20:01:14Z

pymongo/network.py

+                timeout = _POLL_TIMEOUT
+            readable = conn.socket_checker.select(sock, read=True, timeout=timeout)
+        if conn.cancel_context.cancelled:
+            raise _OperationCancelled("operation cancelled")


This block is under indented one-level now.

This block still looks under indented. Is that intentional?

The original code has the following if statements all at the same level:

if context.cancelled: ... if readable: ... if timed_out: ...

This change keeps that consistency, but removes the initial if context: check above, which might be causing the confusion.

Thanks for explaining. Github's diff still shows this as if it changed for me when I set the ignore whitespace option. Must be a bug on their end.

pymongo/pool.py

ShaneHarvey

Could you run a benchmark with:

100 threads all running short running op (eg ping)
100 threads all running a long running op (eg find one with a sleep)

As well as the EVG benchmark? I'm wondering if there's a cost to adding this polling loop to every connection.

pymongo/monitoring.py

ShaneHarvey · 2024-01-16T17:53:05Z

pymongo/monitoring.py

@@ -874,15 +874,23 @@ class PoolClearedEvent(_PoolEvent):
    :param address: The address (host, port) pair of the server this Pool is
       attempting to connect to.
    :param service_id: The service_id this command was sent to, or ``None``.
+     - `service_id`: The service_id this command was sent to, or ``None``.
+     - '__interrupt_in_use_connections': True if all active connections were interrupted by the Pool during clearing.


remove the leading __. Also is this really the name? It seems comically long. Can we think of a better one?

We could shorten it to __interrupt_connections, but then we lose the information that only connections in use are interrupted.

Not sure if we follow the spec this closely, but the spec explicitly calls this flag interruptInUseConnections.

Let's rename it interrupt_connections throughout, even on the monitoring events.

The spec tests actually explicitly check for the existence of interruptInUseConnections in the event, so renaming this breaks the tests.

What we can do in that situation is update the test runner to map the field name. It looks like the test runner ignores the interruptInUseConnections field and only checks for hasServiceId:

mongo-python-driver/test/unified_format.py

Lines 836 to 838 in b1939e1

elif name == "poolClearedEvent":

self.test.assertIsInstance(actual, PoolClearedEvent)

self.assertHasServiceId(spec, actual)

ShaneHarvey · 2024-01-16T17:55:13Z

pymongo/network.py

+                timeout = _POLL_TIMEOUT
+            readable = conn.socket_checker.select(sock, read=True, timeout=timeout)
+        if conn.cancel_context.cancelled:
+            raise _OperationCancelled("operation cancelled")


This block still looks under indented. Is that intentional?

ShaneHarvey · 2024-01-16T17:56:38Z

pymongo/network.py

+    sock = conn.conn
+    timed_out = False
+    # Check if the socket is already closed
+    if sock.fileno() == -1:


When would the socket be closed? The comment could be improved by explaining that case (ie the when/why not the how).

We have at least one test that explicitly closes the socket itself rather than the connection. Sounds good, I'll add a comment.

pymongo/pool.py

NoahStapp · 2024-01-16T23:34:35Z

Could you run a benchmark with:

100 threads all running short running op (eg ping)

100 threads all running a long running op (eg find one with a sleep)

As well as the EVG benchmark? I'm wondering if there's a cost to adding this polling loop to every connection.

EVG benchmark results: https://performance-analyzer.server-tig.prod.corp.mongodb.com/perf-analyzer-viz?comparison_id=9802c9f9-659e-46b3-97ac-c268f34a581c&selected_tab=data-table&percent_filter=0%7C%7C100&z_filter=0%7C%7C10

A slight decrease for some operations, a slight improvement or no change for others.

I ran a very simple local benchmark that created 100 threads and executed either a ping or a find_one with a sleep on each, timing the runtime and averaging out the result over all the threads. The runtime over 100 iterations of this benchmark is essentially unchanged when compared to master.

caseyclements

Good to go. Thank you. I learned a lot doing this review!

ShaneHarvey · 2024-01-17T00:52:16Z

pymongo/network.py

+                timeout = _POLL_TIMEOUT
+            readable = conn.socket_checker.select(sock, read=True, timeout=timeout)
+        if conn.cancel_context.cancelled:
+            raise _OperationCancelled("operation cancelled")


Thanks for explaining. Github's diff still shows this as if it changed for me when I set the ignore whitespace option. Must be a bug on their end.

ShaneHarvey · 2024-01-17T20:20:30Z

pymongo/monitoring.py

@@ -874,15 +874,23 @@ class PoolClearedEvent(_PoolEvent):
    :param address: The address (host, port) pair of the server this Pool is
       attempting to connect to.
    :param service_id: The service_id this command was sent to, or ``None``.
+     - `service_id`: The service_id this command was sent to, or ``None``.
+     - '__interrupt_in_use_connections': True if all active connections were interrupted by the Pool during clearing.


Let's rename it interrupt_connections throughout, even on the monitoring events.

ShaneHarvey · 2024-01-17T20:56:10Z

Looking at the perf results it does appear this results in a ~5% decrease. Locally with 100 threads I see a decrease of somewhere between 5-10%:

$ git:(master) time python bench-100-threads.py 
python bench-100-threads.py  6.61s user 8.06s system 129% cpu 11.320 total
$ git:(master) time python bench-100-threads.py 
python bench-100-threads.py  6.65s user 8.13s system 130% cpu 11.297 total
$ git:(master) time python bench-100-threads.py 
python bench-100-threads.py  6.86s user 8.52s system 129% cpu 11.842 total

vs

$ git:(NoahStapp-PYTHON-3175) time python bench-100-threads.py 
python bench-100-threads.py  7.18s user 8.52s system 127% cpu 12.323 total
$ git:(NoahStapp-PYTHON-3175) time python bench-100-threads.py 
python bench-100-threads.py  7.26s user 8.88s system 130% cpu 12.320 total
$ git:(NoahStapp-PYTHON-3175) time python bench-100-threads.py 
python bench-100-threads.py  7.34s user 9.02s system 132% cpu 12.335 total

With a single threaded benchmark the total execution time is about the same but the user and system CPU time goes up by 5%

$ git:(master) time python bench-single-thread.py
1.80s user 0.34s system 50% cpu 4.259 total
$ git:(master) time python bench-single-thread.py
1.81s user 0.34s system 50% cpu 4.267 total
$ git:(master) time python bench-single-thread.py
1.85s user 0.36s system 51% cpu 4.277 total

VS

$ git:(NoahStapp-PYTHON-3175) time python bench-single-thread.py
1.99s user 0.41s system 56% cpu 4.277 total
$ git:(NoahStapp-PYTHON-3175) time python bench-single-thread.py
1.97s user 0.42s system 56% cpu 4.239 total
$ git:(NoahStapp-PYTHON-3175) time python bench-single-thread.py
1.97s user 0.41s system 55% cpu 4.254 total

Here's the single-threaded benchmark:

from pymongo import MongoClient
client = MongoClient()
N_ITERATIONS = 50000

def main():
    for _ in range(N_ITERATIONS):
        client.admin.command('ping')

if __name__ == "__main__":
    main()

I don't think this should hold up this PR but it would be good to circle back and see how we can improve. Could you open a new ticket?

ShaneHarvey

Still need to update the unified runner to check interruptInUseConnections. It looks like the test runner ignores the interruptInUseConnections field and only checks for hasServiceId:

mongo-python-driver/test/unified_format.py

Lines 836 to 838 in b1939e1

    
           elif name == "poolClearedEvent": 
        
               self.test.assertIsInstance(actual, PoolClearedEvent) 
        
               self.assertHasServiceId(spec, actual)

ShaneHarvey

LGTM!

NoahStapp added 5 commits January 5, 2024 12:11

PYTHON-3175 Preemptively cancel in progress operations when SDAM hear…

cc77bb3

…tbeats timeout

WIP

b8502bc

Fix incorrect corrupting implementation

83d99c1

Update MockPool to reflect Pool changes

475ff43

Remove pool-clear-interrupt-immediately test

16adfbe

NoahStapp requested a review from a team as a code owner January 5, 2024 20:15

NoahStapp requested review from caseyclements and removed request for a team January 5, 2024 20:15

Fix typing + doc failures

dd420a1

ShaneHarvey self-requested a review January 8, 2024 19:27

caseyclements reviewed Jan 9, 2024

View reviewed changes

pymongo/network.py Outdated Show resolved Hide resolved

pymongo/pool.py Show resolved Hide resolved

pymongo/pool.py Show resolved Hide resolved

NoahStapp force-pushed the PYTHON-3175 branch from b08bde5 to fcc878e Compare January 10, 2024 21:16

Revert cancel_context changes

a677b19

NoahStapp force-pushed the PYTHON-3175 branch from fcc878e to a677b19 Compare January 10, 2024 22:23

ShaneHarvey requested changes Jan 11, 2024

View reviewed changes

pymongo/pool.py Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

ShaneHarvey reviewed Jan 11, 2024

View reviewed changes

NoahStapp added 2 commits January 11, 2024 14:50

Use set for active connection contexts, fix cancellation bug

ad354b7

Revert removal of active_sockets

7805d96

ShaneHarvey requested changes Jan 11, 2024

View reviewed changes

pymongo/pool.py Outdated Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

pymongo/pool.py Outdated Show resolved Hide resolved

NoahStapp added 5 commits January 11, 2024 16:09

Rename active_conns back to conns

00e4a16

Update tests

7ca458d

Check for cancelled connections before select()

c6247e9

Fix wait_for_read

01aa361

Check if socket is already closed before polling in wait_for_read

bb368ee

NoahStapp requested review from ShaneHarvey and caseyclements January 12, 2024 23:16

ShaneHarvey requested changes Jan 16, 2024

View reviewed changes

Address review comments

46aec15

NoahStapp requested a review from ShaneHarvey January 17, 2024 18:45

caseyclements approved these changes Jan 17, 2024

View reviewed changes

ShaneHarvey requested changes Jan 17, 2024

View reviewed changes

interrupt_in_use_connections -> interrupt_connections

a01bbff

NoahStapp requested a review from ShaneHarvey January 18, 2024 20:23

ShaneHarvey requested changes Jan 18, 2024

View reviewed changes

Add assertHasInterruptInUseConnections

23d8822

NoahStapp requested a review from ShaneHarvey January 19, 2024 00:31

ShaneHarvey approved these changes Jan 19, 2024

View reviewed changes

NoahStapp merged commit c4e4bd6 into mongodb:master Jan 19, 2024

	elif name == "poolClearedEvent":
	self.test.assertIsInstance(actual, PoolClearedEvent)
	self.assertHasServiceId(spec, actual)

PYTHON-3175 Preemptively cancel in progress operations when SDAM heartbeats timeout #1465

PYTHON-3175 Preemptively cancel in progress operations when SDAM heartbeats timeout #1465

Uh oh!

Conversation

NoahStapp commented Jan 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NoahStapp Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NoahStapp commented Jan 16, 2024

Uh oh!

caseyclements left a comment

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShaneHarvey Jan 17, 2024 •

edited

Loading

NoahStapp Jan 16, 2024 •

edited

Loading

ShaneHarvey Jan 17, 2024 •

edited

Loading

ShaneHarvey commented Jan 17, 2024 •

edited

Loading