Don't wake up destination transaction queue if they're not due for retry. #16223

erikjohnston · 2023-09-02T10:22:00Z

Based on #16221.

When we send an EDU to a large number of hosts (e.g. presence or a receipt in a large room), we instansiate a PerDestinationQueue per host. This has a bunch of overhead, and is pointless for hosts that are marked as "down".

To avoid this, we make sure to only wake up destinations that we will actually try to contact with the new data. This dramatically reduces CPU and memory usages of federation sending in large rooms.

DMRobertson

Would love to see some graphs showing perf gains if you have them!

DMRobertson · 2023-09-04T14:38:35Z

changelog.d/16223.misc

@@ -0,0 +1 @@
+Improve resource usage when sending data to a large number of remote hosts that are marked as "down".


Arguably .feature

DMRobertson · 2023-09-04T14:46:35Z

synapse/storage/databases/main/transactions.py

+        )
+
+        return {
+            row.pop("destination"): DestinationRetryTimings(**row)


I guess it's critical here that the key (and hence the pop) is evaluated before the value!

Yeah, it is a little hacky. I can update it to actually specify each key explicitly?

I think it's fine, it just took me a moment to see what was going on. (And: it's the kind of thing that I'm paranoid might break in a future Python release)

DMRobertson · 2023-09-04T14:48:50Z

synapse/util/retryutils.py

+    retry_timings = await store.get_destination_retry_timings_batch(destinations)
+
+    now = int(clock.time_msec())
+
+    return [
+        destination
+        for destination, timings in retry_timings.items()
+        if timings is None
+        or timings.retry_last_ts + timings.retry_interval <= now + retry_due_within_ms
+    ]


I was gonna say "it might be faster to do this logic in the query"... but I guess that would get in the way of using cachedList to piggyback off the existing cache?

Yeah, exactly

DMRobertson · 2023-09-04T14:49:52Z

tests/handlers/test_presence.py

@@ -909,8 +909,14 @@ def test_send_and_get(self) -> None:

        prev_token = self.queue.get_current_token(self.instance_name)

-        self.queue.send_presence_to_destinations((state1, state2), ("dest1", "dest2"))


Here and below: was the test broken by not having get_success (or similar) here?

Yeah, as calling a coroutine doesn't do anything unless you await on it

DMRobertson · 2023-09-04T14:50:18Z

tests/handlers/test_typing.py

-        self.datastore.get_destination_retry_timings = AsyncMock(return_value=None)
-


Was this just unused?

Yeah, it must have broken the test somehow but I've forgotten how

DMRobertson · 2023-09-04T14:55:23Z

synapse/federation/sender/__init__.py

This seems to add the filtering to

_send_pdu

send_presence_to_destinations

send_read_receipt

Are there other EDUs we should worry about here? (Typing? Device list stuff? to-device messages?)

What about the other methods on FederationSender?

send_edu

send_device_messages

is the point that only the former three handle multiple destinations in one call?

(Oh there is logic below for typing EDUs. But why don't they go via the federation sender too?)

Oh, mainly because a) send_edu and send_device_messages just take a single host, and b) I forgot about device messages 🤦

erikjohnston · 2023-09-04T15:14:56Z

Would love to see some graphs showing perf gains if you have them!

I have a memory graph, but given I deployed it at the start of the weekend I don't think its particularly useful.

DMRobertson · 2023-09-04T15:27:49Z

tests/federation/test_federation_sender.py

-        self.hs.get_federation_sender().send_device_messages("host2")
+        self.get_success(
+            self.hs.get_federation_sender().send_device_messages(["host2"])
+        )


Ugh, more of these. Maybe we should resurrect #14519 , at least for the tests

DMRobertson · 2023-09-04T15:29:13Z

synapse/handlers/device.py

-                        for host in hosts:
-                            self.federation_sender.send_device_messages(
-                                host, immediate=False
-                            )
-                            # TODO: when called, this isn't in a logging context.
-                            # This leads to log spam, sentry event spam, and massive
-                            # memory usage.
-                            # See https://github.com/matrix-org/synapse/issues/12552.
-                            # log_kv(
-                            #     {"message": "sent device update to host", "host": host}
-                            # )
+                        await self.federation_sender.send_device_messages(
+                            hosts, immediate=False
+                        )
+                        # TODO: when called, this isn't in a logging context.
+                        # This leads to log spam, sentry event spam, and massive
+                        # memory usage.
+                        # See https://github.com/matrix-org/synapse/issues/12552.
+                        # log_kv(
+                        #     {"message": "sent device update to host", "host": host}
+                        # )


7 -> 6 levels of indentation :)

DMRobertson

Happy if CI is and you're convinced we haven't missed out any other EDUs.

@evilham

- Add configuration setting for CAS protocol version. Contributed by Aurélien Grimpard. ([\#15816](#15816)) - Suppress notifications from message edits per [MSC3958](matrix-org/matrix-spec-proposals#3958). ([\#16113](#16113)) - Return a `Retry-After` with `M_LIMIT_EXCEEDED` error responses. ([\#16136](#16136)) - Add `last_seen_ts` to the [admin users API](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html). ([\#16218](#16218)) - Improve resource usage when sending data to a large number of remote hosts that are marked as "down". ([\#16223](#16223)) - Fix IPv6-related bugs on SMTP settings, adding groundwork to fix similar issues. Contributed by @evilham and @telmich (ungleich.ch). ([\#16155](#16155)) - Fix a spec compliance issue where requests to the `/publicRooms` federation API would specify `include_all_networks` as a string. ([\#16185](#16185)) - Fix inaccurate error message while attempting to ban or unban a user with the same or higher PL by spliting the conditional statements. Contributed by @leviosacz. ([\#16205](#16205)) - Fix a rare bug that broke looping calls, which could lead to e.g. linearly increasing memory usage. Introduced in v1.90.0. ([\#16210](#16210)) - Fix a long-standing bug where uploading images would fail if we could not generate thumbnails for them. ([\#16211](#16211)) - Fix a long-standing bug where we did not correctly back off from servers that had "gone" if they returned 4xx series error codes. ([\#16221](#16221)) - Update links to the [matrix.org blog](https://matrix.org/blog/). ([\#16008](#16008)) - Document which [admin APIs](https://matrix-org.github.io/synapse/latest/usage/administration/admin_api/index.html) are disabled when experimental [MSC3861](matrix-org/matrix-spec-proposals#3861) support is enabled. ([\#16168](#16168)) - Document [`exclude_rooms_from_sync`](https://matrix-org.github.io/synapse/v1.92/usage/configuration/config_documentation.html#exclude_rooms_from_sync) configuration option. ([\#16178](#16178)) - Prepare unit tests for Python 3.12. ([\#16099](#16099)) - Fix nightly CI jobs. ([\#16121](#16121), [\#16213](#16213)) - Describe which rate limiter was hit in logs. ([\#16135](#16135)) - Simplify presence code when using workers. ([\#16170](#16170)) - Track per-device information in the presence code. ([\#16171](#16171), [\#16172](#16172)) - Stop using the `event_txn_id` table. ([\#16175](#16175)) - Use `AsyncMock` instead of custom code. ([\#16179](#16179), [\#16180](#16180)) - Improve error reporting of invalid data passed to `/_matrix/key/v2/query`. ([\#16183](#16183)) - Task scheduler: add replication notify for new task to launch ASAP. ([\#16184](#16184)) - Improve type hints. ([\#16186](#16186), [\#16188](#16188), [\#16201](#16201)) - Bump black version to 23.7.0. ([\#16187](#16187)) - Log the details of background update failures. ([\#16212](#16212)) - Cache device resync requests over replication. ([\#16241](#16241)) * Bump anyhow from 1.0.72 to 1.0.75. ([\#16141](#16141)) * Bump furo from 2023.7.26 to 2023.8.19. ([\#16238](#16238)) * Bump phonenumbers from 8.13.18 to 8.13.19. ([\#16237](#16237)) * Bump psycopg2 from 2.9.6 to 2.9.7. ([\#16196](#16196)) * Bump regex from 1.9.3 to 1.9.4. ([\#16195](#16195)) * Bump ruff from 0.0.277 to 0.0.286. ([\#16198](#16198)) * Bump sentry-sdk from 1.29.2 to 1.30.0. ([\#16236](#16236)) * Bump serde from 1.0.184 to 1.0.188. ([\#16194](#16194)) * Bump serde_json from 1.0.104 to 1.0.105. ([\#16140](#16140)) * Bump types-psycopg2 from 2.9.21.10 to 2.9.21.11. ([\#16200](#16200)) * Bump types-pyyaml from 6.0.12.10 to 6.0.12.11. ([\#16199](#16199))

erikjohnston added 9 commits September 1, 2023 13:49

Don't reset retry timers on error

ff14b77

Fix up

87ac2ed

Fix up

99d6e1c

Newsfile

badce67

Fix tests

0b68944

Don't wake up destinations

f2149dc

Newsfile

9456f06

Check nulls

55c641e

Add desc

448e7d9

Base automatically changed from erikj/dont_reset_retry_on_403 to develop September 4, 2023 13:04

Merge branch 'develop' into erikj/prune_destinations

89c329e

erikjohnston marked this pull request as ready for review September 4, 2023 14:04

erikjohnston requested a review from a team as a code owner September 4, 2023 14:04

DMRobertson approved these changes Sep 4, 2023

View reviewed changes

erikjohnston added 2 commits September 4, 2023 16:23

Also apply to sending device messages

932cc3e

Update newsfile

cb2295f

erikjohnston requested a review from DMRobertson September 4, 2023 15:24

DMRobertson reviewed Sep 4, 2023

View reviewed changes

DMRobertson approved these changes Sep 4, 2023

View reviewed changes

erikjohnston merged commit d35bed8 into develop Sep 4, 2023
37 checks passed

erikjohnston deleted the erikj/prune_destinations branch September 4, 2023 16:14

DMRobertson mentioned this pull request Sep 11, 2023

Filter out down hosts when retrying fetching device lists #16298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't wake up destination transaction queue if they're not due for retry. #16223

Don't wake up destination transaction queue if they're not due for retry. #16223

erikjohnston commented Sep 2, 2023

DMRobertson left a comment

DMRobertson Sep 4, 2023

DMRobertson Sep 4, 2023

erikjohnston Sep 4, 2023

DMRobertson Sep 4, 2023

DMRobertson Sep 4, 2023

erikjohnston Sep 4, 2023

DMRobertson Sep 4, 2023

erikjohnston Sep 4, 2023

DMRobertson Sep 4, 2023

erikjohnston Sep 4, 2023

DMRobertson Sep 4, 2023

DMRobertson Sep 4, 2023

erikjohnston Sep 4, 2023

erikjohnston commented Sep 4, 2023

DMRobertson Sep 4, 2023

DMRobertson Sep 4, 2023

DMRobertson left a comment

		@@ -0,0 +1 @@
		Improve resource usage when sending data to a large number of remote hosts that are marked as "down".

		@@ -909,8 +909,14 @@ def test_send_and_get(self) -> None:

		prev_token = self.queue.get_current_token(self.instance_name)

		self.queue.send_presence_to_destinations((state1, state2), ("dest1", "dest2"))

		self.datastore.get_destination_retry_timings = AsyncMock(return_value=None)

Don't wake up destination transaction queue if they're not due for retry. #16223

Don't wake up destination transaction queue if they're not due for retry. #16223

Conversation

erikjohnston commented Sep 2, 2023

DMRobertson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikjohnston commented Sep 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DMRobertson left a comment

Choose a reason for hiding this comment