replication timeouts due to message retention purge jobs #16489

verymilan · 2023-10-14T09:08:22Z

Description

Assumingly due to #13632, the master process is unable to handle replication requests by workers due to the load from purge jobs. It is happily logging updates on the purge job states while clients can't connect anymore.

Steps to reproduce

enable message retention
maybe be a big instance idk
wait for the scheduled job to execute

Homeserver

tchncs.de

Synapse Version

1.94.0

Installation Method

pip (from PyPI)

Database

PostgreSQL

Workers

Multiple workers

Platform

Debian GNU/Linux 12 (bookworm), dedicated

Configuration

draupnir module, presence, retention

Relevant log output

synapse.replication.tcp.client - 352 - INFO - _process_incoming_pdus_in_room_inner-124023-$fbrT_6mck678v_gNV527V0f5Jp4kvbDiQVSeHOmiN2E - Finished waiting for repl stream 'events' to reach 361593234 (event_persister1)
synapse.http.client - 923 - INFO - PUT-890470 - Received response to POST synapse-replication://master/_synapse/replication/fed_send_edu/m.receipt/IjFSBKBxIa: 200
synapse.replication.tcp.client - 332 - INFO - PUT-890470 - Waiting for repl stream 'caches' to reach 416737455 (master); currently at: 416710210
synapse.replication.tcp.client - 342 - WARNING - PUT-890464 - Timed out waiting for repl stream 'caches' to reach 416737417 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.http._base - 300 - WARNING - GET-2559861 - presence_set_state request timed out; retrying
synapse.replication.http._base - 312 - WARNING - PUT-899550 - fed_send_edu request connection failed; retrying in 1s: ConnectError(<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>)
synapse.http.client - 932 - INFO - PUT-901284 - Error sending request to  POST synapse-replication://master/_synapse/replication/fed_send_edu/m.presence/WCoECfmCdH: ConnectError [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]

clokep · 2023-10-16T15:44:42Z

I think this issue is mostly that "message retention is expensive and must be done on the main process"?

clokep · 2023-10-16T15:45:49Z

It seems like there's a few issues here though:

Message retention purging being expensive.
Message retention being handled by the main process.
Replication falling behind due to message retention using all CPU.

I'm unsure of the best approach to solve this, or if all 3 are somewhat needed.

abeluck · 2023-11-16T15:07:30Z

We want to give our users a message retention promise, but its currently not possible on large servers with databases in the hundreds gigabytes.

Message retention purging being expensive.

I'm not sure this is solvable. Purging involves touching potentially lots of rows.

It seems to me there are a couple possibilities:

Leave purge jobs on the primary process, but cap their run time or do some sort of coroutine style yielding to allow replication traffic to get cpu time. (There is prior art in limiting background jobs to a certain duration)
Move the purge job to a worker. I know this was tried before but was switched back to the primary process for what seems like lack of dev time, rather than a technical reason (though obviously communication happens outside Github issues)
- This would require moving the purge_history function mentioned in the above issue to a place where workers can access it, thereby moving the purge job off the main process

But since I'm not as familiar with synapse's internals as the devs I can't say which is the most attractive. But this issue is very painful for us and its difficult to explain to stakeholders that despite this feature being well documented, it in fact breaks the server and has been like that for quite a while.

abeluck · 2023-11-28T09:11:01Z

This recent commit removing the experimental warning from retention is not a good idea IMO. While it's fabulous there is no longer risk of corruption bugs, as this issue establishes, there is a serious bug in the retention feature as it exists: your server will essentially stop working (if you have a large number of events that need purging).

fb664 Remove warnings from the docs about using message retention.

matrixbot mentioned this issue Dec 22, 2023

replication timeouts due to message retention purge jobs element-hq/synapse#16489

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replication timeouts due to message retention purge jobs #16489

replication timeouts due to message retention purge jobs #16489

verymilan commented Oct 14, 2023 •

edited

Loading

clokep commented Oct 16, 2023

clokep commented Oct 16, 2023

abeluck commented Nov 16, 2023 •

edited

Loading

abeluck commented Nov 28, 2023 •

edited

Loading

replication timeouts due to message retention purge jobs #16489

replication timeouts due to message retention purge jobs #16489

Comments

verymilan commented Oct 14, 2023 • edited Loading

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Database

Workers

Platform

Configuration

Relevant log output

clokep commented Oct 16, 2023

clokep commented Oct 16, 2023

abeluck commented Nov 16, 2023 • edited Loading

abeluck commented Nov 28, 2023 • edited Loading

verymilan commented Oct 14, 2023 •

edited

Loading

abeluck commented Nov 16, 2023 •

edited

Loading

abeluck commented Nov 28, 2023 •

edited

Loading