Federation workers continue accepting events after the event persister crashes, losing them #14924

turt2live · 2023-01-26T16:04:22Z

Description

My event persister crashes at inopportune times like 2am, but the federation worker (a combined reader and inbound worker in my setup) continues to accept transactions, 200 OKing them. My guess is the worker thinks the persister is online, sees events in the transaction, queues the events, 200 OK's the transaction, persister ends up not being online so the replication calls fail.

This has the effect where your server only sees the last 10 events in a room when someone else sends an event (as the server is now backfilling the prev_events). For rooms where your server is not likely to receive another event (DMs), your only option is to go around asking people to "resend" or otherwise send a message in the room.

Steps to reproduce

Set up at least a federation inbound worker and event persister
Kill your event persister
Receive events over federation
Restart Synapse to recover your event persister
:(

Homeserver

t2l.io

Synapse Version

{"server_version":"1.75.0","python_version":"3.10.6"}

Installation Method

Other (please mention below)

Database

postgresql - no split db, restored from sqlite an eternity ago

Workers

Multiple workers

Platform

Ubuntu droplet on digital ocean

Configuration

Not applicable.

Relevant log output

[federation_reader_1] 2023-01-26 13:59:31,155 - synapse.federation.federation_server - 1221 - INFO - _process_incoming_pdus_in_room_inner-5846- handling received PDU in room !fLeHeojWgBGJlLNdLC:matrix.org: <FrozenEventV3 event_id=$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI, type=m.room.message, state_key=None, outlier=False>
[federation_reader_1] 2023-01-26 13:59:31,234 - synapse.handlers.federation_event - 266 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Acquiring room lock to fetch 1 missing prev_events: ['$2fWcbKbztz66GHmXPQf3whytBpwKA1DsLGx656tRuEs']
[federation_reader_1] 2023-01-26 13:59:31,234 - synapse.handlers.federation_event - 272 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Acquired room lock to fetch 1 missing prev_events
[federation_reader_1] 2023-01-26 13:59:31,262 - synapse.handlers.federation_event - 689 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Requesting missing events between ['$Bm3Fvkhvl8T65KNEmMVVbzgMn7POhmXjvzgrU8QfwIU'] and $weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI
[federation_reader_1] 2023-01-26 13:59:31,384 - synapse.http.matrixfederationclient - 299 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- {POST-O-20667} [matrix.org] Completed request: 200 OK in 0.12 secs, got 8003 bytes - POST matrix://matrix.org/_matrix/federation/v1/get_missing_events/%21fLeHeojWgBGJlLNdLC%3Amatrix.org
[federation_reader_1] 2023-01-26 13:59:31,392 - synapse.handlers.federation_event - 760 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Got 10 prev_events
[federation_reader_1] 2023-01-26 13:59:31,426 - synapse.handlers.federation_event - 868 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c- Processing pulled event <FrozenEventV3 event_id=$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c, type=m.room.member, state_key=@moritzr:element.io, outlier=False>
[federation_reader_1] 2023-01-26 13:59:31,453 - synapse.handlers.federation_event - 1018 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c- Event $gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c is missing prev_events ['$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64']: calculating state for a backwards extremity
[federation_reader_1] 2023-01-26 13:59:31,453 - synapse.handlers.federation_event - 1048 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c- Requesting state after missing prev_event $bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64
[federation_reader_1] 2023-01-26 13:59:31,990 - synapse.http.matrixfederationclient - 299 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- {GET-O-20668} [matrix.org] Completed request: 200 OK in 0.54 secs, got 77629 bytes - GET matrix://matrix.org/_matrix/federation/v1/state_ids/%21fLeHeojWgBGJlLNdLC%3Amatrix.org?event_id=%24bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64
[federation_reader_1] 2023-01-26 13:59:32,959 - synapse.http.matrixfederationclient - 299 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- {GET-O-20672} [matrix.org] Completed request: 200 OK in 0.11 secs, got 935 bytes - GET matrix://matrix.org/_matrix/federation/v1/event/%24bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64
[federation_reader_1] 2023-01-26 13:59:32,963 - synapse.handlers.federation_event - 1609 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- Fetched 1 events of 1 requested
[federation_reader_1] 2023-01-26 13:59:32,964 - synapse.handlers.federation_event - 1673 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- Persisting 1 of 1 remaining outliers: ['$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64']
[federation_reader_1] 2023-01-26 13:59:33,307 - synapse.http.client - 469 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- Error sending request to  POST http://127.0.0.1:9070/_synapse/replication/fed_send_events/GHqwAIrARt: ConnectionRefusedError Connection refused

Anything else that would be useful to know?

It appears the federation worker is aware that it's unable to persist events, and is pulling prev_events for events it's receiving.

My server is installed from git tags with pip, and although is currently on a faster joins RC this problem has been happening for >1 year.

The text was updated successfully, but these errors were encountered:

turt2live · 2023-01-27T17:34:44Z

Since this happened again for me, decided to risk it and only restart the event persister instead of all of Synapse: it appears the federation reader does retry its persist requests, but forgets its queue when it restarts. Unfortunately, while waiting for all the retries to go through the federation reader crashed 🤦

squahtx added A-Federation S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Uncommon Most users are unlikely to come across this or unexpected workflow T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jan 26, 2023

matrixbot mentioned this issue Dec 21, 2023

Federation workers continue accepting events after the event persister crashes, losing them element-hq/synapse#14924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Federation workers continue accepting events after the event persister crashes, losing them #14924

Federation workers continue accepting events after the event persister crashes, losing them #14924

turt2live commented Jan 26, 2023

turt2live commented Jan 27, 2023

Federation workers continue accepting events after the event persister crashes, losing them #14924

Federation workers continue accepting events after the event persister crashes, losing them #14924

Comments

turt2live commented Jan 26, 2023

Description

Steps to reproduce

Homeserver

Synapse Version

Installation Method

Database

Workers

Platform

Configuration

Relevant log output

Anything else that would be useful to know?

turt2live commented Jan 27, 2023