Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Federation workers continue accepting events after the event persister crashes, losing them #14924

Open
turt2live opened this issue Jan 26, 2023 · 1 comment
Labels
A-Federation O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@turt2live
Copy link
Member

Description

My event persister crashes at inopportune times like 2am, but the federation worker (a combined reader and inbound worker in my setup) continues to accept transactions, 200 OKing them. My guess is the worker thinks the persister is online, sees events in the transaction, queues the events, 200 OK's the transaction, persister ends up not being online so the replication calls fail.

This has the effect where your server only sees the last 10 events in a room when someone else sends an event (as the server is now backfilling the prev_events). For rooms where your server is not likely to receive another event (DMs), your only option is to go around asking people to "resend" or otherwise send a message in the room.

Steps to reproduce

  • Set up at least a federation inbound worker and event persister
  • Kill your event persister
  • Receive events over federation
  • Restart Synapse to recover your event persister
  • :(

Homeserver

t2l.io

Synapse Version

{"server_version":"1.75.0","python_version":"3.10.6"}

Installation Method

Other (please mention below)

Database

postgresql - no split db, restored from sqlite an eternity ago

Workers

Multiple workers

Platform

Ubuntu droplet on digital ocean

Configuration

Not applicable.

Relevant log output

[federation_reader_1] 2023-01-26 13:59:31,155 - synapse.federation.federation_server - 1221 - INFO - _process_incoming_pdus_in_room_inner-5846- handling received PDU in room !fLeHeojWgBGJlLNdLC:matrix.org: <FrozenEventV3 event_id=$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI, type=m.room.message, state_key=None, outlier=False>
[federation_reader_1] 2023-01-26 13:59:31,234 - synapse.handlers.federation_event - 266 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Acquiring room lock to fetch 1 missing prev_events: ['$2fWcbKbztz66GHmXPQf3whytBpwKA1DsLGx656tRuEs']
[federation_reader_1] 2023-01-26 13:59:31,234 - synapse.handlers.federation_event - 272 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Acquired room lock to fetch 1 missing prev_events
[federation_reader_1] 2023-01-26 13:59:31,262 - synapse.handlers.federation_event - 689 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Requesting missing events between ['$Bm3Fvkhvl8T65KNEmMVVbzgMn7POhmXjvzgrU8QfwIU'] and $weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI
[federation_reader_1] 2023-01-26 13:59:31,384 - synapse.http.matrixfederationclient - 299 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- {POST-O-20667} [matrix.org] Completed request: 200 OK in 0.12 secs, got 8003 bytes - POST matrix://matrix.org/_matrix/federation/v1/get_missing_events/%21fLeHeojWgBGJlLNdLC%3Amatrix.org
[federation_reader_1] 2023-01-26 13:59:31,392 - synapse.handlers.federation_event - 760 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI- Got 10 prev_events
[federation_reader_1] 2023-01-26 13:59:31,426 - synapse.handlers.federation_event - 868 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c- Processing pulled event <FrozenEventV3 event_id=$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c, type=m.room.member, state_key=@moritzr:element.io, outlier=False>
[federation_reader_1] 2023-01-26 13:59:31,453 - synapse.handlers.federation_event - 1018 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c- Event $gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c is missing prev_events ['$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64']: calculating state for a backwards extremity
[federation_reader_1] 2023-01-26 13:59:31,453 - synapse.handlers.federation_event - 1048 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c- Requesting state after missing prev_event $bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64
[federation_reader_1] 2023-01-26 13:59:31,990 - synapse.http.matrixfederationclient - 299 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- {GET-O-20668} [matrix.org] Completed request: 200 OK in 0.54 secs, got 77629 bytes - GET matrix://matrix.org/_matrix/federation/v1/state_ids/%21fLeHeojWgBGJlLNdLC%3Amatrix.org?event_id=%24bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64
[federation_reader_1] 2023-01-26 13:59:32,959 - synapse.http.matrixfederationclient - 299 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- {GET-O-20672} [matrix.org] Completed request: 200 OK in 0.11 secs, got 935 bytes - GET matrix://matrix.org/_matrix/federation/v1/event/%24bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64
[federation_reader_1] 2023-01-26 13:59:32,963 - synapse.handlers.federation_event - 1609 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- Fetched 1 events of 1 requested
[federation_reader_1] 2023-01-26 13:59:32,964 - synapse.handlers.federation_event - 1673 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- Persisting 1 of 1 remaining outliers: ['$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64']
[federation_reader_1] 2023-01-26 13:59:33,307 - synapse.http.client - 469 - INFO - _process_incoming_pdus_in_room_inner-5846-$weVjhE57ZCE4mT7MieAEZMP92esMet55ZTy0yhv1OBI-$gIjhjD3MhzlotyC3LeBk-MNoNfxQ2xFItdiUaHJ182c-$bPUqYUaxmqufnN9eckAUwuOjbhKGNBxncVFqmTJgU64- Error sending request to  POST http://127.0.0.1:9070/_synapse/replication/fed_send_events/GHqwAIrARt: ConnectionRefusedError Connection refused

Anything else that would be useful to know?

It appears the federation worker is aware that it's unable to persist events, and is pulling prev_events for events it's receiving.

My server is installed from git tags with pip, and although is currently on a faster joins RC this problem has been happening for >1 year.

@squahtx squahtx added A-Federation S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Uncommon Most users are unlikely to come across this or unexpected workflow T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jan 26, 2023
@turt2live
Copy link
Member Author

Since this happened again for me, decided to risk it and only restart the event persister instead of all of Synapse: it appears the federation reader does retry its persist requests, but forgets its queue when it restarts. Unfortunately, while waiting for all the retries to go through the federation reader crashed 🤦

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

2 participants