This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Federation workers continue accepting events after the event persister crashes, losing them #14924
Labels
A-Federation
O-Uncommon
Most users are unlikely to come across this or unexpected workflow
S-Major
Major functionality / product severely impaired, no satisfactory workaround.
T-Defect
Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Description
My event persister crashes at inopportune times like 2am, but the federation worker (a combined reader and inbound worker in my setup) continues to accept transactions, 200 OKing them. My guess is the worker thinks the persister is online, sees events in the transaction, queues the events, 200 OK's the transaction, persister ends up not being online so the replication calls fail.
This has the effect where your server only sees the last 10 events in a room when someone else sends an event (as the server is now backfilling the
prev_events
). For rooms where your server is not likely to receive another event (DMs), your only option is to go around asking people to "resend" or otherwise send a message in the room.Steps to reproduce
Homeserver
t2l.io
Synapse Version
{"server_version":"1.75.0","python_version":"3.10.6"}
Installation Method
Other (please mention below)
Database
postgresql - no split db, restored from sqlite an eternity ago
Workers
Multiple workers
Platform
Ubuntu droplet on digital ocean
Configuration
Not applicable.
Relevant log output
Anything else that would be useful to know?
It appears the federation worker is aware that it's unable to persist events, and is pulling prev_events for events it's receiving.
My server is installed from git tags with pip, and although is currently on a faster joins RC this problem has been happening for >1 year.
The text was updated successfully, but these errors were encountered: