Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Waiting for streams to catch up when processing HTTP replication can get stuck timing out #15136

Closed
squahtx opened this issue Feb 22, 2023 · 3 comments
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@squahtx
Copy link
Contributor

squahtx commented Feb 22, 2023

#14820 (introduced in v1.76.0) added code to wait for streams to catch up when receiving HTTP replication responses.

We've seen that code hit a failure mode where all waits on the backfill stream start timing out, on a deployment where backfill is very rare. It's not clear how Synapse entered that state or how to reproduce it yet.

@squahtx squahtx added A-Workers Problems related to running Synapse in Worker Mode (or replication) S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Feb 22, 2023
@richvdh
Copy link
Member

richvdh commented Feb 22, 2023

Typical logs on the event creator:

2023-02-19 03:48:05,720 - synapse.http.client - 463 - INFO - PUT-165- Received response to POST http://event-persister-repl:9092/_synapse/replication/send_events/NGLcoEtsGt: 200
2023-02-19 03:48:05,720 - synapse.replication.tcp.client - 372 - INFO - PUT-165- Waiting for repl stream 'events' to reach 208963421 (event_persister)
2023-02-19 03:48:05,726 - synapse.replication.tcp.client - 384 - INFO - PUT-165- Finished waiting for repl stream 'events' to reach 208963421 (event_persister)
2023-02-19 03:48:05,726 - synapse.replication.tcp.client - 372 - INFO - PUT-165- Waiting for repl stream 'backfill' to reach 7490201 (event_persister)
2023-02-19 03:48:06,689 - synapse.util.caches.lrucache - 212 - INFO - LruCache._expire_old_entries-476- Dropped 0 items from caches
2023-02-19 03:48:10,727 - synapse.replication.tcp.client - 381 - ERROR - PUT-165- Timed out waiting for stream backfill
2023-02-19 03:48:10,734 - synapse.access.http.8009 - 460 - INFO - PUT-165-  <...> - 8009 - <...> Processed request: 5.047sec/0.002sec (0.008sec, 0.000sec) (0.001sec/0.003sec/6) 59B 200 "PUT /_matrix/client/r0/rooms/<room id>/send..."

@richvdh
Copy link
Member

richvdh commented Feb 22, 2023

Logs on the event persister:

2023-02-19 02:11:59,238 - synapse.replication.tcp.resource - 193 - INFO - replication_notifier-50915- Streaming: backfill -> 7490199

... and nothing else until after we restart it and it streams starting at 7490202.

Incidentally, no events with stream ordering -7490200 or -7490201 have been persisted.

So it looks like something decided to allocate those two stream orderings, but then never persisted them?

@reivilibre
Copy link
Contributor

Should be fixed by #15150.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Workers Problems related to running Synapse in Worker Mode (or replication) O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

3 participants