Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Users stop syncing, postgres queries at high CPU #7618

Closed
f0x52 opened this issue Jun 2, 2020 · 15 comments
Closed

Users stop syncing, postgres queries at high CPU #7618

f0x52 opened this issue Jun 2, 2020 · 15 comments
Labels
T-Other Questions, user support, anything else. X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@f0x52
Copy link

f0x52 commented Jun 2, 2020

Description

At seemingly random intervals, sync totally cuts out for my account, while other users on my homeserver are able to send and receive just fine.
During this, there are multiple postgres queries related to my account, pegged at very high cpu usage:
image

from SELECT datname,query FROM pg_stat_activity where datname='synapse';, it shows that most of these are related to state groups:
This same query is being ran 8 times:

WITH RECURSIVE state(state_group) AS ( VALUES(2380548::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT type, state_key, last_value(event_id) OVER ( PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@f0x:pixie.town'))

Synapse logging does not seem to report anything on this.
A while after Riot drops my sync completely (Connection to server lost), performance seems to stall for other users too, most likely due to postgres being too busy.

Version information

  • Homeserver: pixie.town

If not matrix.org:

  • Version: 1.14.0

  • Install method: pip

  • Platform: Bare-metal debian sid
@f0x52
Copy link
Author

f0x52 commented Jun 2, 2020

after around 15 minutes, these postgres queries finish or get killed (not sure how to check), and server returns back to normal

@f0x52
Copy link
Author

f0x52 commented Jun 2, 2020

got a report that sync broke for at least one other user as well, whereas messages could still be sent by both of our accounts, and bridged by a local appservice

@f0x52
Copy link
Author

f0x52 commented Jun 2, 2020

I think this could very well be forward extremity related #1760, as #ping:maunium.net rapidly accumulates those, without synapse seeming to resolve them. Current count sits on 18, but fluctuates whenever new messages are sent and received in that room. My server does not seem to send dummies there, even though the count is almost always above 10

@f0x52 f0x52 changed the title Sync breaks for single user, postgres queries at high CPU Users stop syncing, postgres queries at high CPU Jun 2, 2020
@erikjohnston
Copy link
Member

Well that is sub optimal. Can you do me a favour and run EXPLAIN ANALYZE WITH RECURSIVE state(state_group) AS ( VALUES(2380548::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT type, state_key, last_value(event_id) OVER ( PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@f0x:pixie.town'))? That should tell us why its slow.

I'm wondering if this is related to #5064

@f0x52
Copy link
Author

f0x52 commented Jun 2, 2020

 Unique  (cost=834420.78..834425.13 rows=435 width=81) (actual time=345.008..345.010 rows=2 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..269.78 rows=101 width=8) (actual time=343.319..343.901 rows=77 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=343.316..343.317 rows=1 loops=1)
           ->  Nested Loop  (cost=0.43..26.78 rows=10 width=8) (actual time=0.006..0.007 rows=1 loops=77)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=77)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.43..2.65 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=77)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=834151.00..834152.09 rows=435 width=81) (actual time=345.007..345.007 rows=2 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=834122.15..834131.94 rows=435 width=81) (actual time=344.963..344.965 rows=2 loops=1)
               ->  Sort  (cost=834122.15..834123.24 rows=435 width=90) (actual time=344.953..344.953 rows=2 loops=1)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=2.96..834103.09 rows=435 width=90) (actual time=344.395..344.932 rows=2 loops=1)
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=343.971..343.990 rows=77 loops=1)
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=343.321..343.929 rows=77 loops=1)
                           ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.69..8258.37 rows=4 width=90) (actual time=0.008..0.012 rows=0 loops=77)
                                 Index Cond: (state_group = state.state_group)
                                 Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@f0x:pixie.town'::text)))
                                 Rows Removed by Filter: 14
 Planning Time: 4.752 ms
 JIT:
   Functions: 26
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.677 ms, Inlining 66.789 ms, Optimization 168.949 ms, Emission 107.352 ms, Total 345.767 ms
 Execution Time: 491.825 ms
(29 rows)

I will rerun it when everything fails, to see if there's a difference

@f0x52
Copy link
Author

f0x52 commented Jun 3, 2020

just happened again, highest extremity count was 4, executing the query above still took only ~500ms, but again, that query was stuck being executed multiple times according to pg_stat_activity

@erikjohnston
Copy link
Member

Hmm, that looks a bit odd. I would expect the query to be faster than that, but the query plan looks correct. Is this on a hard drive or SSD?

#7567 was recently merged that might help performance a little bit there, but I wouldn't expect that to make a massive difference here.

@f0x52
Copy link
Author

f0x52 commented Jun 3, 2020

postgres is on a luks encrypted RAID1 over two ssd's
will set up prometheus/grafana to get some more system + synapse insight

@erikjohnston
Copy link
Member

Thanks!

For context: this is the query that fetches state for an event from the DB, which obviously happens really quite a lot throughout Synapse. Now, these queries are a bit heavy weight, but if you're seeing a lot of them stacking up then that is a problem. Some potential causes could be: a) your postgres instance has poorer performance than most (either due to hardware or the various perf tuning config options not set correctly), b) something is causing Synapse to request more state than usual, c) your server is busy and the state caches are way too more small, or d) these queries are actually backing up behind another resource intensive query and are only showing up in the logs as they're relatively common.

It might also be worth turning on the slow query logging on your postgres instance to see if there is anything else going on, and maybe set up a postgres exporter for prometheus as well? I'm afraid at this point I don't have any suggestions beyond that :/

@joepie91
Copy link

joepie91 commented Jul 10, 2020

It seems that this issue can be reliably reproduced by doing an initial sync, for either @f0x52 or me. The moment I attempt one, Synapse falls over for 5+ minutes, also from the perspective of my Riot client which is doing normal incremental syncs.

@deepbluev7
Copy link
Contributor

deepbluev7 commented Jul 17, 2020

I can second this. For a few months now initial syncs have been unbearably slow. With or without lazy-loading, it always seems to take more them 5 minutes, which causes my reverse proxy to fail the request and clients to retry until the sync completes. Postgres is basically spinning on 100% for the whole time doing that, while synapse doesn't even use that much CPU. I've also heard others being affected by this, so it doesn't seem to be just me.

For comparison, non lazy loading initial syncs used to be less than a minute before this regression, so perf degraded at least 5x. I didn't join that many rooms in that timeframe...

@Half-Shot
Copy link
Collaborator

This has recently started affecting me too. WITH RECURSIVE state(state_group) AS ( VALUES(7981870::bigint) ... queries often take over 5 minutes to process.

@anoadragon453
Copy link
Member

This has recently started affecting me too. WITH RECURSIVE state(state_group) AS ( VALUES(7981870::bigint) ... queries often take over 5 minutes to process.

Related to the followon discussion in #9182.

@anoadragon453 anoadragon453 added T-Other Questions, user support, anything else. X-Needs-Info This issue is blocked awaiting information from the reporter labels Mar 8, 2021
@richvdh
Copy link
Member

richvdh commented Jul 22, 2021

@anoadragon453 can you be clear hat the info needed is? I'm not seeing it at a quick glance.

I think this issue might be a duplicate of #5064 (which is hopefully now fixed)?

@erikjohnston
Copy link
Member

I think this is a duplicate of #5064, so hopefully this will be fixed in v1.39.0.

If you don't want to wait you can safely manually run the following on your DB:

ALTER TABLE state_groups_state ALTER COLUMN state_group SET (n_distinct = -0.02);

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Other Questions, user support, anything else. X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

7 participants