Users stop syncing, postgres queries at high CPU #7618

f0x52 · 2020-06-02T12:45:25Z

Description

At seemingly random intervals, sync totally cuts out for my account, while other users on my homeserver are able to send and receive just fine.
During this, there are multiple postgres queries related to my account, pegged at very high cpu usage:

from SELECT datname,query FROM pg_stat_activity where datname='synapse';, it shows that most of these are related to state groups:
This same query is being ran 8 times:

WITH RECURSIVE state(state_group) AS ( VALUES(2380548::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT type, state_key, last_value(event_id) OVER ( PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@f0x:pixie.town'))

Synapse logging does not seem to report anything on this.
A while after Riot drops my sync completely (Connection to server lost), performance seems to stall for other users too, most likely due to postgres being too busy.

Version information

Homeserver: pixie.town

If not matrix.org:

Version: 1.14.0
Install method: pip

Platform: Bare-metal debian sid

The text was updated successfully, but these errors were encountered:

f0x52 · 2020-06-02T12:47:36Z

after around 15 minutes, these postgres queries finish or get killed (not sure how to check), and server returns back to normal

f0x52 · 2020-06-02T12:50:32Z

got a report that sync broke for at least one other user as well, whereas messages could still be sent by both of our accounts, and bridged by a local appservice

f0x52 · 2020-06-02T12:57:49Z

I think this could very well be forward extremity related #1760, as #ping:maunium.net rapidly accumulates those, without synapse seeming to resolve them. Current count sits on 18, but fluctuates whenever new messages are sent and received in that room. My server does not seem to send dummies there, even though the count is almost always above 10

erikjohnston · 2020-06-02T14:27:59Z

Well that is sub optimal. Can you do me a favour and run EXPLAIN ANALYZE WITH RECURSIVE state(state_group) AS ( VALUES(2380548::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT type, state_key, last_value(event_id) OVER ( PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@f0x:pixie.town'))? That should tell us why its slow.

I'm wondering if this is related to #5064

f0x52 · 2020-06-02T15:13:39Z

 Unique  (cost=834420.78..834425.13 rows=435 width=81) (actual time=345.008..345.010 rows=2 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..269.78 rows=101 width=8) (actual time=343.319..343.901 rows=77 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=343.316..343.317 rows=1 loops=1)
           ->  Nested Loop  (cost=0.43..26.78 rows=10 width=8) (actual time=0.006..0.007 rows=1 loops=77)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=77)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.43..2.65 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=77)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=834151.00..834152.09 rows=435 width=81) (actual time=345.007..345.007 rows=2 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=834122.15..834131.94 rows=435 width=81) (actual time=344.963..344.965 rows=2 loops=1)
               ->  Sort  (cost=834122.15..834123.24 rows=435 width=90) (actual time=344.953..344.953 rows=2 loops=1)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=2.96..834103.09 rows=435 width=90) (actual time=344.395..344.932 rows=2 loops=1)
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=343.971..343.990 rows=77 loops=1)
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=343.321..343.929 rows=77 loops=1)
                           ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.69..8258.37 rows=4 width=90) (actual time=0.008..0.012 rows=0 loops=77)
                                 Index Cond: (state_group = state.state_group)
                                 Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@f0x:pixie.town'::text)))
                                 Rows Removed by Filter: 14
 Planning Time: 4.752 ms
 JIT:
   Functions: 26
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.677 ms, Inlining 66.789 ms, Optimization 168.949 ms, Emission 107.352 ms, Total 345.767 ms
 Execution Time: 491.825 ms
(29 rows)

I will rerun it when everything fails, to see if there's a difference

f0x52 · 2020-06-03T15:57:52Z

just happened again, highest extremity count was 4, executing the query above still took only ~500ms, but again, that query was stuck being executed multiple times according to pg_stat_activity

erikjohnston · 2020-06-03T16:03:09Z

Hmm, that looks a bit odd. I would expect the query to be faster than that, but the query plan looks correct. Is this on a hard drive or SSD?

#7567 was recently merged that might help performance a little bit there, but I wouldn't expect that to make a massive difference here.

f0x52 · 2020-06-03T16:38:03Z

postgres is on a luks encrypted RAID1 over two ssd's
will set up prometheus/grafana to get some more system + synapse insight

erikjohnston · 2020-06-03T16:43:57Z

Thanks!

For context: this is the query that fetches state for an event from the DB, which obviously happens really quite a lot throughout Synapse. Now, these queries are a bit heavy weight, but if you're seeing a lot of them stacking up then that is a problem. Some potential causes could be: a) your postgres instance has poorer performance than most (either due to hardware or the various perf tuning config options not set correctly), b) something is causing Synapse to request more state than usual, c) your server is busy and the state caches are way too more small, or d) these queries are actually backing up behind another resource intensive query and are only showing up in the logs as they're relatively common.

It might also be worth turning on the slow query logging on your postgres instance to see if there is anything else going on, and maybe set up a postgres exporter for prometheus as well? I'm afraid at this point I don't have any suggestions beyond that :/

joepie91 · 2020-07-10T21:56:47Z

It seems that this issue can be reliably reproduced by doing an initial sync, for either @f0x52 or me. The moment I attempt one, Synapse falls over for 5+ minutes, also from the perspective of my Riot client which is doing normal incremental syncs.

deepbluev7 · 2020-07-17T10:38:22Z

I can second this. For a few months now initial syncs have been unbearably slow. With or without lazy-loading, it always seems to take more them 5 minutes, which causes my reverse proxy to fail the request and clients to retry until the sync completes. Postgres is basically spinning on 100% for the whole time doing that, while synapse doesn't even use that much CPU. I've also heard others being affected by this, so it doesn't seem to be just me.

For comparison, non lazy loading initial syncs used to be less than a minute before this regression, so perf degraded at least 5x. I didn't join that many rooms in that timeframe...

Half-Shot · 2021-02-25T16:13:25Z

This has recently started affecting me too. WITH RECURSIVE state(state_group) AS ( VALUES(7981870::bigint) ... queries often take over 5 minutes to process.

anoadragon453 · 2021-03-08T14:35:41Z

This has recently started affecting me too. WITH RECURSIVE state(state_group) AS ( VALUES(7981870::bigint) ... queries often take over 5 minutes to process.

Related to the followon discussion in #9182.

richvdh · 2021-07-22T11:59:19Z

@anoadragon453 can you be clear hat the info needed is? I'm not seeing it at a quick glance.

I think this issue might be a duplicate of #5064 (which is hopefully now fixed)?

erikjohnston · 2021-07-23T16:13:25Z

I think this is a duplicate of #5064, so hopefully this will be fixed in v1.39.0.

If you don't want to wait you can safely manually run the following on your DB:

ALTER TABLE state_groups_state ALTER COLUMN state_group SET (n_distinct = -0.02);

f0x52 changed the title ~~Sync breaks for single user, postgres queries at high CPU~~ Users stop syncing, postgres queries at high CPU Jun 2, 2020

richvdh mentioned this issue Jul 1, 2020

Sync query taking upwards of 30 minutes #7772

Closed

anoadragon453 added T-Other Questions, user support, anything else. X-Needs-Info This issue is blocked awaiting information from the reporter labels Mar 8, 2021

erikjohnston closed this as completed Jul 23, 2021

jffrancob mentioned this issue Mar 3, 2022

Initial sync is taking a long time to finish #12158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Users stop syncing, postgres queries at high CPU #7618

Users stop syncing, postgres queries at high CPU #7618

f0x52 commented Jun 2, 2020 •

edited

Loading

f0x52 commented Jun 2, 2020

f0x52 commented Jun 2, 2020

f0x52 commented Jun 2, 2020

erikjohnston commented Jun 2, 2020

f0x52 commented Jun 2, 2020

f0x52 commented Jun 3, 2020

erikjohnston commented Jun 3, 2020

f0x52 commented Jun 3, 2020

erikjohnston commented Jun 3, 2020

joepie91 commented Jul 10, 2020 •

edited

Loading

deepbluev7 commented Jul 17, 2020 •

edited

Loading

Half-Shot commented Feb 25, 2021

anoadragon453 commented Mar 8, 2021

richvdh commented Jul 22, 2021

erikjohnston commented Jul 23, 2021

Users stop syncing, postgres queries at high CPU #7618

Users stop syncing, postgres queries at high CPU #7618

Comments

f0x52 commented Jun 2, 2020 • edited Loading

Description

Version information

f0x52 commented Jun 2, 2020

f0x52 commented Jun 2, 2020

f0x52 commented Jun 2, 2020

erikjohnston commented Jun 2, 2020

f0x52 commented Jun 2, 2020

f0x52 commented Jun 3, 2020

erikjohnston commented Jun 3, 2020

f0x52 commented Jun 3, 2020

erikjohnston commented Jun 3, 2020

joepie91 commented Jul 10, 2020 • edited Loading

deepbluev7 commented Jul 17, 2020 • edited Loading

Half-Shot commented Feb 25, 2021

anoadragon453 commented Mar 8, 2021

richvdh commented Jul 22, 2021

erikjohnston commented Jul 23, 2021

f0x52 commented Jun 2, 2020 •

edited

Loading

joepie91 commented Jul 10, 2020 •

edited

Loading

deepbluev7 commented Jul 17, 2020 •

edited

Loading