Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

state_groups_state uses a large sequential scan #5064

Closed
grinapo opened this issue Apr 15, 2019 · 20 comments · Fixed by #10359
Closed

state_groups_state uses a large sequential scan #5064

grinapo opened this issue Apr 15, 2019 · 20 comments · Fixed by #10359
Assignees
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)

Comments

@grinapo
Copy link

grinapo commented Apr 15, 2019

Description

I have caught a query which runs several times for room history, recursively, and therefore take a significant amount of time (tens of minutes) to scroll up.

WITH RECURSIVE state(state_group) AS ( 
	VALUES(64193::bigint) 
		UNION ALL 
	SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) 
	
SELECT DISTINCT type, state_key, last_value(event_id) 
	OVER ( 
		PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
	) AS event_id FROM state_groups_state WHERE state_group IN ( 
		SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR 
			(type = 'm.room.member' AND state_key = '@grin:grin.hu'))

And the query can be speeded up by

CREATE INDEX state_groups_state_state_key_type ON state_groups_state (state_key, type);

Without it the planner say:

                                 ->  Parallel Seq Scan on state_groups_state  (cost=0.00..284361.03 rows=82 width=83)                                                                                                                                                                                        
                                       Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@grin:grin.hu'::text)))

And with it it's

                           ->  Bitmap Heap Scan on state_groups_state  (cost=15.20..792.20 rows=198 width=83)
                                 Recheck Cond: (((state_key = ''::text) AND (type = 'm.room.history_visibility'::text)) OR ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text)))
                                 ->  BitmapOr  (cost=15.20..15.20 rows=198 width=0)
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..4.60 rows=4 width=0)
                                             Index Cond: ((state_key = ''::text) AND (type = 'm.room.history_visibility'::text))
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..10.50 rows=194 width=0)
                                             Index Cond: ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text))

which is 175 times faster. ;-)

Version information

If not matrix.org: matrix.grin.hu

  • Version: 0.99.2

  • Install method: debian pkg + workers

  • Platform: debian linux (testing)

@neilisfragile
Copy link
Contributor

Thanks! We'll take a look at that.

@neilisfragile neilisfragile added A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label) labels Apr 16, 2019
@neilisfragile neilisfragile reopened this May 23, 2019
@richvdh
Copy link
Member

richvdh commented May 30, 2019

I'm really surprised that this makes a difference. Obviously I don't have identical data, but equivalent queries are approximately instant. The query plan should look like:

synapse=# explain WITH RECURSIVE sgs(state_group) AS ( 
    VALUES(2565160::bigint) 
    UNION ALL 
    SELECT prev_state_group FROM state_group_edges e, sgs WHERE sgs.state_group = e.state_group
) 
SELECT DISTINCT type, state_key,
    last_value(event_id) OVER (PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS event_id
FROM state_groups_state
WHERE
    state_group IN (SELECT state_group FROM sgs)
    AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@richvdh:sw1v.org'));
                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=401158.97..401159.55 rows=58 width=79)
   Group Key: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?)
   CTE sgs
     ->  Recursive Union  (cost=0.00..849.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=0)
           ->  Nested Loop  (cost=0.42..84.72 rows=10 width=8)
                 ->  WorkTable Scan on sgs sgs_1  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.42..8.44 rows=1 width=16)
                       Index Cond: (state_group = sgs_1.state_group)
   ->  WindowAgg  (cost=400307.95..400309.25 rows=58 width=79)
         ->  Sort  (cost=400307.95..400308.09 rows=58 width=79)
               Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
               ->  Nested Loop  (cost=2.83..400306.25 rows=58 width=79)
                     ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8)
                           Group Key: sgs.state_group
                           ->  CTE Scan on sgs  (cost=0.00..2.02 rows=101 width=8)
                     ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.56..3963.39 rows=1 width=79)
                           Index Cond: (state_group = sgs.state_group)
                           Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@richvdh:sw1v.org'::text)))
(19 rows)

@grinapo: could you confirm a few things for me:

  • do you have a (working) state_groups_state_type_idx on state_groups_state(state_group, type, state_key)
  • what does this show you:
    WITH RECURSIVE sgs(state_group) AS ( 
        VALUES(64193::bigint) 
        UNION ALL 
        SELECT prev_state_group FROM state_group_edges e, sgs WHERE sgs.state_group = e.state_group
    )
    select count(*) FROM sgs;
  • Can you show the full query plan for the original query (without the index)?

@grinapo
Copy link
Author

grinapo commented May 30, 2019

  • yes I have the index

  • result is 66

  • with index the full query

synapse=# explain WITH RECURSIVE state(state_group) AS ( 
synapse(#  VALUES(64193::bigint) 
synapse(#   UNION ALL 
synapse(#  SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) 
synapse-#  
synapse-# SELECT DISTINCT type, state_key, last_value(event_id) 
synapse-#  OVER ( 
synapse(#   PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
synapse(#  ) AS event_id FROM state_groups_state WHERE state_group IN ( 
synapse(#   SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR 
synapse(#    (type = 'm.room.member' AND state_key = '@grin:grin.hu'));
                                                                                             QUERY PLAN                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2611.37..2611.66 rows=29 width=81)
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=8)
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.31 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=1775.09..1775.16 rows=29 width=81)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=1773.73..1774.39 rows=29 width=81)
               ->  Sort  (cost=1773.73..1773.81 rows=29 width=83)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Hash Semi Join  (cost=33.06..1773.03 rows=29 width=83)
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Bitmap Heap Scan on state_groups_state  (cost=29.78..1768.26 rows=443 width=83)
                                 Recheck Cond: (((state_key = ''::text) AND (type = 'm.room.history_visibility'::text)) OR ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text)))
                                 ->  BitmapOr  (cost=29.78..29.78 rows=443 width=0)
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..4.68 rows=12 width=0)
                                             Index Cond: ((state_key = ''::text) AND (type = 'm.room.history_visibility'::text))
                                       ->  Bitmap Index Scan on state_groups_state_state_key  (cost=0.00..24.87 rows=431 width=0)
                                             Index Cond: ((state_key = '@grin:grin.hu'::text) AND (type = 'm.room.member'::text))
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8)
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
(24 rows)

Dropping the index now gives:

                                                                                             QUERY PLAN                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=637941.55..637941.84 rows=29 width=81)
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=8)
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.31 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=637105.27..637105.34 rows=29 width=81)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=637103.91..637104.57 rows=29 width=81)
               ->  Sort  (cost=637103.91..637103.99 rows=29 width=83)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Hash Semi Join  (cost=1003.28..637103.21 rows=29 width=83)
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Gather  (cost=1000.00..637098.44 rows=443 width=83)
                                 Workers Planned: 2
                                 ->  Parallel Seq Scan on state_groups_state  (cost=0.00..636054.14 rows=185 width=83)
                                       Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@grin:grin.hu'::text)))
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8)
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
(21 rows)

@richvdh
Copy link
Member

richvdh commented May 30, 2019

I am at a loss to explain why postgres has decided to seq scan rather than index scan :/

@grinapo
Copy link
Author

grinapo commented May 30, 2019

Here's an explain analyze verbose with the specific parallel timings (no index):

                                                                                                                                   QUERY PLAN                                                                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=639139.28..639139.56 rows=28 width=81) (actual time=20490.851..20490.853 rows=2 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.004..106.748 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=1.612..1.614 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=1.610..1.611 rows=1 loops=66)
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=638303.00..638303.07 rows=28 width=81) (actual time=20490.850..20490.850 rows=2 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=638301.70..638302.33 rows=28 width=81) (actual time=20490.778..20490.781 rows=2 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group
               ->  Sort  (cost=638301.70..638301.77 rows=28 width=83) (actual time=20490.760..20490.760 rows=2 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Hash Semi Join  (cost=1003.28..638301.03 rows=28 width=83) (actual time=16058.945..20490.709 rows=2 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Gather  (cost=1000.00..638296.25 rows=450 width=83) (actual time=4.616..20385.904 rows=3568 loops=1)
                                 Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group
                                 Workers Planned: 2
                                 Workers Launched: 2
                                 ->  Parallel Seq Scan on public.state_groups_state  (cost=0.00..637251.25 rows=188 width=83) (actual time=2.375..20372.781 rows=1189 loops=3)
                                       Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group
                                       Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@grin:grin.hu'::text)))
                                       Rows Removed by Filter: 8029366
                                       Worker 0: actual time=1.331..20370.677 rows=1192 loops=1
                                       Worker 1: actual time=2.107..20371.285 rows=1193 loops=1
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=106.931..106.932 rows=66 loops=1)
                                 Output: state.state_group
                                 Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.008..106.837 rows=66 loops=1)
                                       Output: state.state_group
 Planning Time: 0.737 ms
 Execution Time: 20495.738 ms
(43 rows)

@grinapo
Copy link
Author

grinapo commented May 30, 2019

Forcing set enable_seqscan = off gives a curious result:

                                                                                          QUERY PLAN                                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2303188.77..2303189.05 rows=28 width=81)
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=8)
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.31 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=2302352.49..2302352.56 rows=28 width=81)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=2302351.19..2302351.82 rows=28 width=81)
               ->  Sort  (cost=2302351.19..2302351.26 rows=28 width=83)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Nested Loop  (cost=2.83..2302350.52 rows=28 width=83)
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8)
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
                           ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.56..22795.51 rows=1 width=83)
                                 Index Cond: (state_group = state.state_group)
                                 Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@grin:grin.hu'::text)))
(20 rows)

@grinapo
Copy link
Author

grinapo commented May 30, 2019

And the specific timings with the index, again:

                                                                                                                                   QUERY PLAN                                                                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2630.65..2630.93 rows=28 width=81) (actual time=2931.790..2931.795 rows=2 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.005..381.330 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=5.767..5.771 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=5.763..5.765 rows=1 loops=66)
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=1794.37..1794.44 rows=28 width=81) (actual time=2931.789..2931.789 rows=2 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=1793.07..1793.70 rows=28 width=81) (actual time=2931.759..2931.767 rows=2 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group
               ->  Sort  (cost=1793.07..1793.14 rows=28 width=83) (actual time=2931.727..2931.728 rows=2 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Hash Semi Join  (cost=29.13..1792.39 rows=28 width=83) (actual time=850.017..2931.702 rows=2 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Bitmap Heap Scan on public.state_groups_state  (cost=25.85..1787.62 rows=449 width=83) (actual time=9.622..2547.708 rows=3570 loops=1)
                                 Output: state_groups_state.state_group, state_groups_state.room_id, state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id
                                 Recheck Cond: (((state_groups_state.state_key = ''::text) AND (state_groups_state.type = 'm.room.history_visibility'::text)) OR ((state_groups_state.state_key = '@grin:grin.hu'::text) AND (state_groups_state.type = 'm.room.member'::text)))
                                 Heap Blocks: exact=3348
                                 ->  BitmapOr  (cost=25.85..25.85 rows=449 width=0) (actual time=8.023..8.023 rows=0 loops=1)
                                       ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..4.69 rows=13 width=0) (actual time=4.040..4.041 rows=2011 loops=1)
                                             Index Cond: ((state_groups_state.state_key = ''::text) AND (state_groups_state.type = 'm.room.history_visibility'::text))
                                       ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..20.93 rows=437 width=0) (actual time=3.979..3.979 rows=1559 loops=1)
                                             Index Cond: ((state_groups_state.state_key = '@grin:grin.hu'::text) AND (state_groups_state.type = 'm.room.member'::text))
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=381.643..381.644 rows=66 loops=1)
                                 Output: state.state_group
                                 Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.008..381.486 rows=66 loops=1)
                                       Output: state.state_group
 Planning Time: 0.978 ms
 Execution Time: 2931.972 ms
(42 rows)

@grinapo
Copy link
Author

grinapo commented May 30, 2019

Without the index the specific timings:

                                                                                                                                   QUERY PLAN                                                                                                                                     
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 Unique  (cost=639024.96..639025.24 rows=28 width=81) (actual time=64536.841..64536.853 rows=2 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                              
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.002..1.128 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=0.016..0.016 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=0.015..0.016 rows=1 loops=66)                                                                                                           
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=638188.68..638188.75 rows=28 width=81) (actual time=64536.840..64536.841 rows=2 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                        
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=638187.38..638188.01 rows=28 width=81) (actual time=64536.796..64536.813 rows=2 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group                                                                                                                    
               ->  Sort  (cost=638187.38..638187.45 rows=28 width=83) (actual time=64536.725..64536.726 rows=2 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                                   
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Hash Semi Join  (cost=1003.28..638186.71 rows=28 width=83) (actual time=15003.754..64536.680 rows=2 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                             
                           Hash Cond: (state_groups_state.state_group = state.state_group)
                           ->  Gather  (cost=1000.00..638181.93 rows=449 width=83) (actual time=54.843..64542.430 rows=3574 loops=1)
                                 Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group                                                                                                                       
                                 Workers Planned: 2
                                 Workers Launched: 2
                                 ->  Parallel Seq Scan on public.state_groups_state  (cost=0.00..637137.03 rows=187 width=83) (actual time=77.351..64509.144 rows=1191 loops=3)                                                                                                   
                                       Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id, state_groups_state.state_group                                                                                                                 
                                       Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@grin:grin.hu'::text)))  
                                       Rows Removed by Filter: 8035072
                                       Worker 0: actual time=83.634..64505.140 rows=1203 loops=1
                                       Worker 1: actual time=94.049..64501.215 rows=1211 loops=1
                           ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=1.170..1.171 rows=66 loops=1)
                                 Output: state.state_group
                                 Buckets: 1024  Batches: 1  Memory Usage: 11kB
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.003..1.154 rows=66 loops=1)
                                       Output: state.state_group
 Planning Time: 0.388 ms
 Execution Time: 64548.342 ms
(43 rows)

@grinapo
Copy link
Author

grinapo commented May 31, 2019

Not exactly the 60second version, but longer than 20ms:

                                                                                                                                 QUERY PLAN                                                                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2317039.29..2317039.57 rows=28 width=81) (actual time=814.957..814.958 rows=1 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.006..301.601 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=4.561..4.564 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=4.558..4.559 rows=1 loops=66)
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=2316203.01..2316203.08 rows=28 width=81) (actual time=814.955..814.956 rows=1 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=2316201.70..2316202.33 rows=28 width=81) (actual time=814.931..814.934 rows=1 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group
               ->  Sort  (cost=2316201.70..2316201.77 rows=28 width=83) (actual time=814.903..814.904 rows=1 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=2.83..2316201.03 rows=28 width=83) (actual time=689.166..814.881 rows=1 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=301.888..302.088 rows=66 loops=1)
                                 Output: state.state_group
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.009..301.720 rows=66 loops=1)
                                       Output: state.state_group
                           ->  Index Scan using state_groups_state_type_idx on public.state_groups_state  (cost=0.56..22932.64 rows=1 width=83) (actual time=7.446..7.766 rows=0 loops=66)
                                 Output: state_groups_state.state_group, state_groups_state.room_id, state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id
                                 Index Cond: (state_groups_state.state_group = state.state_group)
                                 Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@test1:grin.hu'::text)))
                                 Rows Removed by Filter: 35
 Planning Time: 0.849 ms
 Execution Time: 815.128 ms
(37 rows)

@grinapo
Copy link
Author

grinapo commented May 31, 2019

ALTER TABLE state_groups_state ALTER state_key SET STATISTICS 5000;
(and I guess it isn't necessary, but also ALTER TABLE state_groups_state ALTER type SET STATISTICS 5000;)
ANALYZE state_groups_state
And the result is awesome:

                                                                                                                                 QUERY PLAN                                                                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------      
 Unique  (cost=301413.25..301413.27 rows=2 width=81) (actual time=3.841..3.842 rows=1 loops=1)
   Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                              
   CTE state
     ->  Recursive Union  (cost=0.00..836.28 rows=101 width=8) (actual time=0.004..0.764 rows=66 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
                 Output: '64193'::bigint
           ->  Nested Loop  (cost=0.29..83.42 rows=10 width=8) (actual time=0.008..0.010 rows=1 loops=66)
                 Output: e.prev_state_group
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.000 rows=1 loops=66)
                       Output: s.state_group
                 ->  Index Scan using state_group_edges_idx on public.state_group_edges e  (cost=0.29..8.31 rows=1 width=16) (actual time=0.007..0.007 rows=1 loops=66)                                                                                                           
                       Output: e.state_group, e.prev_state_group
                       Index Cond: (e.state_group = s.state_group)
   ->  Sort  (cost=300576.97..300576.97 rows=2 width=81) (actual time=3.840..3.840 rows=1 loops=1)
         Output: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?)), state_groups_state.state_group                                                                                                                        
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         Sort Method: quicksort  Memory: 25kB
         ->  WindowAgg  (cost=300576.91..300576.96 rows=2 width=81) (actual time=3.821..3.823 rows=1 loops=1)
               Output: state_groups_state.type, state_groups_state.state_key, last_value(state_groups_state.event_id) OVER (?), state_groups_state.state_group                                                                                                                    
               ->  Sort  (cost=300576.91..300576.92 rows=2 width=83) (actual time=3.807..3.807 rows=1 loops=1)
                     Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                                   
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     Sort Method: quicksort  Memory: 25kB
                     ->  Nested Loop  (cost=2.83..300576.90 rows=2 width=83) (actual time=1.494..3.795 rows=1 loops=1)
                           Output: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group, state_groups_state.event_id                                                                                                                             
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8) (actual time=0.874..0.914 rows=66 loops=1)
                                 Output: state.state_group
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.007..0.821 rows=66 loops=1)
                                       Output: state.state_group
                           ->  Index Scan using state_groups_state_type_idx on public.state_groups_state  (cost=0.56..2975.97 rows=1 width=83) (actual time=0.012..0.043 rows=0 loops=66)                                                                                         
                                 Output: state_groups_state.state_group, state_groups_state.room_id, state_groups_state.type, state_groups_state.state_key, state_groups_state.event_id                                                                                           
                                 Index Cond: (state_groups_state.state_group = state.state_group)
                                 Filter: (((state_groups_state.type = 'm.room.history_visibility'::text) AND (state_groups_state.state_key = ''::text)) OR ((state_groups_state.type = 'm.room.member'::text) AND (state_groups_state.state_key = '@test1:grin.hu'::text)))       
                                 Rows Removed by Filter: 35
 Planning Time: 3.214 ms
 Execution Time: 4.002 ms
(37 rows)

It seems the stats sample has to be raised. Maybe this applies to other tables.... 🙄

@Ralith
Copy link
Contributor

Ralith commented Jun 15, 2019

I just caught a nearly identical query running for a long period, simultaneous with synapse 1.0 memory use exploding by a factor of 4 or so:

state(state_group) AS ( VALUES(3916371::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT type, state_key, last_value(event_id) OVER ( PARTITION BY type, state_key ORDER BY state_group ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member'));

However, the query plan does not seem to involve any sequential scans:

 Unique  (cost=3815577.00..3833301.85 rows=1772485 width=84)
   CTE state
     ->  Recursive Union  (cost=0.00..849.78 rows=101 width=8)
           ->  Result  (cost=0.00..0.01 rows=1 width=0)
           ->  Nested Loop  (cost=0.43..84.78 rows=10 width=8)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.43..8.45 rows=1 width=16)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=3814727.22..3819158.43 rows=1772485 width=84)
         Sort Key: state_groups_state.type, state_groups_state.state_key, (last_value(state_groups_state.event_id) OVER (?))
         ->  WindowAgg  (cost=3421247.92..3461128.83 rows=1772485 width=84)
               ->  Sort  (cost=3421247.92..3425679.13 rows=1772485 width=84)
                     Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
                     ->  Nested Loop  (cost=2.97..3067649.53 rows=1772485 width=84)
                           ->  HashAggregate  (cost=2.27..3.28 rows=101 width=8)
                                 Group Key: state.state_group
                                 ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8)
                           ->  Index Scan using state_groups_state_type_idx on state_groups_state  (cost=0.70..30197.25 rows=17549 width=84)
                                 Index Cond: (state_group = state.state_group)
                                 Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR (type = 'm.room.member'::text))

@erikjohnston
Copy link
Member

Yup, I think the postgres statistics are wrong for some people using the default postgres sampling sizes. This is more of a DB ops problem than an app problem, so I don't really know what Synapse can do about it. The options as far as I can see are:

  1. Document in the postgres docs that you may need to up the statistic counts, or
  2. Automatically set them for state_groups_state on the assumption its probably fine to do so

@symphorien
Copy link

I was affected by very similar symptoms preventing me from logging in at all. I created the index proposed in the opening post and I can now login.

Before:

matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(23751::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group
 ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_
key = '') OR (type = 'm.room.member' AND state_key = '@symphorien:xlumurb.eu')) ORDER BY type, state_key, state_group DESC;
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=384056.64..384057.11 rows=62 width=85) (actual time=147714.835..147714.841 rows=2 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.006..30.011 rows=29 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=1.028..1.030 rows=1 loops=29)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.000..0.001 rows=1 loops=29)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=1.024..1.025 rows=1 loops=29)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=383221.11..383221.27 rows=62 width=85) (actual time=147714.833..147714.833 rows=2 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Hash Semi Join  (cost=1003.28..383219.27 rows=62 width=85) (actual time=108396.455..147714.784 rows=2 loops=1)
               Hash Cond: (state_groups_state.state_group = state.state_group)
               ->  Gather  (cost=1000.00..383213.81 rows=566 width=85) (actual time=20.619..147685.158 rows=1852 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Parallel Seq Scan on state_groups_state  (cost=0.00..382157.21 rows=236 width=85) (actual time=116.617..147511.018 rows=617 loops=3)
                           Filter: (((type = 'm.room.history_visibility'::text) AND (state_key = ''::text)) OR ((type = 'm.room.member'::text) AND (state_key = '@symphorien:xlumurb.eu'::text)))
                           Rows Removed by Filter: 4541228
               ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=30.104..30.105 rows=29 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 10kB
                     ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.010..30.053 rows=29 loops=1)
 Planning Time: 1.114 ms
 Execution Time: 147721.242 ms
(24 rows)

with the index:

matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(23751::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@symphorien:xlumurb.eu')) ORDER BY type, state_key, state_group DESC;
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=3067.43..3067.90 rows=62 width=85) (actual time=11.506..11.515 rows=2 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.005..0.764 rows=29 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=0.018..0.021 rows=1 loops=29)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.001..0.001 rows=1 loops=29)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=0.013..0.015 rows=1 loops=29)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=2231.90..2232.06 rows=62 width=85) (actual time=11.503..11.504 rows=2 loops=1)
         Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Hash Semi Join  (cost=34.34..2230.06 rows=62 width=85) (actual time=6.249..11.475 rows=2 loops=1)
               Hash Cond: (state_groups_state.state_group = state.state_group)
               ->  Bitmap Heap Scan on state_groups_state  (cost=31.05..2224.60 rows=565 width=85) (actual time=2.280..9.421 rows=1852 loops=1)
                     Recheck Cond: (((state_key = ''::text) AND (type = 'm.room.history_visibility'::text)) OR ((state_key = '@symphorien:xlumurb.eu'::text) AND (type = 'm.room.member'::text)))
                     Heap Blocks: exact=1680
                     ->  BitmapOr  (cost=31.05..31.05 rows=565 width=0) (actual time=1.650..1.651 rows=0 loops=1)
                           ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..4.57 rows=1 width=0) (actual time=0.973..0.974 rows=1020 loops=1)
                                 Index Cond: ((state_key = ''::text) AND (type = 'm.room.history_visibility'::text))
                           ->  Bitmap Index Scan on state_groups_state_state_key_type  (cost=0.00..26.20 rows=564 width=0) (actual time=0.675..0.675 rows=832 loops=1)
                                 Index Cond: ((state_key = '@symphorien:xlumurb.eu'::text) AND (type = 'm.room.member'::text))
               ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=0.869..0.869 rows=29 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 10kB
                     ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.009..0.813 rows=29 loops=1)
 Planning Time: 1.200 ms
 Execution Time: 11.696 ms
(26 rows)

@symphorien
Copy link

I still have a problem with

 WITH RECURSIVE state(state_group) AS ( VALUES(17434::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELEC
T DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.member')) ORDER BY type, state_key, st
ate_group DESC;
matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(17434::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group
 ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.member')) ORDER BY type, state
_key, state_group DESC;                                                                                                                                                                              
                                                                            QUERY PLAN                                                                                                               
------------------------------------------------------------------------------------------------------------------------------------------------------------------                                   
 Unique  (cost=792687.00..800218.60 rows=265089 width=85) (actual time=143313.918..143313.953 rows=40 loops=1)                                                                                       
   CTE state                                                                                                                                                                                         
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.005..89.438 rows=2 loops=1)                                                                                            
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=1)                                                                                                    
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=44.706..44.709 rows=0 loops=2)                                                                                           
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.001..0.003 rows=1 loops=2)                                                                                           ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=44.695..44.697 rows=0 loops=2)                                                           Index Cond: (state_group = s.state_group)                                                                                                                                     
   ->  Sort  (cost=791851.47..795617.27 rows=1506320 width=85) (actual time=143313.914..143313.922 rows=40 loops=1)                                                                                  
         Sort Key: state_groups_state.state_key, state_groups_state.state_group DESC                                                                                                                 
         Sort Method: quicksort  Memory: 30kB                                                                                                                                                        
         ->  Hash Semi Join  (cost=3.28..493118.48 rows=1506320 width=85) (actual time=551.260..143313.781 rows=40 loops=1)                                                                          
               Hash Cond: (state_groups_state.state_group = state.state_group)                                                                                                                       
               ->  Seq Scan on state_groups_state  (cost=0.00..440339.94 rows=13720933 width=85) (actual time=70.279..138819.420 rows=13722219 loops=1)                                              
                     Filter: (type = 'm.room.member'::text)                                                                                                                                          
                     Rows Removed by Filter: 23456                                                                                                                                                   
               ->  Hash  (cost=2.02..2.02 rows=101 width=8) (actual time=89.464..89.464 rows=2 loops=1)                                                                                              
                     Buckets: 1024  Batches: 1  Memory Usage: 9kB                                                                                                                                    
                     ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.009..89.447 rows=2 loops=1)                                                                            
 Planning Time: 0.855 ms                                                                                                                                                                             
 Execution Time: 143314.074 ms

Postgresql overestimates the number of rows in the CTE (here, only two).
If I hint that there are at most 3 rows by adding LIMIT 3, then the query is instant:

matrix-synapse=> explain analyse WITH RECURSIVE state(state_group) AS ( VALUES(17434::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state LIMIT 3)  AND ((type = 'm.room.member')) ORDER BY type, state_key, state_group DESC;
                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=59437.90..59518.05 rows=16030 width=85) (actual time=225.977..226.012 rows=40 loops=1)
   CTE state
     ->  Recursive Union  (cost=0.00..835.53 rows=101 width=8) (actual time=0.004..0.052 rows=2 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.001..0.002 rows=1 loops=1)
           ->  Nested Loop  (cost=0.29..83.35 rows=10 width=8) (actual time=0.020..0.021 rows=0 loops=2)
                 ->  WorkTable Scan on state s  (cost=0.00..0.20 rows=10 width=8) (actual time=0.001..0.001 rows=1 loops=2)
                 ->  Index Scan using state_group_edges_idx on state_group_edges e  (cost=0.29..8.30 rows=1 width=16) (actual time=0.015..0.016 rows=0 loops=2)
                       Index Cond: (state_group = s.state_group)
   ->  Sort  (cost=58602.37..58642.45 rows=16030 width=85) (actual time=225.975..225.984 rows=40 loops=1)
         Sort Key: state_groups_state.state_key, state_groups_state.state_group DESC
         Sort Method: quicksort  Memory: 30kB
         ->  Nested Loop  (cost=435.42..57482.80 rows=16030 width=85) (actual time=170.865..225.905 rows=40 loops=1)
               ->  HashAggregate  (cost=0.10..0.13 rows=3 width=8) (actual time=0.068..0.074 rows=2 loops=1)
                     Group Key: state.state_group
                     ->  Limit  (cost=0.00..0.06 rows=3 width=8) (actual time=0.009..0.059 rows=2 loops=1)
                           ->  CTE Scan on state  (cost=0.00..2.02 rows=101 width=8) (actual time=0.007..0.057 rows=2 loops=1)
               ->  Bitmap Heap Scan on state_groups_state  (cost=435.33..19107.46 rows=5343 width=85) (actual time=112.880..112.890 rows=20 loops=2)
                     Recheck Cond: ((state_group = state.state_group) AND (type = 'm.room.member'::text))
                     Heap Blocks: exact=3
                     ->  Bitmap Index Scan on state_groups_state_type_idx  (cost=0.00..433.99 rows=5343 width=0) (actual time=112.861..112.862 rows=20 loops=2)
                           Index Cond: ((state_group = state.state_group) AND (type = 'm.room.member'::text))
 Planning Time: 1.776 ms
 Execution Time: 226.158 ms
(23 rows)

@symphorien
Copy link

I had more luck with this patch:

diff --git a/synapse/storage/databases/state/bg_updates.py b/synapse/storage/databases/state/bg_updates.py
index 139085b67..2b426b660 100644
--- a/synapse/storage/databases/state/bg_updates.py
+++ b/synapse/storage/databases/state/bg_updates.py
@@ -88,6 +88,7 @@ class StateGroupBackgroundUpdateStore(SQLBaseStore):
             # Temporarily disable sequential scans in this transaction. This is
             # a temporary hack until we can add the right indices in
             txn.execute("SET LOCAL enable_seqscan=off")
+            txn.execute("SET LOCAL enable_bitmapscan=off")
 
             # The below query walks the state_group tree so that the "state"
             # table includes all state_groups in the tree. It then joins

@erikjohnston
Copy link
Member

I think what is happening here is that if the server is in very large rooms the statistics collector ends up thinking that there are few distinct state groups, and so when we come to query e.g. 50 of them the query planner ends up thinking that "50 state groups is a majority of the table, and so we may as well scan the entire table".

If people who run into this can report what n_distinct is using the following query, that would be useful:

SELECT attname, inherited, n_distinct, correlation
FROM pg_stats 
WHERE tablename = 'state_groups_state';

It should either be very large, or negative.

Two solutions here are:

  1. Add an index on state_group, which allows postgres to have a much more accurate estimate of the number of distinct state groups there are (as a BTree index will track the number of distinct state groups there are).
  2. Manually set n_distinct to something like -1000, where negative numbers mean "number of distinct state groups is proportional to number of rows in the table", using ALTER TABLE state_groups_state ALTER COLUMN state_group SET (n_dinstict = -1000) (though I haven't tested this).

@erikjohnston
Copy link
Member

Add an index on state_group, which allows postgres to have a much more accurate estimate of the number of distinct state groups there are (as a BTree index will track the number of distinct state groups there are).

Actually, I'm not sure that is true. It may be true for unique indices, and creating the index likely updates the n_distinct value to a sane value.

@symphorien
Copy link

First of all, it looks like I'm not affected by slowness anymore, even dropping the patch mentioned above.

About your query, here are my numbers:

matrix-synapse=# SELECT attname, inherited, n_distinct, correlation
matrix-synapse-# FROM pg_stats        
matrix-synapse-# WHERE tablename = 'state_groups_state';
   attname   | inherited | n_distinct | correlation  
-------------+-----------+------------+--------------                    
state_group | f         |      10013 |   0.10115419
room_id     | f         |         93 |   -0.1697783                                                                                                                                   
type        | f         |         21 |   0.99679154
state_key   | f         |      71505 | -0.056733944      
event_id    | f         |     123828 |  -0.27821112          
                                                                                                                                
(5 rows)      

@Half-Shot
Copy link
Collaborator

#10359 made a night/day difference to my disk IO and CPU usage. Seems like that fixed the problem for me.

@richvdh
Copy link
Member

richvdh commented Oct 19, 2021

A note for others finding this problem: somebody else had ended up (via a mistake with pg_restore) with duplicate data in state_group_edges - which meant that the recursive query on state_group_edges was returning the square of the number of rows it should have done.

(likewise, a loop in state_group_edges would be a massive problem)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants