synapse 1.47 big jump in load due to `remove_hidden_devices_from_device_inbox` #11401

skepticalwaves · 2021-11-19T20:27:56Z

Edit: Tracking expected mitigations in:

See discussion below for context

Description

After upgrading to synapse-1.47, via https://github.com/spantaleev/matrix-docker-ansible-deploy, my server experienced a huge increase in both CPU and IO load.

Upon examining the synapse prometheus/grafana stats, I found the following entires:

A large increase in master_0_background_updates in DB transactions

And in particular, master-0_remove_hidden_device_from_inbox

It may be related, that the HTTP pusher distribution also changed oddly:

Steps to reproduce

upgraded from 1.46 to 1.47

Version information

Version: 1.47
Install method: https://github.com/spantaleev/matrix-docker-ansible-deploy
Platform: Ubuntu 20.04.3

The text was updated successfully, but these errors were encountered:

dklimpel · 2021-11-19T20:44:53Z

There is a background job to clean hidden devices from device_inbox.
Depending on the size of the table, this may take a while.

skepticalwaves · 2021-11-19T20:47:18Z

Would that be:

Fix a long-standing bug where messages in the device_inbox table for deleted devices would persist indefinitely. Contributed by @dklimpel and @JohannesKleine. (#10969, #11212)?

dklimpel · 2021-11-19T20:50:45Z

This job for deleted devices runs after.

The job for hidden devices is

Delete to_device messages for hidden devices that will never be read, reducing database size. (#11199)

skepticalwaves · 2021-11-19T20:51:07Z

The last time such a change was published which was expected to produce unexpected load, it was documented,
https://matrix-org.github.io/synapse/develop/upgrade#re-indexing-of-events-table-on-postgres-databases

skepticalwaves · 2021-11-22T02:17:22Z

2 days later, still have elevated load from this specific process

richvdh · 2021-11-22T22:35:52Z

we've observed a similar problem on one of the hosts on EMS.

expected to produce unexpected load

unfortunately this unexpected load is unexpected.

richvdh · 2021-11-22T22:40:52Z

I had some success by reducing MINIMUM_BACKGROUND_BATCH_SIZE to 1. It's possible to do that either by editing the source and restarting, or via the manhole with hs.get_datastores().databases[0].updates.MINIMUM_BACKGROUND_BATCH_SIZE = 1 (which won't persist over a restart).

@matrix-org/synapse-core: any reason not to reduce MINIMUM_BACKGROUND_BATCH_SIZE ?

erikjohnston · 2021-11-23T08:53:19Z

I had some success by reducing MINIMUM_BACKGROUND_BATCH_SIZE to 1. It's possible to do that either by editing the source and restarting, or via the manhole with hs.get_datastores().databases[0].updates.MINIMUM_BACKGROUND_BATCH_SIZE = 1 (which won't persist over a restart).

@matrix-org/synapse-core: any reason not to reduce MINIMUM_BACKGROUND_BATCH_SIZE ?

The reason for having a limit is to make sure decent amount of progress is made at each step, with a conscious trade off that its better to consume more resources and have the background update finish in a sensible time frame than effectively never have it finish at all. I don't have a particular objection with removing the minimum, but I do worry its going to not help that much if the underlying queries are sloooooooooow.

In this particular case it looks like the query is going slow:

matrix=> explain SELECT device_id, user_id, stream_id
                FROM device_inbox
                WHERE
                    stream_id >= 10000
                    AND (device_id, user_id) IN (
                        SELECT device_id, user_id FROM devices WHERE hidden = true
                    )
                ORDER BY stream_id
                LIMIT 100;
                                                                   QUERY PLAN                                                                    
-------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1001.29..240940.97 rows=100 width=47)
   ->  Gather Merge  (cost=1001.29..7016749017.28 rows=2924380 width=47)
         Workers Planned: 2
         ->  Nested Loop  (cost=1.27..7016410471.24 rows=1218492 width=47)
               ->  Parallel Index Scan using device_inbox_stream_id_user_id on device_inbox  (cost=0.70..2394883832.61 rows=1236491797 width=47)
                     Index Cond: (stream_id >= 10000)
               ->  Index Scan using device_uniqueness on devices  (cost=0.56..3.74 rows=1 width=38)
                     Index Cond: ((user_id = device_inbox.user_id) AND (device_id = device_inbox.device_id))
                     Filter: hidden
(9 rows)

I think due to the DB having very few hidden devices, so its walking many rows of the device_inbox table before it finds an entry for a hidden device. We probably want to change it to find hidden devices first and then delete rows associated with that device.

richvdh · 2021-11-23T11:21:31Z

a conscious trade off that its better to consume more resources and have the background update finish in a sensible time frame than effectively never have it finish at all.

If we have a slow query which needs optimising, it feels like it's better that it goes slowly (and we can optimise it in the next release) than that it takes out the entire homeserver by forging on and doing 100 rows anyway.

I think due to the DB having very few hidden devices, so its walking many rows of the device_inbox table before it finds an entry for a hidden device. We probably want to change it to find hidden devices first and then delete rows associated with that device.

That's a good idea, though I think we might have a similar problem with remove_deleted_devices_from_device_inbox which isn't going to be as amenable to the same technique.

richvdh · 2021-11-24T12:09:48Z

Tasks to do to resolve this: (edit, see PR description)

This is with @babolivier; he has the details of ideas for how to achieve them :)

richvdh · 2021-11-26T11:48:57Z

I think we can close this now that #11421 and #11422 have landed. (For those afflicted: try 1.48.0rc1, it should be much better)

babolivier · 2021-11-26T13:43:36Z

Ah yeah I meant to close it but forgot, thanks.

skepticalwaves · 2021-12-02T18:26:15Z

The load finally stopped and I see the queries have been optimized.

richvdh changed the title ~~synapse 1.47 big jump in load due to remove_hidden_device_from_inbox~~ synapse 1.47 big jump in load due to remove_hidden_devices_from_device_inbox Nov 22, 2021

richvdh mentioned this issue Nov 22, 2021

Add config settings for background update parameters #11260

Closed

richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Nov 22, 2021

richvdh added the X-Release-Blocker Must be resolved before making a release label Nov 23, 2021

This was referenced Nov 24, 2021

Improve performance of remove_hidden_devices_from_device_inbox #11420

Closed

Improve performance of remove_{hidden,deleted}_devices_from_device_inbox #11421

Merged

Lower minumum batch size to 1 for background updates #11422

Merged

richvdh closed this as completed Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synapse 1.47 big jump in load due to `remove_hidden_devices_from_device_inbox` #11401

synapse 1.47 big jump in load due to `remove_hidden_devices_from_device_inbox` #11401

skepticalwaves commented Nov 19, 2021 •

edited by babolivier

dklimpel commented Nov 19, 2021

skepticalwaves commented Nov 19, 2021

dklimpel commented Nov 19, 2021

skepticalwaves commented Nov 19, 2021

skepticalwaves commented Nov 22, 2021

richvdh commented Nov 22, 2021

richvdh commented Nov 22, 2021

erikjohnston commented Nov 23, 2021

richvdh commented Nov 23, 2021

richvdh commented Nov 24, 2021 •

edited by babolivier

richvdh commented Nov 26, 2021

babolivier commented Nov 26, 2021

skepticalwaves commented Dec 2, 2021

synapse 1.47 big jump in load due to remove_hidden_devices_from_device_inbox #11401

synapse 1.47 big jump in load due to remove_hidden_devices_from_device_inbox #11401

Comments

skepticalwaves commented Nov 19, 2021 • edited by babolivier

Description

Steps to reproduce

Version information

dklimpel commented Nov 19, 2021

skepticalwaves commented Nov 19, 2021

dklimpel commented Nov 19, 2021

skepticalwaves commented Nov 19, 2021

skepticalwaves commented Nov 22, 2021

richvdh commented Nov 22, 2021

richvdh commented Nov 22, 2021

erikjohnston commented Nov 23, 2021

richvdh commented Nov 23, 2021

richvdh commented Nov 24, 2021 • edited by babolivier

richvdh commented Nov 26, 2021

babolivier commented Nov 26, 2021

skepticalwaves commented Dec 2, 2021

synapse 1.47 big jump in load due to `remove_hidden_devices_from_device_inbox` #11401

synapse 1.47 big jump in load due to `remove_hidden_devices_from_device_inbox` #11401

skepticalwaves commented Nov 19, 2021 •

edited by babolivier

richvdh commented Nov 24, 2021 •

edited by babolivier