Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Some Synapse instances have been hammering their database after v1.66.0 -> v1.68.0 update #13942

Closed
jaywink opened this issue Sep 29, 2022 · 10 comments · Fixed by #13958 or #13972
Closed
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Performance Performance, both client-facing and admin-facing O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release X-Release-Blocker Must be resolved before making a release

Comments

@jaywink
Copy link
Member

jaywink commented Sep 29, 2022

Description

Some EMS hosted Synapse instances are hammering their database after upgrading from v1.66.0 to v1.68.0. The host concentrating here on is ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live (EMS internal host ID, please check with EMS team for real hostnames).

The offensive query is:

SELECT c.state_key FROM current_state_events as c
            /* Get the depth of the event from the events table */
            INNER JOIN events AS e USING (event_id)
            WHERE c.type = ? AND c.room_id = ? AND membership = ?
            /* Sorted by lowest depth first */
            ORDER BY e.depth ASC

The background update running at the time was event_push_backfill_thread_id, if relevant.

Graphs:

IOPS increase at upgrade. The initial plateau at 4K was due to the database being locked to 4K IOPS. Now it has 10K and has consistently continued to hammer the database after ~7 hours since the upgrade.

Screenshot-20220929153250-367x258

Degraded event send times especially when constrained to 4K IOPS, which the host has been running with for a long time fine.

Screenshot-20220929153504-893x343

Stateres worst-case seems to reflect the database usage, just side effect of a busy db?

Screenshot-20220929153521-889x343

DB usage for background jobs had a rather massive spike for notify_interested_appservices_ephemeral right after upgrade.

Screenshot-20220929154001-1761x547

Taking that away from the graph, we see db usage for background jobs higher since upgrade all around.

Screenshot-20220929154029-1783x562

DB transactions:

Screenshot-20220929154151-1782x569

Cache eviction seems to indicate we should raise the get_local_users_in_room cache as it is being evicted a lot by size. However, this has been the case pre-upgrade as well.

Screenshot-20220929154328-888x344

Appservice transactions have not changed during this time by a large factor (3 bridges):

Screenshot-20220929154352-1783x317

A few other hosts manually found:

  • 01bbd800-4670-11e9-8324-b54a9efc8abc-live

image

  • db0718c0-2480-11e9-83c4-ad579ecfcc33-live

image

Day time based changes in traffic have been ruled out, all these issues started on upgrade with no other changes to the hosting or deployment stack. There are probably more hosts affected by the db usage increase.


Also discussed in backend internal.

Steps to reproduce

Uprgade from v1.66.0 to v1.68.0.

Homeserver

ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live, 01bbd800-4670-11e9-8324-b54a9efc8abc-live, db0718c0-2480-11e9-83c4-ad579ecfcc33-live

Synapse Version

v1.68.0

Installation Method

Other (please mention below)

Platform

EMS flavour Docker images built from upstream images. Kubernetes cluster.

Relevant log output

-

Anything else that would be useful to know?

No response

@reivilibre
Copy link
Contributor

That query was touched recently by #13575.

@DMRobertson
Copy link
Contributor

DMRobertson commented Sep 29, 2022

$ git log --oneline -S get_users_in_room v1.66.0..v1.68.0
  e124b24ad 1.68.0rc1
  906cead9c Update docstrings to explain the impact of partial state (#13750)
  cdf7fb737 1.67.0rc1
  1a209efdb Update `get_users_in_room` mis-use to get hosts with dedicated `get_current_hosts_in_room` (#13605)
  d58615c82 Directly lookup local membership instead of getting all members in a room first (`get_users_in_room` mis-use) (#13608)
  c807b814a Use dedicated `get_local_users_in_room` to find local users when calculating `join_authorised_via_users_server` of a `/make_jo  in` request (#13606)

#13605 #13608 #13606

On the affected hosts we saw that postgres used a heap scan rather than an index scan for the aggressive query. Forcing an index-only scan (SET enable_bitmapscan to off) made the query quicker (500->300ms) for one trial set of parameters. Unclear if those are representitive. Also unclear if the index-only scan used more or less file IO.

But this might be barking up the wrong tree. Are we perhaps calling get_users_in_room in more circumstances now? Do we need to up its cache?

@anoadragon453
Copy link
Member

Note that on matrix.org the cache factor specifically for get_users_in_room is between 7 and 20 across most workers. The global cache factor is typically either 1 or 2. This may help explain a discrepancy between matrix.org and these hosts.

@DMRobertson DMRobertson added A-Performance Performance, both client-facing and admin-facing S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db O-Occasional Affects or can be seen by some users regularly or most users rarely X-Release-Blocker Must be resolved before making a release labels Sep 29, 2022
@DMRobertson
Copy link
Contributor

That query was touched recently by #13575.

By trying different versions we've confirmed that #13575 contributed to the performance regression, and we haven't seen any evidence of other queries being suddenly more expensive.

We did this by trying 51d732d and its parent 4f6de33 on a test server. (I tried to revert 51d732d on top of 1.68.0 for testing, but there were merge conflicts and I wasn't sure how to handle them---could use advice from @MadLittleMods on that one, and on how to proceed here more generally.

Anyway, here's the graph of database IO (bigger = more IO credits available) from the test server.

image

Seems convincing to me.

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Sep 29, 2022

#13605 #13608 #13606

[...]

Are we perhaps calling get_users_in_room in more circumstances now? Do we need to up its cache?

From the PRs you linked, we use get_users_in_room a lot less now. It was being used as a crutch to find the current hosts in the room but we have a dedicated optimized function for that, get_current_hosts_in_room, which it was switched to. And in some cases, we didn't even need to use get_users_in_room or get_current_hosts_in_room at at all.

#13575 does make the get_users_in_room query more complicated by adding a join. Some ideas for removing the join are in that same thread

You can see from the matrix.org graphs that we call it a lot less (it's not even a top db transaction by rate anymore) and the total time is way down but it does take a bit more time now when it runs. (the switch happens with the dramatic change in the graphs around Sept. 6th)

https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1661891889156&to=1664483889157&viewPanel=11


The giant spike in notify_interested_services_ephemeral (from the graphs in the description) would account for a lot more get_users_in_room calls down the line though 🐠


Also as a note, we will always use get_users_in_room to calculate get_current_hosts_in_room if you're using SQLite but I think the EMS hosts are using Postgres from the comment above,

if users is None and isinstance(self.database_engine, Sqlite3Engine):
# If we're using SQLite then let's just always use
# `get_users_in_room` rather than funky SQL.
users = await self.get_users_in_room(room_id)

@MadLittleMods
Copy link
Contributor

MadLittleMods commented Sep 29, 2022

Summarizing some more discussion in backend internal:

We don't know exactly what upstream callers are accounting for all of these get_users_in_room calls but everything kinda points to this being appservice related.

This also makes sense since one of the affected hosts is Element One which has a lot of bridged appservice users.

I suspect we have some more get_users_in_room mis-uses in the appservice code where we should be using get_local_users_in_room since the appservice only needs to worry about users on its own server. Fixing these mis-uses doesn't fix get_users_in_room performance itself but using get_local_users_in_room would be more performant than whatever we end up changing with get_users_in_room anyway.

@clokep pointed out a couple get_users_in_room mis-uses already:

I'll work on a PR now to change these over ⏩

MadLittleMods added a commit that referenced this issue Sep 29, 2022
… users

`get_local_users_in_room` is way more performant since it looks at a single
table (`local_current_membership`) and is looking through way less
data since it only worries about the local users in the room instead
of everyone in the room across the federation.

Fix #13942

Related to:

 - #13605
 - #13608
 - #13606
MadLittleMods added a commit that referenced this issue Sep 29, 2022
Spawning from looking into `get_users_in_room` while investigating
#13942 (comment).

See #13575 (comment)
for the original exploration around finding `get_users_in_room` mis-uses.
@jaywink
Copy link
Member Author

jaywink commented Sep 30, 2022

everything kinda points to this being appservice related.

The host I used for testing also has a whatsapp bridge with lots of remote users and a signal bridge. The other host mentioned (db0718c0-2480-11e9-83c4-ad579ecfcc33-live) is a rather large host with a Telegram bridge. Sorry I failed to mention these in the original issue description even though appservices came to mind early on. I failed to realize db0718c0-2480-11e9-83c4-ad579ecfcc33-live has a Telegram bridge and thus didn't feel there was a connection after all.

One other large host I looked at (8e71e000-3607-11ea-8fb7-3d4a5344b740-live) has no appservices. It shows a slight baseline increase in IOPS, though nothing on the levels of the others. This may be why we're not seeing this issue all around in our monitoring stack. However, just want to make the point that it looks like non-appservice hosts are also affected by the changes, though in a much more smaller level.

Screenshot-20220929210618-876x439

Doesn't look like this is affecting apdex, but it may affect for hosts where the RDS is struggling to provide the extra query weight.

Screenshot-20220930095852-875x334

@erikjohnston
Copy link
Member

I have to say I'm leaning towards backing out the change to get_users_in_room entirely, yeah there are some misuses going around but there are legitimate places we call get_users_in_room and I don't see why those won't be as affected by the change.

It's also worth noting that Beeper (cc @Fizzadar) have been trying to rip out all the joins onto the events table for performance reasons.

I'd vote we add a separate get_ordered_hosts_in_room that returns the ordered list of hosts, and only use that when we actually care. That would also allow us in future to be more clever with the ordering of hosts (e.g. by order by perceived health of the remote, etc).

@erikjohnston
Copy link
Member

But also I should say: I'm surprised the change has caused this much problem, so this is not anyone's fault!

erikjohnston added a commit that referenced this issue Sep 30, 2022
Fixes #13942. Introduced in #13575.

Basically, let's only get the ordered set of hosts out of the DB if we need an ordered set of hosts. Since we split the function up the caching won't be as good, but I think it will still be fine as e.g. multiple backfill requests for the same room will hit the cache.
@Fizzadar
Copy link
Contributor

As Erik mentioned we (Beeper) also have this issue with RDS x IOPs starvation. Joins seem to be particularly expensive in terms of IOPs (both on index and not). I can't find the source right now but IIRC doing joins effectively acts as an IOPs multiplier which may be why this change hurt so much (x the AS paths utilising the function heavily).

MadLittleMods added a commit that referenced this issue Oct 1, 2022
…de` (#13960)

Spawning from looking into `get_users_in_room` while investigating #13942 (comment).

See #13575 (comment) for the original exploration around finding `get_users_in_room` mis-uses.

Related to the following PRs where we also cleaned up some `get_users_in_room` mis-uses:

 - #13605
 - #13608
 - #13606
 - #13958
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Performance Performance, both client-facing and admin-facing O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Regression Something broke which worked on a previous release X-Release-Blocker Must be resolved before making a release
Projects
None yet
7 participants