Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/sync performance slow since v0.10.0 #2777

Closed
raidokaldma opened this issue Oct 7, 2022 · 15 comments
Closed

/sync performance slow since v0.10.0 #2777

raidokaldma opened this issue Oct 7, 2022 · 15 comments
Labels
C-Sync-API T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. X-Performance Issue/PR around something that is slow or taking lots of memory

Comments

@raidokaldma
Copy link

Background information

  • v0.10.1
  • Monolith
  • Postgres
  • Running in Docker

Description

I'm running Dendrite in Docker and noticed a huge difference in performance when doing initial sync when switching from v0.9.9 to v0.10.0 or v0.10.1

Response times in my server:
With docker image matrixdotorg/dendrite-monolith:v0.9.9

# With existing filter
$ curl -o /dev/null -s -w 'Total: %{time_total}s\n' 'https://[my-domain]/_matrix/client/r0/sync?filter=7&timeout=0&_cacheBuster=1665145519640&access_token=[redacted]'
Total: 1,113253s

# With inline filter
$ curl -o /dev/null -s -w 'Total: %{time_total}s\n' 'https://[my-domain]/_matrix/client/r0/sync?filter=%7B%22room%22%3A%7B%22timeline%22%3A%7B%22limit%22%3A100%7D%7D%7D&timeout=0&access_token=[redacted]'
Total: 6,983575s

With docker image matrixdotorg/dendrite-monolith:v0.10.1

# With existing filter
$ curl -o /dev/null -s -w 'Total: %{time_total}s\n' 'https://[my-domain]/_matrix/client/r0/sync?filter=7&timeout=0&_cacheBuster=1665145519640&access_token=[redacted]'
Total: 5,740234s

# With inline filter
$ curl -o /dev/null -s -w 'Total: %{time_total}s\n' 'https://[my-domain]/_matrix/client/r0/sync?filter=%7B%22room%22%3A%7B%22timeline%22%3A%7B%22limit%22%3A100%7D%7D%7D&timeout=0&access_token=[redacted]'
Total: 15,450809s
@Mic92
Copy link

Mic92 commented Oct 8, 2022

I can confirm. With weechat-matrix I am no longer able to login for this reason.

@neilalexander
Copy link
Contributor

How are your database connections configured? Are you using the global database section or per-component? If so, what are your max_open_conns values?

@raidokaldma
Copy link
Author

I'm using one global database connection pool with following config:

  database:
    connection_string: "postgresql://[redacted]"
    max_open_conns: 10
    max_idle_conns: 5
    conn_max_lifetime: -1

@neilalexander
Copy link
Contributor

With a total connection count of 10 shared across the entirety of Dendrite, I'm not at all surprised to hear things are slow. Are you using such a low limit for a specific reason?

@raidokaldma
Copy link
Author

I did an experiment and set max_open_conns: 90 as in default config and ran the same requests, but I did not notice any difference in response times.

Server I am running, does not get a lot of traffic, and I'm almost alone running queries, so available connections do not seem to be exhausted. Looking from the metrics that PostgreSQL reports, I can tell that Dendrite uses 5 connections or less.

@Mic92
Copy link

Mic92 commented Oct 11, 2022

After increasing per component connections to 100, I was able to login again: Mic92/dotfiles@9f6de28
I currently don't see consistently that many connections being used, so I am not sure this if they are used for a short amount of time.

@raidokaldma
Copy link
Author

raidokaldma commented Oct 19, 2022

I created a database dump, imported data to a new database and enabled statement logging in Postgres for that database only. Then I ran docker container with version 0.9.1 and 0.10.3 (latest). Here's what I found:

matrixdotorg/dendrite-monolith:v0.9.1

curl -o /dev/null -sS -w 'Total: %{time_total}s\n' 'http://localhost:8008/_matrix/client/r0/sync?filter=7&timeout=0&access_token=[redacted]'
Total: 0,846131s

With single HTTP request, Dendrite made 858 database queries, top 10 being:

441 SELECT event_type_nid, event_state_key_nid, event_nid FROM roomserver_events WHERE event_nid = ANY($1) AND (CARDINALITY($2::bigint[]) = 0 OR event_type_nid = ANY($2)) AND (CARDINALITY($3::bigint[]) = 0 OR event_state_key_nid = ANY($3)) ORDER BY event_type_nid, event_state_key_nid ASC
 42 SELECT COUNT(*) FROM ( SELECT DISTINCT ON (room_id, user_id) room_id, user_id, membership FROM syncapi_memberships WHERE room_id = $1 AND stream_pos <= $2 ORDER BY room_id, user_id, stream_pos DESC) t WHERE t.membership = $3
 25 SELECT content FROM account_data WHERE localpart = $1 AND room_id = $2 AND type = $3
 21 SELECT topological_position, stream_position FROM syncapi_output_room_events_topology WHERE event_id = $1
 21 SELECT state_snapshot_nid, state_block_nids FROM roomserver_state_snapshots WHERE state_snapshot_nid = ANY($1) ORDER BY state_snapshot_nid ASC
 21 SELECT state_block_nid, event_nids FROM roomserver_state_block WHERE state_block_nid = ANY($1) ORDER BY state_block_nid ASC
 21 SELECT room_version, room_nid, state_snapshot_nid, latest_event_nids FROM roomserver_rooms WHERE room_id = $1
 21 SELECT room_nid, room_version FROM roomserver_rooms WHERE room_nid = ANY($1)
 21 SELECT latest_event_nids, state_snapshot_nid FROM roomserver_rooms WHERE room_nid = $1
 21 SELECT event_type, event_type_nid FROM roomserver_event_types WHERE event_type = ANY($1)

matrixdotorg/dendrite-monolith:v0.10.3 (latest)

curl -o /dev/null -sS -w 'Total: %{time_total}s\n' 'http://localhost:8008/_matrix/client/r0/sync?filter=7&timeout=0&access_token=[redacted]'
Total: 13,016279s

With single HTTP request, Dendrite made 3504 database queries, top 10 being:

2056 SELECT event_type_nid, event_state_key_nid, event_nid FROM roomserver_events WHERE event_nid = ANY($1) AND (CARDINALITY($2::bigint[]) = 0 OR event_type_nid = ANY($2)) AND (CARDINALITY($3::bigint[]) = 0 OR event_state_key_nid = ANY($3)) ORDER BY event_type_nid, event_state_key_nid ASC
 185 SELECT room_nid, room_version FROM roomserver_rooms WHERE room_nid = ANY($1)
 185 SELECT event_nid, room_nid FROM roomserver_events WHERE event_nid = ANY($1)
 185 SELECT event_nid, event_json FROM roomserver_event_json WHERE event_nid = ANY($1) ORDER BY event_nid ASC
 185 SELECT event_nid, event_id FROM roomserver_events WHERE event_nid = ANY($1)
 167 SELECT event_nid, state_snapshot_nid FROM roomserver_events WHERE event_id = $1
 127 SELECT state_block_nid, event_nids FROM roomserver_state_block WHERE state_block_nid = ANY($1) ORDER BY state_block_nid ASC
  42 SELECT state_snapshot_nid, state_block_nids FROM roomserver_state_snapshots WHERE state_snapshot_nid = ANY($1) ORDER BY state_snapshot_nid ASC
  42 SELECT room_version, room_nid, state_snapshot_nid, latest_event_nids FROM roomserver_rooms WHERE room_id = $1
  42 SELECT event_state_key, event_state_key_nid FROM roomserver_event_state_keys WHERE event_state_key = ANY($1)

EDIT. I additionally tested with database (from same dump) and Dendrite container (0.10.3) running on the same host and noticed that HTTP request times were quite OK (around 0.6 sec). If Dendrite and PostgreSQL are running on different hosts, then network latency becomes an issue and request times are as slow as I've described above.

Any thoughts, any suggestions what to try next?

neilalexander added a commit that referenced this issue Nov 1, 2022
This optimizes history visibility checks by (mostly) avoiding database
hits.
Possibly solves #2777

Co-authored-by: Neil Alexander <neilalexander@users.noreply.github.com>
@S7evinK
Copy link
Contributor

S7evinK commented Nov 1, 2022

@raidokaldma mind giving main/latest a try?

@raidokaldma
Copy link
Author

raidokaldma commented Nov 2, 2022

Edit: I removed my previous response, which was incorrect

Since last time I had removed previous database and recreated a new one, so response times are not 1:1 comparable with previous comments, but regardless here is what I found:

I had matrixdotorg/dendrite-monolith:v0.10.4 running, so I did a measurement to get the baseline response time:

curl -o /dev/null -sS -w 'Total: %{time_total}s\n' 'http://localhost:8008/_matrix/client/r0/sync?filter=4&timeout=0&access_token=[omitted]
Total: 13,093631s

Then I ran matrixdotorg/dendrite-monolith:v0.10.6 which is the latest right now, and ran the same request again:

curl -o /dev/null -sS -w 'Total: %{time_total}s\n' 'http://localhost:8008/_matrix/client/r0/sync?filter=4&timeout=0&access_token=[omitted]'
Total: 6,367924s

Improvement is noticable indeed, but it is still slower compared to what I had before 0.10.0.

That being said, I'm doing other sorts of optimization in my JS code (matrix-js-sdk) to overcome the slowness, making use of IndexedDBStore for example. So the current performance is not a big problem any more.

And also big thank you for investing time and shipping improvements.

@yaodongpo

This comment was marked as off-topic.

@kegsay kegsay added X-Needs-Info This issue is blocked awaiting information from the reporter and removed needs investigation labels Dec 5, 2022
@kegsay
Copy link
Member

kegsay commented Dec 6, 2022

@raidokaldma thank you very much for your detailed and comprehensive reporting. I have a few follow-up questions:

  • How large is the account being synced in terms of number of rooms on the account?
  • What is the timeline limit set to in the stored filter?
  • Would you be interested in trying out experimental branches which may improve sync performance?

Thanks!

@kegsay kegsay added C-Sync-API T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. X-Performance Issue/PR around something that is slow or taking lots of memory and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Dec 6, 2022
@raidokaldma
Copy link
Author

Hey, thanks for dedicating time to the issue.

I still have my database dump, and here's a summary of what my database contains:
1.

select relname, n_live_tup from pg_stat_all_tables where schemaname = 'public' order by relname;
┌──────────────────────────────────────────────┬────────────┐
│                   relname                    │ n_live_tup │
├──────────────────────────────────────────────┼────────────┤
│ db_migrations                                │         18 │
│ federationsender_blacklist                   │          0 │
│ federationsender_inbound_peeks               │          0 │
│ federationsender_joined_hosts                │        216 │
│ federationsender_notary_server_keys_json     │          0 │
│ federationsender_notary_server_keys_metadata │          0 │
│ federationsender_outbound_peeks              │          0 │
│ federationsender_queue_edus                  │          0 │
│ federationsender_queue_json                  │          0 │
│ federationsender_queue_pdus                  │          0 │
│ keydb_server_keys                            │          0 │
│ keyserver_cross_signing_keys                 │          6 │
│ keyserver_cross_signing_sigs                 │          6 │
│ keyserver_device_keys                        │        589 │
│ keyserver_key_changes                        │         92 │
│ keyserver_one_time_keys                      │          0 │
│ keyserver_stale_device_lists                 │          0 │
│ mediaapi_media_repository                    │          0 │
│ mediaapi_thumbnail                           │          0 │
│ roomserver_event_json                        │       2098 │
│ roomserver_event_state_keys                  │         93 │
│ roomserver_event_types                       │         11 │
│ roomserver_events                            │       2098 │
│ roomserver_invites                           │        237 │
│ roomserver_membership                        │        329 │
│ roomserver_previous_events                   │       2001 │
│ roomserver_published                         │          0 │
│ roomserver_redactions                        │          0 │
│ roomserver_room_aliases                      │         64 │
│ roomserver_rooms                             │         97 │
│ roomserver_state_block                       │       1985 │
│ roomserver_state_snapshots                   │       1986 │
│ syncapi_account_data_type                    │        256 │
│ syncapi_backward_extremities                 │          0 │
│ syncapi_current_room_state                   │        791 │
│ syncapi_filter                               │         26 │
│ syncapi_ignores                              │          0 │
│ syncapi_invite_events                        │       1021 │
│ syncapi_memberships                          │        470 │
│ syncapi_notification_data                    │        273 │
│ syncapi_output_room_events                   │       2098 │
│ syncapi_output_room_events_topology          │       2098 │
│ syncapi_peeks                                │          0 │
│ syncapi_presence                             │         12 │
│ syncapi_receipts                             │        157 │
│ syncapi_relations                            │          0 │
│ syncapi_send_to_device                       │          0 │
│ userapi_account_datas                        │        256 │
│ userapi_accounts                             │         92 │
│ userapi_daily_stats                          │          0 │
│ userapi_daily_visits                         │       2917 │
│ userapi_devices                              │        583 │
│ userapi_key_backup_versions                  │          0 │
│ userapi_key_backups                          │          0 │
│ userapi_login_tokens                         │          0 │
│ userapi_notifications                        │         34 │
│ userapi_openid_tokens                        │          1 │
│ userapi_profiles                             │         92 │
│ userapi_pushers                              │          0 │
│ userapi_threepids                            │          0 │
└──────────────────────────────────────────────┴────────────┘
# From dendrite DB:
select filter from syncapi_filter where id = 4;
{"account_data":{},"presence":{},"room":{"account_data":{},"ephemeral":{},"state":{},"timeline":{"limit":8}}}
  1. Sure, if I have time. It would be nice if experimental builds were available as docker image too.

@S7evinK
Copy link
Contributor

S7evinK commented Feb 4, 2023

@raidokaldma not yet merged, but PR #2961 should significantly reduce the queries and be much more performant. Your filter might cause problems, because you're getting the whole state for each room, enabling lazy loading will improve the response time drastically with that PR.

S7evinK added a commit that referenced this issue Feb 7, 2023
Should fix the following issues or make a lot less worse when using
Postgres:

The main issue behind #2911: The client gives up after a certain time,
causing a cascade of context errors, because the response couldn't be
built up fast enough. This mostly happens on accounts with many rooms,
due to the inefficient way we're getting recent events and current state

For #2777: The queries for getting the membership events for history
visibility were being executed for each room (I think 185?), resulting
in a whooping 2k queries for membership events. (Getting the
statesnapshot -> block nids -> actual wanted membership event)

Both should now be better by:
- Using a LATERAL join to get all recent events for all joined rooms in
one go (TODO: maybe do the same for room summary and current state etc)
- If we're lazy loading on initial syncs, we're now not getting the
whole current state, just to drop the majority of it because we're lazy
loading members - we add a filter to exclude membership events on the
first call to `CurrentState`.
- Using an optimized query to get the membership events needed to
calculate history visibility

---------

Co-authored-by: kegsay <kegan@matrix.org>
@raidokaldma
Copy link
Author

I finally found time to test latest changes. In my case initial sync on version v0.10.8 took 5s, with v0.11.1 same request responds in 0.7s.

The difference is noticeable.

Thank you @S7evinK!

Feel free to close the issue if no further work on the issue is planned.

@kegsay
Copy link
Member

kegsay commented Feb 13, 2023

We will be adding in more comprehensive profiling via runtime/trace over the next few weeks, so we should hopefully be able to identify this sort of thing much more easily. Closing this.

@kegsay kegsay closed this as completed Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-Sync-API T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. X-Performance Issue/PR around something that is slow or taking lots of memory
Projects
None yet
Development

No branches or pull requests

6 participants