Unable to join some large rooms due to high RAM consumption #7339

rihardsk · 2020-04-23T21:54:27Z

Description

Joining some large rooms, such as #freenode_#haskel:matrix.org (1.4k members), fails because synapse eats up all the available memory leading to it being forcefully stopped. I've configured my system to limit synapse to 3.5 GB of RAM. Upon joining, synapse first spends some time doing some processing (seeing high CPU usage, RAM usage close to baseline 500 MB) and after a while the RAM consumption starts to steadily climb until it reaches the 3.5 GB mark when it has to be killed.

Here are the logs from the moment of joining the room up until synapse getting killed
ram-crash.redacted.log
The request for joining the room comes in at line 28. On line 564 synapse stopped printing anything to the logs and just maintained high CPU usage while steadily growing RAM consumption for ~1 minute until being killed.

Joining other large rooms, e.g., #matrix:matrix.org (3.2k members), #synapse:matrix.org (1.2k members), works fine (haven't monitored RAM consumption when joining but i had the same limits set). Some other person on #synapse:matrix.org reported joining a room with ~20k people with RAM consumption going up to ~1.1 GB, which leads me to suspect that i might be seeing something abnormal in my case. Am i?

Other than this issue, synapse seems to be working fine. I'm willing to repeat this and do some profiling if necessary.

Steps to reproduce

try to join #freenode_#haskel:matrix.org
watch as synapse's RAM consumption grows to > 3.5 GB

Version information

Homeserver: my private homeserver
Version: 1.12.1
Install method: NixOS

Platform: latest NixOS master branch (commit 01c8795673ecff6272895a085e9fe1ffa3b33620) running on a rockpro64 sbc (with a custom patched kernel).

The text was updated successfully, but these errors were encountered:

babolivier · 2020-04-27T17:11:40Z

The "size" of a Matrix room isn't described by its number of users but the number of state events (e.g. joins, leaves, kicks, bans, changes of name, topic, power level rules, join rules, etc.) in its history. To summarise, there is a component to Matrix called the state resolution algorithm that's in charge of resolving clashes between two servers that got out of sync regarding what state a given room currently is. This algorithm works through the whole state of the room, and needs to load most (if not all) state events in that room in memory to work. This is what's making Synapse so hungry on RAM when trying to join a large room, because it needs to retrieve and authenticate every state event, which can be expensive for old rooms. If you're interested, how exactly this algorithm works has been explained recently on the matrix.org website: https://matrix.org/docs/guides/implementing-stateres

IIRC this is also the reason why some rooms can't be joined from small homeservers on modular.im.

The above is more a point of context and details than "it has a reason so it's not an issue" (because it definitely is an issue), and I don't think there's an open issue about that on this repo so I'll keep that one open to track the status of this.

c7hm4r · 2020-09-17T08:33:46Z

@babolivier: This algorithm [...] needs to load most (if not all) state events in that room in memory to work.

Every algorithm can be implemented using few RAM, but then maybe requiring more I/O to persistent storage (such as a DB) and being slower. This is a tradeoff decision.

The current implementation decisions exclude users of cheap hardware (for home servers) to join larger rooms. IMO this is a bug, isn’t it?

If the algorithm implementation would be tied more to the DB and the DB would appropriately implement caching, the memory usage behavior would probably be automatically more adaptive to the amount of available memory and maybe not that much slower with much RAM.

Another idea: Repeatedly check available free memory during execution of the algorithm, and if the requirements are not met, abort cleanly, send an error message to the user, and fall back to some (maybe less secure) alternative, instead of hoping for the OOM killer to do the right thing (after a phase in which the whole system nearly freezes).

mxvin · 2020-10-27T17:19:55Z

What I think is, why don't we just give these chore to the homeserver of that room resides?
Say that I wanna join to X homeserver room, Just talk to X HS "Hey, I wanna join "xyz" room" and our HS also saving the state that we join "xyz" room of X HS. Then, event history sync and Future transaction like texts, media etc... just make our homeserver proxying that straight from X Homeserver ( I guess media delivery also using this approach too)

Why every server that wanna join these room need to process every state/event and all of that logic? I think that can be bypassed.
Federation is the core of matrix. By using these approach, every person even with a very small resource computing like Raspberry PI can spin their own homeserver and join to any room they like.

immanuelfodor · 2020-10-27T18:15:05Z

Bootstrapping room state quickly from a data/db sync, I like the idea.

auscompgeek · 2020-10-29T03:26:25Z

@mxvin a room is replicated to all homeservers that participate in that room, they don't live on a single server like in XMPP.

lqdev · 2020-11-21T15:54:47Z

Similar issues for me. Though not sure if it's because of RAM consumption. I used htop to track the processes and RAM almost never goes above 500MB.

Currently running a homeserver on a Raspberry Pi 4 B with 4GB RAM. Initially, I was running on a Raspberry Pi 3 with 1 GB RAM. I've been able to join rooms like Element Android (2.5k), Synapse Admins (719). I'm using SQLite DB at the moment.

Trying to join a room like Matrix HQ (7.8k) though takes an extremely long time to try to join the room. Eventually, my server crashes and I get a 502 Bad Gateway error.

ptman · 2020-11-21T18:36:00Z

@lqdev First switch from sqlite to postgres. You shouldn't federate with sqlite.

lqdev · 2020-11-22T00:40:58Z

Thanks @ptman I'll give that a try.

lqdev · 2020-11-22T02:26:22Z

@ptman federation is a bit snappier after migrating to Postgres. Thanks for that suggestion. Though I'm still intermittently running into issues. I'm guessing part of that is the fact I have everything running on a RPi. To clarify, it appears it's large rooms that are bridged that I have trouble with, so I can see how that might be an issue (i.e. #techlore:matrix.org)

c7hm4r · 2021-02-01T14:43:45Z

It seems that with Synapse 0.26 memory consumption is much lower. Now, my server can join rooms with complexity between 20 and 30, but the largest rooms on matrix.org are still prohibitive.

jkufner · 2021-02-22T14:44:35Z

Memory usage is certainly a problem. Server's memory usage should not depend on number of historical events in a room.

Ideally, the memory consumption should be constant. If there is a session state or event queue for each client, then it should be linear with number of clients. Other than that, it should be posible to run the server in constant memory space. We have a powerful SQL database available, the Synapse should use it.

Anyway, if a large room is defined by number of events, can we make a state snapshot from time to time, and then synchronize from the last snapshot? This way, we can throw away (or lazy load) the history before the snapshot and every room becomes a small room. The snapshot may be a hash of the state or something like that, not necessairly prepresenting the complete state. If a client desires the earlier history, it could be provided on demand (nobody reads it all anyway).

ptman · 2021-02-22T14:48:41Z

@jkufner complexity (resource use) does not depend on number of events (messages, attachments, etc.) but number of state events (related to e.g. federation, permission calculation, ...) https://github.com/matrix-org/synapse/blob/master/synapse/storage/databases/main/events_worker.py#L1072

ptman · 2021-02-22T14:56:04Z

#8659

jkufner · 2021-02-22T15:44:28Z

@ptman Ok, sorry for inaccuracy; however, the argument still stands.

AnInternetTroll · 2021-05-31T16:44:56Z

Any update on this?

c7hm4r · 2021-06-08T18:56:26Z

Related: #9953 #9973

As announced in https://github.com/matrix-org/synapse/releases/tag/v1.36.0rc1

erikjohnston · 2021-06-11T10:04:27Z

This should hopefully be significantly improved in the upcoming v1.36.0 release. I'm going to close this for now, if people still see issues after updating then feel free to make a new issue.

gabrix73 · 2022-10-27T19:59:56Z

I installed debian11 matrix-synapse-py3 on a CX21 server machine with 2 cpu, 4GB of ram, 40GB of hard disk. It’s not really a small homeserver or a raspberry pi4.
I am the only user and i try to explore just one room.
Loading #matrix:matrix.org from clients takes the server ram up to the 70%, the same for both the CPU.
I assume that a normal use of my server with more users, with a large list of rooms of various kinds loaded, remains impossible.

babolivier added z-p2 (Deprecated Label) A-Performance Performance, both client-facing and admin-facing labels Apr 27, 2020

skepticalwaves mentioned this issue Feb 26, 2021

Runaway RAM usage spantaleev/matrix-docker-ansible-deploy#908

Open

erikjohnston closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to join some large rooms due to high RAM consumption #7339

Unable to join some large rooms due to high RAM consumption #7339

rihardsk commented Apr 23, 2020

babolivier commented Apr 27, 2020 •

edited

Loading

c7hm4r commented Sep 17, 2020

mxvin commented Oct 27, 2020 •

edited

Loading

immanuelfodor commented Oct 27, 2020

auscompgeek commented Oct 29, 2020

lqdev commented Nov 21, 2020 •

edited

Loading

ptman commented Nov 21, 2020

lqdev commented Nov 22, 2020

lqdev commented Nov 22, 2020 •

edited

Loading

c7hm4r commented Feb 1, 2021

jkufner commented Feb 22, 2021

ptman commented Feb 22, 2021

ptman commented Feb 22, 2021

jkufner commented Feb 22, 2021

AnInternetTroll commented May 31, 2021

c7hm4r commented Jun 8, 2021 •

edited

Loading

erikjohnston commented Jun 11, 2021

gabrix73 commented Oct 27, 2022

Unable to join some large rooms due to high RAM consumption #7339

Unable to join some large rooms due to high RAM consumption #7339

Comments

rihardsk commented Apr 23, 2020

Description

Steps to reproduce

Version information

babolivier commented Apr 27, 2020 • edited Loading

c7hm4r commented Sep 17, 2020

mxvin commented Oct 27, 2020 • edited Loading

immanuelfodor commented Oct 27, 2020

auscompgeek commented Oct 29, 2020

lqdev commented Nov 21, 2020 • edited Loading

ptman commented Nov 21, 2020

lqdev commented Nov 22, 2020

lqdev commented Nov 22, 2020 • edited Loading

c7hm4r commented Feb 1, 2021

jkufner commented Feb 22, 2021

ptman commented Feb 22, 2021

ptman commented Feb 22, 2021

jkufner commented Feb 22, 2021

AnInternetTroll commented May 31, 2021

c7hm4r commented Jun 8, 2021 • edited Loading

erikjohnston commented Jun 11, 2021

gabrix73 commented Oct 27, 2022

babolivier commented Apr 27, 2020 •

edited

Loading

mxvin commented Oct 27, 2020 •

edited

Loading

lqdev commented Nov 21, 2020 •

edited

Loading

lqdev commented Nov 22, 2020 •

edited

Loading

c7hm4r commented Jun 8, 2021 •

edited

Loading