-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
after upgrading 1.7.2 my synapse does OOM each minute #6587
Comments
can you please dm me the logs from when it was leaking memory to |
The server is receiving a big federation transaction from matrix.org (with id 1576842288415) which contains 50 PDUs and 6 EDUs. The PDUs have lots of prev events; it tries to go off and locate the missing events, but something gets wedged and it simply never completes. Therefore the request retries, stacks up, and OOMs. a) We shouldn't be wedging. I've suggested to @pedro-nonfree that he tries leaving some of the rooms mentioned in the bad transaction in case that unsticks things until we can figure out what's going wrong. |
Thanks @ara4n a part of finding the bug or the wrong rooms that are hitting this memory leak. Synapse itself should avoid having memory leaks that makes it explode every minute. (Hope that a) and b) deals with it) |
As the affected room is public I would like to show it here: it is the room freenode/#openwrt-devel through the matrix.org bridge to IRC (matrix room: !sAEmVCXWUHBxWiYCmu:matrix.org) |
For reference, if we receive a transaction that we've already processed, we will return the cached response. But there is no check for whether we're currently processing a given transaction. |
I don't think that's correct: the relevant code is here. |
Having had a look at @pedro-nonfree's logs: I don't believe it is correct to conclude that the reason for the OOM is that we are trying to process the same transaction several times in parallel: the code I linked above appears to be doing its job correctly. Rather, it looks like the code processing the incoming transaction is just wedging, without doing any logging. The last thing it logs in every case (apart from the calling server dropping the connection) is a successful response to a call to I'll have to think a bit more about this. In the meantime @pedro-nonfree : if you'd like to try running synapse with debug logging enabled (essentially, change the root log level to |
thanks - please share logs asap (privately) once you've repro'd an OOM. be aware that DEBUG logs can contain personal data though. |
I'd recommend setting |
right, so the final entries in @pedro-nonfree's logs are:
so I'd guess that attempting to load 27637 events into memory at once is causing an OOM. This is happening because the I can't really believe that there are 22327 events in the auth chain for a single event, so something seems off there. I now think that the fact this started happening after the upgrade to 1.7.2 is a coincidence: rather it was triggered by us deploying #6556 (or something similar) on matrix.org. |
I've raised #6597 to track the problem with matrix.org's response. |
@richvdh identified what room was causing the problems, I tried to leave, I could not; he kicked me from that room and the OOM-cycle stopped. So the problem was identified in the matrix.org server. I suggest that the other servers part of the federation should be ready to receive insane state_ids responses: We can think about this as a vector attack: a malicious user/server invites lessons learned: was a good idea to have matrix service mostly isolated. That's a big warning for people that has the server with other unrelated services, together. |
I'm going to close this in favour of #6597 |
here is a long run (1.5 year) matrix instance running official matrix repos
(other times I saw
postgres: 11/main: synapse_user synapse ::1(38138) SELECT
)and here is in the dmesg (starts with the end of a OOM and starts with one of them, then you have an idea of the time between two) [0]
This started when I upgraded to 1.7.2 in 2019-12-21 02:40:35 [1], but the first memory leak started at 2019-12-21 12:29:04; then I decided to upgrade from stretch (debian oldstable) to buster (debian stable).
Then I downgraded to 1.7.1 and I still (?) have the memory leak
Then I downgraded to 1.7.0 and the memory leak continues,
but in a slower timesame timing, first oom requires a little more time, but then again as usualMy configuration overrides the default one (uses postgresql and ldap provider) [3]
Thanks Aaron and realitygaps in the synapse matrix channel for their initial support which was crucial to start this bug report
[0]
[1]
(from
/var/log/apt/var/log/history.log
)[2]
https://github.com/matrix-org/synapse/pull/6576/files
which in debian packaging is here: /opt/venvs/matrix-synapse/lib/python3.5/site-packages/synapse/handlers/federation.py
[3]
The text was updated successfully, but these errors were encountered: