Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upForward extremities accumulate and lead to poor performance #1760
Comments
This was referenced Jan 5, 2017
ara4n
added
the
federation-meltdown
label
Jan 5, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ara4n
Jan 5, 2017
Member
More logging for resolving state groups was added in #1767 which will hopefully help explain this
|
More logging for resolving state groups was added in #1767 which will hopefully help explain this |
This was referenced Jan 8, 2017
ara4n
added
the
p1
label
Jan 9, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ara4n
Jan 15, 2017
Member
As a workaround for people with seriously fragmented rooms (e.g. @Half-Shot has 209 extremities in #mozilla_#rust:matrix.org atm):
delete from event_forward_extremities where
room_id in (select room_id from event_forward_extremities group by room_id having count(*)>1) and
event_id not in
(select max(event_id) from event_forward_extremities where
room_id in (select room_id from event_forward_extremities group by room_id having count(*)>1)
group by room_id);
...is a dangerous and risky and not-really-recommended solution which will remove all but the newest extremities from rooms with multiple extremities. If it leaves the 'wrong' extremity for the room, bad things could happen, however.
it's useful if your server is so hosed you can't otherwise send dummy messages into the room to heal it. it should be run whilst the server is shutdown. so far we haven't seen it make problems worse; only better.
|
As a workaround for people with seriously fragmented rooms (e.g. @Half-Shot has 209 extremities in #mozilla_#rust:matrix.org atm):
...is a dangerous and risky and not-really-recommended solution which will remove all but the newest extremities from rooms with multiple extremities. If it leaves the 'wrong' extremity for the room, bad things could happen, however. it's useful if your server is so hosed you can't otherwise send dummy messages into the room to heal it. it should be run whilst the server is shutdown. so far we haven't seen it make problems worse; only better. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
erikjohnston
Jan 16, 2017
Member
select max(event_id) from event_forward_extremities
won't necessarily give you the latest event id. It may sort of work given that synapse sends out events with an auto-incrementing integer at the front, but that won't be true across different servers.
To get the latest you'd need to compare the stream_orderings.
won't necessarily give you the latest event id. It may sort of work given that synapse sends out events with an auto-incrementing integer at the front, but that won't be true across different servers. To get the latest you'd need to compare the stream_orderings. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
erikjohnston
Jan 16, 2017
Member
DELETE FROM event_forward_extremities AS e
USING (
SELECT DISTINCT ON (room_id)
room_id,
last_value(event_id) OVER w AS event_id
FROM event_forward_extremities
NATURAL JOIN events
WINDOW w AS (
PARTITION BY room_id
ORDER BY stream_ordering
range between unbounded preceding and unbounded following
)
ORDER BY room_id, stream_ordering
) AS s
WHERE
s.room_id = e.room_id
AND e.event_id != s.event_id
AND e.room_id = '!jpZMojebDLgJdJzFWn:matrix.org';is probably more how you can do it on postgres
DELETE FROM event_forward_extremities AS e
USING (
SELECT DISTINCT ON (room_id)
room_id,
last_value(event_id) OVER w AS event_id
FROM event_forward_extremities
NATURAL JOIN events
WINDOW w AS (
PARTITION BY room_id
ORDER BY stream_ordering
range between unbounded preceding and unbounded following
)
ORDER BY room_id, stream_ordering
) AS s
WHERE
s.room_id = e.room_id
AND e.event_id != s.event_id
AND e.room_id = '!jpZMojebDLgJdJzFWn:matrix.org';is probably more how you can do it on postgres |
ara4n
referenced this issue
Jan 16, 2017
Open
Freenode channels should be re-introduced to the main Matrix.org room dir #2936
richvdh
changed the title from
Synapse's memory usage temporarily spikes by ~1GB when performing state group resolution.
to
Forward extremities accumulate and lead to poor performance
Feb 22, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
richvdh
Feb 22, 2017
Member
I've been looking at this over the last few days, as it appears to be a common cause of poor performance for many people. Conclusions so far follow.
There are two principal causes for the accumulation of extremities:
The first is your server being offline, or unreachable by other servers in the federation. This can lead to a gap in the room DAG. Your server will make an attempt to backfill when it receives events after a gap, but will cap this to 10 events, and the backfill attempt may not succeed. To some extent, this situation is to be expected. However, it is particularly nasty because the accumulation of extremities makes your server perform poorly, which makes it slow to respond to federation requests, which makes other servers more likely to consider your server offline and stop trying to send to it - thus exacerbating the problem.
The second cause is a rejected event. If your server receives an event over federation which it believes was forbidden under the auth rules of the room, it will reject it. However, if other servers in the federation accept it, then it will become part of the DAG as they see it; this means that your server will see a gap in the DAG, and the rejected event's predecessor will become a forward_extremity. This problem is also self-perpetuating, because a rejected event also causes the homeserver's view of the room state to be reset (#1935), which can lead to more rejections (and hence more forward extremities) down the line.
This second cause shouldn't really happen, because we don't expect to see rejections unless someone is doing something nefarious, because all HSes should agree on which events are allowed in the DAG. It clearly is happening though, so my current investigation is focussed on trying to pin down why. I'd also like to do something about #1935, such that when a rejection does happen (through incompetence or malice), it doesn't completely mess everything up thereafter.
|
I've been looking at this over the last few days, as it appears to be a common cause of poor performance for many people. Conclusions so far follow. There are two principal causes for the accumulation of extremities: The first is your server being offline, or unreachable by other servers in the federation. This can lead to a gap in the room DAG. Your server will make an attempt to backfill when it receives events after a gap, but will cap this to 10 events, and the backfill attempt may not succeed. To some extent, this situation is to be expected. However, it is particularly nasty because the accumulation of extremities makes your server perform poorly, which makes it slow to respond to federation requests, which makes other servers more likely to consider your server offline and stop trying to send to it - thus exacerbating the problem. The second cause is a rejected event. If your server receives an event over federation which it believes was forbidden under the auth rules of the room, it will reject it. However, if other servers in the federation accept it, then it will become part of the DAG as they see it; this means that your server will see a gap in the DAG, and the rejected event's predecessor will become a forward_extremity. This problem is also self-perpetuating, because a rejected event also causes the homeserver's view of the room state to be reset (#1935), which can lead to more rejections (and hence more forward extremities) down the line. This second cause shouldn't really happen, because we don't expect to see rejections unless someone is doing something nefarious, because all HSes should agree on which events are allowed in the DAG. It clearly is happening though, so my current investigation is focussed on trying to pin down why. I'd also like to do something about #1935, such that when a rejection does happen (through incompetence or malice), it doesn't completely mess everything up thereafter. |
ara4n
referenced this issue
Feb 23, 2017
Closed
hide state redaction events from clients as an emergency temporary measure #1937
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
richvdh
Mar 16, 2017
Member
It clearly is happening though, so my current investigation is focussed on trying to pin down why
The rejections appeared to stem from the fact that the state of the room was out of sync from the very start - it looked like events were received over federation while the join was still in progress, and a race condition meant that the state ended up in an invalid, uh, state. Hopefully this will be fixed by #2016.
The rejections appeared to stem from the fact that the state of the room was out of sync from the very start - it looked like events were received over federation while the join was still in progress, and a race condition meant that the state ended up in an invalid, uh, state. Hopefully this will be fixed by #2016. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
turt2live
Aug 6, 2017
Member
This seems to have gotten worse (at least for me) over the last week or so. Every second day I'm having to clear extremities from t2bot.io just to keep the thing running in a reasonable fashion. No apparent consistency between rooms, just 25+ extremities for 10-15 rooms after a couple days.
|
This seems to have gotten worse (at least for me) over the last week or so. Every second day I'm having to clear extremities from t2bot.io just to keep the thing running in a reasonable fashion. No apparent consistency between rooms, just 25+ extremities for 10-15 rooms after a couple days. |
MacLemon
referenced this issue
Aug 31, 2017
Open
Cannot leave unjoined room in “Hotel California”. #2432
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
vberger
Sep 18, 2017
It still seems to be quite a problem (it is for me at least).
If that interests anyone, I monitor the evolution of these extremities with this SQL query (it is a little big because it retrieves the canonical id of offending rooms as well):
SELECT f.count, concat(f.alias, ' (', f.room_id, ')')
FROM (
SELECT t.room_id, t.count, se.event_id, e.content::json->'alias' AS alias
FROM (
SELECT room_id, count(*)
FROM event_forward_extremities
GROUP BY room_id HAVING count(*)>1
) t
LEFT OUTER JOIN current_state_events AS se
ON se.room_id = t.room_id AND se.type = 'm.room.canonical_alias'
LEFT OUTER JOIN events AS e
ON se.event_id = e.event_id
) f;
vberger
commented
Sep 18, 2017
•
|
It still seems to be quite a problem (it is for me at least). If that interests anyone, I monitor the evolution of these extremities with this SQL query (it is a little big because it retrieves the canonical id of offending rooms as well): SELECT f.count, concat(f.alias, ' (', f.room_id, ')')
FROM (
SELECT t.room_id, t.count, se.event_id, e.content::json->'alias' AS alias
FROM (
SELECT room_id, count(*)
FROM event_forward_extremities
GROUP BY room_id HAVING count(*)>1
) t
LEFT OUTER JOIN current_state_events AS se
ON se.room_id = t.room_id AND se.type = 'm.room.canonical_alias'
LEFT OUTER JOIN events AS e
ON se.event_id = e.event_id
) f; |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
vberger
Sep 18, 2017
From what I see, worst offenders seem to be IRC-bridged rooms with a high join/part turnover. Such as #mozilla_#rust:matrix.org, #mozilla_#rust-offtopic:matrix.org, and #haskell:matrix.org
vberger
commented
Sep 18, 2017
|
From what I see, worst offenders seem to be IRC-bridged rooms with a high join/part turnover. Such as #mozilla_#rust:matrix.org, #mozilla_#rust-offtopic:matrix.org, and #haskell:matrix.org |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Ralith
Sep 20, 2017
Contributor
I haven't had a serious breakdown or runaway forward extremity accumulation while on #rust for several months, FWIW. It seems that either there was a specific event that triggered it which hasn't recurred in that room, or at least some of the causes have been addressed.
|
I haven't had a serious breakdown or runaway forward extremity accumulation while on #rust for several months, FWIW. It seems that either there was a specific event that triggered it which hasn't recurred in that room, or at least some of the causes have been addressed. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
vberger
Sep 21, 2017
I had no catastrophic accumulation, but these rooms sat at around 60-80 extremities.
I finally got around just leaving these rooms, and I must tell, my HS is much more responsive since I've done that.
vberger
commented
Sep 21, 2017
|
I had no catastrophic accumulation, but these rooms sat at around 60-80 extremities. I finally got around just leaving these rooms, and I must tell, my HS is much more responsive since I've done that. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qbit
Sep 27, 2017
I just had what I assume was this issue. I had multiple rooms with >6 (35 max) extremities. Synapse became completely unresponsive: 84847 _synapse 64 0 1548M 1114M onproc/0 - 18.4H 61.33% python2.7 - I have SYNAPSE_CACHE_FACTOR=0.02. This seems to happen to me about every other week and I am not in any large channels (confirmed by clearing cache on all my clients and looking at rooms I am in). I am also the only user on the homeserver.
Edit: Just hit it again. Looks like one of the room IDs that are getting extremities the matrix hq room. Is there a way to remove users from rooms via the db? Maybe I can remove them via a cron job?
qbit
commented
Sep 27, 2017
•
|
I just had what I assume was this issue. I had multiple rooms with >6 (35 max) extremities. Synapse became completely unresponsive: Edit: Just hit it again. Looks like one of the room IDs that are getting extremities the matrix hq room. Is there a way to remove users from rooms via the db? Maybe I can remove them via a cron job? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rogerbraun
Nov 17, 2017
I experience this with the #haskell room on freenode. Is there any way to reset the room or delete it? The extremities come back as soon as I delete them.
rogerbraun
commented
Nov 17, 2017
|
I experience this with the #haskell room on freenode. Is there any way to reset the room or delete it? The extremities come back as soon as I delete them. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ara4n
Jan 9, 2018
Member
We just had to run this on matrix.org after lots of freenode membership churn seemingly fragmented lots of DAGs, causing a feedback loop where subsequent freenode joins got slower and slower, making freenode grind to a halt.
For the record, the query used was:
BEGIN;
SET enable_seqscan=off;
DELETE FROM event_forward_extremities AS e
USING (
SELECT DISTINCT ON (room_id)
room_id,
last_value(event_id) OVER w AS event_id
FROM event_forward_extremities
NATURAL JOIN events
WINDOW w AS (
PARTITION BY room_id
ORDER BY stream_ordering
range between unbounded preceding and unbounded following
)
ORDER BY room_id, stream_ordering
) AS s,
(
select room_id from event_forward_extremities group by room_id having count(*)>1
) AS x
WHERE
s.room_id = e.room_id
AND e.event_id != s.event_id
AND e.room_id = x.room_id;
COMMIT;...which took a few minutes to run.
|
We just had to run this on matrix.org after lots of freenode membership churn seemingly fragmented lots of DAGs, causing a feedback loop where subsequent freenode joins got slower and slower, making freenode grind to a halt. For the record, the query used was: BEGIN;
SET enable_seqscan=off;
DELETE FROM event_forward_extremities AS e
USING (
SELECT DISTINCT ON (room_id)
room_id,
last_value(event_id) OVER w AS event_id
FROM event_forward_extremities
NATURAL JOIN events
WINDOW w AS (
PARTITION BY room_id
ORDER BY stream_ordering
range between unbounded preceding and unbounded following
)
ORDER BY room_id, stream_ordering
) AS s,
(
select room_id from event_forward_extremities group by room_id having count(*)>1
) AS x
WHERE
s.room_id = e.room_id
AND e.event_id != s.event_id
AND e.room_id = x.room_id;
COMMIT;...which took a few minutes to run. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
turt2live
Jan 9, 2018
Member
On the receiving end of many of those membership events, I've also seen extremities skyrocket. Under normal load, extremities accumulate slowly, however the last day or so has caused fairly major outages on my end :(
|
On the receiving end of many of those membership events, I've also seen extremities skyrocket. Under normal load, extremities accumulate slowly, however the last day or so has caused fairly major outages on my end :( |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Valodim
Jan 18, 2018
Contributor
Guys, this is a major problem.
I've been running a synapse instance for a year and a half now (some 50 active users, joining the typical huge channels), and the general experience is that everything is mostly fine as long as no forward extremities are accumulated, but as soon as it happens (5+), it comes out of nowhere and, grinds everything to a halt and needs manual intervention.
Really, for the first two to three months my impression of admin complexity was "just apt-get upgrade once in a while, you're good. no advanced skills necessary". This has since changed to "better know about these sql statements from that issue on the tracker, or your hs is bound to blow up sooner or later". One admin of a major HS I had talked to told me they'd pretty much just regularly schedule downtimes to run the above "dangerous and risky and not-really-recommended" query. For myself I mostly hope that I'll not be asleep when this happens so I can handle things fast enough to minimize downtimes for my users. Still this is unacceptable reliability for what people want to use as messenger, not to mention the admin load.
I don't know how much this shows on matrix.org since it's kind of in a special position, but for other HSes I cannot overstate how much of an impact this has on maintaining a synapse instance. Really, please allocate more time to this. There are workable suggestions above, maybe send dummy events to channels when there is more than a few extremities - this is pretty much what I end up doing manually every once in a while.
|
Guys, this is a major problem. I've been running a synapse instance for a year and a half now (some 50 active users, joining the typical huge channels), and the general experience is that everything is mostly fine as long as no forward extremities are accumulated, but as soon as it happens (5+), it comes out of nowhere and, grinds everything to a halt and needs manual intervention. Really, for the first two to three months my impression of admin complexity was "just apt-get upgrade once in a while, you're good. no advanced skills necessary". This has since changed to "better know about these sql statements from that issue on the tracker, or your hs is bound to blow up sooner or later". One admin of a major HS I had talked to told me they'd pretty much just regularly schedule downtimes to run the above "dangerous and risky and not-really-recommended" query. For myself I mostly hope that I'll not be asleep when this happens so I can handle things fast enough to minimize downtimes for my users. Still this is unacceptable reliability for what people want to use as messenger, not to mention the admin load. I don't know how much this shows on matrix.org since it's kind of in a special position, but for other HSes I cannot overstate how much of an impact this has on maintaining a synapse instance. Really, please allocate more time to this. There are workable suggestions above, maybe send dummy events to channels when there is more than a few extremities - this is pretty much what I end up doing manually every once in a while. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ara4n
Jan 19, 2018
Member
Totally agreed. We're actually working on it currently as part of the current wave of synapse optimisation work triggered by running out of CPU on matrix.org last week (https://twitter.com/matrixdotorg/status/951403752522682369 etc)
The internal discussion lives at https://riot.im/develop/#/room/#matrix-core:matrix.org/$1516300707383452uirNl:matrix.org (which is mainly me failing to follow the discussion, but hey)
|
Totally agreed. We're actually working on it currently as part of the current wave of synapse optimisation work triggered by running out of CPU on matrix.org last week (https://twitter.com/matrixdotorg/status/951403752522682369 etc) The internal discussion lives at https://riot.im/develop/#/room/#matrix-core:matrix.org/$1516300707383452uirNl:matrix.org (which is mainly me failing to follow the discussion, but hey) |
ndarilek
referenced this issue
Feb 1, 2018
Open
SQLite to Postgres migration script has errors #2222
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
turt2live
Feb 27, 2018
Member
fwiw, I'm still seeing serious performance issues even with that last batch of work. The extremity count gets higher before making an impact though (around 20-30ish per room, ~300 overall).
|
fwiw, I'm still seeing serious performance issues even with that last batch of work. The extremity count gets higher before making an impact though (around 20-30ish per room, ~300 overall). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
richvdh
Feb 28, 2018
Member
I think i'm right in saying that the conclusion from the last wave of work on this was that we should just speed up state resolution so that it doesn't matter that the DAG is fragmented. @richvdh did this end up being #2864, and did it work?
Yes it did, and I think it helped a bit (though probably not so much on matrix.org where the extremity count tends to be fairly low anyway).
The situation is that now we should only have to do state resolution when we get a state event, rather than on every event. (State resolution gets more expensive with more extremities, and with a larger room state.) For rooms where the traffic is dominated by messages rather than state, I would hope #2864 would make a reasonable difference. For rooms where there are a lot of state changes, it probably won't help much.
Yes it did, and I think it helped a bit (though probably not so much on matrix.org where the extremity count tends to be fairly low anyway). The situation is that now we should only have to do state resolution when we get a state event, rather than on every event. (State resolution gets more expensive with more extremities, and with a larger room state.) For rooms where the traffic is dominated by messages rather than state, I would hope #2864 would make a reasonable difference. For rooms where there are a lot of state changes, it probably won't help much. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
vberger
Feb 28, 2018
Maybe this is a stupid idea, but would it make sense to have the HS send some kind of "noop event" in the rooms when the number of extremities gets high?
Just an event with no content (is that even possible?), just for the sake of merging the extremities.
vberger
commented
Feb 28, 2018
|
Maybe this is a stupid idea, but would it make sense to have the HS send some kind of "noop event" in the rooms when the number of extremities gets high? Just an event with no content (is that even possible?), just for the sake of merging the extremities. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
richvdh
Feb 28, 2018
Member
I thought I'd written this down somewhere, but I can't see it anywhere, so here goes:
The main reason not to have every server in the room start sending out "noop" events whenever it sees more than a few extremities is a concern that, in trying to fix things for itself, it will actually make the problem worse for everyone else. We would probably end up with quite a lot of these events and we could end up making the problem worse overall.
It's something that might be worth experimenting with a bit though.
|
I thought I'd written this down somewhere, but I can't see it anywhere, so here goes: The main reason not to have every server in the room start sending out "noop" events whenever it sees more than a few extremities is a concern that, in trying to fix things for itself, it will actually make the problem worse for everyone else. We would probably end up with quite a lot of these events and we could end up making the problem worse overall. It's something that might be worth experimenting with a bit though. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
richvdh
Feb 28, 2018
Member
Ok yes that was discussed at https://riot.im/develop/#/room/#matrix-core:matrix.org/$1516300707383452uirNl:matrix.org, but that conversation got a bit sidetracked by deciding to do #2864 instead. As I wrote there though:
anyway if all the HSes in a room suddenly start generating events which heal their particular holes in the DAG, then the chances are that half of those events won't make it to half the other servers, and we end up in more of a mess than we started
Erik also suggested a solution in which we ignore all but the most recent component of connected events in the graph when calculating the current state of a room, and when sending new events (which is effectively what the query above does). The concern is that this would make it easy for somebody to take over a room by deliberately introducing a split in the DAG and giving themselves admin powers in the new component.
|
Ok yes that was discussed at https://riot.im/develop/#/room/#matrix-core:matrix.org/$1516300707383452uirNl:matrix.org, but that conversation got a bit sidetracked by deciding to do #2864 instead. As I wrote there though:
Erik also suggested a solution in which we ignore all but the most recent component of connected events in the graph when calculating the current state of a room, and when sending new events (which is effectively what the query above does). The concern is that this would make it easy for somebody to take over a room by deliberately introducing a split in the DAG and giving themselves admin powers in the new component. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
richvdh
Apr 6, 2018
Member
#1760 (comment) isn't great because it takes several minutes to run, and it has to be done while synapse is down otherwise the results get overwritten by the cache.
An alternative approach is to select the extremities to delete into a temporary table while synapse is running, and then just shut it down to do the actual delete:
BEGIN;
SET enable_seqscan=off;
SELECT e.event_id INTO extrems_to_delete FROM event_forward_extremities AS e,
(
SELECT DISTINCT ON (room_id)
room_id,
last_value(event_id) OVER w AS event_id
FROM event_forward_extremities
NATURAL JOIN events
WINDOW w AS (
PARTITION BY room_id
ORDER BY stream_ordering
range between unbounded preceding and unbounded following
)
ORDER BY room_id, stream_ordering
) AS s,
(
select room_id from event_forward_extremities group by room_id having count(*)>1
) AS x
WHERE
s.room_id = e.room_id
AND e.event_id != s.event_id
AND e.room_id = x.room_id;
COMMIT;
-- shut synapse down here
DELETE FROM event_forward_extremities WHERE event_id IN (SELECT event_id FROM extrems_to_delete);
-- start it up again
DROP TABLE extrems_to_delete;
|
#1760 (comment) isn't great because it takes several minutes to run, and it has to be done while synapse is down otherwise the results get overwritten by the cache. An alternative approach is to select the extremities to delete into a temporary table while synapse is running, and then just shut it down to do the actual delete:
|
ara4n commentedJan 5, 2017
•
edited by richvdh
Edited 1 time
-
richvdh
edited May 24, 2017 (most recent)
TLDR: To determine if you are affected by this problem, run the following query:
Any rows showing a count of more than a handful (say 10) are cause for concern. You can probably gain some respite by running the query at #1760 (comment) for each room_id, and then restarting synapse.
Whilst investigating the cause of heap usage spikes in synapse, correlating jumps in RSZ with logs showed that 'resolving state for !curbaf with 49 groups' loglines took ages to execute and would temporarily take loads of heap (resulting in a permenant hike in RSZ, as python is bad at reclaiming heap).
On looking at the groups being resolved, it turns out that these were the extremities of the current room, and whenever the synapse queries the current room state, it has to merge these all together, whose implementation is currently very slow. To clear the extremities, one has to talk in the room (each message 'heals' 10 extremities, as max prev-events for a message is 10).
Problems here are: