Forward extremities accumulate and lead to poor performance #1760

ara4n · 2017-01-05T02:56:26Z

TLDR: To determine if you are affected by this problem, run the following query:

select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;

Any rows showing a count of more than a handful (say 10) are cause for concern. You can probably gain some respite by running the query at #1760 (comment).

Whilst investigating the cause of heap usage spikes in synapse, correlating jumps in RSZ with logs showed that 'resolving state for !curbaf with 49 groups' loglines took ages to execute and would temporarily take loads of heap (resulting in a permenant hike in RSZ, as python is bad at reclaiming heap).

On looking at the groups being resolved, it turns out that these were the extremities of the current room, and whenever the synapse queries the current room state, it has to merge these all together, whose implementation is currently very slow. To clear the extremities, one has to talk in the room (each message 'heals' 10 extremities, as max prev-events for a message is 10).

Problems here are:

Why are we accumulating so many extremities? I assume it's whenever there's some downtime the graph breaks, leaving a dangling node.
Is there a way to stop them accumulating by healing or discarding them on launch (e.g. by sending a null healing event into the room)?
Why is state resolution so incredibly heavy? There should hardly be any conflicting state here, unless the bifurcation has been going on for months. Is it because to auth potential conflicts we have to load all auth events, which include every m.room.member
Logs of a state resolution happening from arasphere at DEBUG show lots of thrashing on the rejections table too.
We're also seeing ominous pauses in the logging of requests which resolve state, as if there's some lock we're contending for. (This might be the same as Ralith's HS OOMed after repeatedly loading events in get_current_user_in_room #1774)
Can we just insert dummy nodes in our local copy of the DAG after doing a successful state resolution, to avoid having to constantly re-calculate it or rely on naive caching?

The text was updated successfully, but these errors were encountered:

ara4n · 2017-01-05T15:13:17Z

More logging for resolving state groups was added in #1767 which will hopefully help explain this

ara4n · 2017-01-15T01:49:57Z

As a workaround for people with seriously fragmented rooms (e.g. @Half-Shot has 209 extremities in #mozilla_#rust:matrix.org atm):

delete from event_forward_extremities where
    room_id in (select room_id from event_forward_extremities group by room_id having count(*)>1) and
    event_id not in 
        (select max(event_id) from event_forward_extremities where
            room_id in (select room_id from event_forward_extremities group by room_id having count(*)>1)
         group by room_id);

...is a dangerous and risky and not-really-recommended solution which will remove all but the newest extremities from rooms with multiple extremities. If it leaves the 'wrong' extremity for the room, bad things could happen, however.

it's useful if your server is so hosed you can't otherwise send dummy messages into the room to heal it. it should be run whilst the server is shutdown. so far we haven't seen it make problems worse; only better.

erikjohnston · 2017-01-16T10:08:23Z

select max(event_id) from event_forward_extremities

won't necessarily give you the latest event id. It may sort of work given that synapse sends out events with an auto-incrementing integer at the front, but that won't be true across different servers.

To get the latest you'd need to compare the stream_orderings.

erikjohnston · 2017-01-16T10:18:18Z

DELETE FROM event_forward_extremities AS e
USING ( 
    SELECT DISTINCT ON (room_id)
    room_id,
    last_value(event_id) OVER w AS event_id
    FROM event_forward_extremities
    NATURAL JOIN events
    WINDOW w AS (
        PARTITION BY room_id
        ORDER BY stream_ordering
        range between unbounded preceding and unbounded following
    )
    ORDER BY room_id, stream_ordering
) AS s
WHERE
    s.room_id = e.room_id
    AND e.event_id != s.event_id
    AND e.room_id = '!jpZMojebDLgJdJzFWn:matrix.org';

is probably more how you can do it on postgres

richvdh · 2017-02-22T21:56:51Z

I've been looking at this over the last few days, as it appears to be a common cause of poor performance for many people. Conclusions so far follow.

There are two principal causes for the accumulation of extremities:

The first is your server being offline, or unreachable by other servers in the federation. This can lead to a gap in the room DAG. Your server will make an attempt to backfill when it receives events after a gap, but will cap this to 10 events, and the backfill attempt may not succeed. To some extent, this situation is to be expected. However, it is particularly nasty because the accumulation of extremities makes your server perform poorly, which makes it slow to respond to federation requests, which makes other servers more likely to consider your server offline and stop trying to send to it - thus exacerbating the problem.

The second cause is a rejected event. If your server receives an event over federation which it believes was forbidden under the auth rules of the room, it will reject it. However, if other servers in the federation accept it, then it will become part of the DAG as they see it; this means that your server will see a gap in the DAG, and the rejected event's predecessor will become a forward_extremity. This problem is also self-perpetuating, because a rejected event also causes the homeserver's view of the room state to be reset (#1935), which can lead to more rejections (and hence more forward extremities) down the line.

This second cause shouldn't really happen, because we don't expect to see rejections unless someone is doing something nefarious, because all HSes should agree on which events are allowed in the DAG. It clearly is happening though, so my current investigation is focussed on trying to pin down why. I'd also like to do something about #1935, such that when a rejection does happen (through incompetence or malice), it doesn't completely mess everything up thereafter.

richvdh · 2017-03-16T10:52:38Z

It clearly is happening though, so my current investigation is focussed on trying to pin down why

The rejections appeared to stem from the fact that the state of the room was out of sync from the very start - it looked like events were received over federation while the join was still in progress, and a race condition meant that the state ended up in an invalid, uh, state. Hopefully this will be fixed by #2016.

turt2live · 2017-08-06T20:33:05Z

This seems to have gotten worse (at least for me) over the last week or so. Every second day I'm having to clear extremities from t2bot.io just to keep the thing running in a reasonable fashion. No apparent consistency between rooms, just 25+ extremities for 10-15 rooms after a couple days.

elinorbgr · 2017-09-18T16:56:03Z

It still seems to be quite a problem (it is for me at least).

If that interests anyone, I monitor the evolution of these extremities with this SQL query (it is a little big because it retrieves the canonical id of offending rooms as well):

SELECT f.count, concat(f.alias, ' (', f.room_id, ')')
    FROM (
        SELECT t.room_id, t.count, se.event_id, e.content::json->'alias' AS alias
        FROM (
            SELECT room_id, count(*) 
            FROM event_forward_extremities
            GROUP BY room_id HAVING count(*)>1
        ) t 
        LEFT OUTER JOIN current_state_events AS se
            ON se.room_id = t.room_id AND se.type = 'm.room.canonical_alias'
        LEFT OUTER JOIN events AS e
            ON se.event_id = e.event_id
    ) f;

elinorbgr · 2017-09-18T16:58:23Z

From what I see, worst offenders seem to be IRC-bridged rooms with a high join/part turnover. Such as #mozilla_#rust:matrix.org, #mozilla_#rust-offtopic:matrix.org, and #haskell:matrix.org

Ralith · 2017-09-20T21:00:37Z

I haven't had a serious breakdown or runaway forward extremity accumulation while on #rust for several months, FWIW. It seems that either there was a specific event that triggered it which hasn't recurred in that room, or at least some of the causes have been addressed.

elinorbgr · 2017-09-21T07:57:19Z

I had no catastrophic accumulation, but these rooms sat at around 60-80 extremities.

I finally got around just leaving these rooms, and I must tell, my HS is much more responsive since I've done that.

qbit · 2017-09-27T00:34:22Z

I just had what I assume was this issue. I had multiple rooms with >6 (35 max) extremities. Synapse became completely unresponsive: 84847 _synapse 64 0 1548M 1114M onproc/0 - 18.4H 61.33% python2.7 - I have SYNAPSE_CACHE_FACTOR=0.02. This seems to happen to me about every other week and I am not in any large channels (confirmed by clearing cache on all my clients and looking at rooms I am in). I am also the only user on the homeserver.

Edit: Just hit it again. Looks like one of the room IDs that are getting extremities the matrix hq room. Is there a way to remove users from rooms via the db? Maybe I can remove them via a cron job?

rogerbraun · 2017-11-17T13:37:47Z

I experience this with the #haskell room on freenode. Is there any way to reset the room or delete it? The extremities come back as soon as I delete them.

ara4n · 2018-01-09T12:27:51Z

We just had to run this on matrix.org after lots of freenode membership churn seemingly fragmented lots of DAGs, causing a feedback loop where subsequent freenode joins got slower and slower, making freenode grind to a halt.

For the record, the query used was:

BEGIN;
SET enable_seqscan=off;
DELETE FROM event_forward_extremities AS e
USING ( 
    SELECT DISTINCT ON (room_id)
    room_id,
    last_value(event_id) OVER w AS event_id
    FROM event_forward_extremities
    NATURAL JOIN events
    WINDOW w AS (
        PARTITION BY room_id
        ORDER BY stream_ordering
        range between unbounded preceding and unbounded following
    )
    ORDER BY room_id, stream_ordering
) AS s,
    (
    select room_id from event_forward_extremities group by room_id having count(*)>1 
) AS x
WHERE
    s.room_id = e.room_id
    AND e.event_id != s.event_id
    AND e.room_id = x.room_id;
COMMIT;

...which took a few minutes to run.

turt2live · 2018-01-09T15:53:15Z

On the receiving end of many of those membership events, I've also seen extremities skyrocket. Under normal load, extremities accumulate slowly, however the last day or so has caused fairly major outages on my end :(

Valodim · 2018-01-18T10:37:07Z

Guys, this is a major problem.

I've been running a synapse instance for a year and a half now (some 50 active users, joining the typical huge channels), and the general experience is that everything is mostly fine as long as no forward extremities are accumulated, but as soon as it happens (5+), it comes out of nowhere and, grinds everything to a halt and needs manual intervention.

Really, for the first two to three months my impression of admin complexity was "just apt-get upgrade once in a while, you're good. no advanced skills necessary". This has since changed to "better know about these sql statements from that issue on the tracker, or your hs is bound to blow up sooner or later". One admin of a major HS I had talked to told me they'd pretty much just regularly schedule downtimes to run the above "dangerous and risky and not-really-recommended" query. For myself I mostly hope that I'll not be asleep when this happens so I can handle things fast enough to minimize downtimes for my users. Still this is unacceptable reliability for what people want to use as messenger, not to mention the admin load.

I don't know how much this shows on matrix.org since it's kind of in a special position, but for other HSes I cannot overstate how much of an impact this has on maintaining a synapse instance. Really, please allocate more time to this. There are workable suggestions above, maybe send dummy events to channels when there is more than a few extremities - this is pretty much what I end up doing manually every once in a while.

kroeckx · 2019-08-13T22:13:31Z

So I was getting around 24 for a few weeks now for #mozilla_#rust:matrix.org, and today it changed to 74, and things are really slow again. This is using 1.2.1.

ara4n · 2019-08-14T00:25:42Z

@kroeckx can you turn on the cleanup_extremities_with_dummy_events: true button and see if that fixes it?

kroeckx · 2019-08-14T20:06:02Z

I already ran the query. I've now enabled it. I still see 2 rooms with 4 extremities.

kroeckx · 2019-08-15T13:32:43Z

So the rust room grew to 6 again so far.

ara4n · 2019-08-15T13:43:00Z

The way that option works is to only kick in and suppress the extremities if they grow beyond 10. Are you seeing perf issues currently?

kroeckx · 2019-08-15T13:47:00Z

I guess this is still normal. I'll let you know if I ever see it higher than 10 again.

kroeckx · 2019-08-22T20:13:26Z

So I see the values change over time, but they stay below 10.

ara4n · 2019-08-22T21:39:22Z

this is as expected. the question is whether you hit perf problems or if 10 is an adequate threshold.

kroeckx · 2019-08-22T22:16:16Z

I can't say I see performance differences. The only time I could clearly see a difference is when it was much higher.

richvdh · 2019-08-26T10:37:32Z

So it sounds like the new setting is correctly keeping rooms below 10 extremities, so it is no longer causing a performance problem?

MurzNN · 2019-09-01T08:35:15Z

On our not-too-large public ru-matrix.org server, with Synapse v1.3.1 having too much cpu/load usage, extremities count are too large:

!iEiJZbwrOzEkZNjsYf:matrix.org	 - 613
!YynUnYHpqlHuoTAjsp:matrix.org	 - 357
!TKuijUGqELNcsYRdMa:tchncs.de	 - 309

and more than 12 rooms with >100 count! After cleaning up with sql queries, there are becomes one room with 2 count, all other with 1, after restart groves to 5-7 count, and cpu/mem/load groves down dramatically, thanks! Does we need to repeat this cleanups periodically?

aaronraimist · 2019-09-01T15:32:49Z

@MurzNN Yes or you could enable the experimental cleanup_extremities_with_dummy_events: true in your homeserver config to automatically clean them up. @ara4n can you add that to the TLDR in the issue?

neilisfragile · 2019-10-04T14:01:57Z

We no longer consider #5480 to be experimental and have enabled it by default in 1.4.0. So death by extremity buildup should be a thing of the past.

Feedback very much welcome. I'll leave this issue open for now in case folks still have problems.

richvdh · 2019-12-19T10:45:42Z

I think we can consider this closed now.

benhylau · 2019-12-19T16:20:17Z

Glad this one didn't make it to 2020

PC-Admin · 2020-11-30T03:45:59Z

I seem to be getting these extremities still, I cleared them out about a month ago and since then these new forward extremities have accumulated:

                 room_id                  | c  
------------------------------------------+----
 !vUXcVeSXkoAGbIREoj:matrix.org           | 99
 !vJPDJLWMJbPnKlwyzE:matrix.org           | 66
 !jvKJuwwWRGDzBhLurm:matrix.org           | 42
 !ifEXJOtimDZnbaYObc:matrix.org           | 26
 !xGmquSnSozzjgUDBIB:matrix.org           | 16
 !BfPzMJuQMaOmBzFOdD:matrix.org           | 11
 !awLQkSooHfMgAfXwCj:matrix.org           | 10
 !bzQWZFKZWVJEvEdVqE:matrix.org           | 10
 !YaQDIfXXPIjxTtkKZv:matrix.org           | 10
 !lxLFDpQJHZtqePeIUK:matrix.org           | 10
 !DBSOFOSAeAPWhqLARZ:matrix.org           | 10
 !bVLqshyvYOtNWfudcg:matrix.org           | 10
 !Blade_Sparxx_Pats_King_Anubis:ponies.im | 10
 !dseyIKgGxDaWVXpeHf:matrix.org           | 10
 !MFogdGJfnZLrDmgkBN:matrix.org           | 10
 !jlAFXBRrJBvyWsnEsc:matrix.org           | 10
 !KIwsdPeEnTTmPnMpMv:fam-ribbers.com      | 10
 !kfXkmFVLeQDBFexcew:matrix.org           | 10
 !geNjbnNYhhankxHbDh:matrix.org           | 10
 !YJPnqWJENzArNZWzIz:mozilla.org          | 10

My Synapse version: 1.23.0

turt2live · 2020-12-02T04:00:36Z

fwiw, a query to help identify the room names and such of those rooms is:

select rss.room_id, rss.name, (select count(*) from event_forward_extremities as efe where efe.room_id = rss.room_id) as extremities, rss.join_rules, rsc.current_state_events, rsc.joined_members, rsc.invited_members, rsc.left_members, rsc.local_users_in_room from room_stats_state as rss join room_stats_current as rsc on rsc.room_id = rss.room_id where rss.room_id in (select room_id from event_forward_extremities group by room_id order by count(*) desc limit 20);

It's not pretty, but it does at least give a bit more context if one is trying to determine if a particular room even needs to be on the server.

dklimpel · 2021-01-10T15:23:09Z

There is a related PR #9062

MurzNN · 2021-02-03T11:15:12Z

PR is merged, cool! But how to reveal rooms, that have too much extremities, via admin API call, instead of old-school raw SQL query?

This was referenced Jan 5, 2017

Server ended up doing state group resolution on pretty much every CS request. #1761

Closed

prev_events are not capped at 10 seemingly #1759

Closed

ara4n added the federation-meltdown label Jan 5, 2017

This was referenced Jan 8, 2017

export metrics that characterise heap spikes #1779

Closed

Ralith's HS OOMed after repeatedly loading events in get_current_user_in_room #1774

Closed

ara4n added the p1 label Jan 9, 2017

erikjohnston mentioned this issue Jan 13, 2017

Optimise state resolution #1808

Closed

ara4n mentioned this issue Jan 16, 2017

Freenode channels should be re-introduced to the main Matrix.org room dir element-hq/element-web#2936

Closed

richvdh changed the title ~~Synapse's memory usage temporarily spikes by ~1GB when performing state group resolution.~~ Forward extremities accumulate and lead to poor performance Feb 22, 2017

ara4n mentioned this issue Feb 22, 2017

Stuck in "empty" #matrix-dev:matrix.org room #1933

Closed

ara4n mentioned this issue Feb 23, 2017

hide state redaction events from clients as an emergency temporary measure #1937

Closed

seanenck mentioned this issue Apr 10, 2017

Root causing performance issues #2114

Closed

MacLemon mentioned this issue Aug 31, 2017

Cannot leave unjoined room in “Hotel California”. #2432

Closed

MilkManzJourDaddy mentioned this issue Dec 5, 2017

FAQ & Glossary: Forward Extremities MilkManzJourDaddy/matrix-org#14

Open

chr-1x mentioned this issue Sep 13, 2019

I have a large number of rejected events from matrix.org #6018

Closed

ara4n mentioned this issue Sep 22, 2019

Thoughts on improving chat systems eeeeeta/blog-comments#5

Open

richvdh closed this as completed Dec 19, 2019

richvdh mentioned this issue Jan 16, 2020

No forward extremities left! #6708

Closed

f0x52 mentioned this issue Jun 2, 2020

Users stop syncing, postgres queries at high CPU #7618

Closed

richvdh mentioned this issue Dec 10, 2020

event_forward_extremities are not correctly cleared by persistence of successor events #8913

Open

jaywink mentioned this issue Jan 11, 2021

Add forward extremities endpoint to rooms admin API #9062

Merged

awesome-manuel mentioned this issue May 4, 2021

Add a new tab to rooms with forward extremities Awesome-Technologies/synapse-admin#107

Merged

clokep mentioned this issue Feb 2, 2022

high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

Closed

DMRobertson added A-Federation and removed z-federation-meltdown labels Aug 25, 2022

Forward extremities accumulate and lead to poor performance #1760

Forward extremities accumulate and lead to poor performance #1760

Comments

ara4n commented Jan 5, 2017 • edited by richvdh Loading

ara4n commented Jan 5, 2017

ara4n commented Jan 15, 2017 • edited Loading

erikjohnston commented Jan 16, 2017

erikjohnston commented Jan 16, 2017 • edited by richvdh Loading

richvdh commented Feb 22, 2017

richvdh commented Mar 16, 2017

turt2live commented Aug 6, 2017

elinorbgr commented Sep 18, 2017 • edited by ara4n Loading

elinorbgr commented Sep 18, 2017

Ralith commented Sep 20, 2017

elinorbgr commented Sep 21, 2017

qbit commented Sep 27, 2017 • edited Loading

rogerbraun commented Nov 17, 2017

ara4n commented Jan 9, 2018 • edited Loading

turt2live commented Jan 9, 2018

Valodim commented Jan 18, 2018 • edited Loading

kroeckx commented Aug 13, 2019

ara4n commented Aug 14, 2019

kroeckx commented Aug 14, 2019

kroeckx commented Aug 15, 2019

ara4n commented Aug 15, 2019

kroeckx commented Aug 15, 2019

kroeckx commented Aug 22, 2019

ara4n commented Aug 22, 2019

kroeckx commented Aug 22, 2019

richvdh commented Aug 26, 2019

MurzNN commented Sep 1, 2019

aaronraimist commented Sep 1, 2019 • edited Loading

neilisfragile commented Oct 4, 2019 • edited by ara4n Loading

richvdh commented Dec 19, 2019

benhylau commented Dec 19, 2019

PC-Admin commented Nov 30, 2020

turt2live commented Dec 2, 2020

dklimpel commented Jan 10, 2021

MurzNN commented Feb 3, 2021 • edited Loading

ara4n commented Jan 5, 2017 •

edited by richvdh

Loading

ara4n commented Jan 15, 2017 •

edited

Loading

erikjohnston commented Jan 16, 2017 •

edited by richvdh

Loading

elinorbgr commented Sep 18, 2017 •

edited by ara4n

Loading

qbit commented Sep 27, 2017 •

edited

Loading

ara4n commented Jan 9, 2018 •

edited

Loading

Valodim commented Jan 18, 2018 •

edited

Loading

aaronraimist commented Sep 1, 2019 •

edited

Loading

neilisfragile commented Oct 4, 2019 •

edited by ara4n

Loading

MurzNN commented Feb 3, 2021 •

edited

Loading