high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

zxyz · 2022-01-24T21:22:19Z

Description

Since ~ November the 24th last year I'm observing strange metrics in synapse.
This is the day I upgraded synapse from 1.45.1 to 1.46 (that was the only change that day).

CPU load went higher together with some other metrics (I'm referring to the synapse grafana dashboard, see below for screenshots):

DB transactions by total txn time -> matrix- get_auth_chain_difference
Number of events in federation staging area
Age of oldest event in staging area
DB usage by background jobs (including scheduling time) -> matrix- _process_incoming_pdus_in_room_inner

What I tried:

I did disable presence in between -> no change
nothing noticable in the logs (at least for me)
I compressed state, vacuumed postgres and purged old media files
no change after container restart or reboot
no change in between night and day (less users at night)
no change after several updates

Version information

Debian 11
docker using the amazing matrix-docker-ansible-deploy playbook
in November when it started it was running synapse 1.46 AND postgres 13.x.
Current Synapse version is 1.50.2, postgres 14.1.
DB storage is on SSDs, 8 CPU cores, 24G RAM
no room complexity set

grafs

red: matrix- _process_incoming_pdus_in_room_inner
organge: federation_transaction_transmission_loops
blue: matrix_persists_events

red: "state resolve events"
violett: perists_events: get new state after events
orange: resolve state groups for events

Please let me know if you'd like to have some other screenshots or information.

Thanks a lot!

The text was updated successfully, but these errors were encountered:

zxyz · 2022-01-25T18:03:08Z

actually, there is a day/night difference to be seen:

looks like it correlates at bit with PDU and EDU rates

I adjusted the cache factor from 0.5 to 1.0.Didn't change anything.

jaywink · 2022-01-27T09:52:14Z

I'm seeing similar things in EMS land. Four customer hosts, three of them in one particular Libera room. All show similar graphs than the below. All of them seem to start having issues with a large spike in inbound federation, causing things to lock up.

CPU spikes and stays pinned:

Federation spike to start things:

Background jobs causing the issue, _process_incoming_pdus_in_room_inner and Lock._renew

Database is doing lots of renew_lock, confirmed to be the most exhaustive thing on the database server side

It seems to be persisting something

I restarted one of the Synapse's and the "lock" seems to disappear.

The lock renew activity disappears, replaced by actually persisting events a few spiky times, then flattening out.

Something got locked up and it cleared on restart?

All these four customer hosts are Synapse v1.51.0 with an otherwise low amount of activity / users. I've pinged the internal backend room with customer names, should logs be helpful. Times in graphs above are UTC+2.

clokep · 2022-02-01T16:47:53Z

What EMS was seeing sounds a little different since I'm assuming @zxyz has restarted the service sometime between this starting?

It might be useful to grep the logs for biggest rooms for state-res after configuring the synapse.state.metrics for debug logs. I would hope that would show if it is all due to a single room, but it might not.

zxyz · 2022-02-02T07:43:34Z

Thanks @clokep for the hint. Synapse and also the VM got restarted/upgraded several times.

I let the debug log run over night. There are a 300 entries with `state-res´, here is an excerpt:

1 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (159.379s)']
3 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (86.4182s)
2 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (163.432s)'
1 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (73.8072s)']
1 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (64.4513s)']                          
1 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (139.385s)']                          
1 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (32.9124s)']
1 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (72.2949s)']

"!OGEhHVWSdvArJzumhm:matrix.org" is Matrix HQ if I'm not mistaken.

The other rooms in case of "3 biggest rooms" had much lower values for DB and CPU , not posting them here because I don't know if they're private rooms and I don't want to reveal anything private :).

We were already considering setting "complexity: 200.0" on the server to basically just kick out "Matrix HQ" because of to many resource usage (state...). So maybe I'll try this now. Or are there other options?

But with your synapse.state.metrics hint I found #8612 and checked for room extremities with

select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;

There are a few rooms with c of 3, 4 and one with 5. But there is one room with a c of 109 (!) (federated room at privacytools.io). My prometheus data goes back 90 days and this c 109 room was there all the time. This room also has higher state-res than most other in the debug log:

689 - DEBUG - sentinel - 2 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (32.7982s)', '!<ROOMID>:privacytools.io (15.8225s)']
 689 - DEBUG - sentinel - 2 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (78.7121s)', '<ROOMID>:privacytools.io (5.36405s)']
 689 - DEBUG - sentinel - 2 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (68.9037s)', '!<ROOMID>:privacytools.io (6.03624s)']
 689 - DEBUG - sentinel - 2 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (157.548s)', '!<ROOMID>:privacytools.io (1.8746s)']

But should/can I do anything about this room?

Thanks a lot, much appreciated!

PS: synapse was upgraded to 1.51.0 in between, no change.

clokep · 2022-02-02T15:12:09Z

"!OGEhHVWSdvArJzumhm:matrix.org" is Matrix HQ if I'm not mistaken.

We were already considering setting "complexity: 200.0" on the server to basically just kick out "Matrix HQ" because of to many resource usage (state...). So maybe I'll try this now. Or are there other options?

Note that I don't believe setting that now will have your users leave Matrix HQ. You would want to delete the room afterward: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#delete-room-api (please read through that first before just running the commands!)

But with your synapse.state.metrics hint I found #8612 and checked for room extremities with
select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;
But should/can I do anything about this room?

See the forward extremities admin APIs: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#forward-extremities-admin-api, you might want to delete the forward extremities (this pretty much makes an event to simplify the overall DAG, which should help with state res). It mentions reading through #1760 though, we should probably move the important bits of that into our documentation. 😢

zxyz · 2022-02-03T17:44:29Z

thanks a lot @clokep, deleting forward extremities did the trick!

Note that I don't believe setting that now will have your users leave Matrix HQ. You would want to delete the room afterward: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#delete-room-api (please read through that first before just running the commands!)

I set complexity: 200.0, restarted synapse but to my surprise I was still able to join Martix HQ (with an account which was never before in the room) which has a complexity > 800 (if I'm correct with states / 500 = complexity).

Any ideas why this still worked?

Then I deleted the room with:

curl --header "Authorization: Bearer <api-token>" -XDELETE 'http://localhost:8008/_synapse/admin/v1/rooms/!OGEhHVWSdvArJzumhm:matrix.org' --data '{"block": true, "purge": true}'

Which failed after ~ 15 minutes with {"errcode":"M_UNKNOWN","error":"Internal server error"}. I re-ran the command and it finished without output. Trying to join Matrix HQ failed with "Failed to join room This room has been blocked on this server" -> cool!

Re-runing the command again results in {"kicked_users":[],"failed_to_kick_users":[],"local_aliases":[],"new_room_id":null}

But the data is still there:

SELECT rss.name, s.room_id, count(s.room_id) FROM state_groups_state s
LEFT JOIN room_stats_state rss USING (room_id)
GROUP BY s.room_id, rss.name
ORDER BY count(s.room_id) DESC
LIMIT 1;

yields:

name          |            room_id             |  count
-----------------------+--------------------------------+----------
                       | !OGEhHVWSdvArJzumhm:matrix.org | 43861911

So now I'd have to delete it by hand in the database (hints on how to best do this?)?

See the forward extremities admin APIs: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#forward-extremities-admin-api, you might want to delete the forward extremities (this pretty much makes an event to simplify the overall DAG, which should help with state res). It mentions reading through #1760 though, we should probably move the important bits of that into our documentation. cry

I deleted the "forward extremities" in the database with the postgres command from [1] and the CPU load normalized
as well as "federation staging area events", "oldest events in staging area" and GC collection.

Tbh I still don't understand what "forward extremities" is and wasn't able to enlighten myself with a web research. I'd be really thankful if someone could enlighten me here! :)

Having this documented would be really nice! 🐝

I'm closing the ticket. Thanks so much for your help @clokep! 💜

[1] #1760 (comment)

clokep · 2022-02-03T18:03:45Z

Note that I don't believe setting that now will have your users leave Matrix HQ. You would want to delete the room afterward: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#delete-room-api (please read through that first before just running the commands!)

I set complexity: 200.0, restarted synapse but to my surprise I was still able to join Martix HQ (with an account which was never before in the room) which has a complexity > 800 (if I'm correct with states / 500 = complexity).

Any ideas why this still worked?

The setting only applies to new rooms, this allows admins to join rooms that are more complex and still have them work.

Which failed after ~ 15 minutes with {"errcode":"M_UNKNOWN","error":"Internal server error"}.

It probably would have been better to use the v2 version of that endpoint, which is async.

But the data is still there:

So now I'd have to delete it by hand in the database (hints on how to best do this?)?

Sounds like something in the purging broke, unfortunately. I don't have any ideas of a good way to poke at that. Most of the tables have a room_id column which should make it not too bad though..

Tbh I still don't understand what "forward extremities" is and wasn't able to enlighten myself with a web research. I'd be really thankful if someone could enlighten me here! :)

I'm glad it worked for you! Our docs have a bit of info on what a forward extremity is

I'm closing the ticket. Thanks so much for your help @clokep! 💜

You're welcome!

clokep mentioned this issue Feb 1, 2022

Inbound federation worker began spinning CPU after a surge in federation traffic #11863

Closed

clokep added the X-Needs-Info This issue is blocked awaiting information from the reporter label Feb 1, 2022

zxyz closed this as completed Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

zxyz commented Jan 24, 2022

zxyz commented Jan 25, 2022 •

edited

jaywink commented Jan 27, 2022

clokep commented Feb 1, 2022

zxyz commented Feb 2, 2022 •

edited

clokep commented Feb 2, 2022

zxyz commented Feb 3, 2022

clokep commented Feb 3, 2022

high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

Comments

zxyz commented Jan 24, 2022

Description

What I tried:

Version information

grafs

zxyz commented Jan 25, 2022 • edited

jaywink commented Jan 27, 2022

clokep commented Feb 1, 2022

zxyz commented Feb 2, 2022 • edited

clokep commented Feb 2, 2022

zxyz commented Feb 3, 2022

clokep commented Feb 3, 2022

zxyz commented Jan 25, 2022 •

edited

zxyz commented Feb 2, 2022 •

edited