Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

Closed
zxyz opened this issue Jan 24, 2022 · 7 comments
Closed

high CPU load by "matrix- _process_incoming_pdus_in_room_inner" #11818

zxyz opened this issue Jan 24, 2022 · 7 comments
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@zxyz
Copy link

zxyz commented Jan 24, 2022

Description

Since ~ November the 24th last year I'm observing strange metrics in synapse.
This is the day I upgraded synapse from 1.45.1 to 1.46 (that was the only change that day).

CPU load went higher together with some other metrics (I'm referring to the synapse grafana dashboard, see below for screenshots):

  • DB transactions by total txn time -> matrix- get_auth_chain_difference
  • Number of events in federation staging area
  • Age of oldest event in staging area
  • DB usage by background jobs (including scheduling time) -> matrix- _process_incoming_pdus_in_room_inner

What I tried:

  • I did disable presence in between -> no change
  • nothing noticable in the logs (at least for me)
  • I compressed state, vacuumed postgres and purged old media files
  • no change after container restart or reboot
  • no change in between night and day (less users at night)
  • no change after several updates

Version information

  • Debian 11
  • docker using the amazing matrix-docker-ansible-deploy playbook
  • in November when it started it was running synapse 1.46 AND postgres 13.x.
  • Current Synapse version is 1.50.2, postgres 14.1.
  • DB storage is on SSDs, 8 CPU cores, 24G RAM
  • no room complexity set

grafs

cpu

cpu-usage-background

  • red: matrix- _process_incoming_pdus_in_room_inner
  • organge: federation_transaction_transmission_loops
  • blue: matrix_persists_events

cache-misses
avg-db-transaction

  • red: "state resolve events"
  • violett: perists_events: get new state after events
  • orange: resolve state groups for events

federation-staging
oldest_evnts_in_staging
resolve_events

Please let me know if you'd like to have some other screenshots or information.

Thanks a lot!

@zxyz
Copy link
Author

zxyz commented Jan 25, 2022

actually, there is a day/night difference to be seen:
xcpu

looks like it correlates at bit with PDU and EDU rates

xpdu

I adjusted the cache factor from 0.5 to 1.0.Didn't change anything.

@jaywink
Copy link
Member

jaywink commented Jan 27, 2022

I'm seeing similar things in EMS land. Four customer hosts, three of them in one particular Libera room. All show similar graphs than the below. All of them seem to start having issues with a large spike in inbound federation, causing things to lock up.

CPU spikes and stays pinned:

Selection_999(887)

Federation spike to start things:
Selection_999(888)

Background jobs causing the issue, _process_incoming_pdus_in_room_inner and Lock._renew

Selection_999(889)

Database is doing lots of renew_lock, confirmed to be the most exhaustive thing on the database server side

Selection_999(890)

It seems to be persisting something

Selection_999(893)

I restarted one of the Synapse's and the "lock" seems to disappear.

Selection_999(891)

The lock renew activity disappears, replaced by actually persisting events a few spiky times, then flattening out.

Selection_999(894)

Something got locked up and it cleared on restart?

All these four customer hosts are Synapse v1.51.0 with an otherwise low amount of activity / users. I've pinged the internal backend room with customer names, should logs be helpful. Times in graphs above are UTC+2.

@clokep
Copy link
Contributor

clokep commented Feb 1, 2022

What EMS was seeing sounds a little different since I'm assuming @zxyz has restarted the service sometime between this starting?

It might be useful to grep the logs for biggest rooms for state-res after configuring the synapse.state.metrics for debug logs. I would hope that would show if it is all due to a single room, but it might not.

@clokep clokep added the X-Needs-Info This issue is blocked awaiting information from the reporter label Feb 1, 2022
@zxyz
Copy link
Author

zxyz commented Feb 2, 2022

Thanks @clokep for the hint. Synapse and also the VM got restarted/upgraded several times.

I let the debug log run over night. There are a 300 entries with `state-res´, here is an excerpt:

1 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (159.379s)']
3 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (86.4182s)
2 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (163.432s)'
1 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (73.8072s)']
1 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (64.4513s)']                          
1 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (139.385s)']                          
1 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (32.9124s)']
1 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (72.2949s)']

"!OGEhHVWSdvArJzumhm:matrix.org" is Matrix HQ if I'm not mistaken.

The other rooms in case of "3 biggest rooms" had much lower values for DB and CPU , not posting them here because I don't know if they're private rooms and I don't want to reveal anything private :).

We were already considering setting "complexity: 200.0" on the server to basically just kick out "Matrix HQ" because of to many resource usage (state...). So maybe I'll try this now. Or are there other options?

But with your synapse.state.metrics hint I found #8612 and checked for room extremities with

select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;

There are a few rooms with c of 3, 4 and one with 5. But there is one room with a c of 109 (!) (federated room at privacytools.io). My prometheus data goes back 90 days and this c 109 room was there all the time. This room also has higher state-res than most other in the debug log:

689 - DEBUG - sentinel - 2 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (32.7982s)', '!<ROOMID>:privacytools.io (15.8225s)']
 689 - DEBUG - sentinel - 2 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (78.7121s)', '<ROOMID>:privacytools.io (5.36405s)']
 689 - DEBUG - sentinel - 2 biggest rooms for state-res by CPU time: ['!OGEhHVWSdvArJzumhm:matrix.org (68.9037s)', '!<ROOMID>:privacytools.io (6.03624s)']
 689 - DEBUG - sentinel - 2 biggest rooms for state-res by DB time: ['!OGEhHVWSdvArJzumhm:matrix.org (157.548s)', '!<ROOMID>:privacytools.io (1.8746s)']

But should/can I do anything about this room?

Thanks a lot, much appreciated!

PS: synapse was upgraded to 1.51.0 in between, no change.

@clokep
Copy link
Contributor

clokep commented Feb 2, 2022

"!OGEhHVWSdvArJzumhm:matrix.org" is Matrix HQ if I'm not mistaken.

We were already considering setting "complexity: 200.0" on the server to basically just kick out "Matrix HQ" because of to many resource usage (state...). So maybe I'll try this now. Or are there other options?

Note that I don't believe setting that now will have your users leave Matrix HQ. You would want to delete the room afterward: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#delete-room-api (please read through that first before just running the commands!)

But with your synapse.state.metrics hint I found #8612 and checked for room extremities with

select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;

But should/can I do anything about this room?

See the forward extremities admin APIs: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#forward-extremities-admin-api, you might want to delete the forward extremities (this pretty much makes an event to simplify the overall DAG, which should help with state res). It mentions reading through #1760 though, we should probably move the important bits of that into our documentation. 😢

@zxyz
Copy link
Author

zxyz commented Feb 3, 2022

thanks a lot @clokep, deleting forward extremities did the trick!

Note that I don't believe setting that now will have your users leave Matrix HQ. You would want to delete the room afterward: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#delete-room-api (please read through that first before just running the commands!)

I set complexity: 200.0, restarted synapse but to my surprise I was still able to join Martix HQ (with an account which was never before in the room) which has a complexity > 800 (if I'm correct with states / 500 = complexity).

Any ideas why this still worked?

Then I deleted the room with:

curl --header "Authorization: Bearer <api-token>" -XDELETE 'http://localhost:8008/_synapse/admin/v1/rooms/!OGEhHVWSdvArJzumhm:matrix.org' --data '{"block": true, "purge": true}'

Which failed after ~ 15 minutes with {"errcode":"M_UNKNOWN","error":"Internal server error"}. I re-ran the command and it finished without output. Trying to join Matrix HQ failed with "Failed to join room This room has been blocked on this server" -> cool!

Re-runing the command again results in {"kicked_users":[],"failed_to_kick_users":[],"local_aliases":[],"new_room_id":null}

But the data is still there:

SELECT rss.name, s.room_id, count(s.room_id) FROM state_groups_state s
LEFT JOIN room_stats_state rss USING (room_id)
GROUP BY s.room_id, rss.name
ORDER BY count(s.room_id) DESC
LIMIT 1;

yields:

name          |            room_id             |  count
-----------------------+--------------------------------+----------
                       | !OGEhHVWSdvArJzumhm:matrix.org | 43861911

So now I'd have to delete it by hand in the database (hints on how to best do this?)?

See the forward extremities admin APIs: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#forward-extremities-admin-api, you might want to delete the forward extremities (this pretty much makes an event to simplify the overall DAG, which should help with state res). It mentions reading through #1760 though, we should probably move the important bits of that into our documentation. cry

I deleted the "forward extremities" in the database with the postgres command from [1] and the CPU load normalized
as well as "federation staging area events", "oldest events in staging area" and GC collection.

Tbh I still don't understand what "forward extremities" is and wasn't able to enlighten myself with a web research. I'd be really thankful if someone could enlighten me here! :)

Having this documented would be really nice! 🐝

I'm closing the ticket. Thanks so much for your help @clokep! 💜

[1] #1760 (comment)

@zxyz zxyz closed this as completed Feb 3, 2022
@clokep
Copy link
Contributor

clokep commented Feb 3, 2022

Note that I don't believe setting that now will have your users leave Matrix HQ. You would want to delete the room afterward: https://matrix-org.github.io/synapse/develop/admin_api/rooms.html#delete-room-api (please read through that first before just running the commands!)

I set complexity: 200.0, restarted synapse but to my surprise I was still able to join Martix HQ (with an account which was never before in the room) which has a complexity > 800 (if I'm correct with states / 500 = complexity).

Any ideas why this still worked?

The setting only applies to new rooms, this allows admins to join rooms that are more complex and still have them work.

Which failed after ~ 15 minutes with {"errcode":"M_UNKNOWN","error":"Internal server error"}.

It probably would have been better to use the v2 version of that endpoint, which is async.

But the data is still there:

So now I'd have to delete it by hand in the database (hints on how to best do this?)?

Sounds like something in the purging broke, unfortunately. I don't have any ideas of a good way to poke at that. Most of the tables have a room_id column which should make it not too bad though..

Tbh I still don't understand what "forward extremities" is and wasn't able to enlighten myself with a web research. I'd be really thankful if someone could enlighten me here! :)

I'm glad it worked for you! Our docs have a bit of info on what a forward extremity is

I'm closing the ticket. Thanks so much for your help @clokep! 💜

You're welcome!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

3 participants