Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

"No state group for unknown or outlier event" error in federation #13026

Closed
mytbk opened this issue Jun 11, 2022 · 17 comments
Closed

"No state group for unknown or outlier event" error in federation #13026

mytbk opened this issue Jun 11, 2022 · 17 comments
Labels
A-Corruption Things that have led to unexpected state in Synapse or the database S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@mytbk
Copy link

mytbk commented Jun 11, 2022

Description

In synapse to 1.60.0, I found the "No state group for unknown or outlier event" run time error in the journal (in log output section).

There are a lot of such events, giving a runtime error about once per minute.

Steps to reproduce

Start the synapse service and view the log with journalctl, then see the runtime errors.

Homeserver

matrixim.cc

Synapse Version

{"server_version":"1.60.0","python_version":"3.10.5"}

Installation Method

Other (please mention below)

Platform

Arch Linux matrix-synapse package 1.60.0-1.

Relevant log output

Jun 11 02:19:58 matrixim synapse[3229]: 2022-06-11 02:19:58,370 - synapse.federation.sender.per_destination_queue - 374 - ERROR - federation_trans
action_transmission_loop-1388 - TX [matrix.lokiberra.tk] Failed to send transaction
Jun 11 02:19:58 matrixim synapse[3229]: Traceback (most recent call last):
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
Jun 11 02:19:58 matrixim synapse[3229]:     result = current_context.run(gen.send, result)
Jun 11 02:19:58 matrixim synapse[3229]: StopIteration: []
Jun 11 02:19:58 matrixim synapse[3229]: During handling of the above exception, another exception occurred:
Jun 11 02:19:58 matrixim synapse[3229]: Traceback (most recent call last):
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/synapse/federation/sender/per_destination_queue.py", line 279, in _transaction_transmission_loop
Jun 11 02:19:58 matrixim synapse[3229]:     await self._catch_up_transmission_loop()
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/synapse/federation/sender/per_destination_queue.py", line 494, in _catch_up_transmission_loop
Jun 11 02:19:58 matrixim synapse[3229]:     hosts = await self._state.get_hosts_in_room_at_events(
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/synapse/state/__init__.py", line 258, in get_hosts_in_room_at_events
Jun 11 02:19:58 matrixim synapse[3229]:     entry = await self.resolve_state_groups_for_events(room_id, event_ids)
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
Jun 11 02:19:58 matrixim synapse[3229]:     r = await func(self, *args, **kwargs)
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/synapse/state/__init__.py", line 421, in resolve_state_groups_for_events
Jun 11 02:19:58 matrixim synapse[3229]:     state_groups = await self.state_store.get_state_group_for_events(event_ids)
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/synapse/storage/state.py", line 868, in get_state_group_for_events
Jun 11 02:19:58 matrixim synapse[3229]:     return await self.stores.main._get_state_group_for_events(event_ids)
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
Jun 11 02:19:58 matrixim synapse[3229]:     result = current_context.run(gen.send, result)
Jun 11 02:19:58 matrixim synapse[3229]:   File "/usr/lib/python3.10/site-packages/synapse/storage/databases/main/state.py", line 331, in _get_state_group_for_events
Jun 11 02:19:58 matrixim synapse[3229]:     raise RuntimeError("No state group for unknown or outlier event %s" % e)
Jun 11 02:19:58 matrixim synapse[3229]: RuntimeError: No state group for unknown or outlier event $j6Qxtlh-qcXquAz1rwD_9apbSIFn3x8D-9I5zTsJruA

Anything else that would be useful to know?

No response

@richvdh
Copy link
Member

richvdh commented Jun 13, 2022

possibly related to #12507.

@erikjohnston
Copy link
Member

This looks like a duplicate of #12507 to me.

@mytbk
Copy link
Author

mytbk commented Jun 14, 2022

1.61.0 still has this issue.
From the traceback, this doesn't seem to be the same as #12507, although they have the same exception.

@mytbk
Copy link
Author

mytbk commented Jun 14, 2022

Also, the query select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null; in #12507 (comment) returns 0 rows.

@richvdh richvdh reopened this Jun 14, 2022
@richvdh
Copy link
Member

richvdh commented Jun 14, 2022

not the same as #12507 then!

@mytbk
Copy link
Author

mytbk commented Jun 14, 2022

In 1.61.0, I observe another exception that also emits "No state group for unknown or outlier event", don't know what happens. Some exception begins with "Exception: Error fetching missing prev_events for ...", and some StopIteration exception.

Here's the log of ~2min since one start.
1.txt

@erikjohnston erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jun 15, 2022
@mweinelt
Copy link
Contributor

@mytbk Curious, are you using retention settings and/or https://github.com/matrix-org/rust-synapse-compress-state/?

@mytbk
Copy link
Author

mytbk commented Jun 26, 2022

@mweinelt

My retention settings:

retention:
  enabled: true

  default_policy:
    min_lifetime: 1d
    max_lifetime: 1y

  allowed_lifetime_min: 1d
  allowed_lifetime_max: 1y

  purge_jobs:
    - longest_max_lifetime: 3d
      interval: 12h
    - shortest_max_lifetime: 3d
      interval: 1d

I also used rust-synapse-compress-state before.

@FrankNagel
Copy link

I may be in a similar situation: I have configured a default retention time of 240d on my server (and I run rust-synapse-compress-state as a cron job every night). Some days ago I joined !OGEhHVWSdvArJzumhm:matrix.org (#matrix.matrix.org) and two days later the room stopped syncing. I investigated the forward extremities of this room and got two different results from the API and querying the DB directly:

API:

# curl --header "Authorization: Bearer XXX" 'https://matrix.uni-marburg.de/_synapse/admin/v1/rooms/!OGEhHVWSdvArJzumhm:matrix.org/forward_extremities' | jq .
{
  "count": 5,
  "results": [
    {
      "event_id": "$w__T5VCKv34SVaoVF4LoIynZDo5QqJ4mgzCq9XCaaAg",
      "state_group": 79056,
      "depth": 466487,
      "received_ts": 1656433481438
    },
    {
      "event_id": "$FJU7WQNE9c1QcZYZvh5p9o_f1mnXBPuoYbccaCWZNbk",
      "state_group": 80290,
      "depth": 466977,
      "received_ts": 1656600524544
    },
    {
      "event_id": "$m9S8nYfhe-MBDODoxTR-33HRd8-jtt35KoVUV7AbGuQ",
      "state_group": 79705,
      "depth": 465281,
      "received_ts": 1656529102606
    },
    {
      "event_id": "$3uLkuz0dKwartpR4_qTPEKezBtp31a0goxBZXfnU9nw",
      "state_group": 79711,
      "depth": 463617,
      "received_ts": 1656348839094
    },
    {
      "event_id": "$X-1osxxtdtlvcre3QZEcIooCMl3qGE4epxL5ahiLQ8E",
      "state_group": 79709,
      "depth": 463617,
      "received_ts": 1656348838539
    }
  ]
}

DB:

synapse=# select to_timestamp(e.origin_server_ts/1000), e.* readable_os_ts from events as e join event_forward_extremities efe on e.event_id = efe.event_id where e.room_id = '!OGEhHVWSdvArJzumhm:matrix.org' order by origin_server_ts DESC;
    
          to_timestamp      | topological_ordering |                   event_id                   |          type          |            room_id             | content | unrecognized_keys | processed | outlier | depth  | origin_server_ts |  received_ts  |             sender             | contains_url | instance_name | stream_ordering |          state_key          | rejection_reason
    ------------------------+----------------------+----------------------------------------------+------------------------+--------------------------------+---------+-------------------+-----------+---------+--------+------------------+---------------+--------------------------------+--------------+---------------+-----------------+-----------------------------+------------------
     2022-06-30 16:47:57+02 |               466977 | $FJU7WQNE9c1QcZYZvh5p9o_f1mnXBPuoYbccaCWZNbk | m.room.member          | !OGEhHVWSdvArJzumhm:matrix.org |         |                   | t         | f       | 466977 |    1656600477278 | 1656600524544 | @drdrjojo:matrix.org           | f            | master        |         3353182 | @drdrjojo:matrix.org        |
     2022-06-28 18:23:20+02 |               466487 | $w__T5VCKv34SVaoVF4LoIynZDo5QqJ4mgzCq9XCaaAg | org.matrix.dummy_event | !OGEhHVWSdvArJzumhm:matrix.org |         |                   | t         | f       | 466487 |    1656433400757 | 1656433481438 | @nagelfr:matrix.uni-marburg.de | f            | master        |         3347260 |                             |
     2022-06-25 15:15:33+02 |               465281 | $m9S8nYfhe-MBDODoxTR-33HRd8-jtt35KoVUV7AbGuQ | m.room.message         | !OGEhHVWSdvArJzumhm:matrix.org |         |                   | t         | f       | 465281 |    1656162933269 | 1656529102606 | @ninchuka:envs.net             | f            | master        |         3350666 |                             |
     2022-06-20 08:44:44+02 |               463617 | $X-1osxxtdtlvcre3QZEcIooCMl3qGE4epxL5ahiLQ8E | m.room.member          | !OGEhHVWSdvArJzumhm:matrix.org |         |                   | t         | f       | 463617 |    1655707484365 | 1656348838539 | @seastar8170:matrix.org        | f            | master        |        -1157546 | @seastar8170:matrix.org     |
     2022-06-20 08:44:25+02 |               463617 | $3uLkuz0dKwartpR4_qTPEKezBtp31a0goxBZXfnU9nw | m.room.member          | !OGEhHVWSdvArJzumhm:matrix.org |         |                   | t         | f       | 463617 |    1655707465432 | 1656348839094 | @NormietoTechie:libera.chat    | f            | master        |        -1158279 | @NormietoTechie:libera.chat |
     2021-10-16 22:25:50+02 |               361271 | $bCFEYaBSTOOLr8326cbXpy6V7I-N3E2YxlbC5ddpZ1s | m.room.member          | !OGEhHVWSdvArJzumhm:matrix.org |         |                   | t         | t       | 361271 |    1634415950431 | 1656348857068 | @testuser88:schoeneberge.eu    | f            | master        |        -1175678 | @testuser88:schoeneberge.eu |
    (6 rows)

The DB shows one entry more than the API i.e. $bCFEYaBSTOOLr8326cbXpy6V7I-N3E2YxlbC5ddpZ1s.

Every event I receive for #matrix.matrix.org fails at the end with an exception as follows:

2022-07-06 10:37:16,905 - synapse.federation.federation_server - 1156 - ERROR - _process_incoming_pdus_in_room_inner-74 - Failed to handle PDU $_tsx0pE8rlFcqDf7kvOnsa6yccs45Sd732OUdqATPBY
Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: [{'event_id': '$3uLkuz0dKwartpR4_qTPEKezBtp31a0goxBZXfnU9nw', 'state_group': 79711}, {'event_id': '$FJU7WQNE9c1QcZYZvh5p9o_f1mnXBPuoYbccaCWZNbk', 'state_group': 80290}, {'event_id': '$X-1osxxtdtlvcre3QZEcIooCMl3qGE4epxL5ah
iLQ8E', 'state_group': 79709}, {'event_id': '$m9S8nYfhe-MBDODoxTR-33HRd8-jtt35KoVUV7AbGuQ', 'state_group': 79705}, {'event_id': '$w__T5VCKv34SVaoVF4LoIynZDo5QqJ4mgzCq9XCaaAg', 'state_group': 79056}]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 236, in on_receive_pdu
    await self._get_missing_events_for_pdu(
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 687, in _get_missing_events_for_pdu
    await self._process_pulled_events(origin, missing_events, backfilled=False)
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 725, in _process_pulled_events
    await self._process_pulled_event(origin, ev, backfilled=backfilled)
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 781, in _process_pulled_event
    await self._process_received_pdu(
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 1112, in _process_received_pdu
    await self._check_for_soft_fail(event, state_ids, origin=origin)
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 1636, in _check_for_soft_fail
    state_sets_d = await self._state_storage_controller.get_state_groups_ids(
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/state.py", line 103, in get_state_groups_ids
    event_to_groups = await self.get_state_group_for_events(event_ids)
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/controllers/state.py", line 341, in get_state_group_for_events
    return await self.stores.main._get_state_group_for_events(event_ids)
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/storage/databases/main/state.py", line 376, in _get_state_group_for_events
    raise RuntimeError("No state group for unknown or outlier event %s" % e)
RuntimeError: No state group for unknown or outlier event $bCFEYaBSTOOLr8326cbXpy6V7I-N3E2YxlbC5ddpZ1s

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/federation/federation_server.py", line 1146, in _process_incoming_pdus_in_room_inner
    await self._federation_event_handler.on_receive_pdu(
  File "/opt/venvs/matrix-synapse/lib/python3.9/site-packages/synapse/handlers/federation_event.py", line 240, in on_receive_pdu
    raise Exception(
Exception: Error fetching missing prev_events for $_tsx0pE8rlFcqDf7kvOnsa6yccs45Sd732OUdqATPBY: No state group for unknown or outlier event $bCFEYaBSTOOLr8326cbXpy6V7I-N3E2YxlbC5ddpZ1s

As you can see the event mentioned in the Traceback is the one extra forward extremity in the database.

My working hyposthesis is, that bad things start to happen if you get forward extremities outside the retention window which may or may not have something to do with the use of rust-synapse-compress-state.

@FrankNagel
Copy link

First I tried increase the retention time which did not resolve my issue. But a

synapse=# delete from event_forward_extremities where event_id = '$bCFEYaBSTOOLr8326cbXpy6V7I-N3E2YxlbC5ddpZ1s';

got the room to sync again.

@mweinelt
Copy link
Contributor

mweinelt commented Jul 6, 2022

I can confirm that got rooms for me working as well.

@mytbk
Copy link
Author

mytbk commented Aug 8, 2022

I can see deleting from event_forward_extremities can improve this.

@richvdh
Copy link
Member

richvdh commented Aug 8, 2022

Those who find deleting rows from event_forward_extremities helps the problem: this sounds much like #12507.

@mytbk: I'm confused: first you said that you didn't have any rows in event_forward_extremities without corresponding entries in events, leading me to think you had a different problem to #12507, but now you say that deleting rows from event_forward_extremities has helped? what exactly did you delete?

@richvdh richvdh added the X-Needs-Info This issue is blocked awaiting information from the reporter label Aug 8, 2022
@FrankNagel
Copy link

Not in my case. At one point I used an inner join in one of my queries to get additional information about the events in question.

SELECT to_timestamp(e.origin_server_ts/1000),
       e.*
FROM EVENTS AS e
JOIN event_forward_extremities efe ON e.event_id = efe.event_id
WHERE e.room_id = '!OGEhHVWSdvArJzumhm:matrix.org'
ORDER BY origin_server_ts DESC;

The problematic event did exist in the events table.

@richvdh
Copy link
Member

richvdh commented Aug 8, 2022

@FrankNagel please can you open a separate issue explaining exactly what you did and what you know? It's very confusing to try to track several different problems with superficially related symptoms on a single issue. We can always merge them later if we realise they have a similar root cause.

@mytbk
Copy link
Author

mytbk commented Aug 8, 2022

@richvdh The query in #12507 select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null; still returns 0 rows in my database.
But from the exception Exception: Error fetching missing prev_events for $1659962077115813SkFhX:matrix.org: No state group for unknown or outlier event $162515984840861XTSNa:matrix.org I can do the query select * from event_forward_extremities where event_id = '$162515984840861XTSNa:matrix.org'; and get one row.

@richvdh
Copy link
Member

richvdh commented Aug 8, 2022

@mytbk ok, I suspect this is due to the retention jobs, see #13476

@richvdh richvdh closed this as completed Aug 8, 2022
@MadLittleMods MadLittleMods added the A-Corruption Things that have led to unexpected state in Synapse or the database label Jun 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Corruption Things that have led to unexpected state in Synapse or the database S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

6 participants