Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Initial /sync is returning 404 due to "unknown room" when processing an event. #7065

Closed
ara4n opened this issue Mar 11, 2020 · 6 comments
Closed

Comments

@ara4n
Copy link
Member

ara4n commented Mar 11, 2020

My account is currently unable to initial /sync and is broken with:

$ grep GET-17237 initial_synchrotron4.log
2020-03-11 00:13:02,176 - synapse.storage.data_stores.main.events_worker - 592 - WARNING - GET-17237 - Room !YkSMpYmmunCikXWFkB:matrix.org for event $3AJgBnxl5BLJs3Sp0L3GWojCeItDi98uVpmuqWH-vJc is unknown
2020-03-11 00:13:02,177 - synapse.http.server - 81 - INFO - GET-17237 - <XForwardedForRequest at 0x7fb07b58ad68 method='GET' uri='/_matrix/client/r0/sync?filter=139&timeout=0' clientproto='HTTP/1.1' site=8108> SynapseError: 404 - Could not find event $3AJgBnxl5BLJs3Sp0L3GWojCeItDi98uVpmuqWH-vJc
2020-03-11 00:13:02,178 - synapse.access.http.8108 - 302 - INFO - GET-17237 - 92.234.32.44 - 8108 - {@matthew:matrix.org} Processed request: 3.956sec/0.000sec (2.098sec, 0.621sec) (0.203sec/1.574sec/2) 101B 404 "GET /_matrix/client/r0/sync?filter=139&timeout=0 HTTP/1.1" "Riot/0.11.0 (iPhone; iOS 13.3.1; Scale/3.00)" [1 dbevts]

The $...-vJc event is an invite event for my user to a room I hadn't yet accepted.

This circumstantially looks related to the new redaction stuff in #6875 - perhaps in conjunction with stuck invite bugs, or failing to set room_version on unaccepted invites, or something else...

@ara4n
Copy link
Member Author

ara4n commented Mar 11, 2020

I've got back in by /ignore-ing 5 users who had sent me invites which were stuck (meaning I couldn't reject them, as the server didn't consider me to be in the room, as the inviting user had previously parted in). The rooms in question had a NULL in the room_version column of the rooms table.

However, rooms with null room_versions surely shouldn't cause the whole /sync to fail.

It's also worth noting that rejecting other invites (e.g. !qkoxVSPwURByZOmIPQ:matrix.org) fails with: "No create event in auth events"

@richvdh
Copy link
Member

richvdh commented Mar 11, 2020

I think we've figured out what is happening here.

Let me emphasise first of all that !YkSMpYmmunCikXWFkB:matrix.org was slightly weird in that it had no active members: the original creator had left, and @ara4n was not a full member in that he had yet to accept his invite.

As part of erik's work on fixing UISIs (no really, this all starts there), he wanted to arrange things so that we would clear out current_state_events for rooms where the server was no longer participating in the room (see #6655).

However, at that time, the way we figured out the version of a particular room was to look up its create event via CSE. We still need to know the versions of rooms where there are no active members (as this example shows), so #6729 added a room_version column. That update should have applied to this room, and left it with a non-null room_version.

A later background update, added in #6802, then clears out CSE for rooms such as this one where there are no active members.

Later, we needed to know the room version for each event when we read it from the database (in case we later needed to redact it following the room-specific event-redaction algorithm): #6874, #6875, etc. This also meant that we wanted to ensure that rooms.room_version was fully-populated (as opposed to being lazily populated via a background update). Interestingly, #6847 therefore ensures that any remaining rooms are populated at the point of upgrade, again based on CSE.

However, the fly in this ointment is that there doesn't seem to be anything that ensures the add_rooms_room_version_column bg update (from #6729) actually happens before delete_old_current_state_events (from #6802). Obviously, if they run in the wrong order, or in parallel, then #6729 won't work properly, leaving null room_versions in rooms.

So, any homeserver which went straight from pre-#6729 to post-#6847 should be ok, but any servers which ran the code between #6802 and #6847, which includes matrix.org (and anyone else who ran synapse 1.10 or any of its RCs) may have suffered this data corruption.

A final irony here: it's not the first time that delete_old_current_state_events running out-of-order has bitten us: see also #6923.

Fixing it

We still have the create events, so it should be easy enough to construct yet another database update to try to populate rooms.room_version.

We should probably also add the missing constraint between add_rooms_room_version_column and delete_old_current_state_events for completeness, but it's worth reiterating that #6847 fixes the problem anyway.

@richvdh
Copy link
Member

richvdh commented Mar 11, 2020

However, rooms with null room_versions surely shouldn't cause the whole /sync to fail.

well, we could maybe code around this somehow, to make the room be ignored instead of throwing an error, but I'm generally sceptical of such measures. It's easy to clutter the code with defensive mechanisms which obfuscate how it is meant to work, and furthermore it can hide bugs until they come and bite us in the backside later on.

In short: I'm a fan of failing early wherever possible rather than struggling on in the face of bad data.

@richvdh
Copy link
Member

richvdh commented Mar 11, 2020

(incidentally: if you need a workaround in the meantime, poking the right room_version into the table is probably a much easier one than ignoring users and whathaveyou.)

@ara4n
Copy link
Member Author

ara4n commented Mar 12, 2020

yeah. i only confirmed the missing room_version pattern after ignoring the stuck invite users.

worth noting that i have other stuck invites i can’t reject which do have room_version entries, which fail with missing auth events.

richvdh pushed a commit that referenced this issue Mar 16, 2020
Fixes #7065 

This is basically the same as #6847 except it tries to populate events from `state_events` rather than `current_state_events`, since the latter might have been cleared from the state of some rooms too early, leaving them with a `NULL` room version.
@richvdh
Copy link
Member

richvdh commented Mar 16, 2020

fixed by #7070

@richvdh richvdh closed this as completed Mar 16, 2020
phil-flex pushed a commit to phil-flex/synapse that referenced this issue Apr 15, 2020
Fixes matrix-org#7065 

This is basically the same as matrix-org#6847 except it tries to populate events from `state_events` rather than `current_state_events`, since the latter might have been cleared from the state of some rooms too early, leaving them with a `NULL` room version.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants