Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Events sent immediately after joining can be incorrectly soft-failed #10066

Open
MadLittleMods opened this issue May 25, 2021 · 4 comments
Open
Labels
A-Federation A-Soft-Failure O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Critical Blocks development, potential data loss, more than 25% of users possibly affected, no workarounds. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@MadLittleMods
Copy link
Contributor

MadLittleMods commented May 25, 2021

Description

The Gitter bridge seems to have reproduced a race condition in federation where because the appservice joins and sends a message in quick succession, the message can out run the join which causes the message to soft_fail.

To see an actual reproduction, there is a missing message in the !SrkiFczPSWmYrlSNYF:matrix.org room for matrix.org homeserver users ($mMJdjGPC2GaoWK3pLApa_O2vBoeuQwYtP6iY6XzbrfY). This is tracked on Gitter by https://gitlab.com/gitterHQ/webapp/-/issues/2770#note_583925093

The event exists in the gitter.im homeserver database and is visible to the gitter.im homeserver where the Gitter bridge appservice operates. The message is also visible to any new homeserver that comes in the room and backfills the messages. It's only not visible on the matrix.org homeserver.

The message also exists on the matrix.org homeserver database but is soft_failed (thanks to @richvdh for checking the database)

@Half-Shot has also seen this happen a few times with the IRC bridge.

As @leonerd from #1444 and @Half-Shot mentioned, this is probably not a problem for normal users because the time taken to join and then send a message is some number of seconds. Whereas with appservices, can join and send an event in quick succession (almost instantly).

Related issues:

Steps to reproduce

  1. Create a room on HS1, !my-room:hs1
  2. Setup HS2 with an appservice
  3. Join and send a message through the appservice to !my-room:hs1
    • The appservice is probably not necessary to reproduce. Just send a join and message event from HS2
  4. Check whether the message is visible on hs1
  5. Since it's a race condition, this probably does not always reproduce

Version information

  • Homeserver: gitter.im -> matrix.org
@MadLittleMods MadLittleMods added the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label May 25, 2021
@richvdh
Copy link
Member

richvdh commented May 25, 2021

I think this might be the same as #6536 (though with a much better description)?

@MadLittleMods MadLittleMods changed the title Federation race condition when joining and sending a message from bridge appservice Federation race condition when joining and sending a message from bridge appservice (soft_failed) May 25, 2021
@MadLittleMods MadLittleMods added the S-Critical Blocks development, potential data loss, more than 25% of users possibly affected, no workarounds. label May 25, 2021
@MadLittleMods
Copy link
Contributor Author

@richvdh #6536 does look like a good culprit! Should we close one in favor of the other?

I've marked it as S-Critical because it causes data loss (message is not visible). Feel free to re-prioritize more appropriately though.

@richvdh richvdh changed the title Federation race condition when joining and sending a message from bridge appservice (soft_failed) events sent immediately after joining can be incorrectly soft-failed May 26, 2021
@richvdh
Copy link
Member

richvdh commented May 26, 2021

I closed #6536, since this has a bunch more detail

@callahad callahad added the P3 (OBSOLETE: use S- labels.) Approved backlog: not yet scheduled, will accept patches label May 27, 2021
@MadLittleMods
Copy link
Contributor Author

Another reproduction at https://gitter.im/matrix-org/gitter?at=60c0ecae84c2f15b796e33a2 which was sent to !OvgDmYaPOFRRWnIdQe:half-shot.uk as $r3aSycTMNPGi6gKK5-N7jw8fAEt3I2Q1lfW0LdwUyns but failed to federate over to matrix.org.

You can only see their join event on matrix.org, https://matrix.to/#/!OvgDmYaPOFRRWnIdQe:half-shot.uk/$wB_7bSk-IGb0Czes_BCV3W3DIRPsxeGw-7ZTSbvkaus?via=half-shot.uk&via=matrix.org&via=gitter.im

-- https://gitlab.com/gitterHQ/webapp/-/issues/2770#note_597176264

MadLittleMods added a commit that referenced this issue Jun 9, 2021
Spawned from missing messages we were seeing on `matrix.org` from a
federated Gtiter bridged room, https://gitlab.com/gitterHQ/webapp/-/issues/2770.
The underlying issue in Synapse is tracked by #10066
where the message and join event race and the message is `soft_failed` before the
`join` event reaches the remote federated server.

Less soft_failed events = better and usually this should only trigger for events
where people are doing bad things and trying to fuzz and fake everything.
richvdh pushed a commit that referenced this issue Jun 11, 2021
Spawned from missing messages we were seeing on `matrix.org` from a
federated Gtiter bridged room, https://gitlab.com/gitterHQ/webapp/-/issues/2770.
The underlying issue in Synapse is tracked by #10066
where the message and join event race and the message is `soft_failed` before the
`join` event reaches the remote federated server.

Less soft_failed events = better and usually this should only trigger for events
where people are doing bad things and trying to fuzz and fake everything.
@MadLittleMods MadLittleMods changed the title events sent immediately after joining can be incorrectly soft-failed Events sent immediately after joining can be incorrectly soft-failed Sep 2, 2022
@MadLittleMods MadLittleMods added the O-Uncommon Most users are unlikely to come across this or unexpected workflow label Sep 2, 2022
@DMRobertson DMRobertson added A-Soft-Failure and removed P3 (OBSOLETE: use S- labels.) Approved backlog: not yet scheduled, will accept patches labels Sep 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Soft-Failure O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Critical Blocks development, potential data loss, more than 25% of users possibly affected, no workarounds. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

4 participants