Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Track when the pulled event signature fails #13815

Merged

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Sep 14, 2022

Track when the pulled event signature fails. Because we're doing the recording in _check_sigs_and_hash_for_pulled_events_and_fetch (previously named _check_sigs_and_hash_and_fetch), this means we will track signature failures for backfill, get_room_state, get_event_auth, and get_missing_events (all pulled event scenarios). And we also record signature failures from get_pdu.

Part of #13700

Part of #13676 and #13356

This PR will be especially important for #13816 so we can avoid the costly _get_state_ids_after_missing_prev_event down the line when /messages calls backfill.

Dev notes

backfill (federation_client)
_check_sigs_and_hash_and_fetch
_check_sigs_and_hash_and_fetch_one
	_check_sigs_and_hash
	get_pdu

get_pdu
get_pdu_from_destination_raw
_check_sigs_and_hash

Other usages of _check_sigs_and_hash. It looks like it's all processing events that come from another homeserver.

on_PUT (/send/<transaction_id>/)
on_incoming_transaction
_on_incoming_transaction_inner
_handle_incoming_transaction
_handle_pdus_in_txn
process_pdus_for_room
process_pdu
_handle_received_pdu
_check_sigs_and_hash


on_send_join_request (on_PUT /send_join) or on_send_leave_request (on_PUT /send_leave) or on_send_knock_request (on_PUT /send_knock)
_on_send_membership_event
_check_sigs_and_hash


on_invite_request (on_PUT /invite)
_check_sigs_and_hash


handle_new_client_event or _handle_request
_persist_event
persist_and_notify_client_event
send_invite
send_invite
_check_sigs_and_hash

send_join
_check_sigs_and_hash_and_fetch_one
_check_sigs_and_hash

backfill or get_room_state or get_event_auth or get_missing_events
_check_sigs_and_hash_and_fetch
_check_sigs_and_hash_and_fetch_one
_check_sigs_and_hash

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

@MadLittleMods MadLittleMods added T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) labels Sep 14, 2022
@@ -116,6 +122,9 @@ async def _check_sigs_and_hash(
"event_id": pdu.event_id,
}
)
await self._store.record_event_failed_pull_attempt(
pdu.room_id, pdu.event_id, "Event content has been tampered with"
)
return redacted_event

spam_check = await self.spam_checker.check_event_for_spam(pdu)
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we will potentially record a failed attempt when the spam checker soft-fails an event. Part of #13700

MadLittleMods added a commit that referenced this pull request Sep 15, 2022
…e is invalid

Related to
 - #13622
    - #13635
 - #13676

Follow-up to #13815
which tracks event signature failures.

This PR aims to stop us from trying to
`_get_state_ids_after_missing_prev_event` after
we already know that the prev_event will fail
from a previous attempt

To explain an exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check

With this PR, we no longer call the burdensome `_get_state_ids_after_missing_prev_event`
because the signature failure will count as an attempt before we try to run this.
)

return valid_state_events, valid_auth_events

@trace
async def _check_sigs_and_hash_and_fetch(
async def _check_sigs_and_hash_for_pulled_events_and_fetch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this so we stick to using this only for pulled event scenarios

Comment on lines +64 to +66
record_failure_callback: Optional[
Callable[[EventBase, str], Awaitable[None]]
] = None,
Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, see #13815 (comment) for why we're using a callback pattern here.

@MadLittleMods MadLittleMods marked this pull request as ready for review September 24, 2022 03:57
@MadLittleMods MadLittleMods requested a review from a team as a code owner September 24, 2022 03:57
await _check_sigs_on_pdu(self.keyring, room_version, pdu)
try:
await _check_sigs_on_pdu(self.keyring, room_version, pdu)
except Exception as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This except clause seems too broad; I'd worry it might swallow errors arising from bugs rather than solely invalid signatures.

Can we narrow down the interesting types of Exceptions that we care about here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. this will catch CancelledErrors which sounds dodgy (I don't know if cancellation is enabled for backfill, but it doesn't sound like a good reason to record a failed pull attempt)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we can just worry about InvalidEventSignatureError here 👍

But all of the downstream places in _check_sigs_on_pdu do the same except Exception as e: 🤷

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we decide here, we should also apply to #13814

Is there a way we can make the errors we care about opt-out instead of opt-in? It seems like it's going to suck to keep some of these error lists up to date whenever the downstream functions change and raise different errors.

Maybe it's better to be opt-in to handle just what you know but as an example at the following spot, we probably need to look at FederationError, RuntimeError, InvalidResponseError from a quick glance at some of the downstream code and this list seems like it will morph under us since there is so much going on in that block. Is there a good way to get a general sense of all exception types that a code path can raise (doesn't need to be comprehensive, maybe info just from docstrings)?

except Exception as exc:
await self._store.record_event_failed_pull_attempt(
event.room_id, event_id, str(exc)
)
raise exc

(created a draft PR to track this point, #13969, and will update with what we think about here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't imagine opt-out is easier — I don't think RuntimeError, ValueError, AssertionError, KeyError etc should be included, for example. These represent programming errors rather than the remote not acting according to the way we expect.

Tracking the set of exceptions that some code can throw is difficult in Python, unfortunately.

I would suggest trying some cases of where you want to track a failed pull attempt, perhaps even adding tests that reproduce these cases. I can think of at least these classes of problems:

  • DNS errors
  • TCP-level Connection errors
  • TLS problems such as untrusted self-signed certs
  • Homeserver not on federation whitelist (? what should we do here?)
  • Bad response code
  • Response isn't JSON
  • JSON response isn't in an expected format (e.g. doesn't include the right keys)... bearing in mind that our validation is probably rubbish and this might currently manifest itself as a KeyError (IMO this should be fixed so it sounds more like a validation error)

I realise this sounds a bit painful but we ought to figure out what we should do for these cases. Should a network problem actually be tracked against the event at all; won't the existing backoff behaviour be sufficient there? ....

Perhaps, rather than accepting that the code underneath is always going to be messy, try and improve the downstream situation — can we define superclasses for some of these interesting groups of exceptions a e.g. TemporaryNetworkError exception class and make relevant exceptions derived from those or get re-raised as those?
We can then add to the docstrings of the downstream functions what exceptions they're allowed to raise for these kinds of circumstances.

Just some thoughts, realise it may not necessarily be easy..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think RuntimeError, ValueError, AssertionError, KeyError etc should be included, for example. These represent programming errors rather than the remote not acting according to the way we expect.

Fair enough on RuntimeError and friends, I saw this case and during my quick scan and seemed relevant (it's not) although if it was, would be a good candidate to make it throw the proper error type 👍

I would suggest trying some cases of where you want to track a failed pull attempt, perhaps even adding tests that reproduce these cases

Having these tests would be great. We'd want to apply them to /event/{eventId}, /event_auth/{roomId}/{eventId}, /state/{roomId}?event_id=$xxx, /state_ids/{roomId}?event_id=$xxx (although related and probably covered by the existing backoff and NotRetryingDestination in the next point below). For /backfill and /get_missing_events, we can't track server/network problems because there is no specific event we're requesting but we can track whether the events in their responses are rejected, etc.

Should a network problem actually be tracked against the event at all; won't the existing backoff behaviour be sufficient there? ....

In general, if anything goes wrong with trying to pull an event from another server, we should backoff and do it in the background next time (#13623).

The existing backoff underlying all of the federation requests (matrixfederationclient.py and NotRetryingDestination) seems like it could be sufficient for network related errors. I haven't experienced network problems being the cause of /messages being slow so I haven't put as much thought in this area.

Homeserver not on federation whitelist (? what should we do here?)

Seems like we shouldn't be asking homeservers if they aren't on our whitelist in the first place. I guess we just let FederationDeniedError handle this and continue to the next destination which accomplishes this though.


Thanks for the thoughtful response on this @reivilibre! Really helping me form the right thoughts around all of this!

pdu: EventBase,
record_failure_callback: Optional[
Callable[[EventBase, str], Awaitable[None]]
] = None,
) -> EventBase:
"""Checks that event is correctly signed by the sending server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to see a description of at least record_failure_callback — I think the other 2 are fairly self-evident so not fussed about those particularly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done in a later commit; in future please consider collapsing those into the same commit so that commit by commit review is possible for mere mortals like me who can't track the full thing ;-))

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR's are definitely not commit by commit review friendly. I find working that way in the first place is impossible.

And rebasing and splitting multiple changes in the same file is a super chore. I'd rather split stuff by PR which feel free to call out if you see something to split out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can make them a bit more CbC friendly, I would really appreciate it to make reviews less daunting if anything. (Maybe I'm the only one though? Definitely not a requirement, but just a preference from me especially when I'm not familiar with the area of code)

Since you seem to have the idea that it's difficult to do: My technique for things like this is to use lazygit (a TUI git frontend but with good mouse support too), which lets you stage changes line-by-line and easily amend (shift+A) them into earlier commits. It also makes it easy to reorder commits etc. I think some other tools support this kind of thing as well, but lazygit is the best I've found for easily crafting the commits into a meaningful arrangement.
I think you may be referring to git's interactive rebasing and interactive staging; you can definitely do this a lot more efficiently than that :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. I find splitting by commits especially useful to avoid conflating 'style' changes (e.g. moving methods, renaming methods) and semantic changes. I think asking you to put these in separate PRs would be too much, though and it would certainly hurt the latency of review.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the team thoughts around rebasing? Anyone annoyed/worried about references changing after they already pulled the branch?

If you're just looking for some of the low hanging fruit of pulling out some style changes or merging some subsequent changes back to the original chunk, that seems a lot more do-able. If we're talking about building up iteratively commit by commit, I have a lot more apprehension to put effort into doing that.

I assume you don't want me to rebase things together from different review cycles so you can easily review the things that have changed.

What about merge commits? I find it a lot harder to rebase around merge commits and would prefer to just rebase my branch than merge develop if I started doing this. But then the rebase would rewrite all of the history including stuff you already reviewed. Do you have a good way to see what's new even though I rebased on the latest develop? (does that even matter to you)


A note around rebased history: I really like the GitLab diff versioning so even if you rebase, you can still reference old diff. I used to use this all the time over there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note around rebased history: I really like the GitLab diff versioning so even if you rebase, you can still reference old diff. I used to use this all the time over there.

I have similar experiences with GL and reviewboard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the team thoughts around rebasing? Anyone annoyed/worried about references changing after they already pulled the branch?

I believe the rough consensus is: "don't rebase after someone has started to review the changes", because doing so resets the "I have viewed this file" in Github's review UI.

See #13815 (comment)

So wes can avoid things like `CancelledError` which is a valid error
but probably should not count as an error
erikjohnston pushed a commit that referenced this pull request Oct 3, 2022
* Maybe not catch all errors to avoid things in the nature-of CancelledError

See #13815 (comment)

* Remove general exception tracking

* Add changelog
@MadLittleMods MadLittleMods merged commit 70a4317 into develop Oct 3, 2022
@MadLittleMods MadLittleMods deleted the madlittlemods/13700-track-when-event-signature-fails branch October 3, 2022 19:53
@MadLittleMods
Copy link
Contributor Author

Thanks for the review @reivilibre 🐹

squahtx pushed a commit that referenced this pull request Oct 4, 2022
Synapse 1.69.0rc1 (2022-10-04)
==============================

Please note that legacy Prometheus metric names are now deprecated and will be removed in Synapse 1.73.0.
Server administrators should update their dashboards and alerting rules to avoid using the deprecated metric names.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.69/upgrade.html#upgrading-to-v1690) for more details.

Features
--------

- Allow application services to set the `origin_server_ts` of a state event by providing the query parameter `ts` in [`PUT /_matrix/client/r0/rooms/{roomId}/state/{eventType}/{stateKey}`](https://spec.matrix.org/v1.4/client-server-api/#put_matrixclientv3roomsroomidstateeventtypestatekey), per [MSC3316](matrix-org/matrix-spec-proposals#3316). Contributed by @lukasdenk. ([\#11866](#11866))
- Allow server admins to require a manual approval process before new accounts can be used (using [MSC3866](matrix-org/matrix-spec-proposals#3866)). ([\#13556](#13556))
- Exponentially backoff from backfilling the same event over and over. ([\#13635](#13635), [\#13936](#13936))
- Add cache invalidation across workers to module API. ([\#13667](#13667), [\#13947](#13947))
- Experimental implementation of [MSC3882](matrix-org/matrix-spec-proposals#3882) to allow an existing device/session to generate a login token for use on a new device/session. ([\#13722](#13722), [\#13868](#13868))
- Experimental support for thread-specific receipts ([MSC3771](matrix-org/matrix-spec-proposals#3771)). ([\#13782](#13782), [\#13893](#13893), [\#13932](#13932), [\#13937](#13937), [\#13939](#13939))
- Add experimental support for [MSC3881: Remotely toggle push notifications for another client](matrix-org/matrix-spec-proposals#3881). ([\#13799](#13799), [\#13831](#13831), [\#13860](#13860))
- Keep track when an event pulled over federation fails its signature check so we can intelligently back-off in the future. ([\#13815](#13815))
- Improve validation for the unspecced, internal-only `_matrix/client/unstable/add_threepid/msisdn/submit_token` endpoint. ([\#13832](#13832))
- Faster remote room joins: record _when_ we first partial-join to a room. ([\#13892](#13892))
- Support a `dir` parameter on the `/relations` endpoint per [MSC3715](matrix-org/matrix-spec-proposals#3715). ([\#13920](#13920))
- Ask mail servers receiving emails from Synapse to not send automatic replies (e.g. out-of-office responses). ([\#13957](#13957))

Bugfixes
--------

- Send push notifications for invites received over federation. ([\#13719](#13719), [\#14014](#14014))
- Fix a long-standing bug where typing events would be accepted from remote servers not present in a room. Also fix a bug where incoming typing events would cause other incoming events to get stuck during a fast join. ([\#13830](#13830))
- Fix a bug introduced in Synapse v1.53.0 where the experimental implementation of [MSC3715](matrix-org/matrix-spec-proposals#3715) would give incorrect results when paginating forward. ([\#13840](#13840))
- Fix access token leak to logs from proxy agent. ([\#13855](#13855))
- Fix `have_seen_event` cache not being invalidated after we persist an event which causes inefficiency effects like extra `/state` federation calls. ([\#13863](#13863))
- Faster room joins: Fix a bug introduced in 1.66.0 where an error would be logged when syncing after joining a room. ([\#13872](#13872))
- Fix a bug introduced in 1.66.0 where some required fields in the pushrules sent to clients were not present anymore. Contributed by Nico. ([\#13904](#13904))
- Fix packaging to include `Cargo.lock` in `sdist`. ([\#13909](#13909))
- Fix a long-standing bug where device updates could cause delays sending out to-device messages over federation. ([\#13922](#13922))
- Fix a bug introduced in v1.68.0 where Synapse would require `setuptools_rust` at runtime, even though the package is only required at build time. ([\#13952](#13952))
- Fix a long-standing bug where `POST /_matrix/client/v3/keys/query` requests could result in excessively large SQL queries. ([\#13956](#13956))
- Fix a performance regression in the `get_users_in_room` database query. Introduced in v1.67.0. ([\#13972](#13972))
- Fix a bug introduced in v1.68.0 bug where Rust extension wasn't built in `release` mode when using `poetry install`. ([\#14009](#14009))
- Do not return an unspecified `original_event` field when using the stable `/relations` endpoint. Introduced in Synapse v1.57.0. ([\#14025](#14025))
- Correctly handle a race with device lists when a remote user leaves during a partial join. ([\#13885](#13885))
- Correctly handle sending local device list updates to remote servers during a partial join. ([\#13934](#13934))

Improved Documentation
----------------------

- Add `worker_main_http_uri` for the worker generator bash script. ([\#13772](#13772))
- Update URL for the NixOS module for Synapse. ([\#13818](#13818))
- Fix a mistake in sso_mapping_providers.md: `map_user_attributes` is expected to return `display_name`, not `displayname`. ([\#13836](#13836))
- Fix a cross-link from the registration admin API to the `registration_shared_secret` configuration documentation. ([\#13870](#13870))
- Update the man page for the `hash_password` script to correct the default number of bcrypt rounds performed. ([\#13911](#13911), [\#13930](#13930))
- Emphasize the right reasons when to use `(room_id, event_id)` in a database schema. ([\#13915](#13915))
- Add instruction to contributing guide for running unit tests in parallel. Contributed by @ashfame. ([\#13928](#13928))
- Clarify that the `auto_join_rooms` config option can also be used with Space aliases. ([\#13931](#13931))
- Add some cross references to worker documentation. ([\#13974](#13974))
- Linkify urls in config documentation. ([\#14003](#14003))

Deprecations and Removals
-------------------------

- Remove the `complete_sso_login` method from the Module API which was deprecated in Synapse 1.13.0. ([\#13843](#13843))
- Announce that legacy metric names are deprecated, will be turned off by default in Synapse v1.71.0 and removed altogether in Synapse v1.73.0. See the upgrade notes for more information. ([\#14024](#14024))

Internal Changes
----------------

- Speed up creation of DM rooms. ([\#13487](#13487), [\#13800](#13800))
- Port push rules to using Rust. ([\#13768](#13768), [\#13838](#13838), [\#13889](#13889))
- Optimise get rooms for user calls. Contributed by Nick @ Beeper (@Fizzadar). ([\#13787](#13787))
- Update the script which makes full schema dumps. ([\#13792](#13792))
- Use shared methods for cache invalidation when persisting events, remove duplicate codepaths. Contributed by Nick @ Beeper (@Fizzadar). ([\#13796](#13796))
- Improve the `synapse.api.auth.Auth` mock used in unit tests. ([\#13809](#13809))
- Faster Remote Room Joins: tell remote homeservers that we are unable to authorise them if they query a room which has partial state on our server. ([\#13823](#13823))
- Carry IdP Session IDs through user-mapping sessions. ([\#13839](#13839))
- Fix the release script not publishing binary wheels. ([\#13850](#13850))
- Raise issue if complement fails with latest deps. ([\#13859](#13859))
- Correct the comments in the complement dockerfile. ([\#13867](#13867))
- Create a new snapshot of the database schema. ([\#13873](#13873))
- Faster room joins: Send device list updates to most servers in rooms with partial state. ([\#13874](#13874), [\#14013](#14013))
- Add comments to the Prometheus recording rules to make it clear which set of rules you need for Grafana or Prometheus Console. ([\#13876](#13876))
- Only pull relevant backfill points from the database based on the current depth and limit (instead of all) every time we want to `/backfill`. ([\#13879](#13879))
- Faster room joins: Avoid waiting for full state when processing `/keys/changes` requests. ([\#13888](#13888))
- Improve backfill robustness by trying more servers when we get a `4xx` error back. ([\#13890](#13890))
- Fix mypy errors with canonicaljson 1.6.3. ([\#13905](#13905))
- Faster remote room joins: correctly handle remote device list updates during a partial join. ([\#13913](#13913))
- Complement image: propagate SIGTERM to all workers. ([\#13914](#13914))
- Update an innaccurate comment in Synapse's upsert database helper. ([\#13924](#13924))
- Update mypy (0.950 -> 0.981) and mypy-zope (0.3.7 -> 0.3.11). ([\#13925](#13925), [\#13993](#13993))
- Use dedicated `get_local_users_in_room(room_id)` function to find local users when calculating users to copy over during a room upgrade. ([\#13960](#13960))
- Refactor language in user directory `_track_user_joined_room` code to make it more clear that we use both local and remote users. ([\#13966](#13966))
- Revert catch-all exceptions being recorded as event pull attempt failures (only handle what we know about). ([\#13969](#13969))
- Speed up calculating push actions in large rooms. ([\#13973](#13973), [\#13992](#13992))
- Enable update notifications from Github's dependabot. ([\#13976](#13976))
- Prototype a workflow to automatically add changelogs to dependabot PRs. ([\#13998](#13998), [\#14011](#14011), [\#14017](#14017), [\#14021](#14021), [\#14027](#14027))
- Fix type annotations to be compatible with new annotations in development versions of twisted. ([\#14012](#14012))
- Clear out stale entries in `event_push_actions_staging` table. ([\#14020](#14020))
- Bump versions of GitHub actions. ([\#13978](#13978), [\#13979](#13979), [\#13980](#13980), [\#13982](#13982), [\#14015](#14015), [\#14019](#14019), [\#14022](#14022), [\#14023](#14023))
MadLittleMods added a commit that referenced this pull request Oct 15, 2022
…ure is invalid (#13816)

While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place.

Related to
 - #13622
    - #13635
 - #13676

Part of #13356

Follow-up to #13815 which tracks event signature failures.

With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now.

For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid.

 - `backfill`
    - `outgoing-federation-request` `/backfill`
    - `_check_sigs_and_hash_and_fetch`
       - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill
          - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
          - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)`
   - `_process_pulled_events`
      - `_process_pulled_event` for each validated event
         - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it
            - `_get_state_ids_after_missing_prev_event`
               - `outgoing-federation-request` `/state_ids`
               - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again
               - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Oct 18, 2022
NOTE: this is absolutely *not* safe for Beeper usage as-is. I have merged
all of the Python code in but all our customizations to the base rules
and push rule evaluator are not yet present in the new Rust module. This
will fail tests as-is and future commits will re-apply our changes in
Rust.

Synapse 1.69.0 (2022-10-17)
===========================

Please note that legacy Prometheus metric names are now deprecated and will be removed in Synapse 1.73.0.
Server administrators should update their dashboards and alerting rules to avoid using the deprecated metric names.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.69/upgrade.html#upgrading-to-v1690) for more details.

No significant changes since 1.69.0rc4.

Synapse 1.69.0rc4 (2022-10-14)
==============================

Bugfixes
--------

- Fix poor performance of the `event_push_backfill_thread_id` background update, which was introduced in Synapse 1.68.0rc1. ([\matrix-org#14172](matrix-org#14172), [\matrix-org#14181](matrix-org#14181))

Updates to the Docker image
---------------------------

- Fix docker build OOMing in CI for arm64 builds. ([\matrix-org#14173](matrix-org#14173))

Synapse 1.69.0rc3 (2022-10-12)
==============================

Bugfixes
--------

- Fix an issue with Docker images causing the Rust dependencies to not be pinned correctly. Introduced in v1.68.0 ([\matrix-org#14129](matrix-org#14129))
- Fix a bug introduced in Synapse 1.69.0rc1 which would cause registration replication requests to fail if the worker sending the request is not running Synapse 1.69. ([\matrix-org#14135](matrix-org#14135))
- Fix error in background update when rotating existing notifications. Introduced in v1.69.0rc2. ([\matrix-org#14138](matrix-org#14138))

Internal Changes
----------------

- Rename the `url_preview` extra to `url-preview`, for compatability with poetry-core 1.3.0 and [PEP 685](https://peps.python.org/pep-0685/). From-source installations using this extra will need to install using the new name. ([\matrix-org#14085](matrix-org#14085))

Synapse 1.69.0rc2 (2022-10-06)
==============================

Deprecations and Removals
-------------------------

- Deprecate the `generate_short_term_login_token` method in favor of an async `create_login_token` method in the Module API. ([\matrix-org#13842](matrix-org#13842))

Internal Changes
----------------

- Ensure Synapse v1.69 works with upcoming database changes in v1.70. ([\matrix-org#14045](matrix-org#14045))
- Fix a bug introduced in Synapse v1.68.0 where messages could not be sent in rooms with non-integer `notifications` power level. ([\matrix-org#14073](matrix-org#14073))
- Temporarily pin build-system requirements to workaround an incompatibility with poetry-core 1.3.0. This will be reverted before the v1.69.0 release proper, see [\matrix-org#14079](matrix-org#14079). ([\matrix-org#14080](matrix-org#14080))

Synapse 1.69.0rc1 (2022-10-04)
==============================

Features
--------

- Allow application services to set the `origin_server_ts` of a state event by providing the query parameter `ts` in [`PUT /_matrix/client/r0/rooms/{roomId}/state/{eventType}/{stateKey}`](https://spec.matrix.org/v1.4/client-server-api/#put_matrixclientv3roomsroomidstateeventtypestatekey), per [MSC3316](matrix-org/matrix-spec-proposals#3316). Contributed by @lukasdenk. ([\matrix-org#11866](matrix-org#11866))
- Allow server admins to require a manual approval process before new accounts can be used (using [MSC3866](matrix-org/matrix-spec-proposals#3866)). ([\matrix-org#13556](matrix-org#13556))
- Exponentially backoff from backfilling the same event over and over. ([\matrix-org#13635](matrix-org#13635), [\matrix-org#13936](matrix-org#13936))
- Add cache invalidation across workers to module API. ([\matrix-org#13667](matrix-org#13667), [\matrix-org#13947](matrix-org#13947))
- Experimental implementation of [MSC3882](matrix-org/matrix-spec-proposals#3882) to allow an existing device/session to generate a login token for use on a new device/session. ([\matrix-org#13722](matrix-org#13722), [\matrix-org#13868](matrix-org#13868))
- Experimental support for thread-specific receipts ([MSC3771](matrix-org/matrix-spec-proposals#3771)). ([\matrix-org#13782](matrix-org#13782), [\matrix-org#13893](matrix-org#13893), [\matrix-org#13932](matrix-org#13932), [\matrix-org#13937](matrix-org#13937), [\matrix-org#13939](matrix-org#13939))
- Add experimental support for [MSC3881: Remotely toggle push notifications for another client](matrix-org/matrix-spec-proposals#3881). ([\matrix-org#13799](matrix-org#13799), [\matrix-org#13831](matrix-org#13831), [\matrix-org#13860](matrix-org#13860))
- Keep track when an event pulled over federation fails its signature check so we can intelligently back-off in the future. ([\matrix-org#13815](matrix-org#13815))
- Improve validation for the unspecced, internal-only `_matrix/client/unstable/add_threepid/msisdn/submit_token` endpoint. ([\matrix-org#13832](matrix-org#13832))
- Faster remote room joins: record _when_ we first partial-join to a room. ([\matrix-org#13892](matrix-org#13892))
- Support a `dir` parameter on the `/relations` endpoint per [MSC3715](matrix-org/matrix-spec-proposals#3715). ([\matrix-org#13920](matrix-org#13920))
- Ask mail servers receiving emails from Synapse to not send automatic replies (e.g. out-of-office responses). ([\matrix-org#13957](matrix-org#13957))

Bugfixes
--------

- Send push notifications for invites received over federation. ([\matrix-org#13719](matrix-org#13719), [\matrix-org#14014](matrix-org#14014))
- Fix a long-standing bug where typing events would be accepted from remote servers not present in a room. Also fix a bug where incoming typing events would cause other incoming events to get stuck during a fast join. ([\matrix-org#13830](matrix-org#13830))
- Fix a bug introduced in Synapse v1.53.0 where the experimental implementation of [MSC3715](matrix-org/matrix-spec-proposals#3715) would give incorrect results when paginating forward. ([\matrix-org#13840](matrix-org#13840))
- Fix access token leak to logs from proxy agent. ([\matrix-org#13855](matrix-org#13855))
- Fix `have_seen_event` cache not being invalidated after we persist an event which causes inefficiency effects like extra `/state` federation calls. ([\matrix-org#13863](matrix-org#13863))
- Faster room joins: Fix a bug introduced in 1.66.0 where an error would be logged when syncing after joining a room. ([\matrix-org#13872](matrix-org#13872))
- Fix a bug introduced in 1.66.0 where some required fields in the pushrules sent to clients were not present anymore. Contributed by Nico. ([\matrix-org#13904](matrix-org#13904))
- Fix packaging to include `Cargo.lock` in `sdist`. ([\matrix-org#13909](matrix-org#13909))
- Fix a long-standing bug where device updates could cause delays sending out to-device messages over federation. ([\matrix-org#13922](matrix-org#13922))
- Fix a bug introduced in v1.68.0 where Synapse would require `setuptools_rust` at runtime, even though the package is only required at build time. ([\matrix-org#13952](matrix-org#13952))
- Fix a long-standing bug where `POST /_matrix/client/v3/keys/query` requests could result in excessively large SQL queries. ([\matrix-org#13956](matrix-org#13956))
- Fix a performance regression in the `get_users_in_room` database query. Introduced in v1.67.0. ([\matrix-org#13972](matrix-org#13972))
- Fix a bug introduced in v1.68.0 bug where Rust extension wasn't built in `release` mode when using `poetry install`. ([\matrix-org#14009](matrix-org#14009))
- Do not return an unspecified `original_event` field when using the stable `/relations` endpoint. Introduced in Synapse v1.57.0. ([\matrix-org#14025](matrix-org#14025))
- Correctly handle a race with device lists when a remote user leaves during a partial join. ([\matrix-org#13885](matrix-org#13885))
- Correctly handle sending local device list updates to remote servers during a partial join. ([\matrix-org#13934](matrix-org#13934))

Improved Documentation
----------------------

- Add `worker_main_http_uri` for the worker generator bash script. ([\matrix-org#13772](matrix-org#13772))
- Update URL for the NixOS module for Synapse. ([\matrix-org#13818](matrix-org#13818))
- Fix a mistake in sso_mapping_providers.md: `map_user_attributes` is expected to return `display_name`, not `displayname`. ([\matrix-org#13836](matrix-org#13836))
- Fix a cross-link from the registration admin API to the `registration_shared_secret` configuration documentation. ([\matrix-org#13870](matrix-org#13870))
- Update the man page for the `hash_password` script to correct the default number of bcrypt rounds performed. ([\matrix-org#13911](matrix-org#13911), [\matrix-org#13930](matrix-org#13930))
- Emphasize the right reasons when to use `(room_id, event_id)` in a database schema. ([\matrix-org#13915](matrix-org#13915))
- Add instruction to contributing guide for running unit tests in parallel. Contributed by @ashfame. ([\matrix-org#13928](matrix-org#13928))
- Clarify that the `auto_join_rooms` config option can also be used with Space aliases. ([\matrix-org#13931](matrix-org#13931))
- Add some cross references to worker documentation. ([\matrix-org#13974](matrix-org#13974))
- Linkify urls in config documentation. ([\matrix-org#14003](matrix-org#14003))

Deprecations and Removals
-------------------------

- Remove the `complete_sso_login` method from the Module API which was deprecated in Synapse 1.13.0. ([\matrix-org#13843](matrix-org#13843))
- Announce that legacy metric names are deprecated, will be turned off by default in Synapse v1.71.0 and removed altogether in Synapse v1.73.0. See the upgrade notes for more information. ([\matrix-org#14024](matrix-org#14024))

Internal Changes
----------------

- Speed up creation of DM rooms. ([\matrix-org#13487](matrix-org#13487), [\matrix-org#13800](matrix-org#13800))
- Port push rules to using Rust. ([\matrix-org#13768](matrix-org#13768), [\matrix-org#13838](matrix-org#13838), [\matrix-org#13889](matrix-org#13889))
- Optimise get rooms for user calls. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13787](matrix-org#13787))
- Update the script which makes full schema dumps. ([\matrix-org#13792](matrix-org#13792))
- Use shared methods for cache invalidation when persisting events, remove duplicate codepaths. Contributed by Nick @ Beeper (@Fizzadar). ([\matrix-org#13796](matrix-org#13796))
- Improve the `synapse.api.auth.Auth` mock used in unit tests. ([\matrix-org#13809](matrix-org#13809))
- Faster Remote Room Joins: tell remote homeservers that we are unable to authorise them if they query a room which has partial state on our server. ([\matrix-org#13823](matrix-org#13823))
- Carry IdP Session IDs through user-mapping sessions. ([\matrix-org#13839](matrix-org#13839))
- Fix the release script not publishing binary wheels. ([\matrix-org#13850](matrix-org#13850))
- Raise issue if complement fails with latest deps. ([\matrix-org#13859](matrix-org#13859))
- Correct the comments in the complement dockerfile. ([\matrix-org#13867](matrix-org#13867))
- Create a new snapshot of the database schema. ([\matrix-org#13873](matrix-org#13873))
- Faster room joins: Send device list updates to most servers in rooms with partial state. ([\matrix-org#13874](matrix-org#13874), [\matrix-org#14013](matrix-org#14013))
- Add comments to the Prometheus recording rules to make it clear which set of rules you need for Grafana or Prometheus Console. ([\matrix-org#13876](matrix-org#13876))
- Only pull relevant backfill points from the database based on the current depth and limit (instead of all) every time we want to `/backfill`. ([\matrix-org#13879](matrix-org#13879))
- Faster room joins: Avoid waiting for full state when processing `/keys/changes` requests. ([\matrix-org#13888](matrix-org#13888))
- Improve backfill robustness by trying more servers when we get a `4xx` error back. ([\matrix-org#13890](matrix-org#13890))
- Fix mypy errors with canonicaljson 1.6.3. ([\matrix-org#13905](matrix-org#13905))
- Faster remote room joins: correctly handle remote device list updates during a partial join. ([\matrix-org#13913](matrix-org#13913))
- Complement image: propagate SIGTERM to all workers. ([\matrix-org#13914](matrix-org#13914))
- Update an innaccurate comment in Synapse's upsert database helper. ([\matrix-org#13924](matrix-org#13924))
- Update mypy (0.950 -> 0.981) and mypy-zope (0.3.7 -> 0.3.11). ([\matrix-org#13925](matrix-org#13925), [\matrix-org#13993](matrix-org#13993))
- Use dedicated `get_local_users_in_room(room_id)` function to find local users when calculating users to copy over during a room upgrade. ([\matrix-org#13960](matrix-org#13960))
- Refactor language in user directory `_track_user_joined_room` code to make it more clear that we use both local and remote users. ([\matrix-org#13966](matrix-org#13966))
- Revert catch-all exceptions being recorded as event pull attempt failures (only handle what we know about). ([\matrix-org#13969](matrix-org#13969))
- Speed up calculating push actions in large rooms. ([\matrix-org#13973](matrix-org#13973), [\matrix-org#13992](matrix-org#13992))
- Enable update notifications from Github's dependabot. ([\matrix-org#13976](matrix-org#13976))
- Prototype a workflow to automatically add changelogs to dependabot PRs. ([\matrix-org#13998](matrix-org#13998), [\matrix-org#14011](matrix-org#14011), [\matrix-org#14017](matrix-org#14017), [\matrix-org#14021](matrix-org#14021), [\matrix-org#14027](matrix-org#14027))
- Fix type annotations to be compatible with new annotations in development versions of twisted. ([\matrix-org#14012](matrix-org#14012))
- Clear out stale entries in `event_push_actions_staging` table. ([\matrix-org#14020](matrix-org#14020))
- Bump versions of GitHub actions. ([\matrix-org#13978](matrix-org#13978), [\matrix-org#13979](matrix-org#13979), [\matrix-org#13980](matrix-org#13980), [\matrix-org#13982](matrix-org#13982), [\matrix-org#14015](matrix-org#14015), [\matrix-org#14019](matrix-org#14019), [\matrix-org#14022](matrix-org#14022), [\matrix-org#14023](matrix-org#14023))

# -----BEGIN PGP SIGNATURE-----
#
# iQFEBAABCgAuFiEEBTGR3/RnAzBGUif3pULk7RsPrAkFAmNNMOMQHGVyaWtAbWF0
# cml4Lm9yZwAKCRClQuTtGw+sCVnjB/9jpJRVnicteEpDfVX9iLo2qfIfcO/GhUJK
# pJhv4yuY9whAldvJpmNw2f9tfUbAMcvrjlFvNrjihWmXcAGFazC6i3fNBjPgZW2e
# Sxsuuy8xc9X/OqH2EUpHtNZQX3FfSbdBS93Z62ZO3R8tEbCQvjw6FXBdjjjf5uLO
# y5Lsx94+41FJYOhs1Kt4fN92B9WMACR6e/O1YcsDjIXsoZI3uqO1h8filbQIZee7
# DTATE7eIPtShs2Ezaaeuc7tZGVDyPvgWIbuxuT6OGx20zmuChYJgIcVaD1me4UzJ
# i9bVigtpYN0eUxuWnjLf7YC6Ys/Y9wZ7/lhdgaBwdbQKEJdpi+S4
# =JWaO
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon Oct 17 11:39:31 2022 BST
# gpg:                using RSA key 053191DFF4670330465227F7A542E4ED1B0FAC09
# gpg:                issuer "erik@matrix.org"
# gpg: Can't check signature: No public key

# Conflicts:
#	docker/Dockerfile
#	pyproject.toml
#	synapse/_scripts/update_synapse_database.py
#	synapse/handlers/message.py
#	synapse/handlers/receipts.py
#	synapse/logging/context.py
#	synapse/push/baserules.py
#	synapse/push/bulk_push_rule_evaluator.py
#	synapse/push/push_rule_evaluator.py
#	synapse/replication/http/send_event.py
#	synapse/rest/admin/users.py
#	synapse/rest/client/read_marker.py
#	synapse/rest/client/receipts.py
#	synapse/rest/client/room.py
#	synapse/storage/_base.py
#	synapse/storage/databases/main/__init__.py
#	synapse/storage/databases/main/cache.py
#	synapse/storage/databases/main/events.py
#	synapse/storage/databases/main/receipts.py
#	tests/push/test_push_rule_evaluator.py
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 29, 2022
Synapse 1.69.0 (2022-10-17)
===========================

Please note that legacy Prometheus metric names are now deprecated and will be removed in Synapse 1.73.0.
Server administrators should update their dashboards and alerting rules to avoid using the deprecated metric names.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.69/upgrade.html#upgrading-to-v1690) for more details.


Deprecations and Removals
-------------------------

- Remove the `complete_sso_login` method from the Module API which was
  deprecated in Synapse
  1.13.0. ([\#13843](matrix-org/synapse#13843))

- Announce that legacy metric names are deprecated, will be turned off
  by default in Synapse v1.71.0 and removed altogether in Synapse
  v1.73.0. See the upgrade notes for more
  information. ([\#14024](matrix-org/synapse#14024))

- Deprecate the `generate_short_term_login_token` method in favor of
  an async `create_login_token` method in the Module
  API. ([\#13842](matrix-org/synapse#13842))


Features
--------

- Allow application services to set the `origin_server_ts` of a state
  event by providing the query parameter `ts` in [`PUT
  /_matrix/client/r0/rooms/{roomId}/state/{eventType}/{stateKey}`](https://spec.matrix.org/v1.4/client-server-api/#put_matrixclientv3roomsroomidstateeventtypestatekey),
  per
  [MSC3316](matrix-org/matrix-spec-proposals#3316). Contributed
  by
  @lukasdenk. ([\#11866](matrix-org/synapse#11866))

- Allow server admins to require a manual approval process before new
  accounts can be used (using
  [MSC3866](matrix-org/matrix-spec-proposals#3866)). ([\#13556](matrix-org/synapse#13556))

- Exponentially backoff from backfilling the same event over and
  over. ([\#13635](matrix-org/synapse#13635),
  [\#13936](matrix-org/synapse#13936))

- Add cache invalidation across workers to module
  API. ([\#13667](matrix-org/synapse#13667),
  [\#13947](matrix-org/synapse#13947))

- Experimental implementation of
  [MSC3882](matrix-org/matrix-spec-proposals#3882)
  to allow an existing device/session to generate a login token for
  use on a new
  device/session. ([\#13722](matrix-org/synapse#13722),
  [\#13868](matrix-org/synapse#13868))

- Experimental support for thread-specific receipts
  ([MSC3771](matrix-org/matrix-spec-proposals#3771)). ([\#13782](matrix-org/synapse#13782),
  [\#13893](matrix-org/synapse#13893),
  [\#13932](matrix-org/synapse#13932),
  [\#13937](matrix-org/synapse#13937),
  [\#13939](matrix-org/synapse#13939))

- Add experimental support for [MSC3881: Remotely toggle push
  notifications for another
  client](matrix-org/matrix-spec-proposals#3881). ([\#13799](matrix-org/synapse#13799),
  [\#13831](matrix-org/synapse#13831),
  [\#13860](matrix-org/synapse#13860))

- Keep track when an event pulled over federation fails its signature
  check so we can intelligently back-off in the
  future. ([\#13815](matrix-org/synapse#13815))

- Improve validation for the unspecced, internal-only
  `_matrix/client/unstable/add_threepid/msisdn/submit_token`
  endpoint. ([\#13832](matrix-org/synapse#13832))

- Faster remote room joins: record _when_ we first partial-join to a
  room. ([\#13892](matrix-org/synapse#13892))

- Support a `dir` parameter on the `/relations` endpoint per
  [MSC3715](matrix-org/matrix-spec-proposals#3715). ([\#13920](matrix-org/synapse#13920))

- Ask mail servers receiving emails from Synapse to not send automatic
  replies (e.g. out-of-office
  responses). ([\#13957](matrix-org/synapse#13957))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants