Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Fix a long-standing bug which meant that rate limiting was not restrictive enough in some cases. #13018

Merged
merged 7 commits into from
Jun 15, 2022

Conversation

reivilibre
Copy link
Contributor

The issue here comes from the way that it does the accounting for users, with these

        #   * How many times an action has occurred since a point in time
        #   * The point in time
        #   * The rate_hz of this particular entry. This can vary per request

(actions, since, _) tuples.

When updating one of these tuples, we should either:

  1. increment the action count and keep since the same: (actions + 1, since, _); or
  2. calculate how many of those actions have been forgiven, but then reset the time to now: (actions + 1 - forgiven_actions, now(), _)

(For (1), we rely on the tuples being pruned when they expire in order to avoid a problem where: you could do an action, wait a long time and then have a lot of tokens accumulated.)

The code used to do a hybrid: it would forgive your actions after sufficient time had passed but wouldn't update the timestamp that it was tracking them since.
This means that you can 'earn' extra burst tokens with a bit of careful timing.

Review is commit-by-commit with both a test illustrating the current behaviour and then being changed to characterise the intended behaviour.

@@ -265,23 +266,51 @@ def test_multiple_actions(self):
)
)
self.assertTrue(allowed)
# The time allowed is the current time because we could still repeat the action
# once.
self.assertEqual(10.0, time_allowed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like arguing with existing tests (it's like the good ol' 'compiler bug!' excuse) but I do think this test is wrong. Please check carefully and let me know if you dis/agree.

@reivilibre reivilibre marked this pull request as ready for review June 14, 2022 11:57
@reivilibre reivilibre requested a review from a team as a code owner June 14, 2022 11:57
@DMRobertson DMRobertson self-assigned this Jun 14, 2022
@DMRobertson
Copy link
Contributor

I'm struggling to grok what the rate limiter is trying to do in the first place. Putting this back in the queue.

@DMRobertson DMRobertson removed their assignment Jun 14, 2022
@reivilibre
Copy link
Contributor Author

I'm struggling to grok what the rate limiter is trying to do in the first place. Putting this back in the queue.

If this is useful, I'd say that it tries to implement a token bucket. Of course, it's possible the author had a totally different conceptual model in mind but the token bucket is my native interpretation of 'rate limiting with burstiness' and I'm pretty sure it lines up to what the code is trying to do

@reivilibre reivilibre self-assigned this Jun 15, 2022
@reivilibre reivilibre enabled auto-merge (squash) June 15, 2022 13:41
@reivilibre reivilibre merged commit 0dbdc39 into develop Jun 15, 2022
@reivilibre reivilibre deleted the rei/ratelimit_bug branch June 15, 2022 14:11
Gnuxie added a commit to matrix-org/mjolnir that referenced this pull request Jul 6, 2022
matrix-org/synapse#13018

Rate limiting in Synapse used to reset the burst count and remove
the backoff when you were spamming continuously, now it doesn't.
Ideally we'd rewrite the rate limiting logic to back off for longer
than suggested so we could get burst again, but for now
lets just unblock CI by reducing the number of events we send in these
tests.
Gnuxie added a commit to matrix-org/mjolnir that referenced this pull request Jul 6, 2022
* Remove debug leftovers from a test.

This is really terrible and has meant whenever anyone has run `yarn test:integration` they have only been running this test.
💀💀💀
https://www.youtube.com/watch?v=jmX-tzSOFE0

* Set a default timeout for integration tests that is 5 minutes long.

Seriously, I don't think there is much to gain by making people guess
a reasnoble time for a test to complete in all the time, especially
with how much Synapse changes in response time and all of the machines
involved in running these tests.

* Warn when giving up on being throttled

* For some reason it takes longer for events to appear in /state

no i am not going to track down why yet.

* Rate limiting got a lot more aggresive.

matrix-org/synapse#13018

Rate limiting in Synapse used to reset the burst count and remove
the backoff when you were spamming continuously, now it doesn't.
Ideally we'd rewrite the rate limiting logic to back off for longer
than suggested so we could get burst again, but for now
lets just unblock CI by reducing the number of events we send in these
tests.
Gnuxie added a commit to matrix-org/mjolnir that referenced this pull request Aug 16, 2022
It doesn't work. No there really isn't more to say about it
you're welcome to dispute it if you're going to do the work investigating. I'm not.

We used to have a test here that tested whether Mjolnir was going to carry out a redact order the default limits in a reasonable time scale.
Now I think that's never going to happen without writing a new algorithm for respecting rate limiting.
Which is not something there is time for.

matrix-org/synapse#13018

Synapse rate limits were broken and very permitting so that's why the current hack worked so well.
Now it is not broken, so our rate limit handling is.

b850e45

Honestly I don't think we can expect anyone to be able to use Mjolnir under default rate limits.

well, it's not quite simple as broken, but it is broken. With the default level in synapse (which is what matrix.org uses) it is struggling to redact 15 messages within 5 minutes. that means 5 messages over the burst count. This is ofc ontop mjolnir sending reactions / responding to replies (which isn't much but... enough to mess with the rate limiter since ofc, Synapse tells requests to wait x amount of time before trying again, but that doesn't help for concurrent requests since ofc there's only 1 slot available at that future time.  This means Synapse just wacks everything with exponentially longer shit without many (or any?) events going through
it used to be fine
because rate limiting in synapse used to be a lot more liberal because it was "broken" or something, that's not me saying it's broken that's just what synapse devs say which is probably true.
if all requests went into a queue then yeah you could eliminate one problem
but that's a lot of work and i don't think we should be doing it
cos no one uses mjolnir like this anyways
Gnuxie added a commit to matrix-org/mjolnir that referenced this pull request Aug 16, 2022
The implementation is rubbish, as it doesn't avoid the exponential backoff

Remove default rate limit testing.

It doesn't work. No there really isn't more to say about it
you're welcome to dispute it if you're going to do the work investigating. I'm not.

We used to have a test here that tested whether Mjolnir was going to carry out a redact order the default limits in a reasonable time scale.
Now I think that's never going to happen without writing a new algorithm for respecting rate limiting.
Which is not something there is time for.

matrix-org/synapse#13018

Synapse rate limits were broken and very permitting so that's why the current hack worked so well.
Now it is not broken, so our rate limit handling is.

b850e45

Honestly I don't think we can expect anyone to be able to use Mjolnir under default rate limits.

well, it's not quite simple as broken, but it is broken. With the default level in synapse (which is what matrix.org uses) it is struggling to redact 15 messages within 5 minutes. that means 5 messages over the burst count. This is ofc ontop mjolnir sending reactions / responding to replies (which isn't much but... enough to mess with the rate limiter since ofc, Synapse tells requests to wait x amount of time before trying again, but that doesn't help for concurrent requests since ofc there's only 1 slot available at that future time.  This means Synapse just wacks everything with exponentially longer shit without many (or any?) events going through
it used to be fine
because rate limiting in synapse used to be a lot more liberal because it was "broken" or something, that's not me saying it's broken that's just what synapse devs say which is probably true.
if all requests went into a queue then yeah you could eliminate one problem
but that's a lot of work and i don't think we should be doing it
cos no one uses mjolnir like this anyways
Fizzadar added a commit to beeper/synapse-legacy-fork that referenced this pull request Aug 23, 2022
Synapse 1.62.0 (2022-07-05)
===========================

No significant changes since 1.62.0rc3.

Authors of spam-checker plugins should consult the [upgrade notes](https://github.com/matrix-org/synapse/blob/release-v1.62/docs/upgrade.md#upgrading-to-v1620) to learn about the enriched signatures for spam checker callbacks, which are supported with this release of Synapse.

Synapse 1.62.0rc3 (2022-07-04)
==============================

Bugfixes
--------

- Update the version of the [ldap3 plugin](https://github.com/matrix-org/matrix-synapse-ldap3/) included in the `matrixdotorg/synapse` DockerHub images and the Debian packages hosted on `packages.matrix.org` to 0.2.1. This fixes [a bug](matrix-org/matrix-synapse-ldap3#163) with usernames containing uppercase characters. ([\matrix-org#13156](matrix-org#13156))
- Fix a bug introduced in Synapse 1.62.0rc1 affecting unread counts for users on small servers. ([\matrix-org#13168](matrix-org#13168))

Synapse 1.62.0rc2 (2022-07-01)
==============================

Bugfixes
--------

- Fix unread counts for users on large servers. Introduced in v1.62.0rc1. ([\matrix-org#13140](matrix-org#13140))
- Fix DB performance when deleting old push notifications. Introduced in v1.62.0rc1. ([\matrix-org#13141](matrix-org#13141))

Synapse 1.62.0rc1 (2022-06-28)
==============================

Features
--------

- Port the spam-checker API callbacks to a new, richer API. This is part of an ongoing change to let spam-checker modules inform users of the reason their event or operation is rejected. ([\matrix-org#12857](matrix-org#12857), [\matrix-org#13047](matrix-org#13047))
- Allow server admins to customise the response of the `/.well-known/matrix/client` endpoint. ([\matrix-org#13035](matrix-org#13035))
- Add metrics measuring the CPU and DB time spent in state resolution. ([\matrix-org#13036](matrix-org#13036))
- Speed up fetching of device list changes in `/sync` and `/keys/changes`. ([\matrix-org#13045](matrix-org#13045), [\matrix-org#13098](matrix-org#13098))
- Improve URL previews for sites which only provide Twitter Card metadata, e.g. LWN.net. ([\matrix-org#13056](matrix-org#13056))

Bugfixes
--------

- Update [MSC3786](matrix-org/matrix-spec-proposals#3786) implementation to check `state_key`. ([\matrix-org#12939](matrix-org#12939))
- Fix a bug introduced in Synapse 1.58 where Synapse would not report full version information when installed from a git checkout. This is a best-effort affair and not guaranteed to be stable. ([\matrix-org#12973](matrix-org#12973))
- Fix a bug introduced in Synapse 1.60 where Synapse would fail to start if the `sqlite3` module was not available. ([\matrix-org#12979](matrix-org#12979))
- Fix a bug where non-standard information was required when requesting the `/hierarchy` API over federation. Introduced
  in Synapse v1.41.0. ([\matrix-org#12991](matrix-org#12991))
- Fix a long-standing bug which meant that rate limiting was not restrictive enough in some cases. ([\matrix-org#13018](matrix-org#13018))
- Fix a bug introduced in Synapse 1.58 where profile requests for a malformed user ID would ccause an internal error. Synapse now returns 400 Bad Request in this situation. ([\matrix-org#13041](matrix-org#13041))
- Fix some inconsistencies in the event authentication code. ([\matrix-org#13087](matrix-org#13087), [\matrix-org#13088](matrix-org#13088))
- Fix a long-standing bug where room directory requests would cause an internal server error if given a malformed room alias. ([\matrix-org#13106](matrix-org#13106))

Improved Documentation
----------------------

- Add documentation for how to configure Synapse with Workers using Docker Compose. Includes example worker config and docker-compose.yaml. Contributed by @Thumbscrew. ([\matrix-org#12737](matrix-org#12737))
- Ensure the [Poetry cheat sheet](https://matrix-org.github.io/synapse/develop/development/dependencies.html) is available in the online documentation. ([\matrix-org#13022](matrix-org#13022))
- Mention removed community/group worker endpoints in upgrade.md. Contributed by @olmari. ([\matrix-org#13023](matrix-org#13023))
- Add instructions for running Complement with `gotestfmt`-formatted output locally. ([\matrix-org#13073](matrix-org#13073))
- Update OpenTracing docs to reference the configuration manual rather than the configuration file. ([\matrix-org#13076](matrix-org#13076))
- Update information on downstream Debian packages. ([\matrix-org#13095](matrix-org#13095))
- Remove documentation for the Delete Group Admin API which no longer exists. ([\matrix-org#13112](matrix-org#13112))

Deprecations and Removals
-------------------------

- Remove the unspecced `DELETE /directory/list/room/{roomId}` endpoint, which hid rooms from the [public room directory](https://spec.matrix.org/v1.3/client-server-api/#listing-rooms). Instead, `PUT` to the same URL with a visibility of `"private"`. ([\matrix-org#13123](matrix-org#13123))

Internal Changes
----------------

- Add tests for cancellation of `GET /rooms/$room_id/members` and `GET /rooms/$room_id/state` requests. ([\matrix-org#12674](matrix-org#12674))
- Report login failures due to unknown third party identifiers in the same way as failures due to invalid passwords. This prevents an attacker from using the error response to determine if the identifier exists. Contributed by Daniel Aloni. ([\matrix-org#12738](matrix-org#12738))
- Merge the Complement testing Docker images into a single, multi-purpose image. ([\matrix-org#12881](matrix-org#12881), [\matrix-org#13075](matrix-org#13075))
- Simplify the database schema for `event_edges`. ([\matrix-org#12893](matrix-org#12893))
- Clean up the test code for client disconnection. ([\matrix-org#12929](matrix-org#12929))
- Remove code generating comments in configuration. ([\matrix-org#12941](matrix-org#12941))
- Add `Cross-Origin-Resource-Policy: cross-origin` header to content repository's thumbnail and download endpoints. ([\matrix-org#12944](matrix-org#12944))
- Replace noop background updates with `DELETE` delta. ([\matrix-org#12954](matrix-org#12954), [\matrix-org#13050](matrix-org#13050))
- Use lower isolation level when inserting read receipts to avoid serialization errors. Contributed by Nick @ Beeper. ([\matrix-org#12957](matrix-org#12957))
- Reduce the amount of state we pull from the DB. ([\matrix-org#12963](matrix-org#12963))
- Enable testing against PostgreSQL databases in Complement CI. ([\matrix-org#12965](matrix-org#12965), [\matrix-org#13034](matrix-org#13034))
- Fix an inaccurate comment. ([\matrix-org#12969](matrix-org#12969))
- Remove the `delete_device` method and always call `delete_devices`. ([\matrix-org#12970](matrix-org#12970))
- Use a GitHub form for issues rather than a hard-to-read, easy-to-ignore template. ([\matrix-org#12982](matrix-org#12982))
- Move [MSC3715](matrix-org/matrix-spec-proposals#3715) behind an experimental config flag. ([\matrix-org#12984](matrix-org#12984))
- Add type hints to tests. ([\matrix-org#12985](matrix-org#12985), [\matrix-org#13099](matrix-org#13099))
- Refactor macaroon tokens generation and move the unsubscribe link in notification emails to `/_synapse/client/unsubscribe`. ([\matrix-org#12986](matrix-org#12986))
- Fix documentation for running complement tests. ([\matrix-org#12990](matrix-org#12990))
- Faster joins: add issue links to the TODO comments in the code. ([\matrix-org#13004](matrix-org#13004))
- Reduce DB usage of `/sync` when a large number of unread messages have recently been sent in a room. ([\matrix-org#13005](matrix-org#13005), [\matrix-org#13096](matrix-org#13096), [\matrix-org#13118](matrix-org#13118))
- Replaced usage of PyJWT with methods from Authlib in `org.matrix.login.jwt`. Contributed by Hannes Lerchl. ([\matrix-org#13011](matrix-org#13011))
- Modernize the `contrib/graph/` scripts. ([\matrix-org#13013](matrix-org#13013))
- Remove redundant `room_version` parameters from event auth functions. ([\matrix-org#13017](matrix-org#13017))
- Decouple `synapse.api.auth_blocking.AuthBlocking` from `synapse.api.auth.Auth`. ([\matrix-org#13021](matrix-org#13021))
- Add type annotations to `synapse.storage.databases.main.devices`. ([\matrix-org#13025](matrix-org#13025))
- Set default `sync_response_cache_duration` to two minutes. ([\matrix-org#13042](matrix-org#13042))
- Rename CI test runs. ([\matrix-org#13046](matrix-org#13046))
- Increase timeout of complement CI test runs. ([\matrix-org#13048](matrix-org#13048))
- Refactor entry points so that they all have a `main` function. ([\matrix-org#13052](matrix-org#13052))
- Refactor the Dockerfile-workers configuration script to use Jinja2 templates in Synapse workers' Supervisord blocks. ([\matrix-org#13054](matrix-org#13054))
- Add headers to individual options in config documentation to allow for linking. ([\matrix-org#13055](matrix-org#13055))
- Make Complement CI logs easier to read. ([\matrix-org#13057](matrix-org#13057), [\matrix-org#13058](matrix-org#13058), [\matrix-org#13069](matrix-org#13069))
- Don't instantiate modules with keyword arguments. ([\matrix-org#13060](matrix-org#13060))
- Fix type checking errors against Twisted trunk. ([\matrix-org#13061](matrix-org#13061))
- Allow MSC3030 `timestamp_to_event` calls from anyone on world-readable rooms. ([\matrix-org#13062](matrix-org#13062))
- Add a CI job to check that schema deltas are in the correct folder. ([\matrix-org#13063](matrix-org#13063))
- Avoid rechecking event auth rules which are independent of room state. ([\matrix-org#13065](matrix-org#13065))
- Reduce the duplication of code that invokes the rate limiter. ([\matrix-org#13070](matrix-org#13070))
- Add a Subject Alternative Name to the certificate generated for Complement tests. ([\matrix-org#13071](matrix-org#13071))
- Add more tests for room upgrades. ([\matrix-org#13074](matrix-org#13074))
- Pin dependencies maintained by matrix.org to [semantic version](https://semver.org/) bounds. ([\matrix-org#13082](matrix-org#13082))
- Correctly report prometheus DB stats for `get_earliest_token_for_stats`. ([\matrix-org#13085](matrix-org#13085))
- Fix a long-standing bug where a finished logging context would be re-started when Synapse failed to persist an event from federation. ([\matrix-org#13089](matrix-org#13089))
- Simplify the alias deletion logic as an application service. ([\matrix-org#13093](matrix-org#13093))
- Add type annotations to `tests.test_server`. ([\matrix-org#13124](matrix-org#13124))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants