Skip to content

Conversation

@ara4n
Copy link
Member

@ara4n ara4n commented Apr 26, 2024

Rendered

Written with my SCT lead hat on.

Hopefully fixes matrix-org/matrix-spec#117 (at last)

@ara4n ara4n added the proposal A matrix spec change proposal label Apr 26, 2024
@ara4n ara4n added the kind:core MSC which is critical to the protocol's success label Apr 26, 2024
Comment on lines +58 to +59
no other events to be sent), but queued in such a way to prioritise servers with lower `retry_after` value first,
ensuring alive servers are prioritised over likely dead servers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear this "prioritise servers with lower retry_after" bit is non-trivial to implement, and, if you're saying we have to attempt connect to all servers anyway, probably doesn't bring that much value? We're still going to max out our CPU for 5 minutes attempting to connect to a bunch of dead servers; whether that happens before or after sending our first message to the "recently good" servers doesn't seem likely to make much difference to me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it would lower the mean (noticeable) federation latency as "dead servers" are only tried last. For synapse, wouldn't this be a matter of SELECT * FROM destinations ORDER BY retry_after ASC;? (or using a subquery with the same effect)

The spec currently does not specify anything about how federation retry schedules: as part of this change, we propose
explicitly adding that:

* Servers SHOULD follow exponential or geometric backoff schedules (and MUST NOT retry linearly)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this imply about having a backoff limit at which the server will keep retrying at that rate once reached?

Comment on lines +52 to +53
The spec currently does not specify anything about how federation retry schedules: as part of this change, we propose
explicitly adding that:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would whether servers persist backoffs between restarts be up to the implementation, or would you like to specify that as well?

Comment on lines +10 to +12
rapidly as possible (both SRV and .well-known lookups, and then /send attempts) to identify alive servers. Meanwhile,
dead servers (e.g. domains which no longer run Matrix servers) will be
[hammered](https://mastodon.matrix.org/@mnot@techpolicy.social/112319234007365786) by connection attempts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like the real problem, not the load on new servers. Is an alternative to have a way to automatically remove people that are on dead servers? Maybe using MSC4049?

@turt2live turt2live added needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. kind:maintenance MSC which clarifies/updates existing spec and removed kind:core MSC which is critical to the protocol's success labels Apr 30, 2024
Comment on lines +52 to +53
The spec currently does not specify anything about how federation retry schedules: as part of this change, we propose
explicitly adding that:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear to me that this is a required measure given the problem statement. Why do we need to strongly specify retry behaviour instead of adding implementation guidance to the existing spec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation requirements:

  • Server producing values
  • Server consuming values, and making decisions based on those values
  • Evidence of the rate limit strategy being applied as described

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persist/seed retry schedule to speed up initial servers (SPEC-258)

7 participants