-
Notifications
You must be signed in to change notification settings - Fork 423
MSC4136: Shared retry hints #4136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| no other events to be sent), but queued in such a way to prioritise servers with lower `retry_after` value first, | ||
| ensuring alive servers are prioritised over likely dead servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear this "prioritise servers with lower retry_after" bit is non-trivial to implement, and, if you're saying we have to attempt connect to all servers anyway, probably doesn't bring that much value? We're still going to max out our CPU for 5 minutes attempting to connect to a bunch of dead servers; whether that happens before or after sending our first message to the "recently good" servers doesn't seem likely to make much difference to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it would lower the mean (noticeable) federation latency as "dead servers" are only tried last. For synapse, wouldn't this be a matter of SELECT * FROM destinations ORDER BY retry_after ASC;? (or using a subquery with the same effect)
| The spec currently does not specify anything about how federation retry schedules: as part of this change, we propose | ||
| explicitly adding that: | ||
|
|
||
| * Servers SHOULD follow exponential or geometric backoff schedules (and MUST NOT retry linearly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this imply about having a backoff limit at which the server will keep retrying at that rate once reached?
| The spec currently does not specify anything about how federation retry schedules: as part of this change, we propose | ||
| explicitly adding that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would whether servers persist backoffs between restarts be up to the implementation, or would you like to specify that as well?
| rapidly as possible (both SRV and .well-known lookups, and then /send attempts) to identify alive servers. Meanwhile, | ||
| dead servers (e.g. domains which no longer run Matrix servers) will be | ||
| [hammered](https://mastodon.matrix.org/@mnot@techpolicy.social/112319234007365786) by connection attempts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like the real problem, not the load on new servers. Is an alternative to have a way to automatically remove people that are on dead servers? Maybe using MSC4049?
| The spec currently does not specify anything about how federation retry schedules: as part of this change, we propose | ||
| explicitly adding that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's unclear to me that this is a required measure given the problem statement. Why do we need to strongly specify retry behaviour instead of adding implementation guidance to the existing spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation requirements:
- Server producing values
- Server consuming values, and making decisions based on those values
- Evidence of the rate limit strategy being applied as described
Rendered
Written with my SCT lead hat on.
Hopefully fixes matrix-org/matrix-spec#117 (at last)