-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC3618: Add proposal to simplify federation /send
response
#3618
base: main
Are you sure you want to change the base?
Conversation
/send
response/send
response
A significant benefit is that homeserver implementations no longer need to block | ||
the `/send` request in order to wait for the events to be processed for their error | ||
results. This can potentially allow homeserver implementations to remove head-of-line | ||
blocking from `/send` by maintaining durable queues for incoming federation events and | ||
processing them on a per-room basis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be covered by other threads, in which case, apologies, but I don't really understand why removing the pdus
section this is a prerequisite for removing this head-of-line blocking. Indeed, synapse has already done this.
If, as a server, you're happy that the sender will not re-send any of the events in the transaction (ie, under this MSC, you're going to return a 200), all you need to do currently is parse enough of the transaction to calculate the event ids. (Or is that work you're hoping to avoid? It seems like, by the time you've validated the shape of the body, you're most of the way to calculating event IDs)
I guess I agree that pdus
is kinda pointless if no implementations actually read the field (other than for logging), but that doesn't necessarily mean that we should prevent implementations making use of it by getting rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The head-of-line blocking is that you have to wait for the PDUs to be processed before you can generate the /send
response body because the response body expects for there to be event IDs and error results (the PDU Processing Result
). Meanwhile, while that is happening, the remote server is waiting for this request to finish before sending you the next one. One event in the transaction which takes abnormally long to process will therefore slow down that server's ability to send you events from other rooms until it is done.
If what you are saying is correct, Synapse just unmarshals the transaction, generates the event IDs, returns those with empty PDU Processing Result
and never reports problems anyway?
In which case, given that implementations don't actually care what's in the "pdus"
key at all (except for seemingly writing to the log), it seems pointless to have an API shape which implies we should know the result of the PDU being processed before being able to respond.
By removing that burden, it becomes possible for implementations to not have to return anything other than a confirmation that they "accepted" the transaction. They can just queue up the work and return 200 and then deal with it in their own time. Or servers can continue to do things like they do today, where they don't queue and just do things and block while doing it, which is also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have to wait for the PDUs to be processed
I think you need to define "processed" better. There are a lot of steps to "processing" an PDU, and you absolutely don't have to do all of them before you can reply to /send
.
If what you are saying is correct, Synapse just unmarshals the transaction, generates the event IDs, returns those with empty PDU Processing Result and never reports problems anyway?
Synapse also checks the signatures at this stage, for the record. And it writes the events to a queue in persistent storage in case it gets restarted before processing is completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to define "processed" better.
Well, I suppose the spec needs to define "processed" better, sure! 😀 In my mind, an event is fully processed once it has been unmarshalled, signature checked, authed and the forward extremities updated (if needed). I am not really sure if there is any text elsewhere that suggests a better definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I'd say: once you've deserialised the JSON and checked that the object therein is the right shape, calculating the event id is a really small amount of effort. So are you saying:
- you don't want to have to validate the request body before you return a 200, or:
- you don't agree that it's easy to calculate the event id while you're doing that validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fair.
I think there's certainly grounds for the spec to be clearer about what it means by "processed" here (and tbh this starts to get into https://github.com/matrix-org/matrix-doc/issues/1646). I don't think Synapse's behaviour (and hence what you're proposing for Dendrite) is precluded by the spec, but I agree the current wording and examples are suggestive of more comprehensive "processing". (This is probably mostly an artifact of the spec being written based on what Synapse happened to do at the time, rather than being properly designed and thought about.) If you'd like to propose PRs for the spec wording that clarify that, I for one would be delighted.
As for whether we should return the event IDs: if you want to drive this MSC through on the grounds that PDU Processing Result
is pointless and annoying, I won't object. But I don't agree it is necessary to fix the head-of-line blocking problem, so any references to that should be removed from the MSC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the mention of head-of-line blocking from the MSC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still says things like "the receiving homeserver no longer needs to block the the /send request"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're waiting for a PDU Processing Result
, then we're blocking waiting for the event "processing" to be done so that we know what the PDU Processing Result
is.
Is there a better word you have in mind instead of "blocking"? To me it feels like delaying responding until we know the PDU Processing Result
is "blocking".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec's intent was to say "process the event enough to populate pdus
" - anything beyond that is implementation detail. My understanding of Synapse's latest handling of transactions is that it does the absolute bare minimum and queues the remaining handling elsewhere, speeding up the response time. This behaviour should be acceptable in the eyes of the spec today.
Regardless of it being mentioned in this MSC, the spec would likely update to clarify that it's an implementation detail for how to "process" an event and that the server should use blocking the response as a way to slow down inbound transactions (if they want to do that, such as in cases where their internal pending queue is too large).
> The sending server must wait and retry for a 200 OK response before sending a | ||
> transaction with a different txnId to the receiving server. | ||
|
||
With this proposal, blocking becomes optional rather than required. Servers that do not | ||
want to durably persist transactions before processing them can continue to perform all | ||
work in-memory by continuing to block on `/send` as is done today. Additionally, a server | ||
that is receiving too many transactions from a given homeserver may wish to block for | ||
an arbitrary period of time for rate-limiting purposes, but this is not necessarily | ||
required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. We're talking about the sending server (ie, the "client" of /send
from the REST POV) here, but you've written:
Servers that do not want to durably persist transactions before processing them ...
isn't processing a transaction something that happens on the receiving side? Generally your text here only seems to make sense from the receiving side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording "The sending server" comes from the spec, which yes, is the REST client in this case. My wording is talking about the "receiving server" and I've updated the MSC to be clearer with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't really understand how all this text about whether receiving servers want to block before replying to /send
is relevant to the quoted text about senders waiting for an OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say you have servers A and B. The spec states that if A sends something to B, it must wait for a 200 response from B on txnID 1
before it can send txnID 2
. This implies that, between A and B, only one /send
can be happening at a time and they happen serially. Therefore if B spends a long time doing whatever before writing the response body and 200 code for txnID 1
, A has to wait for that response before making a new request to send txnID 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I got all that. But your paragraph starts with a single sentence that says "we're going to relax the constraint that only one /send
can happen at a time", but then instead of explaining why, you then talk about something completely different. I must be completely misunderstanding something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am not trying to remove the constraint that /send
s happen in serial. I am trying to reduce the amount of time it takes to respond to a /send
so that we can increase the throughput and still be definitively spec-compliant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am not trying to remove the constraint that /sends happen in serial.
right, that's the confusion then. It's hard to read this any other way:
The sending server must wait and retry for a 200 OK response before sending a
transaction with a different txnId to the receiving server.With this proposal, blocking becomes optional rather than required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the wording here to:
With this proposal, the receiving server needing to block the
/send
response to wait for
PDU Processing Result
s becomes optional rather than required.
Does this read better?
re. implementation, Implementation PR: matrix-org/dendrite#2088. |
Conduit currently fails to understand the response, see #3618 (comment) |
Rendered.
This is a proposal to simplify the response of the federation
/send
endpoint.Implementation PR: matrix-org/dendrite#2088.
Preview: https://pr3618--matrix-org-previews.netlify.app