MSC3618: Add proposal to simplify federation `/send` response #3618

neilalexander · 2022-01-04T17:05:11Z

Rendered.

This is a proposal to simplify the response of the federation /send endpoint.

Implementation PR: matrix-org/dendrite#2088.

Preview: https://pr3618--matrix-org-previews.netlify.app

proposals/3618-simplify-federation-send.md

richvdh · 2022-01-05T13:33:19Z

proposals/3618-simplify-federation-send.md

+A significant benefit is that homeserver implementations no longer need to block
+the `/send` request in order to wait for the events to be processed for their error
+results. This can potentially allow homeserver implementations to remove head-of-line
+blocking from `/send` by maintaining durable queues for incoming federation events and
+processing them on a per-room basis.


This may be covered by other threads, in which case, apologies, but I don't really understand why removing the pdus section this is a prerequisite for removing this head-of-line blocking. Indeed, synapse has already done this.

If, as a server, you're happy that the sender will not re-send any of the events in the transaction (ie, under this MSC, you're going to return a 200), all you need to do currently is parse enough of the transaction to calculate the event ids. (Or is that work you're hoping to avoid? It seems like, by the time you've validated the shape of the body, you're most of the way to calculating event IDs)

I guess I agree that pdus is kinda pointless if no implementations actually read the field (other than for logging), but that doesn't necessarily mean that we should prevent implementations making use of it by getting rid of it.

The head-of-line blocking is that you have to wait for the PDUs to be processed before you can generate the /send response body because the response body expects for there to be event IDs and error results (the PDU Processing Result). Meanwhile, while that is happening, the remote server is waiting for this request to finish before sending you the next one. One event in the transaction which takes abnormally long to process will therefore slow down that server's ability to send you events from other rooms until it is done.

If what you are saying is correct, Synapse just unmarshals the transaction, generates the event IDs, returns those with empty PDU Processing Result and never reports problems anyway?

In which case, given that implementations don't actually care what's in the "pdus" key at all (except for seemingly writing to the log), it seems pointless to have an API shape which implies we should know the result of the PDU being processed before being able to respond.

By removing that burden, it becomes possible for implementations to not have to return anything other than a confirmation that they "accepted" the transaction. They can just queue up the work and return 200 and then deal with it in their own time. Or servers can continue to do things like they do today, where they don't queue and just do things and block while doing it, which is also fine.

you have to wait for the PDUs to be processed

I think you need to define "processed" better. There are a lot of steps to "processing" an PDU, and you absolutely don't have to do all of them before you can reply to /send.

If what you are saying is correct, Synapse just unmarshals the transaction, generates the event IDs, returns those with empty PDU Processing Result and never reports problems anyway?

Synapse also checks the signatures at this stage, for the record. And it writes the events to a queue in persistent storage in case it gets restarted before processing is completed.

I think you need to define "processed" better.

Well, I suppose the spec needs to define "processed" better, sure! 😀 In my mind, an event is fully processed once it has been unmarshalled, signature checked, authed and the forward extremities updated (if needed). I am not really sure if there is any text elsewhere that suggests a better definition.

Again I'd say: once you've deserialised the JSON and checked that the object therein is the right shape, calculating the event id is a really small amount of effort. So are you saying:

you don't want to have to validate the request body before you return a 200, or:

you don't agree that it's easy to calculate the event id while you're doing that validation?

ok, fair.

I think there's certainly grounds for the spec to be clearer about what it means by "processed" here (and tbh this starts to get into https://github.com/matrix-org/matrix-doc/issues/1646). I don't think Synapse's behaviour (and hence what you're proposing for Dendrite) is precluded by the spec, but I agree the current wording and examples are suggestive of more comprehensive "processing". (This is probably mostly an artifact of the spec being written based on what Synapse happened to do at the time, rather than being properly designed and thought about.) If you'd like to propose PRs for the spec wording that clarify that, I for one would be delighted.

As for whether we should return the event IDs: if you want to drive this MSC through on the grounds that PDU Processing Result is pointless and annoying, I won't object. But I don't agree it is necessary to fix the head-of-line blocking problem, so any references to that should be removed from the MSC.

I removed the mention of head-of-line blocking from the MSC.

it still says things like "the receiving homeserver no longer needs to block the the /send request"

If we're waiting for a PDU Processing Result, then we're blocking waiting for the event "processing" to be done so that we know what the PDU Processing Result is.

Is there a better word you have in mind instead of "blocking"? To me it feels like delaying responding until we know the PDU Processing Result is "blocking".

The spec's intent was to say "process the event enough to populate pdus" - anything beyond that is implementation detail. My understanding of Synapse's latest handling of transactions is that it does the absolute bare minimum and queues the remaining handling elsewhere, speeding up the response time. This behaviour should be acceptable in the eyes of the spec today.

Regardless of it being mentioned in this MSC, the spec would likely update to clarify that it's an implementation detail for how to "process" an event and that the server should use blocking the response as a way to slow down inbound transactions (if they want to do that, such as in cases where their internal pending queue is too large).

richvdh · 2022-01-05T13:36:01Z

proposals/3618-simplify-federation-send.md

+> The sending server must wait and retry for a 200 OK response before sending a
+> transaction with a different txnId to the receiving server.
+
+With this proposal, blocking becomes optional rather than required. Servers that do not
+want to durably persist transactions before processing them can continue to perform all
+work in-memory by continuing to block on `/send` as is done today. Additionally, a server
+that is receiving too many transactions from a given homeserver may wish to block for
+an arbitrary period of time for rate-limiting purposes, but this is not necessarily
+required.


I'm confused. We're talking about the sending server (ie, the "client" of /send from the REST POV) here, but you've written:

Servers that do not want to durably persist transactions before processing them ...

isn't processing a transaction something that happens on the receiving side? Generally your text here only seems to make sense from the receiving side.

The wording "The sending server" comes from the spec, which yes, is the REST client in this case. My wording is talking about the "receiving server" and I've updated the MSC to be clearer with that.

I still don't really understand how all this text about whether receiving servers want to block before replying to /send is relevant to the quoted text about senders waiting for an OK.

So let's say you have servers A and B. The spec states that if A sends something to B, it must wait for a 200 response from B on txnID 1 before it can send txnID 2. This implies that, between A and B, only one /send can be happening at a time and they happen serially. Therefore if B spends a long time doing whatever before writing the response body and 200 code for txnID 1, A has to wait for that response before making a new request to send txnID 2.

sure, I got all that. But your paragraph starts with a single sentence that says "we're going to relax the constraint that only one /send can happen at a time", but then instead of explaining why, you then talk about something completely different. I must be completely misunderstanding something here.

To be clear, I am not trying to remove the constraint that /sends happen in serial. I am trying to reduce the amount of time it takes to respond to a /send so that we can increase the throughput and still be definitively spec-compliant.

To be clear, I am not trying to remove the constraint that /sends happen in serial.

right, that's the confusion then. It's hard to read this any other way:

The sending server must wait and retry for a 200 OK response before sending a
transaction with a different txnId to the receiving server.

With this proposal, blocking becomes optional rather than required.

I updated the wording here to:

With this proposal, the receiving server needing to block the /send response to wait for
PDU Processing Results becomes optional rather than required.

Does this read better?

…ix-spec-proposals#3618)

neilalexander · 2022-01-12T11:43:08Z

re. implementation, dendrite.neilalexander.dev is currently running matrix-org/dendrite@5ac5702e which tests not returning a "pdus" key in the /send response. It has been running for a few hours and everything seems to be fine with incoming federation still.

Implementation PR: matrix-org/dendrite#2088.

timokoesters · 2022-01-16T09:16:44Z

Conduit currently fails to understand the response, see #3618 (comment)

neilalexander added 2 commits January 4, 2022 17:04

Add proposal to simplify federation /send response

e2df3fd

Use MSC number 3618 from PR

924bded

neilalexander changed the title ~~Add proposal to simplify federation /send response~~ MSC3618: Add proposal to simplify federation /send response Jan 4, 2022

turt2live reviewed Jan 4, 2022

View reviewed changes

proposals/3618-simplify-federation-send.md Outdated Show resolved Hide resolved

proposals/3618-simplify-federation-send.md Outdated Show resolved Hide resolved

Wrapping

7f4b1cb

deepbluev7 reviewed Jan 4, 2022

View reviewed changes

proposals/3618-simplify-federation-send.md Outdated Show resolved Hide resolved

timokoesters reviewed Jan 4, 2022

View reviewed changes

proposals/3618-simplify-federation-send.md Outdated Show resolved Hide resolved

ShadowJonathan reviewed Jan 5, 2022

View reviewed changes

proposals/3618-simplify-federation-send.md Show resolved Hide resolved

Clarifications on blocking and head-of-line issues

e0999d6

richvdh reviewed Jan 5, 2022

View reviewed changes

neilalexander added 2 commits January 5, 2022 15:06

Update wording around "receiving homeservers"

d73e570

Address @richvdh review comment by removing HoL wording

33bff5d

neilalexander added a commit to matrix-org/dendrite that referenced this pull request Jan 12, 2022

Don't return pdus key on federation /send (as per matrix-org/matr…

0005f5c

…ix-spec-proposals#3618)

neilalexander mentioned this pull request Jan 12, 2022

MSC3618: Simplified /send response matrix-org/dendrite#2088

Closed

turt2live removed the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jan 12, 2022

Update 3618-simplify-federation-send.md

dad0c7d

ShadowJonathan mentioned this pull request Jan 17, 2022

MSC3618 ruma/ruma#814

Merged

turt2live removed the proposal-in-review label May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSC3618: Add proposal to simplify federation `/send` response #3618

MSC3618: Add proposal to simplify federation `/send` response #3618

neilalexander commented Jan 4, 2022 •

edited by github-actions bot

Loading

richvdh Jan 5, 2022

neilalexander Jan 5, 2022

richvdh Jan 5, 2022

neilalexander Jan 5, 2022

richvdh Jan 5, 2022

richvdh Jan 5, 2022

neilalexander Jan 12, 2022

richvdh Jan 12, 2022

neilalexander Jan 12, 2022

turt2live Jan 12, 2022

richvdh Jan 5, 2022

neilalexander Jan 5, 2022

richvdh Jan 5, 2022

neilalexander Jan 5, 2022

richvdh Jan 5, 2022

neilalexander Jan 5, 2022

richvdh Jan 5, 2022

neilalexander Jan 12, 2022

neilalexander commented Jan 12, 2022 •

edited

Loading

timokoesters commented Jan 16, 2022

MSC3618: Add proposal to simplify federation /send response #3618

Are you sure you want to change the base?

MSC3618: Add proposal to simplify federation /send response #3618

Conversation

neilalexander commented Jan 4, 2022 • edited by github-actions bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neilalexander commented Jan 12, 2022 • edited Loading

timokoesters commented Jan 16, 2022

MSC3618: Add proposal to simplify federation `/send` response #3618

MSC3618: Add proposal to simplify federation `/send` response #3618

neilalexander commented Jan 4, 2022 •

edited by github-actions bot

Loading

neilalexander commented Jan 12, 2022 •

edited

Loading