Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3618: Add proposal to simplify federation /send response #3618

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

neilalexander
Copy link
Contributor

@neilalexander neilalexander commented Jan 4, 2022

Rendered.

This is a proposal to simplify the response of the federation /send endpoint.

Implementation PR: matrix-org/dendrite#2088.

Preview: https://pr3618--matrix-org-previews.netlify.app

@neilalexander neilalexander changed the title Add proposal to simplify federation /send response MSC3618: Add proposal to simplify federation /send response Jan 4, 2022
@turt2live turt2live added kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal proposal-in-review s2s Server-to-Server API (federation) labels Jan 4, 2022
Comment on lines 34 to 38
A significant benefit is that homeserver implementations no longer need to block
the `/send` request in order to wait for the events to be processed for their error
results. This can potentially allow homeserver implementations to remove head-of-line
blocking from `/send` by maintaining durable queues for incoming federation events and
processing them on a per-room basis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be covered by other threads, in which case, apologies, but I don't really understand why removing the pdus section this is a prerequisite for removing this head-of-line blocking. Indeed, synapse has already done this.

If, as a server, you're happy that the sender will not re-send any of the events in the transaction (ie, under this MSC, you're going to return a 200), all you need to do currently is parse enough of the transaction to calculate the event ids. (Or is that work you're hoping to avoid? It seems like, by the time you've validated the shape of the body, you're most of the way to calculating event IDs)

I guess I agree that pdus is kinda pointless if no implementations actually read the field (other than for logging), but that doesn't necessarily mean that we should prevent implementations making use of it by getting rid of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The head-of-line blocking is that you have to wait for the PDUs to be processed before you can generate the /send response body because the response body expects for there to be event IDs and error results (the PDU Processing Result). Meanwhile, while that is happening, the remote server is waiting for this request to finish before sending you the next one. One event in the transaction which takes abnormally long to process will therefore slow down that server's ability to send you events from other rooms until it is done.

If what you are saying is correct, Synapse just unmarshals the transaction, generates the event IDs, returns those with empty PDU Processing Result and never reports problems anyway?

In which case, given that implementations don't actually care what's in the "pdus" key at all (except for seemingly writing to the log), it seems pointless to have an API shape which implies we should know the result of the PDU being processed before being able to respond.

By removing that burden, it becomes possible for implementations to not have to return anything other than a confirmation that they "accepted" the transaction. They can just queue up the work and return 200 and then deal with it in their own time. Or servers can continue to do things like they do today, where they don't queue and just do things and block while doing it, which is also fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to wait for the PDUs to be processed

I think you need to define "processed" better. There are a lot of steps to "processing" an PDU, and you absolutely don't have to do all of them before you can reply to /send.

If what you are saying is correct, Synapse just unmarshals the transaction, generates the event IDs, returns those with empty PDU Processing Result and never reports problems anyway?

Synapse also checks the signatures at this stage, for the record. And it writes the events to a queue in persistent storage in case it gets restarted before processing is completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to define "processed" better.

Well, I suppose the spec needs to define "processed" better, sure! 😀 In my mind, an event is fully processed once it has been unmarshalled, signature checked, authed and the forward extremities updated (if needed). I am not really sure if there is any text elsewhere that suggests a better definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I'd say: once you've deserialised the JSON and checked that the object therein is the right shape, calculating the event id is a really small amount of effort. So are you saying:

  • you don't want to have to validate the request body before you return a 200, or:
  • you don't agree that it's easy to calculate the event id while you're doing that validation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fair.

I think there's certainly grounds for the spec to be clearer about what it means by "processed" here (and tbh this starts to get into https://github.com/matrix-org/matrix-doc/issues/1646). I don't think Synapse's behaviour (and hence what you're proposing for Dendrite) is precluded by the spec, but I agree the current wording and examples are suggestive of more comprehensive "processing". (This is probably mostly an artifact of the spec being written based on what Synapse happened to do at the time, rather than being properly designed and thought about.) If you'd like to propose PRs for the spec wording that clarify that, I for one would be delighted.

As for whether we should return the event IDs: if you want to drive this MSC through on the grounds that PDU Processing Result is pointless and annoying, I won't object. But I don't agree it is necessary to fix the head-of-line blocking problem, so any references to that should be removed from the MSC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the mention of head-of-line blocking from the MSC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still says things like "the receiving homeserver no longer needs to block the the /send request"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're waiting for a PDU Processing Result, then we're blocking waiting for the event "processing" to be done so that we know what the PDU Processing Result is.

Is there a better word you have in mind instead of "blocking"? To me it feels like delaying responding until we know the PDU Processing Result is "blocking".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec's intent was to say "process the event enough to populate pdus" - anything beyond that is implementation detail. My understanding of Synapse's latest handling of transactions is that it does the absolute bare minimum and queues the remaining handling elsewhere, speeding up the response time. This behaviour should be acceptable in the eyes of the spec today.

Regardless of it being mentioned in this MSC, the spec would likely update to clarify that it's an implementation detail for how to "process" an event and that the server should use blocking the response as a way to slow down inbound transactions (if they want to do that, such as in cases where their internal pending queue is too large).

Comment on lines 46 to 54
> The sending server must wait and retry for a 200 OK response before sending a
> transaction with a different txnId to the receiving server.

With this proposal, blocking becomes optional rather than required. Servers that do not
want to durably persist transactions before processing them can continue to perform all
work in-memory by continuing to block on `/send` as is done today. Additionally, a server
that is receiving too many transactions from a given homeserver may wish to block for
an arbitrary period of time for rate-limiting purposes, but this is not necessarily
required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. We're talking about the sending server (ie, the "client" of /send from the REST POV) here, but you've written:

Servers that do not want to durably persist transactions before processing them ...

isn't processing a transaction something that happens on the receiving side? Generally your text here only seems to make sense from the receiving side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording "The sending server" comes from the spec, which yes, is the REST client in this case. My wording is talking about the "receiving server" and I've updated the MSC to be clearer with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't really understand how all this text about whether receiving servers want to block before replying to /send is relevant to the quoted text about senders waiting for an OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's say you have servers A and B. The spec states that if A sends something to B, it must wait for a 200 response from B on txnID 1 before it can send txnID 2. This implies that, between A and B, only one /send can be happening at a time and they happen serially. Therefore if B spends a long time doing whatever before writing the response body and 200 code for txnID 1, A has to wait for that response before making a new request to send txnID 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I got all that. But your paragraph starts with a single sentence that says "we're going to relax the constraint that only one /send can happen at a time", but then instead of explaining why, you then talk about something completely different. I must be completely misunderstanding something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am not trying to remove the constraint that /sends happen in serial. I am trying to reduce the amount of time it takes to respond to a /send so that we can increase the throughput and still be definitively spec-compliant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am not trying to remove the constraint that /sends happen in serial.

right, that's the confusion then. It's hard to read this any other way:

The sending server must wait and retry for a 200 OK response before sending a
transaction with a different txnId to the receiving server.

With this proposal, blocking becomes optional rather than required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the wording here to:

With this proposal, the receiving server needing to block the /send response to wait for
PDU Processing Results becomes optional rather than required.

Does this read better?

neilalexander added a commit to matrix-org/dendrite that referenced this pull request Jan 12, 2022
@neilalexander
Copy link
Contributor Author

neilalexander commented Jan 12, 2022

re. implementation, dendrite.neilalexander.dev is currently running matrix-org/dendrite@5ac5702e which tests not returning a "pdus" key in the /send response. It has been running for a few hours and everything seems to be fine with incoming federation still.

Implementation PR: matrix-org/dendrite#2088.

@turt2live turt2live removed the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jan 12, 2022
@timokoesters
Copy link

Conduit currently fails to understand the response, see #3618 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:maintenance MSC which clarifies/updates existing spec proposal A matrix spec change proposal s2s Server-to-Server API (federation)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants