Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC3360: Server Status #3360

Open
wants to merge 1 commit into
base: old_master
Choose a base branch
from
Open

Conversation

daenney
Copy link

@daenney daenney commented Aug 25, 2021

This proposal aims to provide a channel through which a client can get status information about its homeserver so that it can provide more useful context to users when problems occur or provide advance notice for upcoming maintenance.

Rendered

@daenney daenney changed the title MSCNNNN: Server Status MSC3360: Server Status Aug 25, 2021
@daenney daenney added client-server Client-Server API kind:feature MSC for not-core and not-maintenance stuff proposal A matrix spec change proposal labels Aug 25, 2021
@eras
Copy link

eras commented Aug 25, 2021

Looks good!

However, it seems there's no MSC about signaling server decommissioning? But, if there was to such a feature, then I think this MSC could include it.

In that case the end point would not be used only by the server's clients, but also by other servers. Along the lines: this instance is down for good, don't bother calling me again ;), and it would be possible to arrange this response even with a static web server. For this to work, the end point would need to be available for the world (or at the very least its peers), not just the clients of the server.

Maybe its effects could be determined by the other server: maybe clear the decommissioning status if it sees inbound federating traffic from the server?

I hear it's a real-world issue that decommissioned servers get incoming requests for long periods of time after having been terminated. Probably not a very severe issue, but issue nevertheless.. And it also consumes resources from other servers to keep on contacting it.

@daenney
Copy link
Author

daenney commented Aug 25, 2021

Looks good!

However, it seems there's no MSC about signaling server decommissioning? But, if there was to such a feature, then I think this MSC could include it.

In that case the end point would not be used only by the server's clients, but also by other servers. Along the lines: this instance is down for good, don't bother calling me again ;), and it would be possible to arrange this response even with a static web server. For this to work, the end point would need to be available for the world (or at the very least its peers), not just the clients of the server.

This is an interesting idea. I'm wondering if that wouldn't be better served by some additional information under the well-known instead (which is typically already handled by a front-end returning a static response). Since the server is decommissioned, it feels odd to me personally to have an API endpoint for it. I believe servers periodically check /.well-known/matrix/server for those they're talking to and check the m.server, so I'm wondering if we shouldn't add something in there to signal this instead.

I do foresee some potential issues with that though. Lets suppose you have "my-awesome-domain.com", hosted a Matrix deployment on it and eventually decommissioned it, signalling that to the federation using a hypothetical "hey this deployment is permanently gone". Now eventually you let the domain registration lapse because you no longer have any use for it and someone else registers it and in turn wants to run a Matrix deployment on it again. If this "server is permanently gone" thing is stored by all servers it's previously talked to, the new owner of the domain won't ever be able to successfully federate with those servers again. If it's not stored but periodically checked, then once the domain changes hands the new owner would need to continue to publish this information if they don't want to get random Matrix traffic, which also feels odd.

This feels more like we need to clarify something in the s2s spec as to how and how long homeservers should back-off, and at which point it's OK for them to give up entirely?

Also, if servers don't want to receive any traffic any more, wouldn't it be sufficient for the server admin to forcefully leave all users on their homeserver from all rooms, before decommissioning the deployment? Since they'd no longer be participating in any room there should be no reason for anyone to continue to federate with them.

Maybe its effects could be determined by the other server: maybe clear the decommissioning status if it sees inbound federating traffic from the server?

I hear it's a real-world issue that decommissioned servers get incoming requests for long periods of time after having been terminated. Probably not a very severe issue, but issue nevertheless.. And it also consumes resources from other servers to keep on contacting it.

proposals/3360-server-status.md Outdated Show resolved Hide resolved
}
```

### Retrieval of status events: `GET /_matrix/client/r0/server/status`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For best reliability it would be better if this could be completely removed from the server domain. For example https://www.githubstatus.com/ is used by GitHub. Maybe it makes sense to advertise the preferred URL to the client and the client is expected to cache that URL until it gets a newer one?

The obvious downside is that this doesn't help for a new client. In that case maybe this URL could be used as a fallback if no better URL is known (or the better URL is returning an error)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I've been trying to figure out a way to make that happen. The most obvious place to me would be to add this to either the client well-known response, or capabilities. Something like an m.server_status_endpoint. I'm not sure what's most appropriate, though I'm leaning towards the capabilities endpoint.

This does open up a somewhat interesting can of worms if someone were to point to a different location, say myhomeserverstatus.com/matrix_server_status, but someone else gains control of that domain (a lapsed registration for example). Not sure what to do about that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but someone else gains control of that domain

I don't think this is worth worrying about. You can already make this argument about having the matrix server and federation domain be on different domains. In practice it is the site operators choice and if they choose to use multiple domains they need to be committed to maintaining them. Plus the downside isn't that bad, it just shows an informational message and can always be undone by the homeserver operator (once they get around to it).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling a bit with where to properly put this. The capabilities endpoint is pretty clear that we shouldn't advertise support for unstable features, so it would fit better in /versions's unstable_features object. However, that one is only expected to be feature: boolean mapping, which wouldn't let us point to a URL.

That brings me back to putting it in the well-known instead, which sort of feels appropriate for this since it allows us to direct C2S vs S2S requests at different hosts, but it feels weird to use it to offload a single endpoint.

So, where do we put this/how do we advertise it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think well-known sounds reasonable. It may even be preferred since it is often on a different infrastructure then the server itself. It means that even first-time users or fresh clients could find the status page which is a very nice property.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mulling this over a bit, I was initially thinking about extending the .well-known/matrix/client response. But it kept feeling inappropriate to redirect a single endpoint that way and potentially opening up the door for that discovery document growing and growing.

So I find myself wondering if maybe it's preferable to have something like .well-known/matrix/server-status instead that would return an {m.status_endpoint: ""}?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like .well-known/matrix/server-status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested .well-known/matrix/status in #spec before arriving to this MSC, I think both could enhance eachother.

Regardless, I think this concern could/should be noted under the "Potential Issues" section though, or the drawback of this endpoint being on the same domain at least noted.

(Other than that, this is a solid proposal imo 👍)

proposals/3360-server-status.md Show resolved Hide resolved
proposals/3360-server-status.md Show resolved Hide resolved
proposals/3360-server-status.md Outdated Show resolved Hide resolved
@turt2live turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Aug 30, 2021
Signed-off-by: Daniele Sluijters <daenney@users.noreply.github.com>
```

### Retrieval of status events: `GET /_matrix/client/r0/server/status`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it'd help if a suggested value for caching/re-request time is also noted here.

I'm suggesting;

  • 6 hours in normal conditions
  • every 5 minutes when connection issues are present
    (With exponential backoff to 30min max if the endpoints dont work, to hold off stampeding herds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-server Client-Server API kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants