Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Consider if "502 Bad Gateway" is the right status code for unavailable federated servers #5837

Open
michaelkaye opened this issue Aug 8, 2019 · 2 comments
Labels
T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. z-p2 (Deprecated Label)

Comments

@michaelkaye
Copy link
Contributor

michaelkaye commented Aug 8, 2019

Description:

We have seen a number of loadbalancers that consider server failures like 502's to indicate an issue with the server as a whole, and not be scoped to just that specific URL on the server.

This has empirically lead to flip-flopping behaviour where a number of failing requests to a functioning homeserver that is attempting to contact a non-functioning homeserver over federation has lead to the functioning homeserver itself being marked as unavailable, and so traffic has not been routed to it by a loadbalancer.

This is definitely an issue with the loadbalancers, rather than synapse, but they may be out of our control, or the control of our synapse administrators, so we may still wish to address this.

@michaelkaye
Copy link
Contributor Author

michaelkaye commented Aug 8, 2019

A thought:

Another API style (this is a large change) that would in some way negate this would be responding to this type of request with:

HTTP 200 OK "You talked to your server at the transport layer" with a body which is a machine-readable equivalent of "I was unable to contact the third party server(s) i needed to fufil this request".

That would then mean that 401 or 403 type errors would solely be in relation to a failure to authenticate to your own HS, and 5xx series errors would solely indicate an issue with your HS.

The downside is that we don't follow HTTP status code purism. Not following it may not be a bad thing.

@ara4n
Copy link
Member

ara4n commented Aug 8, 2019

in practice this rendered my acct barely usable today because CF marked the whole backend down because a /groups request returned 502 due to a remote HS being down.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

4 participants