Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Is Synapse backing off on everything it should? #5406

Closed
aaronraimist opened this issue Jun 9, 2019 · 5 comments
Closed

Is Synapse backing off on everything it should? #5406

aaronraimist opened this issue Jun 9, 2019 · 5 comments

Comments

@aaronraimist
Copy link
Contributor

Continued in a new issue as requested

See #5113 (comment) and #5113 (comment)

what makes you think that those do not cause backoff?

I'd be happy to be wrong but my understanding is I should see a log line like

synapse.util.retryutils - 214 - INFO - federation_transaction_transmission_loop-482 - Connection to matrix.th3v01d.net was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: DNSLookupError: DNS lookup failed: no results for hostname lookup: *******.)); backoff now 86400000

if it was backing off from those errors but I don't see any backoff line for those errors I mentioned.

There is a line like

synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-794 - {PUT-O-758} [***********] Waiting 78.30996245200296s before re-sending...

Is that the same thing? It never seems to go above ~85 seconds so I don't think that is the same thing.

@aaronraimist
Copy link
Contributor Author

Examples from the last hour of logs:

DNSLookupError('no results for hostname lookup: im.srizbi.com.srizbi.com',)

Jun 09 15:52:52 ubuntu matrix-synapse[26430]: 2019-06-09 15:52:52,642 - synapse.http.matrixfederationclient - 482 - WARNING - federation_transaction_transmission_loop-558 - {PUT-O-616} [srizbi.com] Request failed: PUT matrix://srizbi.com/_matrix/federation/v1/send/1560095475160: DNSLookupError('no results for hostname lookup: im.srizbi.com.srizbi.com',)
Jun 09 15:52:52 ubuntu matrix-synapse[26430]: 2019-06-09 15:52:52,643 - synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-558 - {PUT-O-616} [srizbi.com] Waiting 3.3683070020577666s before re-sending...

in the last hour my server has logged [srizbi.com] Request failed... 22 times.


RequestTransmissionFailed:[Error([('SSL routines', 'CONNECT_CR_CERT', 'certificate verify failed')],)]

Jun 09 16:13:53 ubuntu matrix-synapse[29832]: 2019-06-09 16:13:53,382 - synapse.http.matrixfederationclient - 482 - WARNING - federation_transaction_transmission_loop-999 - {PUT-O-1192} [matrix.wina.be] Request failed: PUT matrix://matrix.wina.be/_matrix/federation/v1/send/1560095829267: RequestTransmissionFailed:[Error([('SSL routines', 'CONNECT_CR_CERT', 'certificate verify failed')],)]
Jun 09 16:13:53 ubuntu matrix-synapse[29832]: 2019-06-09 16:13:53,383 - synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-999 - {PUT-O-1192} [matrix.wina.be] Waiting 3.4019895098393143s before re-sending...

my server logged [matrix.wina.be] Request failed: 21 times.


ConnectError('Address not available',)

Jun 09 16:23:54 ubuntu matrix-synapse[10774]: 2019-06-09 16:23:54,831 - synapse.http.matrixfederationclient - 482 - WARNING - federation_transaction_transmission_loop-672 - {PUT-O-371} [productionservers.net] Request failed: PUT matrix://productionservers.net/_matrix/federation/v1/send/1560097335772: ConnectError('Address not available',)
Jun 09 16:23:54 ubuntu matrix-synapse[10774]: 2019-06-09 16:23:54,832 - synapse.http.matrixfederationclient - 502 - DEBUG - federation_transaction_transmission_loop-672 - {PUT-O-371} [productionservers.net] Waiting 50.39709599439885s before re-sending...
Jun 09 16:25:12 ubuntu matrix-synapse[12436]: 2019-06-09 16:25:12,220 - synapse.http.matrixfederationclient - 401 - INFO - federation_transaction_transmission_loop-311 - {PUT-O-170} [productionservers.net] Sending request: PUT matrix://productionservers.net/_matrix/federation/v1/send/1560097491755; timeout 60.000000s
Jun 09 16:25:28 ubuntu matrix-synapse[12436]: 2019-06-09 16:25:28,495 - synapse.http.federation.matrix_federation_agent - 339 - INFO - federation_transaction_transmission_loop-311 - Error fetching https://productionservers.net/.well-known/matrix/server: An error occurred while connecting: 113: Host is unreachable.

My server logged [productionservers.net] Request failed: 27 times in addition to 28 [productionservers.net] Sending request: PUT matrix://productionservers.net/_matrix/federation/v1/send/...; timeout 60.000000ss


405 looks like

Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,817 - synapse.http.matrixfederationclient - 433 - INFO - federation_transaction_transmission_loop-907 - {PUT-O-860} [decentralizing.world] Got response headers: 405 Not Allowed
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,817 - synapse.http.matrixfederationclient - 517 - WARNING - federation_transaction_transmission_loop-907 - {PUT-O-860} [decentralizing.world] Request failed: PUT matrix://decentralizing.world/_matrix/federation/v1/send/1560095300424: HttpResponseException("405: b'Not Allowed'",)
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,818 - synapse.federation.sender.transaction_manager - 123 - INFO - federation_transaction_transmission_loop-907 - TX [decentralizing.world] {1560095300424} got 405 response
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,818 - synapse.federation.persistence - 89 - DEBUG - federation_transaction_transmission_loop-907 - Invoked 'delivered' with args: self=<synapse.federation.persistence.TransactionActions..., transaction=(Transaction, {'unrecognized_keys': {}, 'transacti..., response_code=405, response_dict=b'<html>\r\n<head><title>405 Not Allowed</title></...
Jun 09 15:50:21 ubuntu matrix-synapse[24430]: 2019-06-09 15:50:21,818 - synapse.federation.sender.transaction_manager - 130 - DEBUG - federation_transaction_transmission_loop-907 - TX [decentralizing.world] {1560095300424} Marked as delivered

If you need more info let me know.

@ara4n
Copy link
Member

ara4n commented Jun 10, 2019

(see #5414, #5413 and ye olde #1404)

@richvdh
Copy link
Member

richvdh commented Jun 12, 2019

Something that may help to understand the logs:

Each federation request is tried several times. There are two potential per-request retry schedules, a 'regular' one, which tries 4 times with about 2s between requests, and a 'long' one, which tries 11 times with about 60s between requests. The long retry schedule is only used for /federation/v1/send requests. The per-request retry schedules are documented here.

Only once those 4 or 11 attempts have failed, is the per-destination backoff incremented. The backoff starts at 10 seconds [edit 2021/02/09: I think this should be 10 minutes], and then increases by roughly a factor of 5 for each failure, up to a max of 24h [edit 2021/02/09: as of #6026, infinite]. The way the backoff works is that, while the backoff is in place, we won't make any more requests. Once the backoff expires, the next request that we would send will be attempted, and will be retried several times following the per-request retry algorithm.

The reason for the two levels of "backoff" is to distinguish transient errors (the destination server was being restarted) from actual "server is down" situations.

It's also important to understand that some requests (including /key/v1/server, as per #5414) are deliberately excluded from the backoff schedule.

I don't claim that algorithm is perfect: I explain it only in the hope that it will give some clues as to what you are seeing in the logs. What I hope to show is that the retry schedule is at least working as designed.

So, I hope that explains why grepping for 'Request failed' does not give the full picture: 28 hits over an hour seems quite reasonable when each request is attempted 10 times. If you still have concerns, we probably need to see more comprehensive logs, rather than just excerpts.

The 405 handling does seem wrong and I have raised #5442.

@aaronraimist
Copy link
Contributor Author

Alright. I can understand trying to send these for a few days but it seems like most of these are coming from servers which no longer exist or don't intend to upgrade their certificates any time soon. To solve #5113 I believe Synapse should stop all requests, not just the ones included in the backoff schedule. So the blacklist system probably does need to be more a more comprehensive change.

I'll see if I can setup something like https://github.com/grafana/loki so we can more easily see how often each server is being sent requests rather than manually digging through days/weeks of logs.

I'll let you decide if you want to close this in favor of #5113.

@neilisfragile
Copy link
Contributor

Okay, I'll close this one in favour #5113

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants